## News Pigeon Proof of Concept

### Overview
With this notebook, I aim to show that news can be collected and summarized by an
LLM using retreival augmented generation.

To use it, you will need to have Ollama downloaded: https://ollama.ai/

Ollama makes running an LLM locally possible without crashing your device.  For 
context, I built and tested this notebook on a Macbook Air 2020 with 8gb sillicone
chip.

I am using an untuned Mistral 7b as my LLM.  I do need to experiment with other
LLMs, but I wanted to start with a 7b, and, from what twitter says, Mistral's
is the best.

### Details
The notebook can be split into the following sections:

0) Downloads, library imports, and feed selection

1) Scraping rss feeds specfied in section 0

2) Embedding content from scrape and setting up parts of chain

3) Create prompt and initialize chain

4) Testing

### Downloads, library imports, and feed selection

In [2]:
pip install langchain bs4 sentence_transformers feedparser newspaper3k --quiet


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m23.3.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [3]:
from langchain.llms import Ollama
from langchain.vectorstores import Chroma
from langchain.document_loaders import WebBaseLoader
from langchain.embeddings import HuggingFaceBgeEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnableParallel, RunnablePassthrough
from langchain.prompts import ChatPromptTemplate

import feedparser
from requests.exceptions import Timeout
import json
from datetime import datetime, timedelta

from newspaper import Article


In [4]:
feeds = [
    ["NYT World", "https://rss.nytimes.com/services/xml/rss/nyt/World.xml"],
    ["NYT Africa", "https://rss.nytimes.com/services/xml/rss/nyt/Africa.xml"],
    ["NYT Americas", "https://rss.nytimes.com/services/xml/rss/nyt/Americas.xml"],
    ["NYT APAC", "https://rss.nytimes.com/services/xml/rss/nyt/AsiaPacific.xml"],
    ["NYT EU", "https://rss.nytimes.com/services/xml/rss/nyt/Europe.xml"],
    ["NYT ME", "https://rss.nytimes.com/services/xml/rss/nyt/MiddleEast.xml"],
    ["NYT US", "https://rss.nytimes.com/services/xml/rss/nyt/US.xml"],
    ["NYT Education", "https://rss.nytimes.com/services/xml/rss/nyt/Education.xml"],
    ["NYT Business", "https://rss.nytimes.com/services/xml/rss/nyt/Business.xml"],
    ["NYT Energy", "https://rss.nytimes.com/services/xml/rss/nyt/EnergyEnvironment.xml"],
    ["NYT Small Bussiness", "https://rss.nytimes.com/services/xml/rss/nyt/SmallBusiness.xml"],
    ["NYT Economy", "https://rss.nytimes.com/services/xml/rss/nyt/Economy.xml"],
    ["NYT Deals", "https://rss.nytimes.com/services/xml/rss/nyt/Dealbook.xml"],
    ["NYT Tech", "https://rss.nytimes.com/services/xml/rss/nyt/Technology.xml"],
    ["NYT Personal Tech", "https://rss.nytimes.com/services/xml/rss/nyt/PersonalTech.xml"],
    ["NYT Sports", "https://rss.nytimes.com/services/xml/rss/nyt/Sports.xml"],
    ["NYT Baseball", "https://rss.nytimes.com/services/xml/rss/nyt/Baseball.xml"],
    ["NYT College Basketball", "https://rss.nytimes.com/services/xml/rss/nyt/CollegeBasketball.xml"],
    ["NYT College Football", "https://rss.nytimes.com/services/xml/rss/nyt/CollegeFootball.xml"],
    ["NYT Golf", "https://rss.nytimes.com/services/xml/rss/nyt/Golf.xml"],
    ["NYT Hockey", "https://rss.nytimes.com/services/xml/rss/nyt/Hockey.xml"],
    ["NYT Basketball", "https://rss.nytimes.com/services/xml/rss/nyt/ProBasketball.xml"],
    ["NYT Football", "https://rss.nytimes.com/services/xml/rss/nyt/ProFootball.xml"],
    ["NYT Soccer", "https://rss.nytimes.com/services/xml/rss/nyt/Soccer.xml"],
    ["NYT Tennis", "https://rss.nytimes.com/services/xml/rss/nyt/Tennis.xml"],
    ["NYT Science", "https://rss.nytimes.com/services/xml/rss/nyt/Science.xml"],
    ["NYT Enviornment", "https://rss.nytimes.com/services/xml/rss/nyt/Climate.xml"],
    ["NYT Space", "https://rss.nytimes.com/services/xml/rss/nyt/Space.xml"],
    ["NYT Health", "https://rss.nytimes.com/services/xml/rss/nyt/Health.xml"],
    ["NYT Well Blog", "https://rss.nytimes.com/services/xml/rss/nyt/Well.xml"],
    ["NYT Arts", "https://rss.nytimes.com/services/xml/rss/nyt/Arts.xml"],
    ["NYT Arts2", "https://rss.nytimes.com/services/xml/rss/nyt/ArtandDesign.xml"],
    ["NYT Books", "https://rss.nytimes.com/services/xml/rss/nyt/Books/Review.xml"],
    ["NYT Dance", "https://rss.nytimes.com/services/xml/rss/nyt/Dance.xml"],
    ["NYT Movies", "https://rss.nytimes.com/services/xml/rss/nyt/Movies.xml"],
    ["NYT Music", "https://rss.nytimes.com/services/xml/rss/nyt/Music.xml"],
    ["NYT TV", "https://rss.nytimes.com/services/xml/rss/nyt/Television.xml"],
    ["NYT Theater", "https://rss.nytimes.com/services/xml/rss/nyt/Theater.xml"],
    ["NYT Style", "https://rss.nytimes.com/services/xml/rss/nyt/FashionandStyle.xml"],
    ["NYT Dining", "https://rss.nytimes.com/services/xml/rss/nyt/DiningandWine.xml"],
    ["NYT Love", "https://rss.nytimes.com/services/xml/rss/nyt/Weddings.xml"],
    ["NYT Travel", "https://www.nytimes.com/services/xml/rss/nyt/Travel.xml"],
    ["CNN Top Stories", "http://rss.cnn.com/rss/cnn_topstories.rss"],
    ["CNN World", "http://rss.cnn.com/rss/cnn_world.rss"],
    ["CNN US", "http://rss.cnn.com/rss/cnn_us.rss"],
    ["CNN Business", "http://rss.cnn.com/rss/money_latest.rss"],
    ["CNN Politics", "http://rss.cnn.com/rss/cnn_allpolitics.rss"],
    ["CNN Tech", "http://rss.cnn.com/rss/cnn_tech.rss"],
    ["CNN Health", "http://rss.cnn.com/rss/cnn_health.rss"],
    ["CNN Entertainment", "http://rss.cnn.com/rss/cnn_showbiz.rss"],
    ["CNN Travel", "http://rss.cnn.com/rss/cnn_travel.rss"],
    ["WashPo Politics", "http://feeds.washingtonpost.com/rss/politics?itid=lk_inline_manual_2"],
    ["WashPo Sports", "http://feeds.washingtonpost.com/rss/sports?itid=lk_inline_manual_20"],
    ["WashPo Tech", "http://feeds.washingtonpost.com/rss/business/technology?itid=lk_inline_manual_31"],
    ["WashPo US", "http://feeds.washingtonpost.com/rss/national?itid=lk_inline_manual_32"],
    ["WashPo World", "http://feeds.washingtonpost.com/rss/world?itid=lk_inline_manual_36"],
    ["WashPo Business", "http://feeds.washingtonpost.com/rss/business?itid=lk_inline_manual_37"],
    ["WashPo Lifestyle", "http://feeds.washingtonpost.com/rss/lifestyle?itid=lk_inline_manual_38"],
    ["WashPo Entertainment", "http://feeds.washingtonpost.com/rss/entertainment?itid=lk_inline_manual_39"],
    ["Reuters via Google", "https://news.google.com/rss/search?q=when:24h+allinurl:reuters.com&ceid=US:en&hl=en-US&gl=US"]
   ]

### Scraping RSS Feeds
Uses mostly the feedparser library.  We mostly just want the links to later run
them through Newspaper3k.

In [5]:
links = []

for sub_feed in feeds:
    feed = feedparser.parse(sub_feed[1])
    one_day_ago = datetime.now() - timedelta(days=1)

    recent_items = []
    for entry in feed.entries:
        try:
            published = datetime(*entry.published_parsed[:6])
            # if published > one_day_ago:
            recent_items.append(entry)
        except:
            if not entry.title == "":
                recent_items.append(entry)

    for item in recent_items:
        try:
            links.append(item.links[0].href)

        except Timeout:
            print("ARTICLE COLLECTION TIMED OUT:", item.links[0].href)
        except Exception as e:
            print("ARTICLE COLLECTION ERROR:", e)

## Embedding content from RSS feeds

We start to use langchain heavily from here on out.  Depending on the amount of text scraped from RSS feeds, 
this cell might take some time.  Generally, each link in the feed takes around 45 seconds to embed.  There are usually around 30 links per feed.

First, we use Newspaper3k to get the text from the link.

Next, we use "RecursiveCharacterTextSplitter" to split the text into semantially
significant text.

Then, we put in the context to each chunk, which helps the model cite sources.

Finally, we vectorize all the documents and set up other parts of the langchain
chain.

In [6]:
# Using Ollama to host mistral 7b, which, from what I read
# best model that I can run locally
model = Ollama(model='mistral')

# I chose this embedder because it is small and well performing according to 
# HF's MTEB leaderboard: https://huggingface.co/spaces/mteb/leaderboard
embedder_name = "BAAI/bge-base-en-v1.5"
model_kwargs = {"device": "cpu"}
encode_kwargs = {"normalize_embeddings": True}

embedder = HuggingFaceBgeEmbeddings(
    model_name=embedder_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

# Recursive text splitter splits text into semantically meaningful chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)

# This is where we will store the different chunks of text
split_data = []

for link in links:
    good_2_go = True
    try:
        # Using newspaper3k to download and parse the article
        article = Article(link)
        article.download()
        article.parse()
    except:
        # There are certain articles that newspaper3k can't parse
        # I catch those here and see if they are of any value (They usually aren't)
        good_2_go = False
        print("ERROR DOWNLOADING ARTICLE:", link)
    
    # This metadata will be added to the beginning of each chunk
    # Doing this decreases hallucinations in citations
    meta_data = f"Source \n Title: {article.title} \n Url: {link} \n Publish Date: {article.publish_date} \n  Excerpt from source: "

    if good_2_go and not article.text is None:
        # Splitting
        candidate = text_splitter.create_documents([article.text])

        temp_split_data = text_splitter.split_documents(candidate)

        # Adding metadata to each chunk
        for split in temp_split_data:
            split.page_content = meta_data + split.page_content
        
        split_data += temp_split_data

  from .autonotebook import tqdm as notebook_tqdm


ERROR DOWNLOADING ARTICLE: https://www.cnn.com/cnn-underscored/home/editors-favorite-sustainable-products?iid=CNNUnderscoredHPcontainer
ERROR DOWNLOADING ARTICLE: https://www.cnn.com/cnn-underscored/gifts/best-mothers-day-gifts-2023?iid=CNNUnderscoredHPcontainer
ERROR DOWNLOADING ARTICLE: https://www.cnn.com/cnn-underscored/fashion/mens-spring-fashion-style-guide?iid=CNNUnderscoredHPcontainer
ERROR DOWNLOADING ARTICLE: https://www.cnn.com/cnn-underscored/travel/amazon-travel-products?iid=CNNUnderscoredHPcontainer
ERROR DOWNLOADING ARTICLE: https://www.cnn.com/cnn-underscored/money/high-yield-savings-accounts?iid=CNNUnderscoredHPcontainer
ERROR DOWNLOADING ARTICLE: https://www.cnn.com/cnn-underscored/money/how-to-file-taxes?iid=CNNUnderscoredHPcontainer
ERROR DOWNLOADING ARTICLE: https://www.cnn.com/cnn-underscored/home/how-to-compost-at-home?iid=CNNUnderscoredHPcontainer
ERROR DOWNLOADING ARTICLE: https://www.cnn.com/cnn-underscored/reviews/mmmat-silicone-mats?iid=CNNUnderscoredHPconta

In [7]:
# vectorstore is the vector database using lightweight Chromadb
vectorstore = Chroma.from_documents(documents=split_data, embedding=embedder)

# Yet another part of langchain's abstractions
# Used to parse the output of the model
output_parser = StrOutputParser()


In [16]:
vectorstore_k = 6
vectorstore_score_threshold = .5

# Lets us query the db to fill out the prompt template below
setup_and_retrieval = RunnableParallel(
    {"context": vectorstore.as_retriever(search_type="similarity_score_threshold", search_kwargs={"score_threshold": vectorstore_score_threshold, "k": vectorstore_k}), "question": RunnablePassthrough()}
)

In [9]:
vectorstore.get()

{'ids': ['db7cfe6c-9e17-11ee-b386-9a3cf8b0e268',
  'db7cff02-9e17-11ee-b386-9a3cf8b0e268',
  'db7cff16-9e17-11ee-b386-9a3cf8b0e268',
  'db7cff2a-9e17-11ee-b386-9a3cf8b0e268',
  'db7cff3e-9e17-11ee-b386-9a3cf8b0e268',
  'db7cff48-9e17-11ee-b386-9a3cf8b0e268',
  'db7cff5c-9e17-11ee-b386-9a3cf8b0e268',
  'db7cff7a-9e17-11ee-b386-9a3cf8b0e268',
  'db7cff8e-9e17-11ee-b386-9a3cf8b0e268',
  'db7cff98-9e17-11ee-b386-9a3cf8b0e268',
  'db7cffac-9e17-11ee-b386-9a3cf8b0e268',
  'db7cffb6-9e17-11ee-b386-9a3cf8b0e268',
  'db7cffc0-9e17-11ee-b386-9a3cf8b0e268',
  'db7cffd4-9e17-11ee-b386-9a3cf8b0e268',
  'db7cffde-9e17-11ee-b386-9a3cf8b0e268',
  'db7cffe8-9e17-11ee-b386-9a3cf8b0e268',
  'db7cfffc-9e17-11ee-b386-9a3cf8b0e268',
  'db7d0006-9e17-11ee-b386-9a3cf8b0e268',
  'db7d001a-9e17-11ee-b386-9a3cf8b0e268',
  'db7d0024-9e17-11ee-b386-9a3cf8b0e268',
  'db7d002e-9e17-11ee-b386-9a3cf8b0e268',
  'db7d0042-9e17-11ee-b386-9a3cf8b0e268',
  'db7d004c-9e17-11ee-b386-9a3cf8b0e268',
  'db7d0056-9e17-11ee-b386-

## Template and initializing chain

In [17]:
# Prompt engineering is a constant process, especially with a smaller model
template = """
You are a news expert who answers questions about news.  

You must only use the most recent data from this context:
{context}

Do not rely on your historical records.

Answer as concisely as possible, but make sure that your information
lines up with the sources.  

Your answer should be in the following format:
    Your answer to the question here.

    Sources: [article title, source link, publish date]

Cite all Possible sources and put each on a new line.
If you don't have the relevent information to answer the question or a source,
tell the user so.  Err on the side of caution.

If the question or topic is off topic and not about news at all, tell the user so.

Here is the question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

# The final runnable in the chain
chain = setup_and_retrieval | prompt | model | output_parser

## Up and running!

In [15]:

question = "What happened with Fox's lawsuit against Dominion?"

# I want to see what context the model is using to answer the question
print("CONTEXT FETCHED:")
print(setup_and_retrieval.invoke(question))
print({"question": question})

CONTEXT FETCHED:
{'context': [Document(page_content='Source \n Title: Dominion still has pending lawsuits against election deniers such as Rudy Giuliani and Sidney Powell \n Url: https://www.cnn.com/business/live-news/fox-news-dominion-trial-04-18-23/h_8d51e3ae2714edaa0dace837305d03b8 \n Publish Date: 2023-04-18 15:02:55+00:00 \n  Excerpt from source: Reporters and members of the public outside of the Leonard Williams Justice Center where Dominion Voting Systems is suing Fox News in Delaware Superior Court today in Wilmington, Delaware. (Chip Somodevilla/Getty Images)\n\nA last-second settlement has been reached in Dominion Voting Systems’ historic defamation lawsuit against Fox News, the parties announced Tuesday in court.'), Document(page_content='Source \n Title: Settlement reached in Fox vs Dominion lawsuit \n Url: https://www.cnn.com/business/live-news/fox-news-dominion-trial-04-18-23/index.html \n Publish Date: 2023-04-18 15:02:55+00:00 \n  Excerpt from source: He said the settle

In [18]:
"""print("---------------")

print("\nANSWER:")
print(chain.invoke(question))"""

'print("---------------")\n\nprint("\nANSWER:")\nprint(chain.invoke(question))'

## Analysis

This notebook is more of a proof of concept that we can use LLMs to parse through news.  It's not meant to be perfect and it certainly isn't.

The design could be improved in the following ways (no particular order):

 - Improving from a 7b model.  Probably the biggest barrier is that a 7b model's outputs can only be so good.  It often doesn't follow instuctions or just avoids outputting sometimes.  Perhaps this behavior would change with a better model, though I'm skeptical that I'd be able to run any larger model on my M1 8GB.
 - Fine tuning on custom dataset.  Might be possible on slightly better computer using some kind of PEFT technique.  I'd have to use gpt to get me some traning data, though.
 - Improve news collection pipeline.  It's important to get a variety of sources, and maybe the best way to do that is not just through arbirary RSS feeds.
 - LLM memory.  The ability to ask follow up questions would be nice, but I doubt this model's ability to handle very specific questions.
  - Better prompt engineering.