## News Pigeon Proof of Concept

### Overview
With this notebook, I aim to show that news can be collected and summarized by an
LLM using retreival augmented generation.

To use it, you will need to have Ollama downloaded: https://ollama.ai/

Ollama makes running an LLM locally possible without crashing your device.  For 
context, I built and tested this notebook on a Macbook Air 2020 with 8gb sillicone
chip.

I am using an untuned Mistral 7b as my LLM.  I do need to experiment with other
LLMs, but I wanted to start with a 7b, and, from what twitter says, Mistral's
is the best.

### Details
The notebook can be split into the following sections:

0) Downloads, library imports, and feed selection

1) Scraping rss feeds specfied in section 0

2) Embedding content from scrape and setting up parts of chain

3) Create prompt and initialize chain

4) Testing

### Downloads, library imports, and feed selection

In [1]:
pip install langchain bs4 sentence_transformers feedparser newspaper3k --quiet

Note: you may need to restart the kernel to use updated packages.


In [2]:
from langchain.llms import Ollama
from langchain.vectorstores import Chroma
from langchain.document_loaders import WebBaseLoader
from langchain.embeddings import HuggingFaceBgeEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnableParallel, RunnablePassthrough
from langchain.prompts import ChatPromptTemplate

import feedparser
from requests.exceptions import Timeout
import json
from datetime import datetime, timedelta

from newspaper import Article


In [3]:
feeds = [
    ["CNN Top Stories", "http://rss.cnn.com/rss/cnn_topstories.rss"],
    ["CNN World", "http://rss.cnn.com/rss/cnn_world.rss"],
    ["CNN US", "http://rss.cnn.com/rss/cnn_us.rss"],
    ["CNN Business", "http://rss.cnn.com/rss/money_latest.rss"],
    ["CNN Politics", "http://rss.cnn.com/rss/cnn_allpolitics.rss"],
    ["CNN Tech", "http://rss.cnn.com/rss/cnn_tech.rss"],
    ["CNN Health", "http://rss.cnn.com/rss/cnn_health.rss"],
    ["CNN Entertainment", "http://rss.cnn.com/rss/cnn_showbiz.rss"],
    ["CNN Travel", "http://rss.cnn.com/rss/cnn_travel.rss"],
   ]

### Scraping RSS Feeds
Uses mostly the feedparser library.  We mostly just want the links to later run
them through Newspaper3k.

In [4]:
links = []

for sub_feed in feeds:
    feed = feedparser.parse(sub_feed[1])
    one_day_ago = datetime.now() - timedelta(days=1)

    recent_items = []
    for entry in feed.entries:
        try:
            published = datetime(*entry.published_parsed[:6])
            # if published > one_day_ago:
            recent_items.append(entry)
        except:
            if not entry.title == "":
                recent_items.append(entry)

    for item in recent_items:
        try:
            links.append(item.links[0].href)

        except Timeout:
            print("ARTICLE COLLECTION TIMED OUT:", item.links[0].href)
        except Exception as e:
            print("ARTICLE COLLECTION ERROR:", e)

## Embedding content from RSS feeds

We start to use langchain heavily from here on out.  Depending on the amount of text scraped from RSS feeds, 
this cell might take some time.  Generally, each link in the feed takes around 40 seconds to embed.

First, we use Newspaper3k to get the text from the link.

Next, we use "RecursiveCharacterTextSplitter" to split the text into semantially
significant text.

Then, we put in the context to each chunk, which helps the model cite sources.

Finally, we vectorize all the documents and set up other parts of the langchain
chain.

In [5]:
# Using Ollama to host mistral 7b, which, from what I read
# best model that I can run locally
model = Ollama(model='mistral')

# I chose this embedder because it is small and well performing according to 
# HF's MTEB leaderboard: https://huggingface.co/spaces/mteb/leaderboard
embedder_name = "BAAI/bge-base-en-v1.5"
model_kwargs = {"device": "cpu"}
encode_kwargs = {"normalize_embeddings": True}

embedder = HuggingFaceBgeEmbeddings(
    model_name=embedder_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

# Recursive text splitter splits text into semantically meaningful chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)

# This is where we will store the different chunks of text
split_data = []

for link in links:
    good_2_go = True
    try:
        # Using newspaper3k to download and parse the article
        article = Article(link)
        article.download()
        article.parse()
    except:
        # There are certain articles that newspaper3k can't parse
        # I catch those here and see if they are of any value (They usually aren't)
        good_2_go = False
        print("ERROR DOWNLOADING ARTICLE:", link)
    
    # This metadata will be added to the beginning of each chunk
    # Doing this decreases hallucinations in citations
    meta_data = f"Source \n Title: {article.title} \n Url: {link} \
        n Publish Date: {article.publish_date} \n  Excerpt from source: "

    if good_2_go and not article.text is None:
        # Splitting
        candidate = text_splitter.create_documents([article.text])

        temp_split_data = text_splitter.split_documents(candidate)

        # Adding metadata to each chunk
        for split in temp_split_data:
            split.page_content = meta_data + split.page_content
        
        split_data += temp_split_data

# vectorstore is the vector database using lightweight Chromadb
vectorstore = Chroma.from_documents(documents=split_data, embedding=embedder)

# Yet another part of langchain's abstractions
# Used to parse the output of the model
output_parser = StrOutputParser()

# Lets us query the db to fill out the prompt template below
setup_and_retrieval = RunnableParallel(
    {"context": vectorstore.as_retriever(), "question": RunnablePassthrough()}
)

  from .autonotebook import tqdm as notebook_tqdm


ERROR DOWNLOADING ARTICLE: https://www.cnn.com/cnn-underscored/reviews/best-bidets?iid=CNNUnderscoredHPcontainer
ERROR DOWNLOADING ARTICLE: https://www.cnn.com/cnn-underscored/home/editors-favorite-sustainable-products?iid=CNNUnderscoredHPcontainer
ERROR DOWNLOADING ARTICLE: https://www.cnn.com/cnn-underscored/gifts/best-mothers-day-gifts-2023?iid=CNNUnderscoredHPcontainer
ERROR DOWNLOADING ARTICLE: https://www.cnn.com/cnn-underscored/fashion/mens-spring-fashion-style-guide?iid=CNNUnderscoredHPcontainer
ERROR DOWNLOADING ARTICLE: https://www.cnn.com/cnn-underscored/travel/amazon-travel-products?iid=CNNUnderscoredHPcontainer
ERROR DOWNLOADING ARTICLE: https://www.cnn.com/cnn-underscored/money/high-yield-savings-accounts?iid=CNNUnderscoredHPcontainer
ERROR DOWNLOADING ARTICLE: https://www.cnn.com/cnn-underscored/money/how-to-file-taxes?iid=CNNUnderscoredHPcontainer
ERROR DOWNLOADING ARTICLE: https://www.cnn.com/cnn-underscored/home/how-to-compost-at-home?iid=CNNUnderscoredHPcontainer
ERR

## Template and initializing chain

In [6]:
# Prompt engineering is a constant process, especially with a smaller model
template = """
You are a news expert who answers questions about news.  

You must only use the most recent data from this context:
{context}

Do not rely on your historical records.

Answer as concisely as possible, but make sure that your information
lines up with the sources.  

Your answer should be in the following format:
    Your answer to the question here.

    Sources: [article title, source link, publish date]

Cite all Possible sources and put each on a new line.
If you don't have the relevent information to answer the question or a source,
tell the user so.  Err on the side of caution.

If the question or topic is off topic and not about news at all, tell the user so.

Here is the question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

# The final runnable in the chain
chain = setup_and_retrieval | prompt | model | output_parser

## Up and running!

In [9]:

question = "What is the most recent news in Isreal?"

# I want to see what context the model is using to answer the question
print("CONTEXT FETCHED:")
print(setup_and_retrieval.invoke(question))

print("\nANSWER:")
print(chain.invoke(question))

{'context': [Document(page_content='Source \n Title: EU asks Meta for more details on efforts to stop Israel-Hamas war misinformation \n Url: https://www.cnn.com/2023/10/19/tech/eu-meta-tiktok-israel-content-disinformation/index.html \n Publish Date: 2023-10-19 00:00:00 \n  Excerpt from source: London CNN —\n\nThe European Union has told Meta it has a week to explain in greater detail how it is fighting the spread of illegal content and disinformation on its Facebook and Instagram platforms following the attacks across Israel by Hamas.\n\nThe European Commission, the bloc’s executive arm, said it had sent the formal request for information to Meta (META) Thursday.'), Document(page_content='Source \n Title: EU asks Meta for more details on efforts to stop Israel-Hamas war misinformation \n Url: https://www.cnn.com/2023/10/19/tech/eu-meta-tiktok-israel-content-disinformation/index.html \n Publish Date: 2023-10-19 00:00:00 \n  Excerpt from source: On Friday, Meta said its teams had been w