## Project: Summarize and Consult Multiple News Articles

- Given a list of News Articles, provide a summary and answer questions accurately

#### Skills

- Langchain
- Vector Database and Semantic Search
- Web Scraping
- streamlit (for proof of concept app)

#### Techniques

- Chunk articles and provide relavant chunks, instead of providing the entire article to the LLM (save costs)

In [None]:
# Langchain
!pip install langchain_community

# Text Splitter
!pip3 install unstructured libmagic python-magic

# Vector DB
!pip install faiss-cpu
!pip install sentence-transformers

# Files fron github
!wget https://raw.githubusercontent.com/codebasics/langchain/main/2_news_research_tool_project/notebooks/nvda_news_1.txt -O nvda_news_1.txt
!wget https://raw.githubusercontent.com/codebasics/langchain/main/2_news_research_tool_project/notebooks/movies.csv -O movies.csv
!wget https://raw.githubusercontent.com/codebasics/langchain/main/2_news_research_tool_project/notebooks/sample_text.csv -O sample_text.csv

## 1. Document Loader

In [15]:
from langchain.document_loaders import TextLoader

loader = TextLoader('/kaggle/working/nvda_news_1.txt')
data = loader.load()
type(data)

list

In [16]:
data[0].metadata

{'source': '/kaggle/working/nvda_news_1.txt'}

In [22]:
from langchain.document_loaders import CSVLoader

loader = CSVLoader('/kaggle/working/movies.csv',
                  source_column = 'title')
data = loader.load()

data[0].metadata

{'source': 'K.G.F: Chapter 2', 'row': 0}

In [30]:
from langchain.document_loaders import UnstructuredURLLoader

loader = UnstructuredURLLoader(
    urls = [
        "https://www.moneycontrol.com/news/business/banks/hdfc-bank-re-appoints-sanmoy-chakrabarti-as-chief-risk-officer-11259771.html",
        "https://www.moneycontrol.com/news/business/markets/market-corrects-post-rbi-ups-inflation-forecast-icrr-bet-on-these-top-10-rate-sensitive-stocks-ideas-11142611.html"
    ]
)

data = loader.load()

print(data[0].metadata)
data[0].page_content[:250]

{'source': 'https://www.moneycontrol.com/news/business/banks/hdfc-bank-re-appoints-sanmoy-chakrabarti-as-chief-risk-officer-11259771.html'}


'English\n\nHindi\n\nGujarati\n\nSpecials\n\nHello, Login\n\nHello, Login\n\nLog-inor Sign-Up\n\nMy Account\n\nMy Profile\n\nMy Portfolio\n\nMy Watchlist\n\nMy Alerts\n\nMy Messages\n\nPrice Alerts\n\nMy Profile\n\nMy PRO\n\nMy Portfolio\n\nMy Watchlist\n\nMy Alerts\n\nMy Messages\n\nPrice '

## 2. Text Splitter (Chunking)

In [32]:
# Taking some random text from wikipedia
text = """Interstellar is a 2014 epic science fiction film co-written, directed, and produced by Christopher Nolan. It stars Matthew McConaughey, Anne Hathaway, Jessica Chastain, Bill Irwin, Ellen Burstyn, Matt Damon, and Michael Caine. Set in a dystopian future where humanity is embroiled in a catastrophic blight and famine, the film follows a group of astronauts who travel through a wormhole near Saturn in search of a new home for humankind.
Brothers Christopher and Jonathan Nolan wrote the screenplay, which had its origins in a script Jonathan developed in 2007 and was originally set to be directed by Steven Spielberg. Kip Thorne, a Caltech theoretical physicist and 2017 Nobel laureate in Physics,[4] was an executive producer, acted as a scientific consultant, and wrote a tie-in book, The Science of Interstellar. Cinematographer Hoyte van Hoytema shot it on 35 mm movie film in the Panavision anamorphic format and IMAX 70 mm. Principal photography began in late 2013 and took place in Alberta, Iceland, and Los Angeles. Interstellar uses extensive practical and miniature effects, and the company Double Negative created additional digital effects.
Interstellar premiered in Los Angeles on October 26, 2014. In the United States, it was first released on film stock, expanding to venues using digital projectors. The film received generally positive reviews from critics and grossed over $677 million worldwide ($715 million after subsequent re-releases), making it the tenth-highest-grossing film of 2014. It has been praised by astronomers for its scientific accuracy and portrayal of theoretical astrophysics.[5][6][7] Interstellar was nominated for five awards at the 87th Academy Awards, winning Best Visual Effects, and received numerous other accolades."""


In [34]:
# Say LLM token limit is 100, we can do this, but it dooesn't complete words
text[0:100]

'Interstellar is a 2014 epic science fiction film co-written, directed, and produced by Christopher N'

In [35]:
# We can have a for loop, but it's tedious and can have other issues
words = text.split(" ")
chunks = [] # add string chucks using a for loop, making sure len(string) < 100

In [44]:
from langchain.text_splitter import CharacterTextSplitter

splitter = CharacterTextSplitter(
    separator = '\n',
    chunk_size = 200,
    chunk_overlap = 0,
)

chunks = splitter.split_text(text)

chunks[0]

'Interstellar is a 2014 epic science fiction film co-written, directed, and produced by Christopher Nolan. It stars Matthew McConaughey, Anne Hathaway, Jessica Chastain, Bill Irwin, Ellen Burstyn, Matt Damon, and Michael Caine. Set in a dystopian future where humanity is embroiled in a catastrophic blight and famine, the film follows a group of astronauts who travel through a wormhole near Saturn in search of a new home for humankind.'

In [46]:
# This only takes a single seperator, so you run into issues.

[len(c) for c in chunks]

[437, 716, 611]

In [52]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    separators = ['\n\n', '\n', '.', ' '],
    chunk_size = 200,
    chunk_overlap = 0,
)

chunks = splitter.split_text(text)

chunks[0]

'Interstellar is a 2014 epic science fiction film co-written, directed, and produced by Christopher Nolan'

In [53]:
[len(c) for c in chunks]

[104, 121, 197, 13, 1, 180, 198, 114, 94, 130, 162, 194, 106, 149]

## 3. Vector Database

- Industry standard: Pinecone, Milvus, Chroma
- Lightweight: FAISS (Facebook AI Similarity Search) - can be used as a Vector DB for small projects

Convert each chunk into an Embedding, and store in a FAISS Index

In [56]:
import pandas as pd

pd.set_option('display.max_colwidth', 100)

df = pd.read_csv('/kaggle/working/sample_text.csv')
df

Unnamed: 0,text,category
0,Meditation and yoga can improve mental health,Health
1,"Fruits, whole grains and vegetables helps control blood pressure",Health
2,These are the latest fashion trends for this week,Fashion
3,Vibrant color jeans for male are becoming a trend,Fashion
4,The concert starts at 7 PM tonight,Event
5,Navaratri dandiya program at Expo center in Mumbai this october,Event
6,Exciting vacation destinations for your next trip,Travel
7,Maldives and Srilanka are gaining popularity in terms of low budget vacation places,Travel


In [58]:
from sentence_transformers import SentenceTransformer

encoder = SentenceTransformer('all-mpnet-base-v2')
encoder

SentenceTransformer(
  (0): Transformer({'max_seq_length': 384, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

In [59]:
vectors = encoder.encode(df.text)
vectors

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

array([[-0.00247395,  0.03626722, -0.05290459, ..., -0.09152356,
        -0.03970001, -0.04330489],
       [-0.03357267,  0.00980519, -0.03250129, ..., -0.05165466,
         0.02245887, -0.03156182],
       [-0.01865322, -0.04051318, -0.01235387, ...,  0.00610586,
        -0.07179645,  0.02773851],
       ...,
       [-0.00066458,  0.04252127, -0.05645508, ...,  0.0131547 ,
        -0.03183567, -0.04357665],
       [-0.03317153,  0.03252455, -0.02484838, ...,  0.0117442 ,
         0.05747124,  0.00571023],
       [-0.00166395,  0.00413828, -0.04597083, ...,  0.02008527,
         0.05656243, -0.00161596]], dtype=float32)

In [60]:
vectors.shape

(8, 768)

In [61]:
dim = vectors.shape[1]

In [76]:
import faiss

index = faiss.IndexFlatL2(dim)  # Uses the 2-norm (euclidean distance)

# "Flat" means it does a brute-force search over all the vectors stored in the index. 
# Slow for large datasets, but it is exact and guarantees accurate results.

index

<faiss.swigfaiss_avx2.IndexFlatL2; proxy of <Swig Object of type 'faiss::IndexFlatL2 *' at 0x7f5b9025afd0> >

In [78]:
index.add(vectors)
index.ntotal

8

In [80]:
search_query = "I want to buy a polo t-shirt"
svec = encoder.encode(search_query)
svec.shape

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

(768,)

In [81]:
svec = svec.reshape(1,-1)
svec.shape

(1, 768)

In [82]:
index.search(svec, k = 2)

(array([[1.3844836, 1.4039096]], dtype=float32), array([[3, 2]]))

In [83]:
# 3, 2 are both articles related to Fashion, which checks out

In [90]:
search_query = "I want to improve my health"
svec = encoder.encode(search_query).reshape(1,-1)
locs = index.search(svec, k =  2)[1][0]

[df['text'][i] for i in locs]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

['Meditation and yoga can improve mental health',
 'Fruits, whole grains and vegetables helps control blood pressure']