# RAG based spoiler blocker agent
- **purpose** - is to learn RAG & Langchain not build SOTA model
- **VectorDB**- It indexes embedding text for fast retrieval during query
- **Text_Splitter**-LLM's have finite context length so text needs to be splitted into optimal length before injecting into vector databases
- **Encoder Based Transformer**-Vector databases uses NLU based transformer model to comprehend the langugae like BERT we are using OpenAI one

In [1]:
#importing necessary libraries
from operator import itemgetter
from langchain.chat_models import ChatOpenAI
from langchain.embeddings import OpenAIEmbeddings
from langchain.prompts import ChatPromptTemplate
from langchain.schema.output_parser import StrOutputParser
from langchain.schema.runnable import RunnableLambda, RunnablePassthrough
from langchain.vectorstores import FAISS
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
from langchain.text_splitter import RecursiveCharacterTextSplitter
from tqdm import tqdm
import math
from langchain.prompts import PromptTemplate
os.environ["OPENAI_API_KEY"]="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"
embeddings=OpenAIEmbeddings()


ModuleNotFoundError: No module named 'langchain'

1. Loading past historical data from Kaggle
- It is in CSV format
- It contains **plot_summary** & **plot_synopsis**

In [19]:
df=pd.read_csv("details_movie.csv")

In [20]:
df.dropna().shape

(1339, 8)

In [21]:
df.head()

Unnamed: 0.1,Unnamed: 0,movie_id,plot_summary,duration,genre,rating,release_date,plot_synopsis
0,0,tt0105112,"Former CIA analyst, Jack Ryan is in England wi...",1h 57min,"['Action', 'Thriller']",6.9,1992-06-05,"Jack Ryan (Ford) is on a ""working vacation"" in..."
1,1,tt1204975,"Billy (Michael Douglas), Paddy (Robert De Niro...",1h 45min,['Comedy'],6.6,2013-11-01,Four boys around the age of 10 are friends in ...
2,2,tt0243655,"The setting is Camp Firewood, the year 1981. I...",1h 37min,"['Comedy', 'Romance']",6.7,2002-04-11,
3,3,tt0040897,"Fred C. Dobbs and Bob Curtin, both down on the...",2h 6min,"['Adventure', 'Drama', 'Western']",8.3,1948-01-24,Fred Dobbs (Humphrey Bogart) and Bob Curtin (T...
4,4,tt0126886,Tracy Flick is running unopposed for this year...,1h 43min,"['Comedy', 'Drama', 'Romance']",7.3,1999-05-07,Jim McAllister (Matthew Broderick) is a much-a...


2. Before indexing synopsis & summary into **vector db**  it needs to be splitted **coz** LLM has finite context window

In [22]:
arr_synopsis=df['plot_synopsis'].dropna().tolist()
arr_summary=df['plot_summary'].dropna().tolist()


In [23]:
#this functions takes the array and split the longer documents that 1000 into multiple parts with ovrlap

def split_pip(arr):
    #defining the text splitter
    text_splitter=RecursiveCharacterTextSplitter(chunk_size = 1000,chunk_overlap  = 50)
    empty_arr=[]
    #iterating through array
    for i in tqdm(range(len(arr))):
        empty_arr.append(text_splitter.split_text(arr[i]))
    empty_arr_2=[]
    
    #reshaping into 1D array
    for i in range(len(empty_arr)):
        for j in range(len(empty_arr[i])):
            empty_arr_2.append(empty_arr[i][j])
                       
    return empty_arr_2
        


        

In [24]:
#calling helper function to split text
synopsis=split_pip(arr_synopsis)
summary=split_pip(arr_summary)

100%|██████████████████████████████████████████████████████████████| 1339/1339 [00:01<00:00, 751.60it/s]
100%|████████████████████████████████████████████████████████████| 1572/1572 [00:00<00:00, 39225.69it/s]


In [25]:
print("length of synopsis",len(synopsis)," array")

print("length of summary",len(summary)," array")

length of synopsis 14344  array
length of summary 1940  array


##### 3. FAISS vectorDb will use OpenAI embeddings to store the chunks of synopsis and summary into embeddings space. These two vector dbs will be used for RAG

In [26]:
#initilizing NLU embbeddings
embeddings=OpenAIEmbeddings()

In [12]:
# #vectorizing the synopsis array & storing into local memory as vector db
vectorStore=FAISS.from_texts(synopsis,embedding=OpenAIEmbeddings())
vectorStore.save_local("synopsis_final_all")

In [27]:
#loading the stored vector db
synopsis_db=FAISS.load_local("synopsis_final_all", embeddings)
summary_db=FAISS.load_local("summary_final", embeddings)


In [28]:
len(synopsis_db.docstore._dict)

14344

In [29]:
len(summary_db.docstore._dict)

1639

##### - **example working of vector db given a query it will retrive the related text.**]
#### - bashmash 😂- Bandit,goon,endearing way bully in hindi

In [30]:
    
from IPython.display import Image, display
display(Image(url="https://upload.wikimedia.org/wikipedia/en/8/84/Munna_Bhai_M.B.B.S._poster.jpg"))


In [31]:
synopsis_db.similarity_search("Badmash")[0]

Document(page_content="game!Munna is busy memorizing the answers for Dr. Asthana's quiz, kindly provided by Dr. Pavri in gratitude for saving his father, when Dr. Suman brings distressing news: Zaheer is in a terminal condition and wishes to see him. With his last breath, Zaheer begs Munna to save him, eventually dying in Munnas arms. The incident shakes Munna so much that he can't get through Dr. Asthana's quiz. Disgraced by Dr. Asthana a second time, he leaves the Institute. Then the whole Institute is shaken as Anand Banerjee, whom had been given up as a lost cause, attempts to follow Munna and bring him back. Encouraged by this, Dr. Suman publicly rebukes her father for chasing away a man who only tried to make the patients happier and more cheerful.That night, Munna and Circuit drown their sorrows in alcohol, but when they reach home they find a surprise: Munna's parents wait for them at their headquarters with open arms. The Sharmas have been told of their son's exploits, and Har

#### 4. drafting prompt for LLM querying

In [32]:
prompt_template = PromptTemplate.from_template(
"""You are AI assistant your task is to block spoilers of movies.
    
    For Given movie {summary} & {synopsis} answer whether given {review} is spoiler or not.
    
    Answer in binary format 1:Yes & 0:No,followed by Eplanation how its spoilers:
    """
)


llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

#### 5. setting FAISS vector db as retriever object so that during querying relevant document is retrived

In [33]:
synopsis_retri=synopsis_db.as_retriever()
summary_retri=summary_db.as_retriever()

#### 6. Defining Functions for Scrapping IMBD website

In [34]:
from langchain.document_loaders import AsyncHtmlLoader
from langchain.document_transformers import BeautifulSoupTransformer
import re
#this helper function uses beatiful soup for web scarpping and scraps synpsis of movie given movie id
def add_syno(movie_id):
    url=f"https://www.imdb.com/title/{movie_id}/plotsummary/?ref_=tt_stry_pl"
    html=AsyncHtmlLoader(url).load()
    docs_transformed = BeautifulSoupTransformer().transform_documents(html, tags_to_extract=["section"])
    pattern = re.compile(r'(?:Synopsis.+?){2}(.+)', re.IGNORECASE | re.DOTALL)
    text=pattern.search(docs_transformed[0].page_content).group(1).strip()
    pattern2=re.compile(r'^(.*?)(?=(?:Begin INLINE40|$))', re.IGNORECASE | re.DOTALL)
    return pattern2.search(text).group(1).strip()

#this helper function uses beatiful soup for web scarpping and scraps summary of movie given movie id
def add_summary(movie_id):
    url=f"https://www.imdb.com/title/{movie_id}/plotsummary/?ref_=tt_stry_pl"
    html=AsyncHtmlLoader(url).load()
    docs_transformed = BeautifulSoupTransformer().transform_documents(html, tags_to_extract=["section"])
    pattern = re.compile(r'(?:Summaries.+?){2}(.+)', re.IGNORECASE | re.DOTALL)
    text=pattern.search(docs_transformed[0].page_content).group(1).strip()
    pattern2=re.compile(r'^(.*?)(?=(?:Synopsis|$))', re.IGNORECASE | re.DOTALL)
    return pattern2.search(text).group(1).strip()

#function to add text into vector db
def add_into_vectorDb(db,text):
    db.merge_from(FAISS.from_texts(text,embedding=OpenAIEmbeddings()))


In [35]:
print("Summary of Animal 2023 Movie \n",add_summary("tt13751694"))

Fetching pages: 100%|#####################################################| 1/1 [00:00<00:00,  1.33it/s]


Summary of Animal 2023 Movie 
 A son's love for his father, who is often away due to work and hence unable to comprehend the intensity of his son's love. Ironically, this fervent love and admiration for his father and family creates conflict between the father and son. Balbir Singh is a rich industrialist but has no time for his family.His son Ranvijay loves him to the core and considers him a superhero.But differences develop between the father and son at a very young age of Ranvijay and he is sent to boarding school.Years later a Ranvijay returns to celebrate 60th birthday of Balbir but things turn ugly on that day and he asked to leave the house.While Ranjavijay is leaving he is surprised to see Geetaanjali who has broken her engagement and wants to be with him.They both get married in a private ceremony and shift to US.Eight years later Balbir is attacked by unknown assailants but survives, Ranvijay returns with Geetaanjali and his kids to be with his family and starts a war with p

In [36]:
print("Synopsis of Animal 2023 Movie \n",add_syno("tt13751694"))

Fetching pages: 100%|#####################################################| 1/1 [00:00<00:00,  1.81it/s]


Synopsis of Animal 2023 Movie 
 Ranvijay "Vijay" Singh is the son of Balbir Singh, a Delhi-based business magnate who heads the generational steel company, "Swastik Steels". Vijay resides in a palatial mansion with Balbir, his mother, Jyoti, and sisters Reet and Roop. Vijay has adored Balbir from his childhood, but Balbir does not spend time with the family due to his busy schedule. One day, Reet gets bullied by her collegemates, and Vijay learns about this and warns them with an AK-47 and chases them. Balbir learns about this and sends him away to a boarding school in the US. Vijay returns to Delhi after completing his education, where he attends the engagement of his former school classmate Geetanjali. Vijay always has feelings for Geetanjali and confesses to her that she would be much better off with Vijay as he is an alpha male, which leads to Geetanjali instantly falling for him and breaking her engagement. During Balbir's 60th birthday, Vijay fights with Reet's husband Varun Prat

In [37]:
len(synopsis_db.docstore._dict)

14344

In [38]:
len(summary_db.docstore._dict)

1639

- adding scrapped text into vector dbs

In [39]:
add_into_vectorDb(synopsis_db,split_pip([add_syno(("tt13751694"))]))
add_into_vectorDb(summary_db,split_pip([add_summary(("tt13751694"))]))

Fetching pages: 100%|#####################################################| 1/1 [00:00<00:00,  2.08it/s]
100%|███████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1175.20it/s]
Fetching pages: 100%|#####################################################| 1/1 [00:01<00:00,  1.57s/it]
100%|███████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 3196.88it/s]


#### reinitlizing retriver again because we added the datapoints

In [40]:
synopsis_retri=synopsis_db.as_retriever()
summary_retri=summary_db.as_retriever()

#### 7. defining RAG chain

In [41]:
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


rag_chain = (
    {"summary":summary_retri|format_docs,"synopsis":synopsis_retri|format_docs, "review": RunnablePassthrough()}
    | prompt_template
    | llm
    | StrOutputParser()
)



#### 8. Invoking the Chain
- lets try on animal movie worst movie ever watched. 
- Hyped becoz of the song 

In [42]:
display(Image(url="https://media.ptcpunjabi.co.in/wp-content/uploads/2023/06/animal-movie_03dac46d0d2ea838eccf15a0c7880e70_1280X720.webp"))


In [43]:
query="""Vijay & his cousins will chase abrar and finally vijay will kill abrar on the airport after fight. 
        Balbir will develop cancer and will die. 
      His wife geetanjali will leave vijay for usa heartbroken vijay will hug his son for last time. 
      """

In [44]:
synopsis_retri.get_relevant_documents(query)

[Document(page_content="A brutal fight ensues, where Vijay finally kills Abrar and returns to India. During their Diwali celebration, Balbir reveals to Vijay that he has terminal cancer and realizes that his upbringing is the reason behind Vijay's aggression. Balbir finally apologizes and reconciles with Vijay, who is overjoyed after finally receiving familial love and affection from Balbir. In the post-credits scene, Asrar, Abid, and Abrar's other younger brother Aziz, a professional assassin in Istanbul, learn that Vijay was responsible for killing Asrar and Abrar. After successfully undergoing plastic surgery to become Vijay's doppelgänger, Aziz and Abid set out to exact vengeance on Vijay and his family."),
 Document(page_content="Asrar was the actual perpetrator behind Balbir's assassination attempt. Asrar's younger brothers Abid and Abrar found out everything about Vijay and sent her to him. Vijay reveals that he knew earlier that she was sent by someone and that he wanted to kno

In [45]:
summary_retri.get_relevant_documents(query)

[Document(page_content="A son's love for his father, who is often away due to work and hence unable to comprehend the intensity of his son's love. Ironically, this fervent love and admiration for his father and family creates conflict between the father and son. Balbir Singh is a rich industrialist but has no time for his family.His son Ranvijay loves him to the core and considers him a superhero.But differences develop between the father and son at a very young age of Ranvijay and he is sent to boarding school.Years later a Ranvijay returns to celebrate 60th birthday of Balbir but things turn ugly on that day and he asked to leave the house.While Ranjavijay is leaving he is surprised to see Geetaanjali who has broken her engagement and wants to be with him.They both get married in a private ceremony and shift to US.Eight years later Balbir is attacked by unknown assailants but survives, Ranvijay returns with Geetaanjali and his kids to be with his family and starts a war with people w

In [46]:
rag_chain.invoke(query)

"1: Yes\n\nExplanation: The given text contains plot details and reveals important events and twists in the movies. It gives away the conflicts, relationships, and outcomes of the characters, which can spoil the viewing experience for someone who hasn't watched the movies yet."

### 8. Now FUN part

- just chaned name of mirza-mike singh,sahiba-preeto 🤭 a tragic love story to check whether it will hakllucinate or not.
- GPT 3.5 already knows such legendary story plots so i worked 

In [47]:
from IPython.display import Image, display
display(Image(url="https://i.ytimg.com/vi/AI1g_UgTifo/mqdefault.jpg"))
print("Mike singh ❤️ preeto  😹-Modern Mirza")

Mike singh ❤️ preeto  😹-Modern Mirza


In [48]:
rag_chain.invoke(""""In 16th Century Punjab there was Mike Singh a rich landlord from jatt community
                     he fell in love with Preeto from neighbouring village. 
                     Finally they decided to marry and agaisnt will of family. 
                     They ran away and but preeto's brother caught and killed mike singh with axes
                     """)

'0: No, it is not a spoiler. The given information does not reveal any crucial plot points or twists in the movie. It only provides a brief summary of the story without giving away any major surprises or developments.'

In [49]:
rag_chain.invoke("Mike singh will be killed at end by preeto's brother. Preeto will cry in vain")

"0: No, this is not a spoiler. The given text does not reveal any specific plot details or twists that would spoil the movie for someone who hasn't seen it. It only provides a general overview of the characters and their relationships."

In [50]:
rag_chain.invoke("Mike singh will be killed at end by preeto's brother")

"0: No, it is not a spoiler. The given information does not reveal any major plot twists or significant events that would spoil the movie for someone who hasn't seen it yet."