# RAG

## Requirements

In [None]:
%%capture
!pip install transformers accelerate bitsandbytes langchain langchain-community sentence-transformers faiss-gpu pandas gdown contractions unidecode tqdm trange
!pip install -U langchain-huggingface

## Dataset

In [None]:
!gdown --fuzzy https://drive.google.com/file/d/1Lq2zVJlN_B4kUAu4VafQ4jXMIQiAR9vI/view?usp=sharing

Downloading...
From (original): https://drive.google.com/uc?id=1Lq2zVJlN_B4kUAu4VafQ4jXMIQiAR9vI
From (redirected): https://drive.google.com/uc?id=1Lq2zVJlN_B4kUAu4VafQ4jXMIQiAR9vI&confirm=t&uuid=466cf316-66e3-45b1-b60a-a457b8d01a88
To: /content/IMDB_crawled.json
100% 292M/292M [00:01<00:00, 196MB/s]


## Config

In [None]:
class Config:
    EMBEDDING_MODEL_NAME="thenlper/gte-base"
    LLM_MODEL_NAME="HuggingFaceH4/zephyr-7b-beta"
    K = 5 # top K retrieval

## Preprocessing

In [None]:
import pandas as pd

df = pd.read_json('IMDB_crawled.json')

In [None]:
df.head()

Unnamed: 0,id,title,first_page_summary,release_year,mpaa,budget,gross_worldwide,rating,directors,writers,stars,related_links,languages,countries_of_origin,summaries,synposis,reviews,genres
0,tt0071562,The Godfather Part II,The early life and career of Vito Corleone in ...,1974,R,"$13,000,000 (estimated)","$47,962,683",9.0,[Francis Ford Coppola],,"[Al Pacino, Robert De Niro, Robert Duvall]",[https://imdb.com/title/tt0068646/?ref_=tt_sim...,"[English, Italian, Spanish, Latin, Sicilian]",[United States],[The early life and career of Vito Corleone in...,[The Godfather Part II presents two parallel s...,"[[Coppola's masterpiece is rivaled only by ""Th...","[Crime, Drama]"
1,tt0120737,The Lord of the Rings: The Fellowship of the Ring,A meek Hobbit from the Shire and eight compani...,2001,PG-13,"$93,000,000 (estimated)","$884,041,698",8.9,[Peter Jackson],,"[Elijah Wood, Ian McKellen, Orlando Bloom]",[https://imdb.com/title/tt0167261/?ref_=tt_sim...,"[English, Sindarin]","[New Zealand, United States]",[A meek Hobbit from the Shire and eight compan...,[Galadriel (Cate Blanchett) (The Elven co-rule...,"[[Here is one film that lived up to its hype, ...","[Action, Adventure, Drama]"
2,tt0110912,Pulp Fiction,"The lives of two mob hitmen, a boxer, a gangst...",1994,R,"$8,000,000 (estimated)","$213,928,762",8.9,[Quentin Tarantino],,"[John Travolta, Uma Thurman, Samuel L. Jackson]",[https://imdb.com/title/tt0137523/?ref_=tt_sim...,"[English, Spanish, French]",[United States],"[The lives of two mob hitmen, a boxer, a gangs...",[Narrative structure\nPulp Fiction's narrative...,[[I like the bit with the cheeseburger. It mak...,"[Crime, Drama]"
3,tt0068646,The Godfather,The aging patriarch of an organized crime dyna...,1972,R,"$6,000,000 (estimated)","$250,342,030",9.2,[Francis Ford Coppola],,"[Marlon Brando, Al Pacino, James Caan]",[https://imdb.com/title/tt0071562/?ref_=tt_sim...,"[English, Italian, Latin]",[United States],[The aging patriarch of an organized crime dyn...,"[In late summer 1945, guests are gathered for ...",[['The Godfather' is the pinnacle of flawless ...,"[Crime, Drama]"
4,tt0111161,The Shawshank Redemption,"Over the course of several years, two convicts...",1994,R,"$25,000,000 (estimated)","$28,904,232",9.3,[Frank Darabont],"[Stephen King, Frank Darabont]","[Tim Robbins, Morgan Freeman, Bob Gunton]",[https://imdb.com/title/tt0468569/?ref_=tt_sim...,[English],[United States],"[Over the course of several years, two convict...","[In 1947, Andy Dufresne (Tim Robbins), a banke...",[[The Shawshank Redemption is written and dire...,[Drama]


In [None]:
import os
import re
import contractions
import string
from unidecode import unidecode

pd.set_option('display.max_colwidth', None)
os.makedirs('data', exist_ok=True)

# preprocess your data and only store the needed data as the context window for embedding model is limited
def prepreprocess(text):
  text = re.sub(r'\s+', ' ', text)
  text = text.strip()
  text = text.translate(str.maketrans('', '', string.punctuation))

  sw = ['a', 'an', 'the','this','that','about','whom','being','where','why','had','should','each']
  lowered = unidecode(text).lower()
  result = []
  for word in lowered.split():
    if word not in sw:
      result.append(word)
  s = ' '.join(result)
  return s


df = df[['title','first_page_summary', 'genres', 'release_year']]
df = df.dropna(subset=['title', 'first_page_summary', 'genres', 'release_year'])
df = df.drop_duplicates(subset=['first_page_summary'])
df['first_page_summary'] = df['first_page_summary'].apply(prepreprocess)
df = df.dropna(subset=['first_page_summary'])
df['data'] = df['title'] + '(' + df['release_year'] + ')' + ': ' + df['first_page_summary']

df.to_csv('data/imdb.csv', index=False)
df.head()

Unnamed: 0,title,first_page_summary,genres,release_year,data
0,The Godfather Part II,early life and career of vito corleone in 1920s new york city is portrayed while his son michael expands and tightens his grip on family crime syndicate,"[Crime, Drama]",1974,The Godfather Part II(1974): early life and career of vito corleone in 1920s new york city is portrayed while his son michael expands and tightens his grip on family crime syndicate
1,The Lord of the Rings: The Fellowship of the Ring,meek hobbit from shire and eight companions set out on journey to destroy powerful one ring and save middleearth from dark lord sauron,"[Action, Adventure, Drama]",2001,The Lord of the Rings: The Fellowship of the Ring(2001): meek hobbit from shire and eight companions set out on journey to destroy powerful one ring and save middleearth from dark lord sauron
2,Pulp Fiction,lives of two mob hitmen boxer gangster and his wife and pair of diner bandits intertwine in four tales of violence and redemption,"[Crime, Drama]",1994,Pulp Fiction(1994): lives of two mob hitmen boxer gangster and his wife and pair of diner bandits intertwine in four tales of violence and redemption
3,The Godfather,aging patriarch of organized crime dynasty transfers control of his clandestine empire to his reluctant son,"[Crime, Drama]",1972,The Godfather(1972): aging patriarch of organized crime dynasty transfers control of his clandestine empire to his reluctant son
4,The Shawshank Redemption,over course of several years two convicts form friendship seeking consolation and eventually redemption through basic compassion,[Drama],1994,The Shawshank Redemption(1994): over course of several years two convicts form friendship seeking consolation and eventually redemption through basic compassion


## Vectorizer

load the CSV file and vectorize the rows using HuggingFaceEmbeddings.
Store the results using FAISS vectorstore.
Save the vectorestore in a pickle file for future usages.

In [None]:
import pickle

from langchain.document_loaders.csv_loader import CSVLoader
from langchain.vectorstores.utils import DistanceStrategy
from langchain.vectorstores.faiss import FAISS
from langchain.text_splitter import CharacterTextSplitter
from langchain.document_loaders import TextLoader
from langchain_core.documents import Document

from langchain_huggingface import HuggingFaceEmbeddings

from tqdm.notebook import tqdm, trange

# load the csv
data = pd.read_csv('data/imdb.csv').dropna()
documents = []
a = 0
for index, row in data.iterrows():
  a = max(a, len(row['data'].split()))
  d = Document(
      page_content=row['data'],
      metadata={"genres": row['genres']}
    )
  documents.append(d)
# load the embeddings model
embedder = HuggingFaceEmbeddings(model_name=Config.EMBEDDING_MODEL_NAME)

# save embed the documents using the model in a vectorstore
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
vectorstore = await FAISS.afrom_documents(docs, embedder)

with open("data/vectorstore.pkl", "wb") as f:
    pickle.dump(vectorstore, f)

  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


load the vectorstore as a retriever.

In [None]:
with open("data/vectorstore.pkl", "rb") as f:
    vectorstore = pickle.load(f)

# load the retriever from the vectorstore
retriever = vectorstore.as_retriever(k=5)

In [None]:
query = "What are some great batman and joker stories?"
docs = retriever.invoke(query)
for d in docs:
  print(d.page_content)

Batman vs Joker: Final Joke(2008): batman is trying to give urgent message to people of gotham when his greatest rival takes over broadcast and turns it into mayhem
Batman: Dead End(2003): joker has escaped from arkham and batman must once again bring him in once and for all unfortunately for bat there is something even more sinister than joker waiting in read all
Batman(1989): dark knight of gotham city begins his war on crime with his first major enemy jack napier criminal who becomes clownishly homicidal joker
Batman Forever(1995): batman must battle former district attorney harvey dent who is now twoface and edward nygma riddler with help from amorous psychologist and young circus acrobat who becomes his s read all


## LLM

load the quantized LLM.

In [None]:
import torch

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from transformers import pipeline

from langchain_huggingface import HuggingFacePipeline

# load the quantization config
bnb_config = BitsAndBytesConfig()

model = AutoModelForCausalLM.from_pretrained(Config.LLM_MODEL_NAME, quantization_config=bnb_config, device_map="cuda:0")
tokenizer = AutoTokenizer.from_pretrained(Config.LLM_MODEL_NAME)

# init the pipeline
READER_LLM = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=50)

llm = HuggingFacePipeline(
    pipeline=READER_LLM,
)

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

initialize the prompt template for the query chain. query chain is used to get a query from the chat history. you may change the prompt as you like to get better results.

In [None]:
from langchain.prompts import PromptTemplate

from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

class LoggerStrOutputParser(StrOutputParser):
    def parse(self, text: str) -> str:
        # process the LLM output
        print(f"QUERY: {text}")
        return text

query_transform_prompt = PromptTemplate(
    input_variables=["messages"],
    template="""<|system|>You are a helpful assistant.
{messages}
<|user|>
give me the search query about the above conversation.
<|assistant|>"""
)

# init the query chain
query_transforming_retriever_chain = (
    {"messages": RunnablePassthrough()}
    | query_transform_prompt
    | llm
    | LoggerStrOutputParser()
)

initialize the main retrieval chain that gives the resulting documents to LLM and gets the output back.

In [None]:
from langchain.chains.combine_documents import create_stuff_documents_chain

from langchain_core.runnables import RunnablePassthrough

prompt = PromptTemplate(
    input_variables=["context", "messages"],
    template="""<|system|>You are a helpful assistant.

Here are the movies you MUST choose from:

{context}
-----------------
{messages}
<|assistant|>""")

# init the retriver chain
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

retrieval_chain = (
    {"context": retriever | format_docs, "messages": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

write the conversation helper class for easier testing.

In [None]:
class Conversation:
    def __init__(self):
        self.messages = []

    def add_assistant_message(self, message):
        self.messages.append(('assistant', message))

    def add_user_message(self, message):
        self.messages.append(('user', message))

    def get_messages(self):
        # concatenate the messages with the roles in the instruction format
        formatted_messages = "\n".join([f"{role}: {message}" for role, message in self.messages])
        return formatted_messages

    def chat(self, message):
        self.add_user_message(message)
        messages = self.get_messages()
        # invoke the chain
        query = query_transforming_retriever_chain.invoke(messages)
        response = retrieval_chain.invoke(query)
        self.add_assistant_message(response)
        return response

## Test

talk with the RAG to see how good it performs.

In [None]:
c = Conversation()
A = c.chat('give me a cool gangster movie')
print(A)



QUERY: <|system|>You are a helpful assistant.
user: give me a cool gangster movie
<|user|>
give me the search query about the above conversation.
<|assistant|>
"recommend a stylish and intense gangster movie with captivating characters and thrilling action scenes that will leave me on the edge of my seat."
<|system|>You are a helpful assistant.

Here are the movies you MUST choose from:

Q: The Winged Serpent(1982): nypd detectives shepard and powell are working on bizarre case of ritualistic aztec murder meanwhile something big is attacking people of new york and only greedy small time crook jimm read all

Goodfellas 2(2020): two gangsters meet and fall madly in love and have little bit of gay romance but their italian brothers wont let one slide so they all hit gritty hop on fortnite and win som read all

Wanted(2009): radhe is ruthless gangster who will kill anyone for money he is attracted towards jhanvi middle class girl who does not approve of his work and wants him to change

Ji

In [None]:
A = c.chat('give me a newer one')
print(A)

QUERY: <|system|>You are a helpful assistant.
user: give me a cool gangster movie
assistant: <|system|>You are a helpful assistant.

Here are the movies you MUST choose from:

Q: The Winged Serpent(1982): nypd detectives shepard and powell are working on bizarre case of ritualistic aztec murder meanwhile something big is attacking people of new york and only greedy small time crook jimm read all

Goodfellas 2(2020): two gangsters meet and fall madly in love and have little bit of gay romance but their italian brothers wont let one slide so they all hit gritty hop on fortnite and win som read all

Wanted(2009): radhe is ruthless gangster who will kill anyone for money he is attracted towards jhanvi middle class girl who does not approve of his work and wants him to change

Jigarthanda(2014): aspiring director targets ruthless gangster because he wants to make violent gangster film his discreet attempts to research gangster fail miserably finally when he gets caught read all
------------