# RAG

## Requirements

In [1]:
%%capture
!pip install transformers accelerate bitsandbytes langchain langchain-community sentence-transformers faiss-gpu pandas gdown

## Dataset

In [2]:
!gdown --fuzzy https://drive.google.com/file/d/1Lq2zVJlN_B4kUAu4VafQ4jXMIQiAR9vI/view?usp=sharing

Downloading...
From (original): https://drive.google.com/uc?id=1Lq2zVJlN_B4kUAu4VafQ4jXMIQiAR9vI
From (redirected): https://drive.google.com/uc?id=1Lq2zVJlN_B4kUAu4VafQ4jXMIQiAR9vI&confirm=t&uuid=01f31d97-8a3b-4c80-bcc8-fd01d1f19b33
To: /content/IMDB_crawled.json
100% 292M/292M [00:09<00:00, 32.1MB/s]


## Config

In [3]:
class Config:
    EMBEDDING_MODEL_NAME="thenlper/gte-base"
    LLM_MODEL_NAME="HuggingFaceH4/zephyr-7b-beta"
    K = 5 # top K retrieval

## Preprocessing

In [15]:
import pandas as pd

df = pd.read_json('IMDB_crawled.json')
df['synopsis'] = df['synposis']
df = df.drop(columns=['synposis'])
df.head(5)

Unnamed: 0,id,title,first_page_summary,release_year,mpaa,budget,gross_worldwide,rating,directors,writers,stars,related_links,languages,countries_of_origin,summaries,reviews,genres,synopsis
0,tt0071562,The Godfather Part II,The early life and career of Vito Corleone in ...,1974,R,"$13,000,000 (estimated)","$47,962,683",9.0,[Francis Ford Coppola],,"[Al Pacino, Robert De Niro, Robert Duvall]",[https://imdb.com/title/tt0068646/?ref_=tt_sim...,"[English, Italian, Spanish, Latin, Sicilian]",[United States],[The early life and career of Vito Corleone in...,"[[Coppola's masterpiece is rivaled only by ""Th...","[Crime, Drama]",[The Godfather Part II presents two parallel s...
1,tt0120737,The Lord of the Rings: The Fellowship of the Ring,A meek Hobbit from the Shire and eight compani...,2001,PG-13,"$93,000,000 (estimated)","$884,041,698",8.9,[Peter Jackson],,"[Elijah Wood, Ian McKellen, Orlando Bloom]",[https://imdb.com/title/tt0167261/?ref_=tt_sim...,"[English, Sindarin]","[New Zealand, United States]",[A meek Hobbit from the Shire and eight compan...,"[[Here is one film that lived up to its hype, ...","[Action, Adventure, Drama]",[Galadriel (Cate Blanchett) (The Elven co-rule...
2,tt0110912,Pulp Fiction,"The lives of two mob hitmen, a boxer, a gangst...",1994,R,"$8,000,000 (estimated)","$213,928,762",8.9,[Quentin Tarantino],,"[John Travolta, Uma Thurman, Samuel L. Jackson]",[https://imdb.com/title/tt0137523/?ref_=tt_sim...,"[English, Spanish, French]",[United States],"[The lives of two mob hitmen, a boxer, a gangs...",[[I like the bit with the cheeseburger. It mak...,"[Crime, Drama]",[Narrative structure\nPulp Fiction's narrative...
3,tt0068646,The Godfather,The aging patriarch of an organized crime dyna...,1972,R,"$6,000,000 (estimated)","$250,342,030",9.2,[Francis Ford Coppola],,"[Marlon Brando, Al Pacino, James Caan]",[https://imdb.com/title/tt0071562/?ref_=tt_sim...,"[English, Italian, Latin]",[United States],[The aging patriarch of an organized crime dyn...,[['The Godfather' is the pinnacle of flawless ...,"[Crime, Drama]","[In late summer 1945, guests are gathered for ..."
4,tt0111161,The Shawshank Redemption,"Over the course of several years, two convicts...",1994,R,"$25,000,000 (estimated)","$28,904,232",9.3,[Frank Darabont],"[Stephen King, Frank Darabont]","[Tim Robbins, Morgan Freeman, Bob Gunton]",[https://imdb.com/title/tt0468569/?ref_=tt_sim...,[English],[United States],"[Over the course of several years, two convict...",[[The Shawshank Redemption is written and dire...,[Drama],"[In 1947, Andy Dufresne (Tim Robbins), a banke..."


In [16]:
import nltk
nltk.download('stopwords')
import os
from nltk.corpus import stopwords
from string import punctuation

os.makedirs('data', exist_ok=True)

# preprocess your data and only store the needed data as the context window for embedding model is limited
def preprocess_text(text, minimum_length=1, stopword_removal=True, stopwords_domain=[], lower_case=True,
                       punctuation_removal=True):
    if text is not None:
      if lower_case:
          text = text.lower()

      if punctuation_removal:
          text = ''.join([char for char in text if char not in punctuation])

      tokens = text.split()

      if stopword_removal:
          stop_words = set(stopwords.words('english'))
          tokens = [token for token in tokens if token not in stop_words]
          tokens = [token for token in tokens if token not in stopwords_domain]

      tokens = [token for token in tokens if len(token) >= minimum_length]
      preprocessed_string = ' '.join(tokens)

      return preprocessed_string

    return text

def fucn1(x):
    new_str = ''
    if x is not None:
        new_str = ' '.join(x)
    return new_str

def func2(x):
    reviews = ''
    if x is not None:
        for review, score in x:
            reviews = reviews + review + ' '
    return reviews

for col in ['synopsis', 'summaries']:
    df[f'pre_{col}'] = df[col].apply(fucn1)

df['pre_reviews'] = df['reviews'].apply(func2)
df['pre_title'] = df['title']

columns = ['synopsis', 'summaries', 'reviews', 'title']

for col in columns:
    df[f'pre_{col}'] = df[f'pre_{col}'].apply(preprocess_text)
df['pre_first_page_summary'] = df['first_page_summary'].apply(preprocess_text)

# needed_df = pd.DataFrame(df, columns= ['id', 'pre_synopsis', 'pre_summaries', 'pre_reviews', 'pre_title', 'genres'])
needed_df = pd.DataFrame(df, columns= ['id', 'pre_first_page_summary', 'pre_title', 'genres'])

df.to_csv('data/imdb.csv', index=False)
needed_df.to_csv('data/imdb_needed_data.csv', index=False)

needed_df.head(5)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,id,pre_first_page_summary,pre_title,genres
0,tt0071562,early life career vito corleone 1920s new york...,godfather part ii,"[Crime, Drama]"
1,tt0120737,meek hobbit shire eight companions set journey...,lord rings fellowship ring,"[Action, Adventure, Drama]"
2,tt0110912,lives two mob hitmen boxer gangster wife pair ...,pulp fiction,"[Crime, Drama]"
3,tt0068646,aging patriarch organized crime dynasty transf...,godfather,"[Crime, Drama]"
4,tt0111161,course several years two convicts form friends...,shawshank redemption,[Drama]


## Vectorizer

load the CSV file and vectorize the rows using HuggingFaceEmbeddings.
Store the results using FAISS vectorstore.
Save the vectorestore in a pickle file for future usages.

In [17]:
import pickle
from langchain.document_loaders.csv_loader import CSVLoader
from langchain.vectorstores.utils import DistanceStrategy
from langchain.vectorstores.faiss import FAISS

from langchain_community.embeddings import HuggingFaceEmbeddings

# load the csv
loader = CSVLoader('data/imdb_needed_data.csv')
docs = loader.load()

# load the embeddings model
embedding_model = HuggingFaceEmbeddings(model_name=Config.EMBEDDING_MODEL_NAME, encode_kwargs={'normalize_embeddings':True})
# save embed the documents using the model in a vectorstore
vectorstore = FAISS.from_documents(docs, embedding_model)
with open("data/vectorstore.pkl", "wb") as f:
    pickle.dump(vectorstore, f)

  warn_deprecated(
  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/68.1k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/618 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/219M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

load the vectorstore as a retriever.

In [18]:
with open("data/vectorstore.pkl", "rb") as f:
    vectorstore = pickle.load(f)

# load the retriever from the vectorstore
retriever = vectorstore.as_retriever()

## LLM

load the quantized LLM.

In [19]:
import torch

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from transformers import pipeline

from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline

# load the quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
)


model = AutoModelForCausalLM.from_pretrained(Config.LLM_MODEL_NAME, quantization_config=bnb_config, device_map="cuda:0")
tokenizer = AutoTokenizer.from_pretrained(Config.LLM_MODEL_NAME)

# init the pipeline
READER_LLM = pipeline("text-generation", model=model, tokenizer=tokenizer, max_length = 80000)

llm = HuggingFacePipeline(pipeline=READER_LLM)

config.json:   0%|          | 0.00/638 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/8 [00:00<?, ?it/s]

model-00001-of-00008.safetensors:   0%|          | 0.00/1.89G [00:00<?, ?B/s]

model-00002-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00003-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00004-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00005-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00006-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00007-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00008-of-00008.safetensors:   0%|          | 0.00/816M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.43k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/168 [00:00<?, ?B/s]

  warn_deprecated(


initialize the prompt template for the query chain. query chain is used to get a query from the chat history. you may change the prompt as you like to get better results.

In [20]:
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain_core.output_parsers import StrOutputParser

class LoggerStrOutputParser(StrOutputParser):
    def parse(self, text: str) -> str:
        # process the LLM output
        print(f"QUERY: {text}")
        return text

query_transform_prompt = PromptTemplate(
    input_variables=["messages"],
    template="""<|system|>You are a helpful assistant.
{messages}
<|user|>
give me the search query about the above conversation.
<|assistant|>"""
)

# init the query chain
query_transforming_retriever_chain = LLMChain(
    llm=llm,
    prompt=query_transform_prompt,
    output_parser=LoggerStrOutputParser()
)

  warn_deprecated(


initialize the main retrieval chain that gives the resulting documents to LLM and gets the output back.

In [21]:
from langchain.chains.combine_documents import create_stuff_documents_chain
# from langchain.chains import RetrievalChain
from langchain_core.runnables import RunnablePassthrough

prompt = PromptTemplate(
    input_variables=["context", "messages"],
    template="""<|system|>You are a helpful assistant.

Here are the movies you MUST choose from:

{context}
-----------------
{messages}
<|assistant|>""")

# init the retriver chain
retrieval_chain = (
    {'context': retriever, 'messages': RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

write the conversation helper class for easier testing.

In [22]:
class Conversation:
    def __init__(self):
        self.messages = []

    def add_assistant_message(self, message):
        self.messages.append(('assistant', message))

    def add_user_message(self, message):
        self.messages.append(('user', message))

    def get_messages(self):
        # concatenate the messages with the roles in the instruction format
        concatinated_messages = ""
        for role, message in self.messages:
            concatinated_messages += f"{role.capitalize()}: {message}\n"
        return concatinated_messages.strip()

    def chat(self, message):
        self.add_user_message(message)
        messages = self.get_messages()
        # invoke the chain
        response = retrieval_chain.invoke(messages)
        self.add_assistant_message(response)
        return response

## Test

talk with the RAG to see how good it performs.

In [23]:
c = Conversation()
A = c.chat('give me a cool gangster movie')
print(A)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


<|system|>You are a helpful assistant.

Here are the movies you MUST choose from:

[Document(page_content="id: tt3569782\npre_first_page_summary: aspiring director targets ruthless gangster wants make violent gangster film discreet attempts research gangster fail miserably finally gets caught read\npre_title: jigarthanda\ngenres: ['Action', 'Comedy', 'Crime']", metadata={'source': 'data/imdb_needed_data.csv', 'row': 4675}), Document(page_content="id: tt0023427\npre_first_page_summary: ambitious nearly insane violent gangster climbs ladder success mob weaknesses prove downfall\npre_title: scarface\ngenres: ['Action', 'Crime', 'Drama']", metadata={'source': 'data/imdb_needed_data.csv', 'row': 369}), Document(page_content="id: tt0102603\npre_first_page_summary: gangster attempts keep promise made dying father would give life crime go straight\npre_title: oscar\ngenres: ['Comedy', 'Crime']", metadata={'source': 'data/imdb_needed_data.csv', 'row': 9856}), Document(page_content="id: tt242264

In [24]:
A = c.chat('give me a newer one')
print(A)

<|system|>You are a helpful assistant.

Here are the movies you MUST choose from:

[Document(page_content="id: tt3569782\npre_first_page_summary: aspiring director targets ruthless gangster wants make violent gangster film discreet attempts research gangster fail miserably finally gets caught read\npre_title: jigarthanda\ngenres: ['Action', 'Comedy', 'Crime']", metadata={'source': 'data/imdb_needed_data.csv', 'row': 4675}), Document(page_content="id: tt0023427\npre_first_page_summary: ambitious nearly insane violent gangster climbs ladder success mob weaknesses prove downfall\npre_title: scarface\ngenres: ['Action', 'Crime', 'Drama']", metadata={'source': 'data/imdb_needed_data.csv', 'row': 369}), Document(page_content="id: tt24226474\npre_first_page_summary: 1975 filmmaker agrees collaborate film gangster wishes become famous actor\npre_title: jigarthanda double x\ngenres: ['Action', 'Comedy', 'Drama']", metadata={'source': 'data/imdb_needed_data.csv', 'row': 1992}), Document(page_con