# RAG

## Requirements

In [1]:
%%capture
!pip install transformers accelerate bitsandbytes langchain langchain-community sentence-transformers faiss-gpu pandas gdown

## Dataset

In [2]:
!gdown --fuzzy https://drive.google.com/file/d/1Lq2zVJlN_B4kUAu4VafQ4jXMIQiAR9vI/view?usp=sharing

Downloading...
From (original): https://drive.google.com/uc?id=1Lq2zVJlN_B4kUAu4VafQ4jXMIQiAR9vI
From (redirected): https://drive.google.com/uc?id=1Lq2zVJlN_B4kUAu4VafQ4jXMIQiAR9vI&confirm=t&uuid=c0cc8d8d-b863-41a9-b66e-ff66014e6f38
To: /content/IMDB_crawled.json
100% 292M/292M [00:01<00:00, 156MB/s]


## Config

In [3]:
class Config:
    EMBEDDING_MODEL_NAME="thenlper/gte-base"
    LLM_MODEL_NAME="HuggingFaceH4/zephyr-7b-beta"
    K = 5 # top K retrieval

## Preprocessing

In [4]:
import pandas as pd

# path_to_file = "../../IMDB_crawled.json"
path_to_file = "/content/IMDB_crawled.json"
df = pd.read_json(path_to_file)[:1000]
df.head(5)

Unnamed: 0,id,title,first_page_summary,release_year,mpaa,budget,gross_worldwide,rating,directors,writers,stars,related_links,languages,countries_of_origin,summaries,synposis,reviews,genres
0,tt0071562,The Godfather Part II,The early life and career of Vito Corleone in ...,1974,R,"$13,000,000 (estimated)","$47,962,683",9.0,[Francis Ford Coppola],,"[Al Pacino, Robert De Niro, Robert Duvall]",[https://imdb.com/title/tt0068646/?ref_=tt_sim...,"[English, Italian, Spanish, Latin, Sicilian]",[United States],[The early life and career of Vito Corleone in...,[The Godfather Part II presents two parallel s...,"[[Coppola's masterpiece is rivaled only by ""Th...","[Crime, Drama]"
1,tt0120737,The Lord of the Rings: The Fellowship of the Ring,A meek Hobbit from the Shire and eight compani...,2001,PG-13,"$93,000,000 (estimated)","$884,041,698",8.9,[Peter Jackson],,"[Elijah Wood, Ian McKellen, Orlando Bloom]",[https://imdb.com/title/tt0167261/?ref_=tt_sim...,"[English, Sindarin]","[New Zealand, United States]",[A meek Hobbit from the Shire and eight compan...,[Galadriel (Cate Blanchett) (The Elven co-rule...,"[[Here is one film that lived up to its hype, ...","[Action, Adventure, Drama]"
2,tt0110912,Pulp Fiction,"The lives of two mob hitmen, a boxer, a gangst...",1994,R,"$8,000,000 (estimated)","$213,928,762",8.9,[Quentin Tarantino],,"[John Travolta, Uma Thurman, Samuel L. Jackson]",[https://imdb.com/title/tt0137523/?ref_=tt_sim...,"[English, Spanish, French]",[United States],"[The lives of two mob hitmen, a boxer, a gangs...",[Narrative structure\nPulp Fiction's narrative...,[[I like the bit with the cheeseburger. It mak...,"[Crime, Drama]"
3,tt0068646,The Godfather,The aging patriarch of an organized crime dyna...,1972,R,"$6,000,000 (estimated)","$250,342,030",9.2,[Francis Ford Coppola],,"[Marlon Brando, Al Pacino, James Caan]",[https://imdb.com/title/tt0071562/?ref_=tt_sim...,"[English, Italian, Latin]",[United States],[The aging patriarch of an organized crime dyn...,"[In late summer 1945, guests are gathered for ...",[['The Godfather' is the pinnacle of flawless ...,"[Crime, Drama]"
4,tt0111161,The Shawshank Redemption,"Over the course of several years, two convicts...",1994,R,"$25,000,000 (estimated)","$28,904,232",9.3,[Frank Darabont],"[Stephen King, Frank Darabont]","[Tim Robbins, Morgan Freeman, Bob Gunton]",[https://imdb.com/title/tt0468569/?ref_=tt_sim...,[English],[United States],"[Over the course of several years, two convict...","[In 1947, Andy Dufresne (Tim Robbins), a banke...",[[The Shawshank Redemption is written and dire...,[Drama]


In [13]:
import os

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

nltk.download('stopwords')
nltk.download('wordnet')

os.makedirs('./data', exist_ok=True)

# preprocess your data and only store the needed data as the context window for embedding model is limited
def preprocess_text(text):
    text = text.lower()
    tokens = word_tokenize(text)
    tokens = [word for word in tokens if word.isalnum()]
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    preprocessed_text = ' '.join(tokens)
    return preprocessed_text

df = df.loc[:, df.columns.intersection(['id', 'title', 'genres', 'first_page_summary'])]
df = df.dropna(subset=['id', 'title', 'genres', 'first_page_summary'])
df['first_page_summary'] = df['first_page_summary'].apply(preprocess_text)

df.to_csv('./data/imdb.csv', index=False)
df.head()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,id,title,first_page_summary,genres
0,tt0071562,The Godfather Part II,early life career vito corleone 1920s new york...,"[Crime, Drama]"
1,tt0120737,The Lord of the Rings: The Fellowship of the Ring,meek hobbit shire eight companion set journey ...,"[Action, Adventure, Drama]"
2,tt0110912,Pulp Fiction,life two mob hitman boxer gangster wife pair d...,"[Crime, Drama]"
3,tt0068646,The Godfather,aging patriarch organized crime dynasty transf...,"[Crime, Drama]"
4,tt0111161,The Shawshank Redemption,course several year two convict form friendshi...,[Drama]


## Vectorizer

load the CSV file and vectorize the rows using HuggingFaceEmbeddings.
Store the results using FAISS vectorstore.
Save the vectorestore in a pickle file for future usages.

In [12]:
!pip install -U langchain_huggingface

Collecting langchain_huggingface
  Downloading langchain_huggingface-0.0.3-py3-none-any.whl (17 kB)
Installing collected packages: langchain_huggingface
Successfully installed langchain_huggingface-0.0.3


In [15]:
import pickle

from langchain.document_loaders.csv_loader import CSVLoader
from langchain.vectorstores.utils import DistanceStrategy
from langchain.vectorstores.faiss import FAISS
# from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_huggingface import HuggingFaceEmbeddings

# load the csv
csv_loader = CSVLoader(file_path='./data/imdb.csv')
# TODO: change this 1000
documents = csv_loader.load()[:1000]

# load the embeddings model
embedding_model = HuggingFaceEmbeddings(model_name=Config.EMBEDDING_MODEL_NAME)

# save embed the documents using the model in a vectorstore
vectorstore = FAISS.from_documents(documents, embedding_model, distance_strategy=DistanceStrategy.COSINE)

with open("data/vectorstore.pkl", "wb") as f:
    pickle.dump(vectorstore, f)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/68.1k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/618 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/219M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

load the vectorstore as a retriever.

In [16]:
with open("data/vectorstore.pkl", "rb") as f:
    vectorstore = pickle.load(f)

# load the retriever from the vectorstore
retriever = vectorstore.as_retriever(K=Config.K)
print("Vectorstore and retriever initialized successfully.")


Vectorstore and retriever initialized successfully.


## LLM

load the quantized LLM.

In [None]:
import torch

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from transformers import pipeline

from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline

# load the quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16
)

model = AutoModelForCausalLM.from_pretrained(Config.LLM_MODEL_NAME, quantization_config=bnb_config, device_map="cuda:0")
tokenizer = AutoTokenizer.from_pretrained(Config.LLM_MODEL_NAME)

# init the pipeline
generation_pipeline = pipeline(task="text-generation",
                               model=model,
                               tokenizer=tokenizer,
                               max_new_tokens=500)

READER_LLM = HuggingFacePipeline(pipeline=generation_pipeline)

llm = HuggingFacePipeline(
    pipeline=READER_LLM,
)
print("LLM and pipeline initialized successfully")

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

LLM and pipeline initialized successfully


initialize the prompt template for the query chain. query chain is used to get a query from the chat history. you may change the prompt as you like to get better results.

In [None]:
from langchain.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
# from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.runnables import RunnablePassthrough


class LoggerStrOutputParser(StrOutputParser):
    def parse(self, text: str) -> str:
        # process the LLM output
        print(f"QUERY: {text}")
        return text

query_transform_prompt = PromptTemplate(
    input_variables=["messages"],
    template="""<|system|>You are a helpful assistant.
{messages}
<|user|>
give me the search query about the above conversation.
<|assistant|>"""
)

# init the query chain
query_transforming_retriever_chain = (
    {"messages": RunnablePassthrough()}
    | query_transform_prompt
    | llm
    | StrOutputParser()
)

initialize the main retrieval chain that gives the resulting documents to LLM and gets the output back.

In [None]:
from langchain.chains.combine_documents import create_stuff_documents_chain

from langchain_core.runnables import RunnablePassthrough

prompt = PromptTemplate(
    input_variables=["context", "messages"],
    template="""You are a helpful assistant with the role of helping to make recommendations and answer questions.

Here are the movies that you must select from them:
{context}
-----------------
User Queries:
{messages}
-----------------

Based on the above movies and the user queries, please generate a response that is about the most relevant movie to the user queries. Your answer should be in followin from:
Title: [the title of the movie (The year the movie was made)]
Genres: [the genres of the movie]
Plot : [A brief summary of the movie]

""" + "|SEP|")

# init the retriver chain
retrieval_chain = (
    {"context" : retriever, "messages": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

write the conversation helper class for easier testing.

In [None]:
class Conversation:
    def __init__(self):
        self.messages = []

    def add_assistant_message(self, message):
        self.messages.append(('assistant', message))

    def add_user_message(self, message):
        self.messages.append(('user', message))

    def get_messages(self):
        # concatenate the messages with the roles in the instruction format
        formatted_messages = "\n \n".join([f"{role}: {msg}" for role, msg in self.messages])
        return formatted_messages

    def chat(self, message):
        self.add_user_message(message)
        messages = self.get_messages()

        # invoke the chain
        search_query = query_transforming_retriever_chain.invoke(input=messages).split("|SEP|")[-1]
        print("Current message : " , message)
        print("*****************************************")
        print("Current Query : ", search_query)
        print("*****************************************")

        response = retrieval_chain.invoke(search_query).split("|SEP|")[-1]

        self.add_assistant_message(response)
        return response

c = Conversation()
A = c.chat('give me a cool gangster movie')
print(A)

ValueError: Argument `prompt` is expected to be a string. Instead found <class 'list'>. If you want to run the LLM on multiple prompts, use `generate` instead.

## Test

talk with the RAG to see how good it performs.

In [None]:
c = Conversation()
A = c.chat('give me a cool gangster movie')
print(A)

In [None]:
A = c.chat('give me a newer one')
print(A)