# RAG

## Requirements

In [1]:
# %%capture
%pip install transformers accelerate bitsandbytes langchain langchain-community sentence-transformers faiss-gpu pandas gdown

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


## Dataset

In [11]:
!gdown --fuzzy https://drive.google.com/file/d/1Lq2zVJlN_B4kUAu4VafQ4jXMIQiAR9vI/view?usp=sharing

Downloading...
From (original): https://drive.google.com/uc?id=1Lq2zVJlN_B4kUAu4VafQ4jXMIQiAR9vI
From (redirected): https://drive.google.com/uc?id=1Lq2zVJlN_B4kUAu4VafQ4jXMIQiAR9vI&confirm=t&uuid=da964a90-b52b-4f9f-a896-f3a33731083a
To: /home/danial/SUT/Term6/MIR/Project/Phase2/IMDb-IR-System/Logic/core/rag/IMDB_crawled.json
100%|████████████████████████████████████████| 292M/292M [03:59<00:00, 1.22MB/s]


## Config

In [1]:
class Config:
    EMBEDDING_MODEL_NAME="thenlper/gte-base"
    LLM_MODEL_NAME="HuggingFaceH4/zephyr-7b-beta"
    K = 5 # top K retrieval

## Preprocessing

In [3]:
import pandas as pd
df = pd.read_json('IMDB_crawled.json')
df.head()

Unnamed: 0,id,title,first_page_summary,release_year,mpaa,budget,gross_worldwide,rating,directors,writers,stars,related_links,languages,countries_of_origin,summaries,synposis,reviews,genres
0,tt0071562,The Godfather Part II,The early life and career of Vito Corleone in ...,1974,R,"$13,000,000 (estimated)","$47,962,683",9.0,[Francis Ford Coppola],,"[Al Pacino, Robert De Niro, Robert Duvall]",[https://imdb.com/title/tt0068646/?ref_=tt_sim...,"[English, Italian, Spanish, Latin, Sicilian]",[United States],[The early life and career of Vito Corleone in...,[The Godfather Part II presents two parallel s...,"[[Coppola's masterpiece is rivaled only by ""Th...","[Crime, Drama]"
1,tt0120737,The Lord of the Rings: The Fellowship of the Ring,A meek Hobbit from the Shire and eight compani...,2001,PG-13,"$93,000,000 (estimated)","$884,041,698",8.9,[Peter Jackson],,"[Elijah Wood, Ian McKellen, Orlando Bloom]",[https://imdb.com/title/tt0167261/?ref_=tt_sim...,"[English, Sindarin]","[New Zealand, United States]",[A meek Hobbit from the Shire and eight compan...,[Galadriel (Cate Blanchett) (The Elven co-rule...,"[[Here is one film that lived up to its hype, ...","[Action, Adventure, Drama]"
2,tt0110912,Pulp Fiction,"The lives of two mob hitmen, a boxer, a gangst...",1994,R,"$8,000,000 (estimated)","$213,928,762",8.9,[Quentin Tarantino],,"[John Travolta, Uma Thurman, Samuel L. Jackson]",[https://imdb.com/title/tt0137523/?ref_=tt_sim...,"[English, Spanish, French]",[United States],"[The lives of two mob hitmen, a boxer, a gangs...",[Narrative structure\nPulp Fiction's narrative...,[[I like the bit with the cheeseburger. It mak...,"[Crime, Drama]"
3,tt0068646,The Godfather,The aging patriarch of an organized crime dyna...,1972,R,"$6,000,000 (estimated)","$250,342,030",9.2,[Francis Ford Coppola],,"[Marlon Brando, Al Pacino, James Caan]",[https://imdb.com/title/tt0071562/?ref_=tt_sim...,"[English, Italian, Latin]",[United States],[The aging patriarch of an organized crime dyn...,"[In late summer 1945, guests are gathered for ...",[['The Godfather' is the pinnacle of flawless ...,"[Crime, Drama]"
4,tt0111161,The Shawshank Redemption,"Over the course of several years, two convicts...",1994,R,"$25,000,000 (estimated)","$28,904,232",9.3,[Frank Darabont],"[Stephen King, Frank Darabont]","[Tim Robbins, Morgan Freeman, Bob Gunton]",[https://imdb.com/title/tt0468569/?ref_=tt_sim...,[English],[United States],"[Over the course of several years, two convict...","[In 1947, Andy Dufresne (Tim Robbins), a banke...",[[The Shawshank Redemption is written and dire...,[Drama]


In [4]:
from tqdm import tqdm
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import stopwords
import re, string

def preprocess_text(text: str, lowercase=True, stopword_removal=True, stopwords_domain=[], min_length=2,  punctuation_removal=True,
                    does_stem=False, does_lemm=False):
    if text is None:
        return ""
    if lowercase:
        text = text.lower()
    if punctuation_removal:
        text = text.translate(str.maketrans('', '', string.punctuation))
    tokens = word_tokenize(text)
    if stopword_removal:
        stop_words = set(stopwords.words('english') + stopwords_domain)
        tokens = [word for word in tokens if word not in stop_words]
    if does_stem:
        stemmer = PorterStemmer()
        tokens = [stemmer.stem(word) for word in tokens]
    if does_lemm:
        lemmatizer = WordNetLemmatizer()
        tokens = [lemmatizer.lemmatize(word) for word in tokens]
    tokens = [word for word in tokens if len(word) >= min_length]

    return " ".join(tokens)


In [6]:
import os

os.makedirs('data', exist_ok=True)

# preprocess your data and only store the needed data as the context window for embedding model is limited
selected_columns = ['id', 'title', 'first_page_summary', 'genres']
df_preprocessed = df[selected_columns]
df_preprocessed = df_preprocessed.dropna(subset=selected_columns)

df_preprocessed['first_page_summary'] = df_preprocessed['first_page_summary'].apply(
    lambda x: preprocess_text(x, lowercase=True, stopword_removal=True, does_stem=True)
)

df_preprocessed.to_csv('data/imdb.csv', index=False)

df_preprocessed.head()

Unnamed: 0,id,title,first_page_summary,genres
0,tt0071562,The Godfather Part II,earli life career vito corleon 1920 new york c...,"[Crime, Drama]"
1,tt0120737,The Lord of the Rings: The Fellowship of the Ring,meek hobbit shire eight companion set journey ...,"[Action, Adventure, Drama]"
2,tt0110912,Pulp Fiction,live two mob hitmen boxer gangster wife pair d...,"[Crime, Drama]"
3,tt0068646,The Godfather,age patriarch organ crime dynasti transfer con...,"[Crime, Drama]"
4,tt0111161,The Shawshank Redemption,cours sever year two convict form friendship s...,[Drama]


## Vectorizer

load the CSV file and vectorize the rows using HuggingFaceEmbeddings.
Store the results using FAISS vectorstore.
Save the vectorestore in a pickle file for future usages.

In [5]:
import torch
torch.cuda.empty_cache()

In [7]:
import pickle

from langchain.document_loaders.csv_loader import CSVLoader
from langchain.vectorstores.utils import DistanceStrategy
from langchain.vectorstores.faiss import FAISS

# from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_huggingface import HuggingFaceEmbeddings
# load the csv
loader = CSVLoader("data/imdb.csv", encoding="utf-8")
documents = loader.load()

# def extract_first_page_summary(page_content):
#     lines = page_content.split("\n")
#     for line in lines:
#         if line.startswith("first_page_summary"):
#             return line[len("first_page_summary")+1:].strip()
#     return ""

# print(documents[0].page_content)
# print(extract_first_page_summary(documents[0].page_content))

# load the embeddings model

embedding_model = HuggingFaceEmbeddings(model_name="thenlper/gte-base")

faiss_vector_store = FAISS.from_documents(documents, embedding_model, distance_strategy=DistanceStrategy.COSINE)

# save embed the documents using the model in a vectorstore

vectorstore_path = "data/vectorstore.pkl"
with open(vectorstore_path, "wb") as f:
    pickle.dump(faiss_vector_store, f)

print(f"Vector store saved to {vectorstore_path}")


  from tqdm.autonotebook import tqdm, trange


Vector store saved to data/vectorstore.pkl


load the vectorstore as a retriever.

In [2]:
# with open("data/vectorstore.pkl", "rb") as f:
#     vectorstore = pickle.load(f)

# load the retriever from the vectorstore
import pickle


vectorstore_path = "data/vectorstore.pkl"
with open(vectorstore_path, "rb") as f:
    faiss_vector_store = pickle.load(f)

retriever = faiss_vector_store.as_retriever()

print("Retriever initialized successfully")


  from tqdm.autonotebook import tqdm, trange


Retriever initialized successfully


## LLM

load the quantized LLM.

In [61]:
# import os
# os.environ['HF_HOME'] = '~/SUT/Term6/MIR/Project/Phase2/IMDb-IR-System/Logic/core/rag/cache'

In [3]:
import torch

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from transformers import pipeline

from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline


# load the quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    # bnb_4bit_use_double_quant=True,
    # bnb_4bit_quant_type='nf4'
)

model = AutoModelForCausalLM.from_pretrained(
    Config.LLM_MODEL_NAME,
    quantization_config=bnb_config,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(Config.LLM_MODEL_NAME)


Loading checkpoint shards: 100%|██████████| 8/8 [00:18<00:00,  2.29s/it]


In [4]:

# init the pipeline
READER_LLM = pipeline(
    task="text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=500,
)

llm = HuggingFacePipeline(
    pipeline=READER_LLM,
)

print("LLM and pipeline initialized successfully")

  warn_deprecated(


LLM and pipeline initialized successfully


initialize the prompt template for the query chain. query chain is used to get a query from the chat history. you may change the prompt as you like to get better results.

In [39]:
from langchain.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

class LoggerStrOutputParser(StrOutputParser):
    def parse(self, text: str) -> str:
        # process the LLM output
        print(f"QUERY: {text}")
        return text

query_transform_prompt = PromptTemplate(
    input_variables=["messages"],
    template="""
"{messages}"
Please generate a search query for the llm engine for movie about the above conversation. Your query should not be more than 15 words and just give one query as the output.
""" + "|SEP|"
)

# init the query chain
query_transforming_retriever_chain = (
    {"messages": RunnablePassthrough()}
    | query_transform_prompt
    | llm
    | StrOutputParser()
)

print("Query transforming retriever chain initialized successfully")

Query transforming retriever chain initialized successfully


initialize the main retrieval chain that gives the resulting documents to LLM and gets the output back.

In [40]:
from langchain.chains.combine_documents import create_stuff_documents_chain

from langchain_core.runnables import RunnablePassthrough

prompt = PromptTemplate(
    input_variables=["context", "messages"],
    template="""You are a helpful assistant with the role of helping to make recommendations and answer questions.

Here are the movies that you must select from them:
{context}
-----------------
User Queries:
{messages}
-----------------

Based on the above movies and the user queries, please generate a response that is about the most relevant movie to the user queries. Your answer must be in the following form:

Title: [the title of the movie (The year the movie was made)]

Genres: [the genres of the movie]

Plot : [A brief summary of the movie]


Just one single movie recommendation in this format.
""" + "|SEP|")

# init the retriver chain
retrieval_chain = (
    {"context" : retriever, "messages": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

write the conversation helper class for easier testing.

In [41]:
class Conversation:
    def __init__(self):
        self.messages = []
        self.retriever = retriever
        
    def add_assistant_message(self, message):
        self.messages.append(('assistant', message))

    def add_user_message(self, message):
        self.messages.append(('user', message))

    def get_messages(self):
        # concatenate the messages with the roles in the instruction format
        formatted_messages = "\n**********\n".join(
            f"{role}: {msg}" for role, msg in self.messages
        )
        return formatted_messages

    def chat(self, message):
        self.add_user_message(message)
        messages = self.get_messages()

        # print(messages)
        # print("***************")
        # invoke the chain

        query = query_transforming_retriever_chain.invoke(messages).split("|SEP|")[-1]

        print("Current message : " , message)
        print("*****************************************")
        print("Current Query : ", query)
        print("*****************************************")

        response = retrieval_chain.invoke(query).split("|SEP|")[-1]
        self.add_assistant_message(response.split("\n")[0:6])
        return response


## Test

talk with the RAG to see how good it performs.

In [42]:
c = Conversation()
A = c.chat('give me a cool gangster movie')
print(A)

Current message :  give me a cool gangster movie
*****************************************
Current Query :  >
"recommend a stylish gangster film with charismatic antiheroes"
*****************************************
>
Title: Scarface (1983)

Genres: Action, Crime, Drama

Plot: In 1980 Miami, a Cuban refugee named Tony Montana (Al Pacino) flees his homeland after the Marxist revolution, settling in with his guerrilla fighter girlfriend and his cousin, Manolo. In a few years, Montana manages to get a job working for ruthless Miami drug lord Frank Lopez (Robert Loggia). Montana proves himself to be a prosperous worker, ultimately replacing Lopez's right-hand man, Ganz (Harris Yulin). As Montana's power grows, however, so do his greed and narcissism, and his blind ambition puts everyone he loves in danger.

Charismatic Antiheroes: Tony Montana, played by Al Pacino, is a charismatic antihero who rises from a Cuban refugee to a powerful drug lord in Miami. His charm and charisma make him a c

In [43]:
A = c.chat('give me a newer one')
print(A)

Current message :  give me a newer one
*****************************************
Current Query :  >
"recommend a modern gangster movie"

*****************************************
>
Title: "Gangster Land" (2017)

Genres: Crime, Drama, Thriller

Plot: In the 1990s, the Irish mafia fights for control of the streets of Boston, and a young man, Colin, gets caught in the middle. As he rises through the ranks, he must decide whether to follow his dreams or to betray his friends and become a ruthless criminal himself. With a cast including Chris Coppola, Peter Greene, and Milo Gibson, "Gangster Land" is a gritty and intense portrayal of the criminal underworld.
