# RAG

## Requirements

In [6]:
%%capture
!pip install transformers accelerate bitsandbytes langchain langchain-community sentence-transformers faiss-gpu pandas gdown

## Dataset

In [7]:
!gdown --fuzzy https://drive.google.com/file/d/1Lq2zVJlN_B4kUAu4VafQ4jXMIQiAR9vI/view?usp=sharing

Downloading...
From (original): https://drive.google.com/uc?id=1Lq2zVJlN_B4kUAu4VafQ4jXMIQiAR9vI
From (redirected): https://drive.google.com/uc?id=1Lq2zVJlN_B4kUAu4VafQ4jXMIQiAR9vI&confirm=t&uuid=a1a8abb9-59e0-49ab-9dd7-bdffd261a4db
To: /kaggle/working/IMDB_crawled.json
100%|█████████████████████████████████████████| 292M/292M [00:01<00:00, 176MB/s]


## Config

In [8]:
class Config:
    EMBEDDING_MODEL_NAME="thenlper/gte-base"
    LLM_MODEL_NAME="HuggingFaceH4/zephyr-7b-beta"
    K = 5 # top K retrieval

## Preprocessing

In [10]:
import pandas as pd

df = pd.read_json('IMDB_crawled.json')

In [9]:
import os

os.makedirs('data', exist_ok=True)

# preprocess your data and only store the needed data as the context window for embedding model is limited
df = df[['title', 'genres', 'rating',  'first_page_summary','summaries']]  
df["summaries"] = df["summaries"].apply(lambda x: x[0] if x else None)

df.to_csv('data/imdb.csv', index=False)

## Vectorizer

load the CSV file and vectorize the rows using HuggingFaceEmbeddings.
Store the results using FAISS vectorstore.
Save the vectorestore in a pickle file for future usages.

In [11]:
import pickle

from langchain.document_loaders.csv_loader import CSVLoader
from langchain.vectorstores.utils import DistanceStrategy
from langchain.vectorstores.faiss import FAISS

from langchain_community.embeddings import HuggingFaceEmbeddings

# load the csv
csv_loader = CSVLoader("data/imdb.csv")
data = csv_loader.load()



# load the embeddings model
embeddings = HuggingFaceEmbeddings(model_name = Config.EMBEDDING_MODEL_NAME)


# save embed the documents using the model in a vectorstore
vectorstore = FAISS.from_documents(data, embeddings, distance_strategy=DistanceStrategy.COSINE)



with open("data/vectorstore.pkl", "wb") as f:
     pickle.dump(vectorstore, f)

  warn_deprecated(
  from tqdm.autonotebook import tqdm, trange
2024-06-28 00:51:54.554736: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-28 00:51:54.554872: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-28 00:51:54.693673: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


modules.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/68.1k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/618 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/219M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

load the vectorstore as a retriever.

In [12]:
# with open("data/vectorstore.pkl", "rb") as f:
#     vectorstore = pickle.load(f)

# load the retriever from the vectorstore
retriever = vectorstore.as_retriever(k=3)


## LLM

load the quantized LLM.

In [13]:
import torch

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from transformers import pipeline

from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline
# load the quantization config
bnb_config = BitsAndBytesConfig(load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_use_double_quant=True,
   bnb_4bit_compute_dtype=torch.bfloat16)

model = AutoModelForCausalLM.from_pretrained(Config.LLM_MODEL_NAME, quantization_config=bnb_config, device_map="cuda:0")
tokenizer = AutoTokenizer.from_pretrained(Config.LLM_MODEL_NAME)

# init the pipeline
READER_LLM = pipeline(
    task="text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=200
)


llm = HuggingFacePipeline(
    pipeline=READER_LLM,
)

config.json:   0%|          | 0.00/638 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/8 [00:00<?, ?it/s]

model-00001-of-00008.safetensors:   0%|          | 0.00/1.89G [00:00<?, ?B/s]

model-00002-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00003-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00004-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00005-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00006-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00007-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00008-of-00008.safetensors:   0%|          | 0.00/816M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.43k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/168 [00:00<?, ?B/s]

  warn_deprecated(


initialize the prompt template for the query chain. query chain is used to get a query from the chat history. you may change the prompt as you like to get better results.

In [14]:
# initialize the prompt template for the query chain. query chain is used to get a query from the chat history. you may change the prompt as you like to get better results.
from langchain.prompts import PromptTemplate

from langchain_core.output_parsers import StrOutputParser

class LoggerStrOutputParser(StrOutputParser):
    def parse(self, text: str) -> str:
        # process the LLM output
        text = text.split('|>')[-1]
        return text

query_transform_prompt = PromptTemplate(
    input_variables=["messages"],
    template="""<|system|>You are a helpful assistant.
{messages}
<|user|>
expand a one line descriptive search query for movie search with keywords from convesation above.
<|assistant|>"""
)

# init the query chain
query_transforming_retriever_chain = query_transform_prompt | llm | LoggerStrOutputParser()



initialize the main retrieval chain that gives the resulting documents to LLM and gets the output back.

In [15]:
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains.retrieval import create_retrieval_chain

from langchain_core.runnables import RunnablePassthrough

prompt = PromptTemplate(
    input_variables=["context", "messages"],
    template="""<|system|>You are a helpful assistant.

Here are the movies you MUST choose from:

{context}
-----------------
{messages}
<|assistant|>""")

# init the retriver chain
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)
retriever_chain = (
    {"context": retriever| format_docs , "messages": RunnablePassthrough()}
    | prompt
    | llm
    | LoggerStrOutputParser()
)
 

write the conversation helper class for easier testing.

In [16]:

class Conversation:
    def __init__(self):
        self.messages = []

    def add_assistant_message(self, message):
        self.messages.append(('assistant', message))

    def add_user_message(self, message):
        self.messages.append(('user', message))

    def get_messages(self):
        # concatenate the messages with the roles in the instruction format
        messages = "\n".join([f"<|{role}|>{message}" for role, message in self.messages])
        return messages

    def chat(self, message):
        self.add_user_message(message)
        messages = self.get_messages()
        # invoke the chain
        query = query_transforming_retriever_chain.invoke(messages)
        print(f'Query: {query}')
        response = retriever_chain.invoke(query)
        self.add_assistant_message(response)

        return response

## Test

talk with the RAG to see how good it performs.

In [17]:
c = Conversation()
A = c.chat('give me a cool gangster movie')
print(A)

Query: 
"Looking for a gritty and intense gangster movie with a charismatic lead character who rises through the ranks of a criminal organization, facing dangerous enemies and moral dilemmas along the way. Bonus points for a stylish and atmospheric cinematography that captures the seedy underworld of the city."

Based on your preferences, I would recommend "Scarface" as the movie that best fits your criteria. While "American Gangster" is also a biographical crime drama, "Scarface" has a more intense and gritty portrayal of a gangster's rise to power, with a charismatic lead character facing dangerous enemies and moral dilemmas. Additionally, "Scarface" has a stylish and atmospheric cinematography that captures the seedy underworld of the city. "Gangster Squad" and "Mean Streets" are also worth considering, but they may not have the same level of intensity and grittiness that you're looking for.


In [18]:
A = c.chat('give me a newer one')
print(A)


Query: 
"Search for a gritty and intense gangster movie with a charismatic lead character facing dangerous enemies and moral dilemmas, similar to 'Scarface,' but released in the past decade." Keywords: gritty, intense, gangster movie, charismatic lead character, dangerous enemies, moral dilemmas, released in the past decade.

Based on your search criteria, I would recommend "Gone" (2012) as a movie that fits your description. It follows the story of Jules, a former FBI agent who is forced to team up with a notorious drug lord to bring down a ruthless cartel. The movie has a gritty and intense tone, with Jules facing dangerous enemies and moral dilemmas as he tries to protect his loved ones and bring justice to the community. The lead character, played by Amanda Seyfried, is charismatic and determined, making her a compelling protagonist. While it may not be a traditional gangster movie, "Gone" has elements of the genre and is a thrilling and engaging watch.


('assistant', '<|system|>You are a helpful assistant.\n\nHere are the movies you MUST choose from:\n\ntitle: Gangster Squad\ngenres: [\'Action\', \'Crime\', \'Drama\']\nrating: 6.7\nfirst_page_summary: It\'s 1949 Los Angeles, the city is run by gangsters and a malicious mobster, Mickey Cohen. Determined to end the corruption, John O\'Mara assembles a team of cops, ready to take down the ruth... Read all\nsummaries: I\n\ntitle: American Gangster\ngenres: [\'Biography\', \'Crime\', \'Drama\']\nrating: 7.8\nfirst_page_summary: An outcast New York City cop is charged with bringing down Harlem drug lord Frank Lucas, whose real life inspired this partly biographical film.\nsummaries: A\n\ntitle: A Bronx Tale\ngenres: [\'Crime\', \'Drama\']\nrating: 7.8\nfirst_page_summary: Robert De Niro and Chazz Palminteri give captivating performances in this intense drama about a boy torn between his tough, hard-working father and a violent yet charismatic crime boss.\nsummaries: R\n\ntitle: Scarface\nge