# A Movie recommending agent using the IMDB dataset powered by RAG.

A RAG system is composed of the following  
- A LLM to generate the answer
- A encoder that transforms queries and documents into vectors
- A vector database to save the output vectors from the encoder

This project is a good example of using LLMs powered by RAG because the LLM has no specific knowledge about the movies and does not have idea about the latest releases. 

In [16]:
import nltk
import chromadb
import transformers
import pandas as pd
from sentence_transformers import SentenceTransformer
from langchain_text_splitters import NLTKTextSplitter
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

In [2]:
df = pd.read_csv("../data/movies_metadata.csv")
df.head()

  df = pd.read_csv("../data/movies_metadata.csv")


Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


Since we are only making a movie recommendation system based on the overview, we just need the title and overview. (we could use the genre too but the Genre part has a dictionary inside an array and the genre are not well specified)

In [3]:
df =  df[["original_title", "overview"]]
df.head()

Unnamed: 0,original_title,overview
0,Toy Story,"Led by Woody, Andy's toys live happily in his ..."
1,Jumanji,When siblings Judy and Peter discover an encha...
2,Grumpier Old Men,A family wedding reignites the ancient feud be...
3,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom..."
4,Father of the Bride Part II,Just when George Banks has recovered from his ...


Now let's create chunks of the text in the overview column.

In [4]:
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/sajalpaudyal/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [5]:
text_splitter = NLTKTextSplitter(chunk_size =1500)
def split_overview(overview):
    if pd.isna(overview):
        return []
    return text_splitter.split_text(str(overview))

df['chunks'] = df['overview'].apply(split_overview)
df.head()

Unnamed: 0,original_title,overview,chunks
0,Toy Story,"Led by Woody, Andy's toys live happily in his ...","[Led by Woody, Andy's toys live happily in his..."
1,Jumanji,When siblings Judy and Peter discover an encha...,[When siblings Judy and Peter discover an ench...
2,Grumpier Old Men,A family wedding reignites the ancient feud be...,[A family wedding reignites the ancient feud b...
3,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...","[Cheated on, mistreated and stepped on, the wo..."
4,Father of the Bride Part II,Just when George Banks has recovered from his ...,[Just when George Banks has recovered from his...


In [6]:
chunked_df = df.explode('chunks').reset_index(drop=True)

Now we need to transform chunks into vectors using embedders. We can find the list of various embedders in this [leaderboard](https://huggingface.co/spaces/mteb/leaderboard). *e5-small-v2* is a small and powerful model that has good balance between power and speed. 

In [7]:
embedder = SentenceTransformer("intfloat/e5-small-v2")

In [8]:
def encode_chunk(chunk):
    if not isinstance(chunk, str) or chunk.strip() == "":
        return None
    return embedder.encode(chunk).tolist()


In [9]:
chunked_df['embeddings'] = chunked_df['chunks'].apply(encode_chunk)

chunked_df.dropna(subset=['embeddings'], inplace=True)


In [11]:
client = chromadb.Client()
collection = client.create_collection(name = 'movies')

In [12]:
chunked_df['embeddings']

0        [-0.06718820333480835, 0.03433913737535477, 0....
1        [-0.0452115535736084, 0.02868107706308365, 0.0...
2        [-0.032773420214653015, 0.031449250876903534, ...
3        [-0.06683981418609619, 0.03363468497991562, -0...
4        [-0.07878788560628891, 0.039820197969675064, 0...
                               ...                        
45461    [-0.08558125793933868, 0.03270949050784111, -0...
45462    [-0.02992500551044941, 0.055029258131980896, -...
45463    [-0.10009708255529404, -0.005005186889320612, ...
45464    [-0.06996455788612366, 0.061901289969682693, 0...
45465    [-0.0455254465341568, 0.017802070826292038, 0....
Name: embeddings, Length: 44507, dtype: object

In [13]:
chunked_df[chunked_df['embeddings'].isna()]

Unnamed: 0,original_title,overview,chunks,embeddings


In [14]:
for idx, row in chunked_df.iterrows():
    collection.add(
        ids=[str(idx)],
        embeddings=[row['embeddings']],
        metadatas=[{
            'original_title': row['original_title'],
            'chunk': row['chunks']
        }]
    )

In [15]:
sentence_model = SentenceTransformer('intfloat/e5-small-v2') 

In [20]:
model_name = "Qwen/Qwen2.5-1.5B-Instruct"

In [23]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
)

In [24]:
text_generation_pipeline = pipeline(
    model=model,
    tokenizer=tokenizer,
    task="text-generation",
    return_full_text=True,
    max_new_tokens=800
)


Device set to use cpu


We need the same embedder used above to find the chunks, though, so we need a method that:
 - Creates a new vector for the query
 - Finds the most similar documents
 - Returns the associated text

In [25]:
def retrieve_documents(query, collection, top_k=5):
    query_embedding = sentence_model.encode(query).tolist()
    
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=top_k
    )
    
    if not results['documents']:
        print("No results found for the query.")
        return [], []
    
    chunks = []
    titles = []
    for document in results['metadatas'][0]:
        chunks.append(document['chunk'])
        titles.append(document['original_title'])
    
    return chunks, titles

Now we use the LLM providing a very clear prompt which clearly explains the task to the model. We also provide the model with both the context and the question.

In [26]:
def generate_answer(query, chunks, titles, text_generation_pipeline):
    context = "\n\n".join([f"Title: {title}\nChunk: {chunk}" for title, chunk in zip(titles, chunks)])
    
    prompt = f"""[INST]
    Instruction: You're an expert in movie suggestions. Your task is to analyze carefully the context and come up with an exhaustive answer to the following question:
    {query}
    
    Here is the context to help you:

    {context}

    [/INST]"""
    
    generated_text = text_generation_pipeline(prompt)[0]['generated_text']
    
    return generated_text


In [27]:
client = chromadb.Client()
collection = client.get_collection(name='movies')

query = "What are some good movies to watch on a rainy day?"
top_k = 5

chunks, titles = retrieve_documents(query, collection, top_k)
print(f"Retrieved Chunks: {chunks}")
print(f"Retrieved Titles: {titles}")

Retrieved Chunks: ["There are some movies that are so bad they're good.\n\nAnd there are some movies that are so bad- that they're just bad...", 'Film about filmmaking.\n\nIt takes place during one day on set of non-budget movie.\n\nUltimate tribute to all independent filmmakers.', "A sci-fi film about futuristic Africa, 35 years after World War III, 'The Water War'.", "A fine day in the life of a fly presented completely from the fly's point of view.\n\nA fine day until something dreary happens, that is.", 'Hilarious, sad, absurd, eerie and beautiful, "FAST, CHEAP &amp; OUT OF CONTROL" is a film like no other.\n\nStarting as a darkly funny contemplation of the Sisyphus-like nature of human striving, it ultimately becomes a profoundly moving meditation on the very nature of existence.']
Retrieved Titles: ['The 50 Worst Movies Ever Made', 'Living in Oblivion', 'Pumzi', 'A Légy', 'Fast, Cheap & Out of Control']


In [28]:
if chunks and titles:
    answer = generate_answer(query, chunks, titles, text_generation_pipeline)
    print(answer)
else:
    print("No relevant documents found to generate an answer.")

[INST]
    Instruction: You're an expert in movie suggestions. Your task is to analyze carefully the context and come up with an exhaustive answer to the following question:
    What are some good movies to watch on a rainy day?

    Here is the context to help you:

    Title: The 50 Worst Movies Ever Made
Chunk: There are some movies that are so bad they're good.

And there are some movies that are so bad- that they're just bad...

Title: Living in Oblivion
Chunk: Film about filmmaking.

It takes place during one day on set of non-budget movie.

Ultimate tribute to all independent filmmakers.

Title: Pumzi
Chunk: A sci-fi film about futuristic Africa, 35 years after World War III, 'The Water War'.

Title: A Légy
Chunk: A fine day in the life of a fly presented completely from the fly's point of view.

A fine day until something dreary happens, that is.

Title: Fast, Cheap & Out of Control
Chunk: Hilarious, sad, absurd, eerie and beautiful, "FAST, CHEAP &amp; OUT OF CONTROL" is a film

As we can see a very small model when feeded with required data through the vector database can learn and adapt to any scenario and give answer based on a certain context. This powerful mechanism of RAG can be used in numerous Small Language Models and make them effective and efficient minimizing the energy consumption and making the system fast and effective. 