
# üé• Build a RAG question answering system for movie recommendations

Our goal for this project is to build a movie question answering system using RAG.

We will use a dataset from the Internet Movie Database (IMDb).

Users will be interact with the dataset by asking questions about movies, and the chatbot will retrieve relevant information from our dataset to answer those questions.


# üíª Install `datasets` library to access IMDb dataset from Hugging Face Hub

[Hugging Face Hub](https://huggingface.co/docs/hub/en/index) is to machine learning what GitHub is to software development‚Äîa centralized platform that promotes open sharing, testing, and collaboration.

The models and datasets on the Hub are hosted as Git repositories, allowing for versioning and reproducibility.

The Hub provides a simple way for developers to discover, download, and use these pre-trained models and datasets through the `huggingface-hub` Python library, which you will be installing below.

In addition to models, the Hub also hosts a variety of machine learning applications and demos created by the community, called "Spaces".

[Here is documentation](https://huggingface.co/docs/datasets/en/index) on `datasets` if you want to read more about its capabilities.

In [1]:
%pip install -q -U datasets

Note: you may need to restart the kernel to use updated packages.


# üíª Generate a Hugging Face token

Generate a new token on Hugging Face with "Write" permissions using the instructions below. We will need it to download datasets via `huggingface-hub`.

[How to generate a new Hugging Face token](https://huggingface.co/docs/hub/en/security-tokens)


# üíª Add your Hugging Face token to your Google Colab Secrets

Use the sidebar on the left to add your Hugging Face token `HF_TOKEN` to your Google Colab secrets.

NOTE: The secrets are persisted for all future Colab sessions.

![Screenshot](https://drive.google.com/uc?export=view&id=10U0nesFSgXdCR4ywPk18mHERG47T_rRt)

# üíª Download an IMDB datset from Hugging Face Hub

Use the `datasets` documentation to load the [ShubhamChoksi/IMDB_Movies](https://huggingface.co/datasets/ShubhamChoksi/IMDB_Movies) dataset.

[Datasets documentation](https://pypi.org/project/datasets/)

In [27]:
from datasets import load_dataset

dataset = load_dataset("ShubhamChoksi/IMDB_Movies")

# üíª Store the IMDb dataset locally as CSV file

We will be using [LangChain](https://www.langchain.com/) to build our RAG question answering system.

[LangChain](https://www.langchain.com/) is an open-source framework designed to simplify the development of applications powered by LLMs. It provides a set of tools, components, and interfaces that make it easier to build LLM-centric applications, allowing you to focus on the core functionality rather than the complexities of integrating language models.

To build our RAG systen, we will first store the IMDb dataset into a local CSV file to ensure it is in format we can pass to LangChain.


In [28]:
dataset_dict = dataset
dataset_dict["train"].to_csv("imdb.csv")

Creating CSV from Arrow format:   0%|          | 0/7 [00:00<?, ?ba/s]

10753100

# [Optional] üíª Write the IMDd CSV data into a Pandas DataFrame


Pandas is a powerful open-source Python library for data manipulation and analysis.

It provides two main data structures:
* DataFrame
  * A two-dimensional, size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns)
* Series
  * A one-dimensional labeled array capable of holding data of any data type.

The DataFrame is the primary and most widely used data structure in Pandas. It is similar to a spreadsheet or a SQL table, with rows and columns.

Each column in a DataFrame can have a different data type, making it a flexible and powerful tool for working with diverse data.

We don't **need** to use a DataFrame for this example because we can convert the `dataset` to a CSV using `to_csv`. However, you will likely encounter a DataFrame while processing data for LLM applications, so we want to briefly introduce it here.

[Pandas DataFrame Documentation](https://pandas.pydata.org/docs/reference/io.html)

In [11]:
import pandas as pd # the Colab runtime will already have this library installed - no need to `pip install`

movies_dataframe = pd.read_csv("imdb.csv")

print(movies_dataframe.head())

                   Name  rating No_of_ratings  user  critics  \
0          First Knight     6.0           77K   226     54.0   
1             First Man     7.3          198K  1.4K    496.0   
2  First Man into Space     5.4          1.7K    40     31.0   
3          First of May     6.8           454    13      3.0   
4      The First of May     6.8           454    13      3.0   

                                          Movie_Info  
0  Mel Gibson was attached to this project at one...  
1  Mark Armstrong and Rick Armstrong said that th...  
2  The pilot in the stock footage sequences is Ch...  
3  Charles Nelson Reilly survived the worst circu...  
4  Charles Nelson Reilly survived the worst circu...  


# üíª Install LangChain

In [3]:
%pip install -q -U langchain
%pip install -U langchain-community
%pip install python-dotenv

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Collecting python-dotenv
  Downloading python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)
Downloading python_dotenv-1.0.1-py3-none-any.whl (19 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.0.1
Note: you may need to restart the kernel to use updated packages.


[LangChain Document Loaders documentation](https://python.langchain.com/docs/integrations/document_loaders/)

In [12]:
from langchain.document_loaders.csv_loader import CSVLoader
loader = CSVLoader(file_path="./imdb.csv")
data = loader.load()

len(data) # ensure we have actually loaded data into a format LangChain can recognize


6591

# üíª Chunk the loaded data to improve retrieval performance

In a RAG system, the model needs to be able to quickly and accurately retrieve relevant information from a knowledge base or other data sources to assist in generating high-quality responses. However, working with large, unstructured datasets can be computationally expensive and time-consuming, especially during the retrieval process.

By splitting the data into these smaller, overlapping chunks, the RAG system can more efficiently search and retrieve the most relevant information to include in the generated response. This can lead to improved performance, as the model doesn't have to process the entire dataset at once, and can focus on the most relevant parts of the data.

[LangChain `RecursiveCharacterTextSplitter` documentation](https://sj-langchain.readthedocs.io/en/latest/text_splitter/langchain.text_splitter.RecursiveCharacterTextSplitter.html)


In [13]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Create a text splitter with 1000 character chunks and 100 character overlap
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)

# Split the data into chunks
chunked_documents = text_splitter.split_documents(data)

# Print the length of the chunked documents to ensure the data has been split
print(len(chunked_documents))

19077


# üíª Build a way for your RAG system to understand the relationship between words and their meanings

You can think of embeddings as secret codes that capture the essence of words, allowing your system to understand their true meanings and relationships. This semantic understanding is crucial for the system to provide precise and contextually relevant answers.

Embeddings are represented as dense, continuous vectors in a high-dimensional space and serve as a semantic map that guides your RAG system to the most relevant answers.

A good analogy is a compass: embeddings help your system navigates throgugh a vast sea of information, delivering accurate and contextual responses to user queries.

# üíª Use OpenAI embeddings to create a vector store

The first step in creating a vector store is to create embeddings from the data that you want the RAG system to be able to retrieve.

This is done using an embedding model, which transforms text data into a high-dimensional vector representation. Each piece of text (such as a document, paragraph, or sentence) is converted into a vector that captures its semantic meaning.

For this exercise, we will use OpenAI's embedding model.

In [14]:
%pip install -q -U langchain-openai

Note: you may need to restart the kernel to use updated packages.


[LangChain `OpenAIEmbeddings` documentation](https://python.langchain.com/docs/integrations/text_embedding/openai/)

In [16]:
import os
from dotenv import load_dotenv
from langchain_openai import OpenAIEmbeddings

# Load the environment variables from the .env file
load_dotenv()

# Get the OpenAI API key from the environment variables
openai_api_key = os.getenv("OPENAI_API_KEY")

# Set the OpenAI API key in the environment variables
os.environ["OPENAI_API_KEY"] = openai_api_key

# Initialize the embedding model
embedding_model = OpenAIEmbeddings(model="text-embedding-3-large")

# üíª Create embedder

We will create our embedder using the `CacheBackedEmbeddings` class.

This class is designed to optimize the process of generating embeddings by caching the results of expensive embedding computations.

This caching mechanism prevents the need to recompute embeddings for the same text multiple times, which can be computationally expensive and time-consuming.

[LangChain walkthrough of caching embeddings](https://python.langchain.com/docs/modules/data_connection/text_embedding/caching_embeddings/)

[`CacheBackedEmbeddings` documentation](https://api.python.langchain.com/en/latest/embeddings/langchain.embeddings.cache.CacheBackedEmbeddings.html)

In [17]:
from langchain.embeddings import CacheBackedEmbeddings
from langchain.storage import LocalFileStore

store = LocalFileStore("./cache/")
underlying_embeddings = OpenAIEmbeddings()
embedder = CacheBackedEmbeddings.from_bytes_store(
    underlying_embeddings, store, namespace=underlying_embeddings.model
)

# üíª Create vector store using Facebook AI Similarity Search (FAISS)

[FAISS](https://ai.meta.com/tools/faiss/) is specifically designed for efficient similarity search in large datasets of high-dimensional vectors.

By using vector embeddings and storing them in a FAISS index, you can significantly reduce the computational cost associated with real-time embedding generation and similarity calculations.

Retrieval from a FAISS index is much faster than linear search across high-dimensional vectors, speeding up the response time of the system.

You may have also noticed that we save the vector store using [`LocalFileStore`](https://python.langchain.com/docs/integrations/stores/file_system/).

Saving the vector store locally ensures that the embeddings are persistent across sessions, reducing the need to recompute embeddings and rebuild the index each time the system is used.

[LangChain FAISS documentation](https://python.langchain.com/docs/integrations/vectorstores/faiss/)

In [18]:
%pip install -q faiss-cpu tiktoken

Note: you may need to restart the kernel to use updated packages.


In [19]:
from langchain_community.vectorstores import FAISS

vector_store = FAISS.from_documents(chunked_documents, embedder)
# TODO: How do we create our vector store using FAISS?

# TODO: How do we save our vector store locally?
vector_store.save_local("vector_store")

# üíª Ask your RAG system a question!

Now that we have the embeddings for our chunked IMDb data saved locally in our vector store, we are ready to ask it a question.

To accomplish this task, we first transform a question like "What are some good sci-fi movies from the 1980s?" into a vector representation using our embedding model.

After that, we perform a similarity search to grab the relevant documents from our vector store.

[LangChain Text Embedding Models documentation](https://python.langchain.com/docs/modules/data_connection/text_embedding/)

[LangChain Vector Store documentation](https://python.langchain.com/docs/modules/data_connection/vectorstores/)

In [20]:
query = "What are some good sci-fi movies from the 1980s?"

# TODO: How do we embed our query?
embedded_query = OpenAIEmbeddings().embed_query(query)
# TODO: How do we do a similarity search to find documents similar to our query?
similar_documents = vector_store.similarity_search_by_vector(embedded_query)

for page in similar_documents:
  print(page.page_content)
  # TODO: Print the similar documents that the similarity search returns?

'realistic' images from 1950s' speculative magazines (fictional and 'factual') but are neither as novel nor as effective the contemporaneous Czech sci-fi film 'Ikarie XB-1' (1963) (with which 'Andromeda Nebula' shares a number of tropes). Watching as I did, most of the back-story (which occurs in a futuristic 'socialist' world in which Earth is part of an enlightened galactic alliance) was lost on me but again the images were interesting (a mix of Soviet-style monumental architecture and neoclassical 'future-tropes'). The film is based on a novel by Ivan Yefremov and was originally intended to the first in a film series. Worth watching for the imagery but unless you speak Russian, I'd suggest investing in a DVD, waiting for a subtitled version to show up on-line, or skipping to the 'special effects' sequences.,
Compare this movie to the 1956 movie Forbidden Planet, and think about which one gives you a better 'futuristic' portrayal of how mankind has advanced in 'the future'. Even allo

# üíª Combine the retrieved data with the output of the LLM using [`Runnable`](https://python.langchain.com/docs/expression_language/interface/) interface

To understand the Runnable interface in [LangChain]((https://python.langchain.com/docs/expression_language/interface/)), let's use the analogy of a kitchen staff in a restaurant.

In a kitchen, you have different chefs who specialize in various tasks‚Äîthere's a pastry chef, a grill chef, a sauce chef, and so on.

Each chef is responsible for preparing a specific part of the meal, and they must work in a certain order to ensure the dish comes out correctly.

The head chef oversees the process, ensuring that each part is ready at the right time and that everything comes together in the end.

In this analogy:
* Each chef represents a component in LangChain that implements the Runnable interface.
* The dish being prepared is the final output from the LangChain system, such as an answer to a user's question.
* The head chef is like the LangChain framework, which coordinates the execution of each Runnable component.

## Preparing a Multi-Course Meal

We are using LangChain to build a system that answers questions about movies.

We will have a sequence of Runnables that:

1. Retrieves documents related to the query (like finding the right ingredients)
2.¬†Parses the documents to extract relevant information (like prepping the ingredients)
3. Generates a response based on the information (like cooking the ingredients to create a dish)


# üíª Create the components (chefs)

In [21]:
%pip install -q langchain_openai

Note: you may need to restart the kernel to use updated packages.


[LangChain ChatPromptTemplate quick reference](https://python.langchain.com/docs/modules/model_io/prompts/quick_start/)

[LangChain `VectorStoreRetriever` documentation](https://python.langchain.com/docs/modules/data_connection/retrievers/vectorstore/)

[LangChain `ChatOpenAI` documentation](https://python.langchain.com/docs/integrations/chat/openai/)

[LangChain `StrOutputParser` documentation](https://api.python.langchain.com/en/latest/output_parsers/langchain_core.output_parsers.string.StrOutputParser.html)

In [22]:
from langchain_core.runnables.base import RunnableSequence
from langchain_core.runnables.passthrough import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

# Create the components (chefs)

# TODO: How do we create a prompt template to send to our LLM that will incorporate the documents from our retriever with the question we ask the chat model?
prompt_template = ChatPromptTemplate.from_template(
    "What are good movies for people who like {user_input}"
)

messages = prompt_template.format_messages(user_input="bats")
print(prompt_template.invoke({"user_input": "bats"}))

# print(messages)

retriever = vector_store.as_retriever()
print(retriever.invoke("What is a good movie for someone who likes bats"))

chat_model = ChatOpenAI(model="gpt-3.5-turbo-0125", temperature=0) # TODO: How do we create a chat model / LLM?

parser = StrOutputParser()


messages=[HumanMessage(content='What are good movies for people who like bats')]
[Document(page_content='Name: Bats in the Belfry\nrating: 5.5\nNo_of_ratings: 176\nuser: 4\ncritics: \nMovie_Info:', metadata={'source': './imdb.csv', 'row': 2210}), Document(page_content='Name: Bats\nrating: 4.0\nNo_of_ratings: 11K\nuser: 177\ncritics: 60.0', metadata={'source': './imdb.csv', 'row': 2209}), Document(page_content='"The Batman vs. Dracula" is a DC animated movie that is well-worth the time, money and effort. This is without a doubt the best animated superhero movie I had seen, and it was quite a pleasant surprise in terms of entertainment.\n\nMy rating of "The Batman vs. Dracula" lands on an eight our of ten stars.,', metadata={'source': './imdb.csv', 'row': 2197}), Document(page_content='Name: The Bat People\nrating: 2.8\nNo_of_ratings: 2.6K\nuser: 55\ncritics: 26.0\nMovie_Info: The first feature film for makeup artist Stan Winston.,Dr. Beck\'s field changes from caves, to bats, to "preven

# üíª Create the sequence (recipe)

[LangChain Runnable interface documentation](https://python.langchain.com/docs/expression_language/interface/)

In [23]:

runnable_chain = (
    prompt_template
    | chat_model
    | parser
)


# üíª Execute the sequence (prepare the meal)

In [24]:
# Synchronous execution
output_chunks = runnable_chain.invoke({"user_input": "bats"})
# print(output_chunks)
print(''.join(output_chunks))

1. Batman Begins (2005) - A reboot of the Batman franchise that explores the origins of the Dark Knight.

2. The Dark Knight (2008) - The sequel to Batman Begins, this film features an iconic performance by Heath Ledger as the Joker.

3. Dracula (1992) - A classic vampire film that features bats prominently in its imagery.

4. Interview with the Vampire (1994) - A gothic horror film that follows the story of a vampire named Louis, played by Brad Pitt.

5. Bram Stoker's Dracula (1992) - A visually stunning adaptation of the classic vampire novel, featuring bats as a recurring motif.

6. The Lost Boys (1987) - A cult classic vampire film that follows a group of teenage vampires in a California beach town.

7. Nosferatu (1922) - A silent film classic that features a vampire character inspired by Dracula.

8. Bats (1999) - A horror film about genetically engineered bats that terrorize a small Texas town.

9. The Batman (2022) - An upcoming film starring Robert Pattinson as the Caped Crusad

In [26]:
# Asynchronous execution (e.g., for a better a chatbot user experience)
import asyncio

async def main():
  output_stream = runnable_chain.astream({"user_input": "bats"})
  # TODO: How do we execute our chain asynchronously?

  async for chunk in output_stream:
    print(chunk, sep='', flush=True)

await main()


1
.
 Batman
 Begins
 (
200
5
)
 -
 A
 reboot
 of
 the
 Batman
 franchise
 that
 explores
 the
 origins
 of
 the
 iconic
 superhero
.


2
.
 The
 Dark
 Knight
 (
200
8
)
 -
 The
 sequel
 to
 Batman
 Begins
,
 this
 film
 features
 an
 iconic
 performance
 by
 Heath
 Ledger
 as
 the
 Joker
.


3
.
 Dr
acula
 (
193
1
)
 -
 A
 classic
 horror
 film
 featuring
 the
 iconic
 vampire
 Count
 Dr
acula
.


4
.
 Interview
 with
 the
 Vampire
 (
199
4
)
 -
 A
 film
 adaptation
 of
 Anne
 Rice
's
 novel
 about
 a
 vampire
 who
 tells
 his
 life
 story
 to
 a
 journalist
.


5
.
 The
 Lost
 Boys
 (
198
7
)
 -
 A
 cult
 classic
 film
 about
 a
 group
 of
 teenage
 vampires
 in
 a
 California
 beach
 town
.


6
.
 Blade
 (
199
8
)
 -
 A
 superhero
 film
 about
 a
 half
-v
ampire
,
 half
-human
 who
 hunts
 vampires
.


7
.
 Let
 the
 Right
 One
 In
 (
200
8
)
 -
 A
 Swedish
 horror
 film
 about
 a
 young
 boy
 who
 be
friends
 a
 vampire
 girl
.


8
.
 Nos
fer
atu
 (
192
2
)
 -
 A
 silent
 film
 ada

# üéâ Congratulations on Completing Your Project!

You have successfully built a Retrieval Augmented Generation (RAG) question answering system for movie recommendations using the IMDb dataset.

This system allows users to interactively ask questions about movies, and it retrieves relevant information to provide insightful answers.

Great job on reaching this milestone!

From here, we can move on to our [Week 2: Introduction to Retrieval-Augmented Generation (RAG) Experiment Notebook](https://colab.research.google.com/drive/1pyzWvYDCKEmORN_yIsKwki6BQ5I9kXWK#scrollTo=Grd0XoS-tLgs).

# üìù Submission

Submit your experiment notebook for Week 2 using the form [here](https://docs.google.com/forms/d/1l935d2L3YN3Kj3ovNf3CKWB2EyxvDMkYY_sYte-NYWI/edit).

Please make sure sharing permissions are turned on for everyone with the link.
