## QA Tutorial LangChain - OpenAI

This tutorial uses langchain frameworks to form qa pipeline with combination of ChromaDB and OpenAI Embeddings and Chat models.

This tutorial requires:
- pip install 'langchain[all]'
- pip install qdrant-client

In [1]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Qdrant
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA

In [2]:
import os 
import pandas as pd

In [3]:
# Gather openai api key
openai_api_key = os.getenv("OPENAI_API_KEY")

### Steps to create movies dataset

In [4]:
# Create the dataset
df = pd.read_csv("/Users/KT71KC/Trainings_and_trials/generative_ai/movies.csv")

df['release_date'] = pd.to_datetime(df['release_date'])

# I filtered the momvies after Jan 2022. Because, end date of ChatGPT is Sep 2021. Also, I only picked movies Released after that day.
df_filtered = df[df['release_date']>='2022-01-01']
df_filtered = df_filtered[df_filtered['status']=='Released']

# Some preprocessing
df_filtered['credits'] = df_filtered['credits'].str.replace("-",", ")
df_filtered['genres'] = df_filtered['genres'].str.replace("-",", ")

# Create a context for the prompt from the dataset
df_filtered['context'] = df_filtered.apply(lambda row: f"The title of the movie is {row['title']}. The genre of this movie is {row['genres']}. This movie is released on {row['release_date']}. The budget of the movie is ${row['budget']/1000000} million and total revenue of the movie is ${row['revenue']/1000000} million. The plot of the movie is {row['overview']}. Average vote of the movie is {row['vote_average']} out of 10 in {row['vote_count']} votes.", axis =1) 

df_filtered.shape

(12573, 21)

In [5]:
df_filtered.head()

Unnamed: 0,id,title,genres,original_language,overview,popularity,production_companies,release_date,budget,revenue,...,status,tagline,vote_average,vote_count,credits,keywords,poster_path,backdrop_path,recommendations,context
0,76600,Avatar: The Way of Water,"Science Fiction, Adventure, Action",en,Set more than a decade after the events of the...,9366.788,20th Century Studios-Lightstorm Entertainment,2022-12-14,350000000.0,2312336000.0,...,Released,Return to Pandora.,7.751,6748.0,"Sam Worthington, Zoe Saldaña, Sigourney Weaver...",loss of loved one-dying and death-alien life-f...,/t6HIqrRAclMCA60NsSmeqe9RmNV.jpg,/s16H6tpK2utvwDtzZ8Qy4qm5Emw.jpg,183392-111332-702432-505642-1064215-436270-874...,The title of the movie is Avatar: The Way of W...
1,502356,The Super Mario Bros. Movie,"Animation, Adventure, Family, Fantasy, Comedy",en,While working underground to fix a water main ...,5132.098,Universal Pictures-Illumination-Nintendo,2023-04-05,100000000.0,58000000.0,...,Released,,7.556,332.0,"Chris Pratt, Anya Taylor, Joy, Charlie Day, Ja...",video game-plumber-magic mushroom-based on vid...,/qNBAXBIQlnOThrVvA6mA2B5ggV6.jpg,/iw0Na1UBHgA5BgifwmQ8vKhlWgA.jpg,,The title of the movie is The Super Mario Bros...
2,640146,Ant-Man and the Wasp: Quantumania,"Action, Adventure, Science Fiction",en,Super-Hero partners Scott Lang and Hope van Dy...,4704.903,Marvel Studios-Kevin Feige Productions,2023-02-15,200000000.0,473237900.0,...,Released,Witness the beginning of a new dynasty.,6.448,1547.0,"Paul Rudd, Evangeline Lilly, Jonathan Majors, ...",hero-ant-sequel-superhero-based on comic-famil...,/ngl2FKBlU4fhbdsrtdom9LVLBXw.jpg,/3CxUndGhUcZdt1Zggjdb2HkLLQX.jpg,965839-734048-267805-1035806-823999-842942-772...,The title of the movie is Ant-Man and the Wasp...
3,677179,Creed III,"Drama, Action",en,After dominating the boxing world Adonis Creed...,3994.342,Metro-Goldwyn-Mayer-Proximity Media-Balboa Pro...,2023-03-01,75000000.0,269000000.0,...,Released,You can't run from your past.,7.262,1129.0,"Michael B. Jordan, Tessa Thompson, Jonathan Ma...",philadelphia pennsylvania-husband wife relatio...,/cvsXj3I9Q2iyyIo95AecSd1tad7.jpg,/5i6SjyDbDWqyun8klUuCxrlFbyw.jpg,965839-267805-943822-842942-1035806-823999-107...,The title of the movie is Creed III. The genre...
4,631842,Knock at the Cabin,"Horror, Mystery, Thriller",en,While vacationing at a remote cabin a young gi...,3422.537,Blinding Edge Pictures-Universal Pictures-Film...,2023-02-01,20000000.0,52000000.0,...,Released,Save your family or save humanity. Make the ch...,6.457,888.0,"Dave Bautista, Jonathan Groff, Ben Aldridge, K...",based on novel or book-sacrifice-cabin-faith-e...,/dm06L9pxDOL9jNSK4Cb6y139rrG.jpg,/zWDMQX0sPaW2u0N2pJaYA8bVVaJ.jpg,1058949-646389-772515-505642-143970-667216-104...,The title of the movie is Knock at the Cabin. ...


In [6]:
df_filtered.columns

Index(['id', 'title', 'genres', 'original_language', 'overview', 'popularity',
       'production_companies', 'release_date', 'budget', 'revenue', 'runtime',
       'status', 'tagline', 'vote_average', 'vote_count', 'credits',
       'keywords', 'poster_path', 'backdrop_path', 'recommendations',
       'context'],
      dtype='object')

In [7]:
df_test = df_filtered[['title','genres','overview','revenue','vote_average', 'vote_count', 'credits', 'context']]
df_test.to_csv('movies_dataset.csv')

In [8]:
df_test.shape

(12573, 8)

### Forming QA System

In [5]:
# Load the dataset
df = pd.read_csv("movies_dataset.csv")
df.shape

(12573, 9)

In [6]:
# Load the dataframe 
from langchain.document_loaders import DataFrameLoader
loader = DataFrameLoader(df, page_content_column='context')
data = loader.load()

In [8]:
# Arrange the text to upload DocumentStore
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(data)

In [10]:

# Get Embeddings and write them to Document Store
embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)
docsearch = Qdrant.from_documents(texts, 
                                  embeddings,
                                  location = ":memory:", # Local mode with in-memory storage only
                                  collection_name="my_documents")

In [11]:
from langchain.chat_models import ChatOpenAI
qa = RetrievalQA.from_chain_type(llm=ChatOpenAI(model_name='gpt-3.5-turbo'), chain_type="stuff", retriever=docsearch.as_retriever(), return_source_documents = True)

In [12]:
query = """What is the plot of the movie Knock at the Cabin 
and what was the revenue generated from this movie? 
What is the genre of this movie and accordingly, can you recommend me 3 other popular movies in the same genre by adding one sentence summary of plots of these movies? """
result = qa({"query":query})

In [13]:
result["result"]

"The plot of the movie Knock at the Cabin is about a family consisting of a young girl and her two fathers who are taken hostage by four armed strangers while vacationing at a remote cabin. The strangers demand that the family make an unthinkable choice to avert the apocalypse, and the family must decide what they believe before all is lost. The total revenue generated from this movie is $52.0 million.\n\nThe genre of this movie is Horror, Mystery, Thriller. Based on this genre, I can recommend the following three popular movies:\n\n1. Get Out (2017) - A young African-American man visits his white girlfriend's family estate, only to become ensnared in a more sinister real reason for the invitation.\n2. A Quiet Place (2018) - In a post-apocalyptic world, a family is forced to live in silence while hiding from monsters with ultra-sensitive hearing.\n3. The Silence of the Lambs (1991) - A young FBI cadet must confide in an incarcerated and manipulative killer to receive his help on catchi

In [14]:
query = """What is the plot of the movie Knock at the Cabin 
and what was the revenue generated from this movie? 
"""
result = qa({"query":query})

In [15]:
result["result"]

'The plot of the movie Knock at the Cabin is about a young girl and her two fathers who are taken hostage by four armed strangers while vacationing at a remote cabin. The kidnappers demand that the family make an unthinkable choice to avert the apocalypse. With limited access to the outside world, the family must decide what they believe before all is lost.\n\nThe total revenue generated from the movie Knock at the Cabin is $52.0 million.'