## QA Tutorial LangChain - OpenAI

This tutorial uses langchain frameworks to form qa pipeline with combination of ChromaDB and OpenAI Embeddings and Chat models.

This tutorial requires:
- pip install 'langchain[all]'
- pip install chromadb

In [None]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA

In [None]:
import os 
import pandas as pd

In [None]:
# Gather openai api key
openai_api_key = os.getenv("OPENAI_API_KEY")

### Steps to create movies dataset

In [None]:
# Create the dataset
df = pd.read_csv("/Users/KT71KC/Trainings_and_trials/generative_ai/movies.csv")

df['release_date'] = pd.to_datetime(df['release_date'])

# I filtered the momvies after Jan 2022. Because, end date of ChatGPT is Sep 2021. Also, I only picked movies Released after that day.
df_filtered = df[df['release_date']>='2022-01-01']
df_filtered = df_filtered[df_filtered['status']=='Released']

# Some preprocessing
df_filtered['credits'] = df_filtered['credits'].str.replace("-",", ")
df_filtered['genres'] = df_filtered['genres'].str.replace("-",", ")

# Create a context for the prompt from the dataset
df_filtered['context'] = df_filtered.apply(lambda row: f"The title of the movie is {row['title']}. The genre of this movie is {row['genres']}. This movie is released on {row['release_date']}. The budget of the movie is ${row['budget']/1000000} million and total revenue of the movie is ${row['revenue']/1000000} million. The plot of the movie is {row['overview']}. Average vote of the movie is {row['vote_average']} out of 10 in {row['vote_count']} votes.", axis =1) 

df_filtered.shape

In [None]:
df_filtered.head()

In [None]:
df_filtered.columns

In [None]:
df_test = df_filtered[['title','genres','overview','revenue','vote_average', 'vote_count', 'credits', 'context']]
df_test.to_csv('movies_dataset.csv')

In [None]:
df_test.shape

### Forming QA System

In [None]:
# Load the dataset
df = pd.read_csv("movies_dataset.csv")
df.shape

In [None]:
# Load the dataframe 
from langchain.document_loaders import DataFrameLoader
loader = DataFrameLoader(df, page_content_column='context')
data = loader.load()

In [None]:
# Arrange the text to upload DocumentStore
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(data)

In [None]:

# Get Embeddings and write them to Document Store
embeddings = OpenAIEmbeddings()
docsearch = Chroma.from_documents(texts, embeddings)

In [None]:
from langchain.chat_models import ChatOpenAI
qa = RetrievalQA.from_chain_type(llm=ChatOpenAI(model_name='gpt-3.5-turbo'), chain_type="stuff", retriever=docsearch.as_retriever(), return_source_documents = True)

In [None]:
query = """What is the plot of the movie Knock at the Cabin 
and what was the revenue generated from this movie? 
What is the genre of this movie and accordingly, can you recommend me 3 other popular movies in the same genre by adding one sentence summary of plots of these movies? """
result = qa({"query":query})

In [None]:
result["result"]

In [None]:
query = """What is the plot of the movie Knock at the Cabin 
and what was the revenue generated from this movie? 
"""
result = qa({"query":query})

In [None]:
result["result"]