# Question answering on the Pulp Fiction film script


### Sources
- Blog: https://towardsdatascience.com/4-ways-of-question-answering-in-langchain-188c6707cc5a

### Contents
0. Install packages
1. Imports & settings & getting the data
2. Langchain
3. Queries

## 0. Install packages

In [7]:
!pip install langchain
!pip install openai
!pip install PyPDF2
!pip install faiss-cpu
!pip install tiktoken



## 1. Imports & settings & getting the data

In [1]:
#we will do multple imports to get all the settings right.
from PyPDF2 import PdfReader
#import the embeddings
from langchain.embeddings.openai import OpenAIEmbeddings 
#Textsplitter
from langchain.text_splitter import CharacterTextSplitter 
#Import the vectorstores
from langchain.vectorstores import ElasticVectorSearch, Pinecone, Weaviate, FAISS 
#Import the chains
from langchain.chains.question_answering import load_qa_chain
#Import the LLM's
from langchain.llms import OpenAI
#Import the summarizer function
from langchain.chains.summarize import load_summarize_chain

In [2]:
# Get your API keys from openai, you will need to create a (paid) account. 
# Here is the link to get the keys: https://platform.openai.com/account/billing/overview
#I store my api keys in an config file as well
import os
import config
os.environ["OPENAI_API_KEY"] = config.openai_key

In [3]:
# Select OpenAI type embeddings (alternatives are o.a. )
embeddings = OpenAIEmbeddings()

In [4]:
llm = OpenAI(temperature=0)

## 2. Get the data and split into chunks

In [6]:
# Select the .pdf file to read 
reader = PdfReader('./PDF/pulp-fiction-1994.pdf')
reader

<PyPDF2._reader.PdfReader at 0x7f77c0d4bac0>

In [7]:
# read data from the file and put them into a variable called raw_text
raw_text = ''
for i, page in enumerate(reader.pages):
    text = page.extract_text()
    if text:
        raw_text += text
print(raw_text[:200])
print(50 * '-')
print(f'The text is {len(raw_text)} characters')

PULP FICTION
by
Quentin Tarantino & Roger AvaryPULP [pulp] n.
1. A soft, moist, shapeless
mass or matter.
2. A magazine or book containing lurid
subject matter and being characteristically
printed on 
--------------------------------------------------
The text is 152213 characters


In [8]:
# We need to split the text that we read into smaller chunks so that during information retreival we don't hit the token size limits. 
text_splitter = CharacterTextSplitter(        
    separator = "\n",
    chunk_size = 1000,
    chunk_overlap  = 200,
    length_function = len,
)
chunks = text_splitter.split_text(raw_text)

In [9]:
#print the number of chunks
len(chunks)

190

In [10]:
#print the first chunk
chunks[0]

'PULP FICTION\nby\nQuentin Tarantino & Roger AvaryPULP [pulp] n.\n1. A soft, moist, shapeless\nmass or matter.\n2. A magazine or book containing lurid\nsubject matter and being characteristically\nprinted on rough, unfinished paper.\nAmerican Heritage Dictionary: New College Edition\nINT. COFFEE SHOP – MORNING\nA normal Denny\'s, Spires-like coffee shop in Los Angeles. It\'s\nabout 9:00 in the morning. While the place isn\'t jammed, there\'s a\nhealthy number of people drinking coffee, munching on bacon and\neating eggs.\nTwo of these people are a YOUNG MAN and a YOUNG WOMAN. The Young\nMan has a slight working-class English accent and, like his fellow\ncountryman, smokes cigarettes like they\'re going out of style.\nIt is impossible to tell where the Young Woman is from or how old\nshe is; everything she does contradicts something she did. The boy\nand girl sit in a booth. Their dialogue is to be said in a rapid-\npace "HIS GIRL FRIDAY" fashion.\nYOUNG MAN\nNo, forget it, it\'s too ri

In [17]:
#select the chaintype. The default chain_type="stuff" uses ALL of the text from the documents in the prompt. Expensive!
chain = load_qa_chain(OpenAI(), chain_type="stuff")

## 2. Do the Question answering with Langchain

Faiss is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning.
source: https://faiss.ai/

In [22]:
#We will use FAISS to do similarity search
docsearch = FAISS.from_texts(chunks, embeddings)

In [23]:
question = "What do the Dutch eat on their french fries"
docs = docsearch.similarity_search(question)
chain.run(input_documents=docs, question=question)

" Jules and Vincent discuss the differences between the US and Europe, such as the metric system and the fact that you can buy beer in a movie theatre in Amsterdam. They also discuss the fact that in Paris, you can buy beer at McDonald's and that a Quarter Pounder with Cheese is called a Royale with Cheese. Jules also mentions that in Holland, they put mayonnaise on french fries instead of ketchup."

In [25]:
question = "Who is Butch Coolidge? "
docs = docsearch.similarity_search(question)
chain.run(input_documents=docs, question=question)

" Capt. Koons, a friend of Butch's father, gives Butch a watch that was passed down from his father. Butch is a 27-year-old boxer preparing for a big fight. He is shaken by the memory of the watch and is helped into his boxing robe by his trainer Klondike. As Butch steps into the hallway, the crowd goes wild. Later, Butch finds a submachine gun on his kitchen counter and is surprised by Vincent Vega coming out of the bathroom. Capt. Koons then explains to Butch that he is responsible for passing down his father's watch."

In [27]:
question = "What is the joke about the tomato? "
docs = docsearch.similarity_search(question)
chain.run(input_documents=docs, question=question)

' Vincent and Mia share a joke, but neither of them laugh. Mia then tells a joke about three tomatoes walking down the street, and they both smile. Vincent then blows Mia a kiss as she walks inside her house. Meanwhile, Lance is watching the Three Stooges on TV when the phone rings. His wife, Jody, wakes up and scolds him for letting people call late. Lance answers the phone and Jimmie and Jules have a conversation about a dead nigger in the garage. Finally, Butch is stopped at a light when he sees Marsellus Wallace crossing the street in front of him. Butch quickly drives away.'

In [28]:
question = "How do Vincent and Mia dance? "
docs = docsearch.similarity_search(question)
chain.run(input_documents=docs, question=question)

' Mia and Vincent dance to Chuck Berry\'s "YOU NEVER CAN TELL" and then stand face to face looking at each other. Mia moves away to attend to music and drinks while Vincent goes to the bathroom. Mia then selects a CD and dances to it. Vincent and Mia then enter a dance competition and dance to Chuck Berry\'s song. After the competition, Mia and Vincent agree to keep the incident a secret and shake on it. Vincent then leaves to go home.'

In [24]:
question = "What Bible verse is mentioned?  "
docs = docsearch.similarity_search(question)
chain.run(input_documents=docs, question=question)

' Ezekiel 25:17: "The path of the righteous man is beset on all sides by the inequities of the selfish and the tyranny of evil men. Blessed is he who, in the name of charity and good will, shepherds the weak through the valley of darkness, for he is truly his brother\'s keeper and the finder of lost children. And I will strike down upon thee with great vengeance and furious anger those who attempt to poison and destroy my brothers. And you will know my name is the Lord when I lay my vengeance upon you."'