# Question answering on the Pulp Fiction film script


### Sources
- Blog: https://towardsdatascience.com/4-ways-of-question-answering-in-langchain-188c6707cc5a

### Contents
0. Install packages
1. Imports & settings & getting the data
2. Langchain
3. Queries

## 0. Install packages

In [7]:
!pip install langchain
!pip install openai
!pip install PyPDF2
!pip install faiss-cpu
!pip install tiktoken



## 1. Imports & settings & getting the data

In [20]:
#we will do multple imports to get all the settings right.
from PyPDF2 import PdfReader
#import the embeddings
from langchain.embeddings.openai import OpenAIEmbeddings 
#Textsplitter
from langchain.text_splitter import CharacterTextSplitter 
#Import the vectorstores
from langchain.vectorstores import ElasticVectorSearch, Pinecone, Weaviate, FAISS 
#Import the chains
from langchain.chains.question_answering import load_qa_chain
#Import the LLM's
from langchain.llms import OpenAI
#Import the summarizer function
from langchain.chains.summarize import load_summarize_chain

In [21]:
# Get your API keys from openai, you will need to create a (paid) account. 
# Here is the link to get the keys: https://platform.openai.com/account/billing/overview
#I store my api keys in an config file as well
import os
import config
os.environ["OPENAI_API_KEY"] = config.openai_key

In [22]:
# Select OpenAI type embeddings (alternatives are o.a. )
embeddings = OpenAIEmbeddings()

In [30]:
llm = OpenAI(temperature=0)

In [32]:
#select the chaintype. The default chain_type="stuff" uses ALL of the text from the documents in the prompt. Expensive!
chain = load_qa_chain(OpenAI(), chain_type="stuff")
chain.run(docs)

ValueError: Missing some input keys: {'question'}

## 2. Get the data and split into chunks

In [24]:
# Select the .pdf file to read 
reader = PdfReader('./PDF/pulp-fiction-1994.pdf')
reader

<PyPDF2._reader.PdfReader at 0x7fbb48917a00>

In [25]:
# read data from the file and put them into a variable called raw_text
raw_text = ''
for i, page in enumerate(reader.pages):
    text = page.extract_text()
    if text:
        raw_text += text
print(raw_text[:200])
print(50 * '-')
print(f'The text is {len(raw_text)} characters')

PULP FICTION
by
Quentin Tarantino & Roger AvaryPULP [pulp] n.
1. A soft, moist, shapeless
mass or matter.
2. A magazine or book containing lurid
subject matter and being characteristically
printed on 
--------------------------------------------------
The text is 152213 characters


In [26]:
# We need to split the text that we read into smaller chunks so that during information retreival we don't hit the token size limits. 
text_splitter = CharacterTextSplitter(        
    separator = "\n",
    chunk_size = 1000,
    chunk_overlap  = 200,
    length_function = len,
)
chunks = text_splitter.split_text(raw_text)

In [27]:
#print the number of chunks
len(chunks)

190

In [28]:
#print the first chunk
chunks[0]

'PULP FICTION\nby\nQuentin Tarantino & Roger AvaryPULP [pulp] n.\n1. A soft, moist, shapeless\nmass or matter.\n2. A magazine or book containing lurid\nsubject matter and being characteristically\nprinted on rough, unfinished paper.\nAmerican Heritage Dictionary: New College Edition\nINT. COFFEE SHOP – MORNING\nA normal Denny\'s, Spires-like coffee shop in Los Angeles. It\'s\nabout 9:00 in the morning. While the place isn\'t jammed, there\'s a\nhealthy number of people drinking coffee, munching on bacon and\neating eggs.\nTwo of these people are a YOUNG MAN and a YOUNG WOMAN. The Young\nMan has a slight working-class English accent and, like his fellow\ncountryman, smokes cigarettes like they\'re going out of style.\nIt is impossible to tell where the Young Woman is from or how old\nshe is; everything she does contradicts something she did. The boy\nand girl sit in a booth. Their dialogue is to be said in a rapid-\npace "HIS GIRL FRIDAY" fashion.\nYOUNG MAN\nNo, forget it, it\'s too ri

In [34]:
chain = load_summarize_chain(llm, chain_type="stuff")
#chain.run(docs)

AttributeError: 'str' object has no attribute 'page_content'

## 2. Do the Question answering with Langchain

Faiss is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning.
source: https://faiss.ai/

In [18]:
#We will use FAISS to do similarity search
tdocsearch = FAISS.from_texts(chunks, embeddings)

In [19]:
question = "What do the Dutch eat on their french fries"
docs = docsearch.similarity_search(question)
chain.run(input_documents=docs, question=question)

' Mayonnaise.'

In [26]:
question = "Who is Butch Coolidge? "
docs = docsearch.similarity_search(question)
chain.run(input_documents=docs, question=question)

" Butch Coolidge is a 27-year old boxer who is preparing for a big fight. He is receiving a watch from Capt. Koons, who was a good friend of his father's in a POW camp."

In [23]:
question = "What is the joke about the tomato? "
docs = docsearch.similarity_search(question)
chain.run(input_documents=docs, question=question)

' Three tomatoes are walking down the street, a poppa tomato, a momma tomato, and a little baby tomato. The baby tomato is lagging behind the poppa and momma tomato. The poppa tomato gets mad, goes over to the momma tomato and stamps on him – (stamps on the ground) – and says: catch up.'

In [28]:
question = "How do Vincent and Mia dance? "
docs = docsearch.similarity_search(question)
chain.run(input_documents=docs, question=question)

' Vincent and Mia dance to Chuck Berry\'s "YOU NEVER CAN TELL" and make hand movements as they dance.'

In [24]:
question = "What Bible verse is mentioned?  "
docs = docsearch.similarity_search(question)
chain.run(input_documents=docs, question=question)

' Ezekiel 25:17: "The path of the righteous man is beset on all sides by the inequities of the selfish and the tyranny of evil men. Blessed is he who, in the name of charity and good will, shepherds the weak through the valley of darkness, for he is truly his brother\'s keeper and the finder of lost children. And I will strike down upon thee with great vengeance and furious anger those who attempt to poison and destroy my brothers. And you will know my name is the Lord when I lay my vengeance upon you."'