# Question answering on the Pulp Fiction film script

In this notebook we will take a .pdf from the internet to do question answering on. Here we will use the Pulp Fiction film script (1994). 

### Contents
0. Install packages
1. Imports & settings & getting the data
2. Langchain
3. Queries

### Sources
- Blog: https://towardsdatascience.com/4-ways-of-question-answering-in-langchain-188c6707cc5a

## 0. Install packages

In [7]:
!pip install langchain
!pip install openai
!pip install PyPDF2
!pip install faiss-cpu
!pip install tiktoken



## 1. Imports & settings & getting the data

In [2]:
#we will do multple imports to get all the settings right.
from PyPDF2 import PdfReader
#import the embeddings
from langchain.embeddings.openai import OpenAIEmbeddings 
#Textsplitter
from langchain.text_splitter import CharacterTextSplitter 
#Import the vectorstores
from langchain.vectorstores import ElasticVectorSearch, Pinecone, Weaviate, FAISS 
#Import the chains
from langchain.chains.question_answering import load_qa_chain
#Import the LLM's
from langchain.llms import OpenAI
#Import the summarizer function
from langchain.chains.summarize import load_summarize_chain

In [3]:
# Get your API keys from openai, you will need to create a (paid) account. 
# Here is the link to get the keys: https://platform.openai.com/account/billing/overview
#I store my api keys in an config file as well
import os
import config
os.environ["OPENAI_API_KEY"] = config.openai_key

In [4]:
# Select OpenAI type embeddings (alternatives are o.a. )
embeddings = OpenAIEmbeddings()

In [4]:
#setting temperature of the model. Temperature 0 gives a very factual answer and 1 a creative answer.
llm = OpenAI(temperature=0.2)

## 2. Get the data and split into chunks

In [1]:
import wget
document = wget.download('https://artificialintelligenceact.eu/wp-content/uploads/2022/05/AIA-COM-Proposal-21-April-21.pdf')
document

100% [......................................................] 1351521 / 1351521

'AIA-COM-Proposal-21-April-21.pdf'

In [5]:
# Select the .pdf file to read 
reader = PdfReader('./AIA-COM-Proposal-21-April-21.pdf')
reader

<PyPDF2._reader.PdfReader at 0x7f81ab4bf400>

In [6]:
# read data from the file and put them into a variable called raw_text
raw_text = ''
for i, page in enumerate(reader.pages):
    text = page.extract_text()
    if text:
        raw_text += text
print(raw_text[:200])
print(50 * '-')
print(f'The text is {len(raw_text)} characters')

EN   EN 
 
 
 EUROPEAN  
COMMISSION   
Brussels, 21.4.2021  
COM(2021) 206 final  
2021/0106 (COD)  
 
Proposal for a  
REGULATION OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL  
LAYING DOWN HARMONISE
--------------------------------------------------
The text is 316717 characters


In [7]:
# We need to split the text that we read into smaller chunks so that during information retreival we don't hit the token size limits. 
text_splitter = CharacterTextSplitter(        
    separator = "\n",
    chunk_size = 1000,
    chunk_overlap  = 200,
    length_function = len,
)
chunks = text_splitter.split_text(raw_text)

In [8]:
#print the number of chunks
len(chunks)

401

In [9]:
#print the first chunk
chunks[0]

'EN   EN \n \n \n EUROPEAN  \nCOMMISSION   \nBrussels, 21.4.2021  \nCOM(2021) 206 final  \n2021/0106 (COD)  \n \nProposal for a  \nREGULATION OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL  \nLAYING DOWN HARMONISED RULES ON ARTIFICIAL INTELLIGENCE \n(ARTIFICIAL INTELLIGENCE ACT) AND AMENDING CERTAIN UNION \nLEGISLATIVE ACTS  \n{SEC(2021)  167 final}  - {SWD(2021)  84 final}  - {SWD(2021)  85 final}   EN 1  EN EXPLANATORY MEMORANDUM  \n1. CONTEXT  OF THE  PROPOSAL  \n1.1. Reasons for and objectives of the proposal  \nThis explanatory memorandum accompanies the proposal for a Regulation laying down \nharmonised rules on artificial intelligence (Artificial Intelligence Act). Artificial Intelligence \n(AI) is a fast evolving family of technologies that can bring a wide array of economic and \nsocietal benefits across the entire s pectrum of industries and social activities. By improving \nprediction, optimising operations and resource allocation, and personalising service delivery,'

In [10]:
#select the chaintype. The default chain_type="stuff" uses ALL of the text from the documents in the prompt. Expensive!
chain = load_qa_chain(OpenAI(), chain_type="stuff")

Faiss is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning.
source: https://faiss.ai/

In [11]:
#We will use FAISS to do similarity search
docsearch = FAISS.from_texts(chunks, embeddings)

## 2. Do the Question answering with Langchain

In [13]:
question = "Is face recognition allowed in the EU?"
docs = docsearch.similarity_search(question)
chain.run(input_documents=docs, question=question)

' No, face recognition is not allowed in the EU. Article 5(1), point (d), (2) and (3) of this Regulation adopted on the basis of Article 16 of the TFEU specifically prohibits the use of AI systems for ‘real -time’ remote biometric identification in publicly accessible spaces for the purpose of law enforcement.'

In [25]:
question = "Who is Butch Coolidge? "
docs = docsearch.similarity_search(question)
chain.run(input_documents=docs, question=question)

" Capt. Koons, a friend of Butch's father, gives Butch a watch that was passed down from his father. Butch is a 27-year-old boxer preparing for a big fight. He is shaken by the memory of the watch and is helped into his boxing robe by his trainer Klondike. As Butch steps into the hallway, the crowd goes wild. Later, Butch finds a submachine gun on his kitchen counter and is surprised by Vincent Vega coming out of the bathroom. Capt. Koons then explains to Butch that he is responsible for passing down his father's watch."

In [27]:
question = "What is the joke about the tomato? "
docs = docsearch.similarity_search(question)
chain.run(input_documents=docs, question=question)

' Vincent and Mia share a joke, but neither of them laugh. Mia then tells a joke about three tomatoes walking down the street, and they both smile. Vincent then blows Mia a kiss as she walks inside her house. Meanwhile, Lance is watching the Three Stooges on TV when the phone rings. His wife, Jody, wakes up and scolds him for letting people call late. Lance answers the phone and Jimmie and Jules have a conversation about a dead nigger in the garage. Finally, Butch is stopped at a light when he sees Marsellus Wallace crossing the street in front of him. Butch quickly drives away.'

In [28]:
question = "How do Vincent and Mia dance? "
docs = docsearch.similarity_search(question)
chain.run(input_documents=docs, question=question)

' Mia and Vincent dance to Chuck Berry\'s "YOU NEVER CAN TELL" and then stand face to face looking at each other. Mia moves away to attend to music and drinks while Vincent goes to the bathroom. Mia then selects a CD and dances to it. Vincent and Mia then enter a dance competition and dance to Chuck Berry\'s song. After the competition, Mia and Vincent agree to keep the incident a secret and shake on it. Vincent then leaves to go home.'

In [24]:
question = "What Bible verse is mentioned?  "
docs = docsearch.similarity_search(question)
chain.run(input_documents=docs, question=question)

' Ezekiel 25:17: "The path of the righteous man is beset on all sides by the inequities of the selfish and the tyranny of evil men. Blessed is he who, in the name of charity and good will, shepherds the weak through the valley of darkness, for he is truly his brother\'s keeper and the finder of lost children. And I will strike down upon thee with great vengeance and furious anger those who attempt to poison and destroy my brothers. And you will know my name is the Lord when I lay my vengeance upon you."'