# Question answering on multiple files using Quentin Tarantino scripts

In this Notebook we will get multiple pdf's that we will use for question answering.


### Contents
0. Install packages
1. Setting up
2. Get the .pdf's and convert to txt
3. Create the vector database and store the data in it
4. Retrieve the data (search the source)
5. Make a chain to ask the LLM the question
6. Delete the database (option)

### Sources
- Video: https://www.youtube.com/watch?v=3yPBVii7Ct0
- Adapted from: https://colab.research.google.com/drive/1gyGZn_LZNrYXYXa-pltFExbptIe7DAPe?usp=sharing#scrollTo=XHVE9uFb3Ajj

## 1. Setting up


In [2]:
!pip show langchain

Name: langchain
Version: 0.0.136
Summary: Building applications with LLMs through composability
Home-page: https://www.github.com/hwchase17/langchain
Author: 
Author-email: 
License: MIT
Location: /Users/michielbontenbal/anaconda3/lib/python3.10/site-packages
Requires: aiohttp, async-timeout, dataclasses-json, numpy, openapi-schema-pydantic, pydantic, PyYAML, requests, SQLAlchemy, tenacity
Required-by: 


In [1]:
import os
import config
os.environ["OPENAI_API_KEY"] = config.openai_key

In [2]:
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
from langchain.document_loaders import TextLoader
from langchain.document_loaders import DirectoryLoader

## 2. Get multiple pdf's and convert to txt documents and split into chunks


In [10]:
import wget
wget.download('https://assets.scriptslug.com/live/pdf/scripts/pulp-fiction-1994.pdf')
wget.download('https://assets.scriptslug.com/live/pdf/scripts/reservoir-dogs-1992.pdf')
wget.download('https://assets.scriptslug.com/live/pdf/scripts/jackie-brown-1997.pdf')

100% [........................................................] 198596 / 198596

'jackie-brown-1997 (1).pdf'

In [15]:
import glob
my_pdfs = glob.glob('*.pdf')
my_pdfs

['pulp-fiction-1994.pdf', 'jackie-brown-1997.pdf', 'reservoir-dogs-1992.pdf']

In [16]:
#a script to convert multiple pdf's to multiple txt's
from PyPDF2 import PdfReader
import os

for i in range(len(my_pdfs)):
    reader = PdfReader(my_pdfs[i])
    number_of_pages = len(reader.pages)
    file_name, ext = os.path.splitext(my_pdfs[i])
    
    textfile = open(file_name+".txt", "w")

    for j in range (number_of_pages):
        page = reader.pages[j]
        textfile.write(page.extract_text())
        textfile.write('}\n')
    textfile.close()

In [17]:
import glob
my_txts = glob.glob('*.txt')
my_txts

['pulp-fiction-1994.txt',
 'reservoir-dogs-1992.txt',
 'state_of_the_union.txt',
 'jackie-brown-1997.txt']

In [18]:
# Load and process the text files
# loader = TextLoader('single_text_file.txt')
loader = DirectoryLoader('', glob="*.txt", loader_cls=TextLoader)

documents = loader.load()

In [19]:
#splitting the .txt files into chunks of texts
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
texts = text_splitter.split_documents(documents)

In [20]:
len(texts)

591

In [21]:
#print the third text
texts[3]

Document(page_content='bank. You take more of a risk. Banks are\neasier! Federal banks aren\'t supposed to\nstop you anyway, during a robbery. They\'re\ninsured, why should they care? You don\'t\neven need a gun in a federal bank. I heard\nabout this guy, walked into a federal bank\nwith a portable phone, handed the phone to\nthe teller, the guy on the other end of\nthe phone said: "We got this guy\'s little\ngirl, and if you don\'t give him all your\nmoney, we\'re gonna kill \'er."\nYOUNG WOMAN\nDid it work?}\nYOUNG MAN\nFuckin\' A it worked, that\'s what I\'m\ntalkin\' about! Knucklehead walks in a bank\nwith a telephone, not a pistol, not a\nshotgun, but a fuckin\' phone, cleans the\nplace out, and they don\'t lift a fuckin\'\nfinger.\nYOUNG WOMAN\nDid they hurt the little girl?\nYOUNG MAN\nI don\'t know. There probably never was a\nlittle girl – the point of the story isn\'t\nthe little girl. The point of the story is\nthey robbed the bank with a telephone.\nYOUNG WOMAN\nYou wanna 

## 3. Create the vector database and store the texts

In [22]:
# Embed and store the texts
# Supplying a persist_directory will store the embeddings on disk
persist_directory = 'db'

## here we are using OpenAI embeddings but in future we will swap out to local embeddings
embedding = OpenAIEmbeddings()

vectordb = Chroma.from_documents(documents=texts, 
                                 embedding=embedding,
                                 persist_directory=persist_directory)

Using embedded DuckDB with persistence: data will be stored in: db


In [23]:
# persiste the db to disk
vectordb.persist()
vectordb = None

In [24]:
# Now we can load the persisted database from disk, and use it as normal. 
vectordb = Chroma(persist_directory=persist_directory, 
                  embedding_function=embedding)

Using embedded DuckDB with persistence: data will be stored in: db


## 4. Retrieve the data from the database

In [25]:
retriever = vectordb.as_retriever()

In [27]:
docs = retriever.get_relevant_documents("Who is vincent vega?")
docs

[Document(page_content='Blessed is he who, in the name of charity\nand good will, shepherds the weak through\nthe valley of darkness, for he is truly\nhis brother\'s keeper and the finder of\nlost children. And I will strike down upon\nthee with great vengeance and furious\nanger those who attempt to poison and\ndestroy my brothers. And you will know my\nname is the Lord when I lay my vengeance\nupon you."\nThe two men EMPTY their guns at the same time on the sitting\nBrett.\nAGAINST BLACK, TITLE CARD:\n "VINCENT VEGA AND MARSELLUS WALLACE\'S WIFE"\nFADE IN:\nMEDIUM SHOT – BUTCH COOLIDGE\nWe FADE UP on BUTCH COOLIDGE, a white, 26-year-old\nprizefighter.Butch sits at a table wearing a red and blue high\nschool athletic jacket. Talking to him OFF SCREEN is everybody\'s\nboss MARSELLUS WALLACE. The black man sounds like a cross between\na gangster and a king.\nMARSELLUS (O.S.)\nI think you\'re gonna find – when all this\nshit is over and done – I think you\'re\ngonna find yourself one smi

In [28]:
len(docs)

4

In [29]:
retriever = vectordb.as_retriever(search_kwargs={"k": 2})

In [21]:
retriever.search_type

'similarity'

In [22]:
retriever.search_kwargs

{'k': 2}

## 5. Make a chain to ask the LLM the question

In [30]:
# create the chain to answer questions 
qa_chain = RetrievalQA.from_chain_type(llm=OpenAI(), 
                                  chain_type="stuff", 
                                  retriever=retriever, 
                                  return_source_documents=True)

In [31]:
## Cite sources
def process_llm_response(llm_response):
    print(llm_response['result'])
    print('\n\nSources:')
    for source in llm_response["source_documents"]:
        print(source.metadata['source'])

In [43]:
# first question
query = "Who is Vincent Vega?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

 Vincent Vega is the person that Missus Mia Wallace introduces to Ed Sullivan and dances with to Chuck Berry's "YOU NEVER CAN TELL".


Sources:
pulp-fiction-1994.txt
pulp-fiction-1994.txt


In [35]:
# Second question
query = "Who is Jackie Brown?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

 Jackie Brown is a woman who was arrested and is being bailed out by Max Cherry.


Sources:
jackie-brown-1997.txt
jackie-brown-1997.txt


In [37]:
# Third question
query = "Who is Mr. Orange?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

 Mr. Orange is a cop.


Sources:
reservoir-dogs-1992.txt
reservoir-dogs-1992.txt


In [None]:
qa_chain.retriever.search_type , qa_chain.retriever.vectorstore

In [None]:
print(qa_chain.combine_documents_chain.llm_chain.prompt.template)

## 6. Deleteing the DB

In [None]:
!zip -r db.zip ./db

In [None]:
# To cleanup, you can delete the collection
vectordb.delete_collection()
vectordb.persist()

# delete the directory
!rm -rf db/

## Starting again loading the db

restart the runtime

In [None]:
!unzip db.zip

In [None]:
import os

os.environ["OPENAI_API_KEY"] = ""

In [None]:
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings

from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA

In [None]:
persist_directory = 'db'
embedding = OpenAIEmbeddings()

vectordb2 = Chroma(persist_directory=persist_directory, 
                  embedding_function=embedding,
                   )

retriever = vectordb2.as_retriever(search_kwargs={"k": 2})

In [None]:
# Set up the turbo LLM
turbo_llm = ChatOpenAI(
    temperature=0,
    model_name='gpt-3.5-turbo'
)

In [None]:
# create the chain to answer questions 
qa_chain = RetrievalQA.from_chain_type(llm=turbo_llm, 
                                  chain_type="stuff", 
                                  retriever=retriever, 
                                  return_source_documents=True)

In [None]:
## Cite sources
def process_llm_response(llm_response):
    print(llm_response['result'])
    print('\n\nSources:')
    for source in llm_response["source_documents"]:
        print(source.metadata['source'])

In [None]:
# full example
query = "How much money did Pando raise?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

### Chat prompts

In [None]:
print(qa_chain.combine_documents_chain.llm_chain.prompt.messages[0].prompt.template)

In [None]:
print(qa_chain.combine_documents_chain.llm_chain.prompt.messages[1].prompt.template)