# Q&A with multiple text files

With this notebook you can do question answering on three film scripts (Pulp fiction, Reservoir Dogs and Jackie Brown).

### Sources
https://www.youtube.com/watch?v=DXmiJKrQIvg



### Contents
0. Install packages
1. Settings
2. Download pdf's and convert to txt
3.

## 0. Install packages

In [1]:
!pip install langchain
!pip install unstructured
!pip install openai
!pip install chromadb
!pip install Cython
!pip install tiktoken



Collecting requests>=2.28
  Downloading requests-2.31.0-py3-none-any.whl (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.6/62.6 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m


Installing collected packages: requests
  Attempting uninstall: requests
    Found existing installation: requests 2.25.1
    Uninstalling requests-2.25.1:
      Successfully uninstalled requests-2.25.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
weaviate-client 3.0.0 requires requests<2.26.0,>=2.23.0, but you have requests 2.31.0 which is incompatible.
datasets 2.11.0 requires tqdm>=4.62.1, but you have tqdm 4.61.2 which is incompatible.
conda-repo-cli 1.0.27 requires clyent==1.2.1, but you have clyent 1.2.2 which is incompatible.
conda-repo-cli 1.0.27 requires nbformat==5.4.0, but you have nbformat 5.7.0 which is incompatible.
conda-repo-cli 1.0.27 requires requests==2.28.1, but you have requests 2.31.0 which is incompatible.[0m[31m
[0mSuccessfully installed requests-2.31.0


In [2]:
pip install PyPDF2

Note: you may need to restart the kernel to use updated packages.


## 1. Settings

In [4]:
#Load Required Packages
#from langchain.document_loaders import UnstructuredPDFLoader
from langchain.indexes import VectorstoreIndexCreator
from langchain.document_loaders import TextLoader

In [5]:
# Get your API keys from openai, you will need to create an account. 
# Here is the link to get the keys: https://platform.openai.com/account/billing/overview
import os
import config
os.environ["OPENAI_API_KEY"] = config.openai_key

## 2. Download pdf's and convert to txt 

In [3]:
import wget
wget.download('https://assets.scriptslug.com/live/pdf/scripts/pulp-fiction-1994.pdf')
wget.download('https://assets.scriptslug.com/live/pdf/scripts/reservoir-dogs-1992.pdf')
wget.download('https://assets.scriptslug.com/live/pdf/scripts/jackie-brown-1997.pdf')

100% [........................................................] 198596 / 198596

'jackie-brown-1997 (1).pdf'

In [1]:
import glob
my_pdfs = glob.glob('*.pdf')
my_pdfs

['pulp-fiction-1994.pdf', 'jackie-brown-1997.pdf', 'reservoir-dogs-1992.pdf']

In [3]:
#a script to convert multiple pdf's to multiple txt's
from PyPDF2 import PdfReader
import os

for i in range(len(my_pdfs)):
    reader = PdfReader(my_pdfs[i])
    number_of_pages = len(reader.pages)
    file_name, ext = os.path.splitext(my_pdfs[i])
    
    textfile = open(file_name+".txt", "w")

    for j in range (number_of_pages):
        page = reader.pages[j]
        textfile.write(page.extract_text())
        textfile.write('}\n')
    textfile.close()

In [4]:
import glob
my_txts = glob.glob('*.txt')
my_txts

['pulp-fiction-1994.txt', 'reservoir-dogs-1992.txt', 'jackie-brown-1997.txt']

In [5]:
pwd

'/Users/michielbontenbal/Documents/GitHub/AI_advanced'

## 3. Load multiple txt files with Langchain into Chroma vectorstore

In [6]:
from langchain.document_loaders import DirectoryLoader
loader = DirectoryLoader('', glob="./*.txt", loader_cls=TextLoader)

documents = loader.load()

In [8]:
documents

[Document(page_content='Quentin Tarantino\'s\nR E S E R V O I R  D O G S\nOctober 22, 1990}\nWRITTEN AND DIRECTED\nBY\nQUENTIN TARANTINOii.}\nThis movie is dedicated to these following sources of\ninspiration:\nTIMOTHY CAREY\nROGER CORMANANDRE DeTOTH\n   CHOW YUEN FAT JEAN LUC GODDARD\n  JEAN PIERRE MELVILLE    LAWRENCE TIERNEY\n   LIONEL WHITEiii.}\nINT. UNCLE BOB\'S PANCAKE HOUSE - MORNING 1 1\nEight men dressed in BLACK SUITS, sit around a table at a \nbreakfast cafe.  They are MR. WHITE, MR. PINK, MR. BLUE,\nMR. BLONDE, MR. ORANGE, MR. BROWN, NICE GUY EDDIE CABOT,and the big boss, JOE CABOT.  Most are finished eating and \nare enjoying coffee and conversation.  Joe flips through a small address book.  Mr. Pink is telling a long and involved story about Madonna.\nMR. PINK\n"Like a Virgin" is all about a girl who digs a guy with a big dick.  The whole song is a metaphor for big dicks.\nMR. BLUE\nNo it\'s not.  It\'s about a girl who is very vulnerable and she\'s been fucked over a fe

In [10]:
#splitting the documents into text chunks
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
texts = text_splitter.split_documents(documents)
len(texts)

653

### What is a chroma vector store? 
Chroma as vectorstore to index and search embeddings


There are three main steps going on after the documents are loaded:

- Splitting documents into chunks

- Creating embeddings for each document

- Storing documents and embeddings in a vectorstore


In [11]:
from langchain.vectorstores import Chroma
vectordb = Chroma.from_documents(documents=texts)

Using embedded DuckDB without persistence: data will be transient
No embedding_function provided, using default embedding function: SentenceTransformerEmbeddingFunction


## 4. Q&A with the documents

In [12]:
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI

In [13]:
#create the chain, give the prompt
chain = load_qa_chain(OpenAI(temperature=0), chain_type="stuff")
#create your prompt
query = "What did the president say about Justice Breyer"
chain.run(input_documents=texts, question=query)

InvalidRequestError: This model's maximum context length is 4097 tokens, however you requested 198307 tokens (198051 in your prompt; 256 for the completion). Please reduce your prompt; or completion length.

In [14]:
index.query_with_sources('Who is vincent vega?')

NameError: name 'index' is not defined

In [15]:
pwd

'/Users/michielbontenbal/Documents/GitHub/chat_with_documents'

In [19]:
pdf_folder_path = '/Users/michielbontenbal/Documents/GitHub/chat_with_documents/PDF'
os.listdir(pdf_folder_path)

['pulp-fiction-1994.pdf',
 'jackie-brown-1997.pdf',
 '.ipynb_checkpoints',
 'reservoir-dogs-1992.pdf']

In [20]:
!rm .ipynb_checkpoints

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
rm: .ipynb_checkpoints: is a directory


In [17]:
# location of the pdf file/files. 
from langchain.document_loaders import UnstructuredPDFLoader
loaders = [UnstructuredPDFLoader(os.path.join(pdf_folder_path, fn)) for fn in os.listdir(pdf_folder_path)]
index = VectorstoreIndexCreator().from_loaders(loaders)
index

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

IsADirectoryError: [Errno 21] Is a directory: '/Users/michielbontenbal/Documents/GitHub/chat_with_documents/PDF/.ipynb_checkpoints'

In [18]:
index.query('How was the GPT4all model trained?')

NameError: name 'index' is not defined

In [None]:
index.query_with_sources('How was the GPT4all model trained?')

{'question': 'How was the GPT4all model trained?',
 'answer': ' The GPT4all model was trained with LoRA on 437,605 post-processed examples for four epochs. The data was collected using the GPT-3.5-Turbo OpenAI API and was curated to ensure a diverse distribution of prompt topics and model responses.\n\n',
 'sources': '/content/gdrive/My Drive/data_2/2023_GPT4All_Technical_Report.pdf'}

In [None]:
index.query_with_sources('Who wrote the lip sync paper? ')

{'question': 'Who wrote the lip sync paper? ',
 'answer': ' The lip sync paper was written by K R Prajwal, Vinay P. Namboodiri, Rudrabha Mukhopadhyay, and C V Jawahar.\n',
 'sources': '/content/gdrive/My Drive/data_2/2008.10010.pdf'}

In [None]:
index.query_with_sources('How was the GPT4all model trained?')

{'question': 'How was the GPT4all model trained?',
 'answer': ' The GPT4all model was trained with LoRA on 437,605 post-processed examples for four epochs. Detailed model hyper-parameters and training code can be found in the associated repository and model training log.\n',
 'sources': '/content/gdrive/My Drive/data_2/2023_GPT4All_Technical_Report.pdf'}

In [None]:
index.query_with_sources('Who wrote the lip sync paper? ')

{'question': 'Who wrote the lip sync paper? ',
 'answer': ' The lip sync paper was written by K R Prajwal, Vinay P. Namboodiri, Rudrabha Mukhopadhyay, and C V Jawahar.\n',
 'sources': '/content/gdrive/My Drive/data_2/2008.10010.pdf'}

## Disclaimer:
Note: OpenAI provides a free API key for initial testing. Once you move to a paid subscription, calling the API in the way demonstrated in this example will incur monetary charges. Refer to OpenAI's pricing information for details.

Be aware that information, such as files to train OpenAI's LLM can become public if applied in the way this demo demonstrates. Refer to OpenAI's usage policy for details.

Be careful to use it for actual sharing of your (private) data as you will hand over your data to OpenAI. This demo is for educational purposes only and for demonstrating machine learning methods. The author makes no claims that the outcomes shown here or any outcomes that could be produced by this method are accurate or reliable.