# **For the Goonies: Learn How To Chat with your PDF Using OpenAI, LangChain, and Faiss**
From the [Blueprint Technologies](https://www.bpcs.com) LLM Center of Excellence.

![The Goonies](https://i.gifer.com/4y4.gif)

*This notebook assumes you know how to navigate a Google Colab notebook. If you need an overview, check [this](https://web.eecs.umich.edu/~justincj/teaching/eecs442/WI2021/colab.html) out.*

*This notebook requires an OpenAI api key. You can get one via a free trial with OpenAI [here](https://platform.openai.com/account/api-keys).*

This is a notebook for those that want to learn about working with LLM's. It's entirely self-contained and only requires you to upload your PDF to the Files on the left.

![Example of uploading](https://miro.medium.com/v2/resize:fit:846/1*TYvbH2G9G6JtLUVlDsyp9w.png)





# I. The Adventure Starts Here
![It's our time down here](https://y.yarn.co/6d1f234b-cbcd-4fd3-8e4b-4c9aa62d64b3_text.gif)

#### **Background**
It's time to install the packages and import the libraries needed. The main ones to be aware of:
- [LangChain](https://python.langchain.com/en/latest/index.html) - for chunking and creating our question/answer chain
- [Faiss](https://github.com/facebookresearch/faiss) - for similarity search and vector index
- [OpenAI](https://platform.openai.com/docs/models/overview) - for creating our embeddings and natural language interaction

### Step 1: Install the packages we need

In [None]:
!pip install -q langchain pypdf pandas matplotlib tiktoken textract transformers openai faiss-cpu

### Step 2: Import the libraries we need

In [None]:
import os
import pandas as pd
import matplotlib.pyplot as plt
from transformers import GPT2TokenizerFast
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI
from langchain.chains import ConversationalRetrievalChain

### Step 3: Set your OpenAI api key

In [None]:
os.environ["OPENAI_API_KEY"] = "PASTE YOUR OPENAI API CODE HERE"

# II. The Chunking

![Truffle Shuffle](https://media0.giphy.com/media/GHcm2aWIczatG/200w.gif?cid=6c09b95282hparla8to2yq6klqeddy1xkorzztyimkupr7o7&ep=v1_gifs_search&rid=200w.gif&ct=g)


#### **Background**
Basically, chunking a file is breaking it up into tokens.

![The basics of chunking](https://www.pinecone.io/images/chunking-doc.png)


#### **Terms**
- **Chunking**: process of extracting phrases from unstructured text by analyzing a sentence to identify constituents such as noun groups, verbs, verb groups, etc. [Read more](https://towardsdatascience.com/chunking-in-nlp-decoded-b4a71b2b4e24)
- **LangChain Documents**: a piece of text and optional metadata used to interact with the language model.

In [None]:
# Upload your PDF to this workspace Files folder.

# That's right, make sure your PDF is uploaded. You should see it to the left under Files. I used the PDF of the whitepaper at https://www.cidrdb.org/cidr2023/papers/p92-jain.pdf.

In [None]:
# Update with name (no .pdf extension) of your PDF file
originalPDF = "./p92-jain"

# Convert PDF to text
import textract
doc = textract.process(originalPDF + '.pdf')

# Save text to .txt and reopen
with open(originalPDF + '.txt', 'w') as f:
    f.write(doc.decode('utf-8'))

with open(originalPDF + '.txt', 'r') as f:
    text = f.read()

# Count tokens using function from transformers imported earlier
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

def count_tokens(text: str) -> int:
    return len(tokenizer.encode(text))

# Split text into chunks via LangChain
text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size.
    chunk_size = 512,
    chunk_overlap  = 24,
    length_function = count_tokens,
)
# Convert chunks into LangChain document objects
chunks = text_splitter.create_documents([text])

# Result is LangChain documents objects with a token size around what you defined in chunk_size (LangChain's RecursiveCharacterTextSplitter sometimes allows more tokens to retain context)
print ("Your pdf has been chunked into", len(chunks), "documents.")

In [None]:
# Let's visualize The Chunking

# Create list of token counts
token_counts = [count_tokens(chunk.page_content) for chunk in chunks]

# Create a pandas DataFrame from the token counts
df = pd.DataFrame({'Token Count': token_counts})

# Create a histogram of the token count distribution
df.hist(bins=40, )

# Show the plot
plt.show()

# III. The Data
![Data on zipline](https://64.media.tumblr.com/2d5ffe083dd70a4c1660aa839774482e/tumblr_osppf6rYBe1r59rp1o5_r1_250.gif)

#### **Background**
ML algorithms need numbers to work with. Vector embeddings are basically content (such as text chunks) converted/reduced into lists of numbers. Learn more about vector embeddings [here](https://www.pinecone.io/learn/vector-embeddings/)

Specifically for our PDF chat, we have taken the PDF file and put it through The Chunking. Now we will convert those chunks into embeddings and store in a vector index.

![](https://pbs.twimg.com/media/Ftb3YhiX0AMl5m8.jpg:large)

#### **Terms**
- **Vector embeddings**:

In [None]:
# Generate embeddings using OpenAI embedding model
embeddings = OpenAIEmbeddings()

In [None]:
# Create Faiss vector index with our embeddings
db = FAISS.from_documents(chunks, embeddings)

# IV. The Ride
![Get on the bike](https://j.gifs.com/KYzMln.gif)

Time to get rolling! Here we will test our similarity search using the embeddings created in the previous step and try out asking a question of our PDF using LangChain and OpenAI.

In [None]:
# Setup and check the Faiss similarity search of the PDF
query = "What is the name of this?"
docs = db.similarity_search(query)
docs[0]

In [None]:
# Create question answering chain with LangChain and run it with Faiss similarity search
chain = load_qa_chain(OpenAI(temperature=0), chain_type="stuff")

# Now run it. If all goes well, you will see a more natural response to the query we set up.
chain.run(input_documents=docs, question=query)

# V. The Mouth
![Mouth](https://64.media.tumblr.com/cc419bf2f758986e69dda3b4643bcfb8/tumblr_osmoo8EqYc1r59rp1o9_r2_250.gif)

It has come down to this. We will be combining our vector index we created previously, with some LangChain and OpenAI goodness to create a chatbot. With this chatbot, we can continually ask questions of our PDF. The chatbot will even keep a history of the questions we ask.

In [None]:
from IPython.display import display
import ipywidgets as widgets

# Create a conversation chain that uses our vector index as retriver, this also allows for chat history management
qa = ConversationalRetrievalChain.from_llm(OpenAI(temperature=0.1), db.as_retriever())

In [None]:
chat_history = []

def on_submit(_):
    query = input_box.value
    input_box.value = ""

    if query.lower() == 'exit':
        print("Hey, if you find a 50 dollar bill, let me know...else, have a nice day.")
        return

    result = qa({"question": query, "chat_history": chat_history})
    chat_history.append((query, result['answer']))

    display(widgets.HTML(f'<b>User:</b> {query}'))
    display(widgets.HTML(f'<b><font color="blue">Data:</font></b> {result["answer"]}'))

print("Welcome to your PDF chatbot! Type 'exit' to stop.")

input_box = widgets.Text(placeholder='Enter your question:')
input_box.on_submit(on_submit)

display(input_box)