## Ask questions to your PDF

This notebook allows you to ask questions to your PDF.

we will leverage the power of langchain abstractions to perform the following workflow:

### Extract text from a PDF
either local or remote files are supported, We will use an UnstructuredPDFLoader to avoid chunking the file twice 

### Text Splitting
When dealing with large text files text splitting is a must to avoid overcapping the token limit of the LLM. We will divide the text into smaller pieces leaving some overlap between them to ensure context is preserved.

### Stage our Embedding model
Open source Instructor embedding model is one of the most powerful models for text embedding. We will use it to embed our text chunks. this means that we will have a vector representation of each chunk.each vector will be a 768 dimensional vector. that contains te semantic representation of the text.

### Vector Database

We will use a vector database to store our vectors. this will allow us to perform fast similarity search on our vectors. we will use Pinecone for this.a self-hosted alternative to this can be FAISS
Once our pinecone index is created we can store the embeddings created by the embedding model in it.
this vectors contain the information of our pdf file.

### Question Embedding
We will use the same embedding model to embed our question. this will give us a vector representation of our question.With this we can perform a similarity check to get the text chunks that are more similar to our question.
This will work as the context for the LLM to predict the answer.

### Language Model

we will create a chain to perform the following steps:
embed a question -> perform a similarity check against the text corpus -> feed the context to the llm -> get a prediction



## Project dependencies


In [None]:

# !pip install pypdf langchain huggingface-hub openai transformers


In [53]:
# PDF Loaders. If unstructured gives you a hard time, try PyPDFLoader
from langchain.document_loaders import UnstructuredPDFLoader, OnlinePDFLoader, PyPDFLoader
from langchain.vectorstores import Chroma, Pinecone
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.embeddings.huggingface import HuggingFaceInstructEmbeddings
from langchain.llms import OpenAI,HuggingFaceHub
from langchain.chains.question_answering import load_qa_chain
import pinecone
from langchain.text_splitter import RecursiveCharacterTextSplitter
from dotenv import load_dotenv
import os

### Load your PDF file

using langchain pdf loader we can load our pdf file. this will return a list of text chunks. each chunk is a string containing the text of the pdf.This will only apply if PyPDFLoader and variants are used.

In [54]:
loader = UnstructuredPDFLoader("./data/constitution.pdf")

##! Other options for loaders
#?loader = PyPDFLoader("../../data/constitution.pdf")
#?loader = OnlinePDFLoader("www.example.com/constitution.pdf")

In [55]:
data = loader.load()

In [56]:
# Note: If you're using PyPDFLoader then it will split by page for you already
print (f'You have {len(data)} document(s) in your data')
print (f'There are {len(data[0].page_content)} characters in your document')

print(f'Content data example (first 100 characters): \n{data[0].page_content[:100]}')

You have 1 document(s) in your data
There are 52378 characters in your document
Content data example (first 100 characters): 
THE

CONSTITUTION of the United States

NATIONAL CONSTITUTION CENTER

We the People of the United St


### Chunk your data up into smaller documents

we will use the langchain RecursiveCharacterTextSplitter to split our text into smaller chunks. this will allow us to avoid overcapping the token limit of the LLM. we will use a chunk size of 300 tokens and an overlap of 50 tokens. this means that each chunk will contain 250 tokens of new text and 50 tokens of overlap with the previous chunk.


In [57]:
# Note: If you're using PyPDFLoader then we'll be splitting for the 2nd time.
# This is optional, test out on your own data.

text_splitter = RecursiveCharacterTextSplitter(chunk_size=300, chunk_overlap=50)
texts = text_splitter.split_documents(data)

In [58]:
print (f'Now you have {len(texts)} documents')

Now you have 254 documents


## Create embeddings of your documents

### load env variables

In [60]:


load_dotenv()
# Check to see if there is an environment variable with you API keys, if not, use what you put below
HUGGINGFACEHUB_API_TOKEN = os.environ.get("HUGGINGFACEHUB_API_TOKEN")
OPENAI_API_KEY = os.environ.get('OPENAI_API_KEY')

PINECONE_API_KEY = os.environ.get('PINECONE_API_KEY')
PINECONE_API_ENV = os.environ.get('PINECONE_API_ENV')

### Stage the embedding

underneath langchain´s abstractions we will be making api calls to the hugging face inference API or hitting OpenAI endpoints, depending on the model we are using.

In [42]:
#embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)
embeddings = HuggingFaceInstructEmbeddings(model_name="hkunlp/instructor-xl")

load INSTRUCTOR_Transformer
max_seq_length  512


{'title': 'HuggingFaceInstructEmbeddings',
 'description': 'Wrapper around sentence_transformers embedding models.\n\nTo use, you should have the ``sentence_transformers``\nand ``InstructorEmbedding`` python packages installed.\n\nExample:\n    .. code-block:: python\n\n        from langchain.embeddings import HuggingFaceInstructEmbeddings\n\n        model_name = "hkunlp/instructor-large"\n        model_kwargs = {\'device\': \'cpu\'}\n        encode_kwargs = {\'normalize_embeddings\': True}\n        hf = HuggingFaceInstructEmbeddings(\n            model_name=model_name,\n            model_kwargs=model_kwargs,\n            encode_kwargs=encode_kwargs\n        )',
 'type': 'object',
 'properties': {'client': {'title': 'Client'},
  'model_name': {'title': 'Model Name',
   'default': 'hkunlp/instructor-large',
   'type': 'string'},
  'cache_folder': {'title': 'Cache Folder', 'type': 'string'},
  'model_kwargs': {'title': 'Model Kwargs', 'type': 'object'},
  'encode_kwargs': {'title': 'Enco

### Initialize Poinecone client 

create an instance of pinecone client using your credentials and your index name

In [45]:
# initialize pinecone
pinecone.init(
    api_key=PINECONE_API_KEY,  # find at app.pinecone.io
    environment=PINECONE_API_ENV  # next to api key in console
)
index_name = "book-questions" # put in the name of your pinecone index here
index = pinecone.Index('book-questions')

Using langchain Pinecone adapter we can create a reference to the index containing methods to perform queries against the text corpus, the *from_text* method allow us to embedd the text chunks using and existing embedding model and storing it in the index.

In [64]:
text = [t.page_content for t in texts]
docsearch = Pinecone.from_texts(texts=text,index_name=index_name, embedding=embeddings)

lets make a test query to see if our index is working properly

In [74]:
query = "What is the First Article of the Constitution?"
docs = docsearch.similarity_search(query)
print(docs)

[Document(page_content='THE AMENDMENTS TO THE CONSTITUTION OF THE UNITED STATES AS RATIFIED BY THE STATES\n\nAmendment I.\n\nPreamble to the Bill of Rights', metadata={}), Document(page_content='(Note: The first 10 amendments to the Constitution were ratified December 15, 1791, and form what is known as the “Bill of Rights.”)\n\nC O N S T I T U T I O N O F T H E U N I T E D S T A T E S\n\nAmendment VI.\n\nAmendment XII.', metadata={}), Document(page_content='Passed by Congress June 13, 1866. Ratified July 9, 1868.\n\n(Note: Article I, Section 2 of the Constitution was modified by Section 2 of the 14th Amendment.)\n\nSECTION 1', metadata={}), Document(page_content='Passed by Congress July 2, 1909. Ratified February 3, 1913.\n\n(Note: Article I, Section 9 of the Constitution was modified by the 16 h Amendment.)\n\nSECTION 3', metadata={})]


### Create te context for the LLM
langchain´s abstractions allow us to create a chain to perform the following steps despite the llm chosen:
embed a question -> perform a similarity check against the text corpus -> feed the context to the llm -> get a prediction


In [73]:
#llm = OpenAI(temperature=0, openai_api_key=OPENAI_API_KEY)
llm = HuggingFaceHub(repo_id="tiiuae/falcon-7b-instruct")
chain = load_qa_chain(llm, chain_type="stuff")

ValidationError: 1 validation error for HuggingFaceHub
temperature
  extra fields not permitted (type=value_error.extra)

In [71]:
query = "What is the second article of the constitution about?"
docs = docsearch.similarity_search(query,k=5)

In [72]:
chain.run(input_documents=docs, question=query)

' The second article of the Constitution deals with the election of the President and Vice President.'