# Project: Question-Answering System on Private Documents Using OpenAI, Pinecone, and LangChain

GPT models are great at answering questions, but only on topics they have been trained on. What if you want GPT to answer questions about topics it hasn't been trained on? For example, about recent events after September 2021 for GPT-3.5 or GPT-4(not included in the training data) or about your non-public documents.

**LLMs can learn new knowledge in two ways:**

**1) Fine-Tuning on a training set:-** It is the most natural way to teach the model knowledge, but it can be time-consuming and expensive. It also builds long-term memory, which is not always necessary.
   
**2) Model Inputs:-** Model inputs means inserting the knowledge into an input message. For example, we can send an entire book or PDF document to the model as an input message, and then we can start asking questions on topics found in the input message. This is a good way to build short-term memory for the model. When we have a large corpus of text, it can be difficult to use model inputs because each model is limited to a maximum number of tokens, which in most cases is around 4000. We can not simply send the text from a 500-page document to the model because this will exceed the maximum number of tokens that the model supports.

**The recommended approach is to use model inputs with embedded-based search.** Embeddings are simple to implement and work especially well with questions.


## Question-Answering Pipeline

**1) Prepare the document (Once per document)**

   a)Load the data into LangChain Documents.
   
   b)Split the documents into chunks(short and self-contained sections).
   
   c)Embed the chunks into numeric vectors.(using an embedding model such as OpenAI's text-embedding-ada-002)
   
   d)Save the chunks and the embeddings to a vector database(such as Pinecone, Chroma, Milvus or Quadrant).

**2) Search (Once per Query)**

   a)Embed the user's question.(Given a user query, generate an embedding for the question using the same embedding model that was used for chunk embeddings)
   
   b)Using the question's embedding and the chunk embeddings, rank the vectors by similarity to the question's embedding(using cosine similarity or Euclidean distance). The nearest vectors represent chunks similar to the question.

**3)Ask(once per query)**

   a)Insert the question and the most relevant chunks (   obtained in step 2)b)  ) into a message to a GPT model.
   
   b)Return GPT's answer. (The GPT model will return an answer)

   
In this project we are building a complete quetion-answering application on custom data that follows the above pipeline. This Technique is also called Retrieval Augmentation because we retrieve relevant information from an external knowledge base and give that information to our LLM. The external knowledge base is our window into the world beyond the LLM's training data.

### 1) Prepare the document (Once per document)
#### Loading Your Custom(Private) PDF Documents into LangChain documents
The private data can be provided in different formats such as Pandas, Dataframes, PDFs, CSV or JSON files, HTML or office documents
**LangChain provides with Document Loaders which load this data into documents.**  document loaders are used to load data from a source as Document's. A Document is a piece of text and associated metadata. For example, there are document loaders for loading a simple .txt file, for loading the text contents of any web page, or even for loading a transcript of a YouTube video.





In [1]:
import os
from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv(), override=True)

True

In [None]:
pip install --upgrade pip

In [2]:
pip install -r ./requirements.txt -q

Note: you may need to restart the kernel to use updated packages.


In [3]:
pip install -q pinecone-client

Note: you may need to restart the kernel to use updated packages.


To load PDF files install the library named pypdf

In [None]:
pip install pypdf -q

In [None]:
pip install docx2txt -q

In [4]:
pip install wikipedia -q

Note: you may need to restart the kernel to use updated packages.


In [None]:
# The following function will take as an argument a PDF file and return its text . This function loads the PDFs using a library called pypdf into an array of documents, where each document contains the page_content and  meta_data with a page number.

# def load_document(file):
#     from langchain.document_loaders import PyPDFLoader       # By the way, the standard  recommendation is to put import statements at the top of the file, However there are cases when putting import statements inside the function is even better. When you move a function from one module to another, you will know that the function will continue to work, because it contains everything inside it.
#     print(f'Loading {file}')
#     loader = PyPDFLoader(file)    # note that it is also able to load online PDFs. just pass a URL to the PDF to PyPDFLoader()
#     data = loader.load()            # This will return a list of langchain documents, one document for each page.
#     return data





In the above code, we can load PDF files into langchain documents. However our private unstructured data isn't limited to PDF format, it can be found in various other formats such as office documents, Google Docs, and many more. In the following code, we are loading only pdf and docx formats document formats into the langchain document. for this, we will check the file's extension and load it using the specific langchain loader based on its extension.

In [5]:
# Transform loaders (pdf, docx)
    #(which transforms or load data from a specific format into the langchain document format)
def load_document(file):
    import os
    name, extension = os.path.splitext(file)   # splitting the file name into name and extension. We can print name and extension if we want to see their values.

    if extension == '.pdf':
        from langchain.document_loaders import PyPDFLoader
        print(f'Loading {file}')
        loader = PyPDFLoader(file)  
    elif extension == '.docx':
        from langchain.document_loaders import Docx2txtLoader
        print(f'Loading {file}')
        loader = Docx2txtLoader(file)
    else:
        print('Document format is not supported!')
        return None
        
    data = loader.load()            
    return data



#Public Service loader (Wikipedia)
    #(Loading data from online public services into langchain. Here we don't deal with files but with different protocols or APIs that connect to those services. Since the format and code differ for each service, I would create a unique function for each dataset or service loader that I want to support in my application.)
def load_from_wikipedia(query, lang='en', load_max_docs=2):
    from langchain.document_loaders import WikipediaLoader
    loader = WikipediaLoader(query=query, lang=lang, load_max_docs=load_max_docs)    #load_max_docs can be used to limit the number of downloaded documents. for this we can use the hard-coded value or add a third argument to the function.
    data = loader.load() 
    return data

#### Split the documents into chunks

After loading our custom or private data into langchain documents, The next step is to split or chunk the documents into smaller parts in the context of building the LLM applications.

Chunking is the breaking down of large pieces of text into smaller segments. It is an essential technique that helps optimize the relevance of the content we get back from a vector database.  

By applying an effective chunking strategy, we can make sure that our search results accurately capture the essence of the user's query. If our chunks are too small or too large, It may lead to imprecise search results or missed opportunities to surface relevant content.

As a rule of thumb, if a chunk of text makes sense without the surrounding context to a human, it will also make sense to the language model. Therefore, finding the optimal chunk size for the documents in the corpus is crucial to ensure that the search results are accurate and relevant.

When a full paragraph or document is embedded, The embedding process considers both the overall context and the relationships between the sentences and the phrases within the text. This can result in more comprehensive vector representation that captures the broader meaning of the text.

In [6]:
def chunk_data(data, chunk_size=256):
    from langchain.text_splitter import RecursiveCharacterTextSplitter     # langchain provides many text splitters, but RecursiveCharacterTextSplitter is recommended for generic text. By default, the characters it tries to split on are \\n  \n and whitespace.
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=0)
    chunks = text_splitter.split_documents(data)   # it returns a list of dacuments.
   # chunks = text_splitter.create_documents(data)    # use this "text_splitter.create_documents()" method, instead of "text_splitter.split_documents()", when it is not already splitted in pages. It depends on how you have loaded the data.
    return chunks

#### calculating the embedding cost

In [None]:
def print_embedding_cost(texts):
    import tiktoken
    enc = tiktoken.encoding_for_model('text-embedding-ada-002')
    total_tokens = sum([len(enc.encode(page.page_content)) for page in texts])
    print(f'Total Tokens: {total_tokens}')
    print(f'Embedding cost in USD: {total_tokens / 1000 * 0.0004:.6f}')

#### Embedding the chunks into numeric vectors and uploading the chunks and the embeddings to a vector database (Pinecone)

In [7]:
# This function will create an index, if the index doesn't exist, embed the chunks and add both the chunks and embeddings into the pinecone index for fast retrieval and similarity search.
# If the index already exists, the function will load the embeddings from that index.

def insert_or_fetch_embeddings(index_name, chunks):
    import pinecone
    from langchain_community.vectorstores import Pinecone
    from langchain_openai import OpenAIEmbeddings
    from pinecone import ServerlessSpec
   

    pc = pinecone.Pinecone()   # if the API key is not provided in .env file then, we can write as follows:  pc = pinecone.Pinecone(api_key='YOUR_API_KEY') 
    embeddings = OpenAIEmbeddings(model='text-embedding-3-small')

    if index_name in pc.list_indexes().names():
        print(f'Index {index_name} already exists. Loading embeddings ...', end='')
        vector_store = Pinecone.from_existing_index(index_name, embeddings)
        print('Ok')
    else:
        print(f'Creating index {index_name} and embeddings ...', end='')
        pc.create_index(
            name=index_name,
            dimension=1536, 
            metric='cosine', 
            spec=pinecone.ServerlessSpec(
                    cloud="aws",
                    region="us-east-1"
            ) 
        )
        vector_store = Pinecone.from_documents(chunks, embeddings, index_name=index_name)  # This method is processing the input documents(the chunks), generating the embeddings using the provided OpenAI's embeddings instance, inserting the embeddings into the index and returning a new pinecone vector store object.
        print('Ok')

    return vector_store

##If you want to create a new Pod Based index instead of a serverless one, then use the following configuration:
#from pinecone import PodSpec
# pc.create_index(
#             name=index_name,
#             dimension=1536,
#             metric='cosine'
#             spec=PodSpec(environment='gcp-starter')  #gcp stands for google cloud platform

#         )
        
        
        

In [8]:
#Here we are creating a function that deletes a pinecone index or all the indexes. Because the pinecone free tier supports only one index, it could be necessary to delete the existing index frequently.

#When we are using pinecone free tire and we want to avoid getting an error when we try to create a new index then we have to make sure that there are no pinecone indexes. So we will remove all indexes first.

def delete_pinecone_index(index_name='all'):
    import pinecone
    pc = pinecone.Pinecone()
    if index_name == 'all':
        indexes = pc.list_indexes().names()
        print('Deleting all indexes ....')
        for index in indexes:
            pc.delete_index(index)
        print('Ok')
    else:
        print(f'Deleting index {index_name} ....', end='')
        pc.delete_index(index_name)
        print('Ok')
            


##### Asking and Getting Answers
The chunks represent the answer, but you can't give them to users like this. we need the answers in natural language. That's where the LLM comes in. In the following function, first, we'll retrieve the most relevant chunks of text from our vector database, and then we'll feed those chunks to the LLMs to get the final answer.

Note: Also check the previous project (Question-Answer about content of private document)

In [None]:
def ask_and_get_answer(vector_store, q):
    from langchain.chains import RetrievalQA
    from langchain_openai import ChatOpenAI
    
    llm = ChatOpenAI(model='gpt-4o-mini', temperature=1)
    
    # here I am exposing the index in a retriever interface. The retriever interface is a generic interface that makes it easy to combine documents with language models.
    #search_kwargs takes the value as a dictionary. The key k has a value as an integer
    # {'k': 3} means that it will return the three most similar chunks to the user's query 
    retriever = vector_store.as_retriever(search_type='similarity', search_kwargs={'k': 3})
    
    # Finally, creating a chain to answer the questions. The default chain_type="stuff" uses all of the text from the documents in the prompt.
    chain = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever)

    #Now run the chain
    answer = chain.invoke(q)
    return answer


#index = Pinecone.from_existing_index(index_name=index_name, embedding=embeddings)
# docs = index.similarity_search('Enter query here', include_metadata = True)

##### Running Code

In [None]:
data = load_document('files/Learn_Java.pdf')                # note that it is also able to load online PDFs. just pass a URL to the PDF to PyPDFLoader().
                                                                        
print(data[20].page_content)         # The data is splitted by pages and you can use indexes to display a specific page. This is second page because it starts from zero.
print(data[20].metadata)             # metadata is a dictionary.
print(f'You have {len(data)} pages in your data')         # Number of pages
print(f' There are {len(data[20].page_content)} characters in the page')                      #Number of characters in one page




In [None]:
data = load_document('files/java_notes.docx')     # here data is a list with a single element and content is the page_content attribute

print(data[0].page_content)

In [9]:
#data = load_from_wikipedia('Chandrayaan-3')
data = load_from_wikipedia('Olympics', 'en')  #Important Note: (Now I am using 'gpt-4o-mini' with a knowledge cut-off date of October 2023.)The training data for GPT-4 was cut off in September 2021. Chandrayaan-3 was launched in July 2023. So it was not included in the GPT-4 training data. Without loading the data from external sources, LLMs like gpt-3.5-turbo or gpt-4 have no knowledge of it.
print(data[0].page_content)

The modern Olympic Games (OG; or Olympics; French: Jeux olympiques, JO) are the leading international sporting events featuring summer and winter sports competitions in which thousands of athletes from around the world participate in a variety of competitions. The Olympic Games are considered the world's foremost sports competition with more than 200 teams, representing sovereign states and territories, participating. By default, the Games generally substitute for any world championships during the year in which they take place (however, each class usually maintains its own records). The Olympic Games are held every four years. Since 1994, they have alternated between the Summer and Winter Olympics every two years during the four-year Olympiad.
Their creation was inspired by the ancient Olympic Games, held in Olympia, Greece from the 8th century BC to the 4th century AD. Baron Pierre de Coubertin founded the International Olympic Committee (IOC) in 1894, leading to the first modern Gam

In [10]:
chunks = chunk_data(data)
print(len(chunks))
print(chunks[20].page_content)

39
Sydney was selected as the host city for the 2000 Games in 1993. Teams from 199 countries participated in the 2000 Games, which were the first to feature at least 300 events in its official sports program. The Games' cost was estimated to be A$6.6


We are using Openai's Model "text-embedding-ada-002" which has a cost. So in the following cell, we are calculating the embedding costs using tiktoken library, in advance to avoid any surprises.

In [None]:
print_embedding_cost(chunks)

In [None]:
delete_pinecone_index()

In [11]:
index_name = 'olympics'
vector_store = insert_or_fetch_embeddings(index_name, chunks)
print(vector_store)

Index olympics already exists. Loading embeddings ...Ok
<langchain_community.vectorstores.pinecone.Pinecone object at 0x0000017DE650BD90>


In [None]:
# q = "when the propulsion module returned to a high Earth orbit from lunar orbit?"
q = "from where chandrayaan 3 was launched?"
print(vector_store)
answer = ask_and_get_answer(vector_store, q)
print(answer['result'])

In [None]:
#creating a loop so that the user can send questions to the application continuously
import time
i = 1
print('Write Quit or Exit to quit.')
while True:
    q = input(f'Question #{i}: ')
    i = i + 1
    if q.lower() in ['quit', 'exit']:
        print('Quitting.....bye bye!')
        time.sleep(2)
        break

    answer = ask_and_get_answer(vector_store, q)
    print(f"\nAnswer: {answer['result']}")
    print(f'\n{"-" * 50}\n')

##### Adding Memory (Chat History)

In [12]:
#Earlier in ask_and_get_answer() method we used the RetrievalQA chain to ask questions against a vector store. That works well but has one disadvantage. It fails to preserve conversational history for any follow-up question. Follow-up questions can contain references to past chat history. So in the following code, we will save the chat history to ask follow-up questions.
#Instead of using the  RetrievalQA chain, here we are using another chain called ConversationalRetrievalChain.This chain is used to have a conversation based on the retrieved documents. 
#ConversationBufferMemory class acts as a buffer for storing conversation memory.

from langchain_openai import ChatOpenAI
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory

llm = ChatOpenAI(model='gpt-4o-mini', temperature=0)

#In the RAG system, external data is retrieved and passed to the LLM during the generation step. A retriever is a crucial component that helps LLM find and access relevant information. It does this by searching for relevant data and retrieving the information. In this example, the retriever will search by similarity and will retrieve the top k most similar chunks of data.
retriever = vector_store.as_retriever(search_type='similarity', search_kwargs={'k': 5})
memory = ConversationBufferMemory(memory_key='chat_history', return_messages=True)  # The memory will be automatically updated with the questions and the answers. Memory is specifically designed to store and manage conversation history within the langchain application. memory_key='chat_history' gives your memory a label. When retrieving or interacting with the stored conversation, we'll use the key "chat_history".

crc = ConversationalRetrievalChain.from_llm(
        llm=llm,
        retriever=retriever,
        chain_type="stuff",   #chain_type="stuff" means use all of the text from the documents.
        memory=memory,
        verbose=True
    )

In [13]:
def ask_question(q, chain):
    result = chain.invoke({'question': q})
    return result

In [14]:
q = "In how many sports, athletes competed at the 2020 Summer Olympics and 2022 Winter Olympics combined?"
result = ask_question(q, crc)
print(result)
print('-' * 50)
print(result['answer'])
print('-' * 50)



[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mSystem: Use the following pieces of context to answer the user's question. 
If you don't know the answer, just say that you don't know, don't try to make up an answer.
----------------
torch, and opening and closing ceremonies. Over 14,000 athletes competed at the 2020 Summer Olympics and 2022 Winter Olympics combined, in 40 different sports and 448 events. The first-, second-, and third-place finishers in each event receive Olympic

competitions. The Olympic Games are considered the world's foremost sports competition with more than 200 teams, representing sovereign states and territories, participating. By default, the Games generally substitute for any world championships during

The modern Olympic Games (OG; or Olympics; French: Jeux olympiques, JO) are the leading international sporting events featuring summer and winter sports competitions in whic

In [15]:
q = "Multiply that number by 3."
result = ask_question(q, crc)
print(result)
print('-' * 50)
print(result['answer'])
print('-' * 50)



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mGiven the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.

Chat History:

Human: In how many sports, athletes competed at the 2020 Summer Olympics and 2022 Winter Olympics combined?
Assistant: Athletes competed in 40 different sports at the 2020 Summer Olympics and 2022 Winter Olympics combined.
Follow Up Input: Multiply that number by 3.
Standalone question:[0m

[1m> Finished chain.[0m


[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mSystem: Use the following pieces of context to answer the user's question. 
If you don't know the answer, just say that you don't know, don't try to make up an answer.
----------------
torch, and opening and closing ceremonies. Over 14,000 athletes competed at the 2020 Summer Olympics and 2022 Winter O

In [16]:
q = "divide that number by 100."
result = ask_question(q, crc)
print(result)
print('-' * 50)
print(result['answer'])
print('-' * 50)



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mGiven the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.

Chat History:

Human: In how many sports, athletes competed at the 2020 Summer Olympics and 2022 Winter Olympics combined?
Assistant: Athletes competed in 40 different sports at the 2020 Summer Olympics and 2022 Winter Olympics combined.
Human: Multiply that number by 3.
Assistant: The number of sports at the 2020 Summer Olympics and 2022 Winter Olympics combined is 40. Multiplying this by 3 gives you 120.
Follow Up Input: divide that number by 100.
Standalone question:[0m

[1m> Finished chain.[0m


[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mSystem: Use the following pieces of context to answer the user's question. 
If you don't know the answer, just say that you don't k

In [17]:
# To display the chat history, which contains all the questions and their answers, Iterate over the content of the chat history key as follows:-
for item in result['chat_history']:
    print(item)

content='In how many sports, athletes competed at the 2020 Summer Olympics and 2022 Winter Olympics combined?'
content='Athletes competed in 40 different sports at the 2020 Summer Olympics and 2022 Winter Olympics combined.'
content='Multiply that number by 3.'
content='The number of sports at the 2020 Summer Olympics and 2022 Winter Olympics combined is 40. Multiplying this by 3 gives you 120.'
content='divide that number by 100.'
content='120 divided by 100 is 1.2.'
