# Question-Answering System on Private Documents Using OpenAI, Pinecone, and LangChain

GPT models are great at answering questions, but only on topics they have been trained on. What if you want GPT to answer questions about topics it hasn't been trained on? For example, about recent events after September 2021 for GPT-3.5 or GPT-4(not included in the training data) or about your non-public documents.

**LLMs can learn new knowledge in two ways:**

**1) Fine-Tuning on a training set:-** It is the most natural way to teach the model knowledge, but it can be time-consuming and expensive. It also builds long-term memory, which is not always necessary.
   
**2) Model Inputs:-** Model inputs means inserting the knowledge into an input message. For example, we can send an entire book or PDF document to the model as an input message, and then we can start asking questions on topics found in the input message. This is a good way to build short-term memory for the model. When we have a large corpus of text, it can be difficult to use model inputs because each model is limited to a maximum number of tokens, which in most cases is around 4000. We can not simply send the text from a 500-page document to the model because this will exceed the maximum number of tokens that the model supports.

**The recommended approach is to use model inputs with embedded-based search.** Embeddings are simple to implement and work especially well with questions.


## Question-Answering Pipeline

**1) Prepare the document (Once per document)**

   a)Load the data into LangChain Documents.
   
   b)Split the documents into chunks(short and self-contained sections).
   
   c)Embed the chunks into numeric vectors.(using an embedding model such as OpenAI's text-embedding-ada-002)
   
   d)Save the chunks and the embeddings to a vector database(such as Pinecone, Chroma, Milvus or Quadrant).

**2) Search (Once per Query)**

   a)Embed the user's question.(Given a user query, generate an embedding for the question using the same embedding model that was used for chunk embeddings)
   
   b)Using the question's embedding and the chunk embeddings, rank the vectors by similarity to the question's embedding(using cosine similarity or Euclidean distance). The nearest vectors represent chunks similar to the question.

**3)Ask(once per query)**

   a)Insert the question and the most relevant chunks (   obtained in step 2)b)  ) into a message to a GPT model.
   
   b)Return GPT's answer. (The GPT model will return an answer)

   
In this project we are building a complete quetion-answering application on custom data that follows the above pipeline. This Technique is also called Retrieval Augmentation because we retrieve relevant information from an external knowledge base and give that information to our LLM. The external knowledge base is our window into the world beyond the LLM's training data.

### 1) Prepare the document (Once per document)
#### Loading Your Custom(Private) PDF Documents into LangChain
The private data can be provided in different formats such as Pandas, Dataframes, PDFs, CSV or JSON files, HTML or office documents
**LangChain provides with Document Loaders which load this data into documents.**  document loaders are used to load data from a source as Document's. A Document is a piece of text and associated metadata. For example, there are document loaders for loading a simple .txt file, for loading the text contents of any web page, or even for loading a transcript of a YouTube video.





In [3]:
import os
from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv(), override=True)

True

To load PDF files install the library named pypdf

In [None]:
pip install pypdf -q

In [None]:
# The following function will take as an argument a PDF file and return its text . This function loads the PDFs using a library called pypdf into an array of documents, where each document contains the page_content and  meta_data with a page number.

def load_document(file):
    from langchain.document_loaders import PyPDFLoader       # By the way, the standard  recommendation is to put import statements at the top of the file, However there are cases when putting import statements inside the function is even better. When you move a function from one module to another, you will know that the function will continue to work, because it contains everything inside it.
    print(f'Loading {file}')
    loader = PyPDFLoader(file)    # note that it is also able to load online PDFs. just pass a URL to the PDF to PyPDFLoader()
    data = loader.load()            # This will return a list of langchain documents, one document for each page.
    return data



##### Running Code

In [None]:
data = load_document()                # note that it is also able to load online PDFs. just pass a URL to the PDF to PyPDFLoader().
print(data[1].page_content)         # The data is splitted by pages and you can use indexes to display a specific page. This is second page because it starts from zero.
print(data[1].metadata)             # metadata is a dictionary.
print(f'You have {len(data)} pages in your data')         # Number of pages
print(f' There are {len(data[1].page_content)} characters in the page')                      #Number of characters in one page


