# Project: Question-Answering System on Private Documents Using OpenAI, Pinecone, and LangChain

GPT models are great at answering questions, but only on topics they have been trained on. What if you want GPT to answer questions about topics it hasn't been trained on? For example, about recent events after September 2021 for GPT-3.5 or GPT-4(not included in the training data) or about your non-public documents.

**LLMs can learn new knowledge in two ways:**

**1) Fine-Tuning on a training set:-** It is the most natural way to teach the model knowledge, but it can be time-consuming and expensive. It also builds long-term memory, which is not always necessary.
   
**2) Model Inputs:-** Model inputs means inserting the knowledge into an input message. For example, we can send an entire book or PDF document to the model as an input message, and then we can start asking questions on topics found in the input message. This is a good way to build short-term memory for the model. When we have a large corpus of text, it can be difficult to use model inputs because each model is limited to a maximum number of tokens, which in most cases is around 4000. We can not simply send the text from a 500-page document to the model because this will exceed the maximum number of tokens that the model supports.

**The recommended approach is to use model inputs with embedded-based search.** Embeddings are simple to implement and work especially well with questions.


## Question-Answering Pipeline

**1) Prepare the document (Once per document)**

   a)Load the data into LangChain Documents.
   
   b)Split the documents into chunks(short and self-contained sections).
   
   c)Embed the chunks into numeric vectors.(using an embedding model such as OpenAI's text-embedding-ada-002)
   
   d)Save the chunks and the embeddings to a vector database(such as Pinecone, Chroma, Milvus or Quadrant).

**2) Search (Once per Query)**

   a)Embed the user's question.(Given a user query, generate an embedding for the question using the same embedding model that was used for chunk embeddings)
   
   b)Using the question's embedding and the chunk embeddings, rank the vectors by similarity to the question's embedding(using cosine similarity or Euclidean distance). The nearest vectors represent chunks similar to the question.

**3)Ask(once per query)**

   a)Insert the question and the most relevant chunks (   obtained in step 2)b)  ) into a message to a GPT model.
   
   b)Return GPT's answer. (The GPT model will return an answer)

   
In this project we are building a complete quetion-answering application on custom data that follows the above pipeline. This Technique is also called Retrieval Augmentation because we retrieve relevant information from an external knowledge base and give that information to our LLM. The external knowledge base is our window into the world beyond the LLM's training data.

### 1) Prepare the document (Once per document)
#### Loading Your Custom(Private) PDF Documents into LangChain documents
The private data can be provided in different formats such as Pandas, Dataframes, PDFs, CSV or JSON files, HTML or office documents
**LangChain provides with Document Loaders which load this data into documents.**  document loaders are used to load data from a source as Document's. A Document is a piece of text and associated metadata. For example, there are document loaders for loading a simple .txt file, for loading the text contents of any web page, or even for loading a transcript of a YouTube video.





In [132]:
import os
from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv(), override=True)

True

In [133]:
pip install -r ./requirements.txt -q

Note: you may need to restart the kernel to use updated packages.


In [134]:
pip install -q pinecone-client

Note: you may need to restart the kernel to use updated packages.


To load PDF files install the library named pypdf

In [135]:
pip install pypdf -q

Note: you may need to restart the kernel to use updated packages.


In [136]:
pip install docx2txt -q

Note: you may need to restart the kernel to use updated packages.


In [137]:
pip install wikipedia -q

Note: you may need to restart the kernel to use updated packages.


In [None]:
# The following function will take as an argument a PDF file and return its text . This function loads the PDFs using a library called pypdf into an array of documents, where each document contains the page_content and  meta_data with a page number.

# def load_document(file):
#     from langchain.document_loaders import PyPDFLoader       # By the way, the standard  recommendation is to put import statements at the top of the file, However there are cases when putting import statements inside the function is even better. When you move a function from one module to another, you will know that the function will continue to work, because it contains everything inside it.
#     print(f'Loading {file}')
#     loader = PyPDFLoader(file)    # note that it is also able to load online PDFs. just pass a URL to the PDF to PyPDFLoader()
#     data = loader.load()            # This will return a list of langchain documents, one document for each page.
#     return data





In the above code, we can load PDF files into langchain documents. However our private unstructured data isn't limited to PDF format, it can be found in various other formats such as office documents, Google Docs, and many more. In the following code, we are loading only pdf and docx formats document formats into the langchain document. for this, we will check the file's extension and load it using the specific langchain loader based on its extension.

In [138]:
# Transform loaders (pdf, docx)
    #(which transforms or load data from a specific format into the langchain document format)
def load_document(file):
    import os
    name, extension = os.path.splitext(file)   # splitting the file name into name and extension. We can print name and extension if we want to see their values.

    if extension == '.pdf':
        from langchain.document_loaders import PyPDFLoader
        print(f'Loading {file}')
        loader = PyPDFLoader(file)  
    elif extension == '.docx':
        from langchain.document_loaders import Docx2txtLoader
        print(f'Loading {file}')
        loader = Docx2txtLoader(file)
    else:
        print('Document format is not supported!')
        return None
        
    data = loader.load()            
    return data



#Public Service loader (Wikipedia)
    #(Loading data from online public services into langchain. Here we don't deal with files but with different protocols or APIs that connect to those services. Since the format and code differ for each service, I would create a unique function for each dataset or service loader that I want to support in my application.)
def load_from_wikipedia(query, lang='en', load_max_docs=2):
    from langchain.document_loaders import WikipediaLoader
    loader = WikipediaLoader(query=query, lang=lang, load_max_docs=load_max_docs)    #load_max_docs can be used to limit the number of downloaded documents. for this we can use the hard-coded value or add a third argument to the function.
    data = loader.load() 
    return data

#### Split the documents into chunks

After loading our custom or private data into langchain documents, The next step is to split or chunk the documents into smaller parts in the context of building the LLM applications.

Chunking is the breaking down of large pieces of text into smaller segments. It is an essential technique that helps optimize the relevance of the content we get back from a vector database.  

By applying an effective chunking strategy, we can make sure that our search results accurately capture the essence of the user's query. If our chunks are too small or too large, It may lead to imprecise search results or missed opportunities to surface relevant content.

As a rule of thumb, if a chunk of text makes sense without the surrounding context to a human, it will also make sense to the language model. Therefore, finding the optimal chunk size for the documents in the corpus is crucial to ensure that the search results are accurate and relevant.

When a full paragraph or document is embedded, The embedding process considers both the overall context and the relationships between the sentences and the phrases within the text. This can result in more comprehensive vector representation that captures the broader meaning of the text.

In [139]:
def chunk_data(data, chunk_size=256):
    from langchain.text_splitter import RecursiveCharacterTextSplitter     # langchain provides many text splitters, but RecursiveCharacterTextSplitter is recommended for generic text. By default, the characters it tries to split on are \\n  \n and whitespace.
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=0)
    chunks = text_splitter.split_documents(data)   # it returns a list of dacuments.
   # chunks = text_splitter.create_documents(data)    # use this "text_splitter.create_documents()" method, instead of "text_splitter.split_documents()", when it is not already splitted in pages. It depends on how you have loaded the data.
    return chunks

#### calculating the embedding cost

In [116]:
def print_embedding_cost(texts):
    import tiktoken
    enc = tiktoken.encoding_for_model('text-embedding-ada-002')
    total_tokens = sum([len(enc.encode(page.page_content)) for page in texts])
    print(f'Total Tokens: {total_tokens}')
    print(f'Embedding cost in USD: {total_tokens / 1000 * 0.0004:.6f}')

#### Embedding the chunks into numeric vectors and uploading the chunks and the embeddings to a vector database (Pinecone)

In [152]:
# This function will create an index, if the index doesn't exist, embed the chunks and add both the chunks and embeddings into the pinecone index for fast retrieval and similarity search.
# If the index already exists, the function will load the embeddings from that index.

def insert_or_fetch_embeddings(index_name, chunks):
    import pinecone
    from langchain_community.vectorstores import Pinecone
    from langchain_openai import OpenAIEmbeddings
    from pinecone import ServerlessSpec
   

    pc = pinecone.Pinecone()   # if the API key is not provided in .env file then, we can write as follows:  pc = pinecone.Pinecone(api_key='YOUR_API_KEY') 
    embeddings = OpenAIEmbeddings(model='text-embedding-ada-002')

    if index_name in pc.list_indexes().names():
        print(f'Index {index_name} already exists. Loading embeddings ...', end='')
        vector_store = Pinecone.from_existing_index(index_name, embeddings)
        print('Ok')
    else:
        print(f'Creating index {index_name} and embeddings ...', end='')
        pc.create_index(
            name=index_name,
            dimension=1536, 
            metric='cosine', 
            spec=pinecone.ServerlessSpec(
                    cloud="aws",
                    region="us-east-1"
            ) 
        )
        vector_store = Pinecone.from_documents(chunks, embeddings, index_name=index_name)  # This method is processing the input documents(the chunks), generating the embeddings using the provided OpenAI's embeddings instance, inserting the embeddings into the index and returning a new pinecone vector store object.
        print('Ok')
        return vector_store

##If you want to create a new Pod Based index instead of a serverless one, then use the following configuration:
#from pinecone import PodSpec
# pc.create_index(
#             name=index_name,
#             dimension=1536,
#             metric='cosine'
#             spec=PodSpec(environment='gcp-starter')  #gcp stands for google cloud platform

#         )
        
        
        

In [147]:
#Here we are creating a function that deletes a pinecone index or all the indexes. Because the pinecone free tier supports only one index, it could be necessary to delete the existing index frequently.

#When we are using pinecone free tire and we want to avoid getting an error when we try to create a new index then we have to make sure that there are no pinecone indexes. So we will remove all indexes first.

def delete_pinecone_index(index_name='all'):
    import pinecone
    pc = pinecone.Pinecone()
    if index_name == 'all':
        indexes = pc.list_indexes().names()
        print('Deleting all indexes ....')
        for index in indexes:
            pc.delete_index(index)
        print('Ok')
    else:
        print(f'Deleting index {index_name} ....', end='')
        pc.delete_index(index_name)
        print('Ok')
            


##### Running Code

In [55]:
data = load_document('files/Learn_Java.pdf')                # note that it is also able to load online PDFs. just pass a URL to the PDF to PyPDFLoader().
                                                                        
print(data[20].page_content)         # The data is splitted by pages and you can use indexes to display a specific page. This is second page because it starts from zero.
print(data[20].metadata)             # metadata is a dictionary.
print(f'You have {len(data)} pages in your data')         # Number of pages
print(f' There are {len(data[20].page_content)} characters in the page')                      #Number of characters in one page




Loading files/Learn_Java.pdf
Teach Yourself JAVA in 21 DaysMTWTFSS
21
xxiv
P2/V4SQC6    TY  Java in 21 Days   030-4    louisa  12.31.95    FM   LP#4nnDay 16 covers interfaces and packages, useful for abstracting protocols of methods to
aid reuse and for the grouping and categorization of classes.
generated either by the system or by you in your programs.
nnDay 18 builds on the thread basics you learned on Day 10 to give a broad overview of
multithreading and how to use it to allow different parts of your Java programs to runin parallel.
nnOn Day 19, you’ll learn all about the input and output streams in Java’s I/O library.
nnDay 20 teaches you about native code—how to link C code into your Java programs
to provide missing functionality or to gain performance.
nnFinally, on Day 21, you’ll get an overview of some of the “behind-the-scenes” techni-
cal details of how Java works: the bytecode compiler and interpreter, the techniquesJava uses to ensure the integrity and security of your pro

In [153]:
data = load_document('files/java_notes.docx')     # here data is a list with a single element and content is the page_content attribute

print(data[0].page_content)

Loading files/java_notes.docx
Java Virtual Machine, or JVM, loads, verifies and executes Java bytecode. It is known as the interpreter or the core of Java programming language because it executes Java programming.



Java can be considered both a compiled and an interpreted language because its source code is first compiled into a binary byte-code. This byte-code runs on the Java Virtual Machine (JVM), which is usually a software-based interpreter



JIT compiler overview

Last Updated: 2021-02-28

The Just-In-Time (JIT) compiler is a component of the Java™ Runtime Environment that improves the performance of Java applications at run time.

Java programs consists of classes, which contain platform-neutral bytecodes that can be interpreted by a JVM on many different computer architectures. At run time, the JVM loads the class files, determines the semantics of each individual bytecode, and performs the appropriate computation. The additional processor and memory usage during interpretat

In [125]:
#data = load_from_wikipedia('Chandrayaan-3')
data = load_from_wikipedia('Chandrayaan-3', 'en')  #Important Note: The training data for GPT-4 was cut off in September 2021. Chandrayaan-3 was launched in July 2023. So it was not included in the GPT-4 training data. Without loading the data from external sources, LLMs like gpt-3.5-turbo or gpt-4 have no knowledge of it.
print(data[0].page_content)

Chandrayaan-3 ( CHUN-drə-YAHN) is the third mission in the Chandrayaan programme, a series of lunar-exploration missions developed by the Indian Space Research Organisation (ISRO). The mission consists of a Vikram lunar lander and a Pragyan lunar rover  was launched from Satish Dhawan Space Centre on 14 July 2023. The spacecraft entered lunar orbit on 5 August, and India became the first country to touch down near the lunar south pole, at 69°S, the southernmost lunar landing  on 23 August 2023 at 18:03 IST (12:33 UTC), made ISRO the fourth space agency to successfully land on the Moon, after Roscosmos, NASA, and the CNSA. 
The lander was not built to withstand the cold temperatures of the lunar night, and sunset over the landing site ended the surface mission twelve days after landing. The propulsion module, still operational, transited back to a high Earth orbit from lunar orbit on 22 November 2023 for continued scientific observations of Earth.
Chandrayaan-3 was launched from Satish 

In [154]:
chunks = chunk_data(data)
print(len(chunks))
print(chunks[20].page_content)

112
Note JVM is a specification (sun microsystem has said that it should be this way) that can be implemented by anyone.  If someone wants to create their own version of java then it should be aligned to that specification given by sun microsystem.   The


We are using Openai's Model "text-embedding-ada-002" which has a cost. So in the following cell, we are calculating the embedding costs using tiktoken library, in advance to avoid any surprises.

In [129]:
print_embedding_cost(chunks)

Total Tokens: 1835
Embedding cost in USD: 0.000734


In [155]:
delete_pinecone_index()

Deleting all indexes ....
Ok


In [156]:
index_name = 'java-notes'
insert_or_fetch_embeddings(index_name, chunks)

Creating index java-notes and embeddings ...Ok


<langchain_community.vectorstores.pinecone.Pinecone at 0x1a0f5862ed0>