# Question-Answering System on Private Documents Using OpenAI, Pinecone, and LangChain

GPT models are great at answering questions, but only on topics they have been trained on. What if you want GPT to answer questions about topics it hasn't been trained on? For example, about recent events after September 2021 for GPT-3.5 or GPT-4(not included in the training data) or about your non-public documents.

**LLMs can learn new knowledge in two ways:**

**1) Fine-Tuning on a training set:-** It is the most natural way to teach the model knowledge, but it can be time-consuming and expensive. It also builds long-term memory, which is not always necessary.
   
**2) Model Inputs:-** Model inputs means inserting the knowledge into an input message. For example, we can send an entire book or PDF document to the model as an input message, and then we can start asking questions on topics found in the input message. This is a good way to build short-term memory for the model. When we have a large corpus of text, it can be difficult to use model inputs because each model is limited to a maximum number of tokens, which in most cases is around 4000. We can not simply send the text from a 500-page document to the model because this will exceed the maximum number of tokens that the model supports.

**The recommended approach is to use model inputs with embedded-based search.** Embeddings are simple to implement and work especially well with questions.


## Question-Answering Pipeline

**1) Prepare the document (Once per document)**

   a)Load the data into LangChain Documents.
   
   b)Split the documents into chunks(short and self-contained sections).
   
   c)Embed the chunks into numeric vectors.(using an embedding model such as OpenAI's text-embedding-ada-002)
   
   d)Save the chunks and the embeddings to a vector database(such as Pinecone, Chroma, Milvus or Quadrant).

**2) Search (Once per Query)**

   a)Embed the user's question.(Given a user query, generate an embedding for the question using the same embedding model that was used for chunk embeddings)
   
   b)Using the question's embedding and the chunk embeddings, rank the vectors by similarity to the question's embedding(using cosine similarity or Euclidean distance). The nearest vectors represent chunks similar to the question.

**3)Ask(once per query)**

   a)Insert the question and the most relevant chunks (   obtained in step 2)b)  ) into a message to a GPT model.
   
   b)Return GPT's answer. (The GPT model will return an answer)

   
In this project we are building a complete quetion-answering application on custom data that follows the above pipeline. This Technique is also called Retrieval Augmentation because we retrieve relevant information from an external knowledge base and give that information to our LLM. The external knowledge base is our window into the world beyond the LLM's training data.

### 1) Prepare the document (Once per document)
#### Loading Your Custom(Private) PDF Documents into LangChain
The private data can be provided in different formats such as Pandas, Dataframes, PDFs, CSV or JSON files, HTML or office documents
**LangChain provides with Document Loaders which load this data into documents.**  document loaders are used to load data from a source as Document's. A Document is a piece of text and associated metadata. For example, there are document loaders for loading a simple .txt file, for loading the text contents of any web page, or even for loading a transcript of a YouTube video.





In [31]:
import os
from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv(), override=True)

True

To load PDF files install the library named pypdf

In [18]:
pip install pypdf -q

Note: you may need to restart the kernel to use updated packages.


In [19]:
pip install docx2txt -q

Note: you may need to restart the kernel to use updated packages.


In [20]:
pip install wikipedia -q

Note: you may need to restart the kernel to use updated packages.


In [None]:
# The following function will take as an argument a PDF file and return its text . This function loads the PDFs using a library called pypdf into an array of documents, where each document contains the page_content and  meta_data with a page number.

# def load_document(file):
#     from langchain.document_loaders import PyPDFLoader       # By the way, the standard  recommendation is to put import statements at the top of the file, However there are cases when putting import statements inside the function is even better. When you move a function from one module to another, you will know that the function will continue to work, because it contains everything inside it.
#     print(f'Loading {file}')
#     loader = PyPDFLoader(file)    # note that it is also able to load online PDFs. just pass a URL to the PDF to PyPDFLoader()
#     data = loader.load()            # This will return a list of langchain documents, one document for each page.
#     return data





In the above code, we can load PDF files into langchain documents. However our private unstructured data isn't limited to PDF format, it can be found in various other formats such as office documents, Google Docs, and many more. In the following code, we are loading only pdf and docx formats document formats into the langchain document. for this, we will check the file's extension and load it using the specific langchain loader based on its extension.

In [23]:
# Transform loaders (pdf, docx)
    #(which transforms or load data from a specific format into the langchain document format)
def load_document(file):
    import os
    name, extension = os.path.splitext(file)   # splitting the file name into name and extension. We can print name and extension if we want to see their values.

    if extension == '.pdf':
        from langchain.document_loaders import PyPDFLoader
        print(f'Loading {file}')
        loader = PyPDFLoader(file)  
    elif extension == '.docx':
        from langchain.document_loaders import Docx2txtLoader
        print(f'Loading {file}')
        loader = Docx2txtLoader(file)
    else:
        print('Document format is not supported!')
        return None
        
    data = loader.load()            
    return data



#Public Service loader (Wikipedia)
    #(Loading data from online public services into langchain. Here we don't deal with files but with different protocols or APIs that connect to those services. Since the format and code differ for each service, I would create a unique function for each dataset or service loader that I want to support in my application.)
def load_from_wikipedia(query, lang='en', load_max_docs=2):
    from langchain.document_loaders import WikipediaLoader
    loader = WikipediaLoader(query=query, lang=lang, load_max_docs=load_max_docs)    #load_max_docs can be used to limit the number of downloaded documents. for this we can use the hard-coded value or add a third argument to the function.
    data = loader.load() 
    return data

##### Running Code

In [24]:
data = load_document('files/Learn_Java.pdf')                # note that it is also able to load online PDFs. just pass a URL to the PDF to PyPDFLoader().
                                                                        
print(data[20].page_content)         # The data is splitted by pages and you can use indexes to display a specific page. This is second page because it starts from zero.
print(data[20].metadata)             # metadata is a dictionary.
print(f'You have {len(data)} pages in your data')         # Number of pages
print(f' There are {len(data[20].page_content)} characters in the page')                      #Number of characters in one page




Loading files/Learn_Java.pdf
Teach Yourself JAVA in 21 DaysMTWTFSS
21
xxiv
P2/V4SQC6    TY  Java in 21 Days   030-4    louisa  12.31.95    FM   LP#4nnDay 16 covers interfaces and packages, useful for abstracting protocols of methods to
aid reuse and for the grouping and categorization of classes.
generated either by the system or by you in your programs.
nnDay 18 builds on the thread basics you learned on Day 10 to give a broad overview of
multithreading and how to use it to allow different parts of your Java programs to runin parallel.
nnOn Day 19, you’ll learn all about the input and output streams in Java’s I/O library.
nnDay 20 teaches you about native code—how to link C code into your Java programs
to provide missing functionality or to gain performance.
nnFinally, on Day 21, you’ll get an overview of some of the “behind-the-scenes” techni-
cal details of how Java works: the bytecode compiler and interpreter, the techniquesJava uses to ensure the integrity and security of your pro

In [25]:
data = load_document('files/java_notes.docx')     # here data is a list with a single element and content is the page_content attribute

print(data[0].page_content)

Loading files/java_notes.docx
Java Virtual Machine, or JVM, loads, verifies and executes Java bytecode. It is known as the interpreter or the core of Java programming language because it executes Java programming.



Java can be considered both a compiled and an interpreted language because its source code is first compiled into a binary byte-code. This byte-code runs on the Java Virtual Machine (JVM), which is usually a software-based interpreter



JIT compiler overview

Last Updated: 2021-02-28

The Just-In-Time (JIT) compiler is a component of the Java™ Runtime Environment that improves the performance of Java applications at run time.

Java programs consists of classes, which contain platform-neutral bytecodes that can be interpreted by a JVM on many different computer architectures. At run time, the JVM loads the class files, determines the semantics of each individual bytecode, and performs the appropriate computation. The additional processor and memory usage during interpretat

In [30]:
#data = load_from_wikipedia('Chandrayaan-3')
data = load_from_wikipedia('Chandrayaan-3', 'hi')  #Important Note: The training data for GPT-4 was cut off in September 2021. Chandrayaan-3 was launched in July 2023. So it was not included in the GPT-4 training data. Without loading the data from external sources, LLMs like gpt-3.5-turbo or gpt-4 have no knowledge of it.
print(data[0].page_content)

चंद्रयान-3 चाँद पर खोजबीन करने के लिए भारतीय अंतरिक्ष अनुसंधान संगठन (इसरो) द्वारा भेजा गया तीसरा भारतीय चंद्र मिशन है। इसमें चंद्रयान-2 के समान एक लैंडर और एक रोवर है, लेकिन इसमें कक्षित्र (ऑर्बिटर) नहीं है।
यह मिशन चंद्रयान-2 की अगली कड़ी है, क्योंकि पिछला मिशन सफलता पूर्वक चाँद की कक्षा में प्रवेश करने के बाद अंतिम समय में मार्गदर्शन सॉफ्टवेयर में गड़बड़ी के कारण उतरने की नियंत्रित प्रकिया में विफल हो गया था, सॉफ्ट लैंडिंग का पुनः सफल प्रयास करने हेतु इस नए चंद्र परियोजना को प्रस्तावित किया गया था।
चंद्रयान-3 का प्रक्षेपण सतीश धवन अंतरिक्ष केंद्र (शार), श्रीहरिकोटा से 14 जुलाई, 2023 शुक्रवार को भारतीय समय अनुसार दोपहर 2:35 बजे हुआ था। यह यान चंद्रमा के दक्षिणी ध्रुव के पास की सतह पर 23 अगस्त 2023 को भारतीय समय अनुसार सायं 06:04 बजे के आसपास सफलतापूर्वक उतर चुका है। इसी के साथ भारत चंद्रमा के दक्षिणी ध्रुव पर सफलतापूर्वक अंतरिक्ष यान उतारने वाला पहला और चंद्रमा पर उतरने वाला चौथा देश बन गया।


== इतिहास ==
चंद्रमा पर उतरने की नियंत्रित प्रक्रिया (सॉफ्ट लैंडिंग) की क्षमता प्रदर्शित कर