# Table of Contents
## Building a Question Answering System

1. [Introduction](#Building-a-Question-Answering-System:)
   1. [Our Primary Aim](#Introduction)
2. [Synergizing Components for an Effective System](#Synergizing-Components-for-an-Effective-System)
   1. [System Overview](#Overview)
   2. [Embeddings: The Core](#Embeddings:-The-Core)
   3. [Vector Databases: Enhancing Retrieval Efficiency](#Vector-Databases:-Enhancing-Retrieval-Efficiency)
   4. [Langchain Library: Simplifying Language Tasks](#Langchain-Library:-Simplifying-Language-Tasks)
   5. [Prompt Engineering: Directing ChatGPT](#Prompt-Engineering:-Directing-ChatGPT)
   6. [ChatGPT: The Conversational Heart](#ChatGPT:-The-Conversational-Heart)

3. [Seamless Workflow Overview](#Seamless-Workflow-Overview)

4. [Project Steps Breakdown](#Project-Steps-Breakdown)
   1. [Step 1: Importing Libraries](#Step-1:-Importing-Libraries)
   2. [Step 2: PDF Text Extraction](#Step-2:-PDF-Text-Extraction)
   3. [Step 3: Preparing for Subsequent Steps](#Step-3:-Preparing-for-Subsequent-Steps)
   4. [Step 4: Initializing Document Loader](#Step-4:-Initializing-Document-Loader)
   5. [Step 5: Preparing Text Splitter](#Step-5:-Preparing-Text-Splitter)
   6. [Step 6: Text Splitting Process](#Step-6:-Text-Splitting-Process)
   7. [Step 7: API Key Setup](#Step-7:-API-Key-Setup)
   8. [Step 8: Creating Chroma Vector Store](#Step-8:-Creating-Chroma-Vector-Store)
   9. [Step 9: Setting up VectorDBQA](#Step-9:-Setting-up-VectorDBQA)
   10. [Step 10: Executing the First Query](#Step-10:-Executing-the-First-Query)
   11. [Step 11: Conducting Additional Queries](#Step-11:-Conducting-Additional-Queries)

5. [Conclusion](#Conclusion)


# Building a Question Answering System: 
### Utilizing ChatGPT to Answer Questions Based on Your Personal Data
## Introduction


Our primary aim is to develop a system capable of effectively answering a wide range of questions, regardless of their complexity. This challenge involves working with data, language understanding, and delivering user-friendly responses.

Imagine being able to request key information from an email thread or locate specific details within your company's documents effortlessly. This is made possible by harnessing the power of embeddings, vectors, vector databases, and prompt engineering in conjunction with large language models like ChatGPT.

This system refines user queries into structured prompts, amalgamating the query with relevant text passages from the vector database. This approach furnishes ChatGPT with a rich context, enhancing the relevance and accuracy of its responses.



# Synergizing Components for an Effective System

## Overview

Our quest to develop a robust question-answering system integrates embeddings, vector databases, the Langchain library, prompt engineering, and ChatGPT.
In the context of the question-answering system we build, the answer to a user's query is found in the vector database that was created from our data file. The system uses vector search techniques to identify the most relevant text passages or documents in the database that match the user's query. It then combines this relevant information with the user's query to create a prompt for ChatGPT, which generates a response based on its general knowledge and language abilities.

Let's delve into how these elements synergize.

### Embeddings: The Core

Embeddings are central to our system, converting text into numerical representations. They serve as the bridge between language and mathematics, facilitating analysis and processing.

### Vector Databases: Enhancing Retrieval Efficiency

Vector databases are pivotal for storing and retrieving embeddings. They optimize searches in high-dimensional spaces, ensuring fast and precise results from user queries.
Vectors, or numerical text fingerprints, transform words or documents into numerical arrays, capturing their essence. These embeddings, akin to stars in a galaxy, are positioned in a mathematical space, signifying textual similarities and differences. Closer points indicate related texts, making information retrieval efficient and relevant.

### Langchain Library: Simplifying Language Tasks

Langchain aids in text processing and transforming text into embeddings, streamlining complex language operations.

### Prompt Engineering: Directing ChatGPT

Crafting structured prompts merges user queries with relevant text, providing ChatGPT with direction for generating contextually relevant responses.

### ChatGPT: The Conversational Heart

ChatGPT, the core language model, processes these prompts to deliver human-like, coherent responses, making the system interactive and user-friendly.


## Seamless Workflow Overview

1. **Embedding Generation**: Langchain converts text into numerical embeddings.
2. **Vector Database Storage**: These embeddings are stored for optimized retrieval.
3. **Query Processing**: User queries are transformed into embeddings.
4. **Similarity Search**: The system identifies relevant matches for the query embedding.
5. **Prompt Crafting**: A context-rich prompt is formulated.
6. **Response Generation**: ChatGPT interprets the prompt to generate a relevant response.

# Project Steps Breakdown

### Step 1: Importing Libraries
I started by install and importing the libraries I needed for my project.

##### pip install requests PyPDF2:
This installs requests for making HTTP requests (to download the PDF file) and PyPDF2 for handling PDF files.
##### pip install langchain:
Installs the Langchain library. 
##### pip install openai:
Installs the OpenAI library, which is needed for the embeddings and chat models you're using in Langchain.


### Step 2: PDF Text Extraction

I defined a function called `extract_text_from_pdf` to extract text from a PDF using its URL. This function sends a GET request to the PDF URL, saves the PDF content to a local file, and then uses PyPDF2 to extract text from the PDF. The extracted text is saved to a text file.Then I printed a message to confirm that the extracted text has been successfully saved to "ml_competition.txt."

In [13]:
# Import necessary libraries
import requests
import PyPDF2

# Define a function to extract text from a PDF given its URL
def extract_text_from_pdf(pdf_url):
    # Send a GET request to the PDF URL and store the response
    response = requests.get(pdf_url)
    
    # Save the PDF content to a local file named "ML-Competition.pdf"
    with open("ML-Competition.pdf", "wb") as pdf_file:
        pdf_file.write(response.content)
    
    # Open the saved PDF file for reading
    pdf_file = open("ML-Competition.pdf", "rb")
    
    # Create a PdfFileReader object to read the PDF
    pdf_reader = PyPDF2.PdfFileReader(pdf_file)
    
    # Initialize an empty string to store extracted text
    extracted_text = ""
    
    # Loop through each page in the PDF and extract text
    for page_num in range(pdf_reader.numPages):
        page = pdf_reader.getPage(page_num)
        extracted_text += page.extractText()
    
    # Close the PDF file
    pdf_file.close()
    
    # Return the extracted text
    return extracted_text

# Call the function with a PDF URL and save the extracted text to a text file
pdf_text = extract_text_from_pdf("https://arxiv.org/pdf/2310.02263.pdf")
file_path = "ml_competition.txt"
with open(file_path, "w") as file:
    file.write(pdf_text)

# Print a message indicating where the extracted text is saved
print(f"Extracted text saved to {file_path}")


Extracted text saved to ml_competition.txt


### Step 3: Preparing for Subsequent Steps

I imported additional libraries and modules that I'll need for the next parts of my project, which involve working with the extracted text data.

In [14]:
# Import necessary libraries for the next part
from langchain.document_loaders.unstructured import UnstructuredFileLoader
from langchain.text_splitter import CharacterTextSplitter
import os
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA, VectorDBQA
from langchain.chat_models import ChatOpenAI


### Step 4: Initializing Document Loader

I initialized a document loader using the extracted text file "ml_competition.txt."
first initializing a document loader to read data from a text file ('ml_competition.txt') and then actually loading this data, storing it in a variable for subsequent use in the program. This is a typical pattern in data processing where you first set up a mechanism to read data (a loader) and then use it to actually get the data into your program.

In [18]:
# Initialize a document loader with the extracted text file
loader = UnstructuredFileLoader('ml_competition.txt')

# Load documents using the document loader
documents = loader.load()


### Step 5: Preparing Text Splitter

I initialized a text splitter object with specific configuration settings for how text should be divided into chunks. This object is then used for processing the text data in smaller, more manageable parts.

**Initialize Text Splitter**:
   - `text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)`
   - This line creates a `CharacterTextSplitter` object, which is used to split text into smaller parts. The `chunk_size` parameter is set to 1000, indicating each text chunk will contain approximately 1000 characters. The `chunk_overlap` parameter is set to 0, meaning there will be no overlap between consecutive text chunks.


In [None]:
# Initialize a text splitter
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)


### Step 6: Text Splitting Process


**Split Documents into Chunks**:
   - `texts = text_splitter.split_documents(documents)`
   - This line applies the `split_documents` method of the `text_splitter` object to the `documents`. The result is the division of the `documents` text into smaller, more manageable chunks, stored in the `texts` variable. This process makes the data suitable for further analysis or processing steps.


In [19]:
# Split documents into smaller chunks
texts = text_splitter.split_documents(documents)

### Step 7: API Key Setup

I set the environment variable OPENAI_API_KEY to a specific value, which is the API key provided by OpenAI.
This environment variable is crucial for authenticating requests made to the OpenAI API, such as generating embeddings or interacting with language models like GPT-3 or GPT-4.

#### Initializing OpenAI Embeddings
I initialized OpenAI embeddings to work with the text data effectively.
It creates an instance of the OpenAIEmbeddings class. This class is part of a library that interfaces with OpenAI's services for generating text embeddings.
By calling OpenAIEmbeddings(), we are initializing the embeddings generation tool without specifying any additional configuration parameters. It will use default settings.
The resultant object, embeddings, is now ready to be used for generating embeddings of text data. Embeddings are numerical representations that capture the semantic essence of the text, which can be used in various natural language processing applications.
This step is crucial for preparing the text data for operations like semantic search or similarity analysis in later stages of the pipeline.

#### Printing API Key
I printed the API key from the environment variable to ensure it was set correctly.
It retrieves the value of the environment variable OPENAI_API_KEY using Python's os.environ.get method.
This method looks up the value of the environment variable named OPENAI_API_KEY, which was previously set in the script.
print is then used to display this value. This can be useful for verification purposes to ensure that the correct API key is being used.

In [None]:
# Initialize OpenAI embeddings
embeddings = OpenAIEmbeddings()

# Set the OPENAI_API_KEY environment variable (replace with your API key)
os.environ["OPENAI_API_KEY"] = "******My-API-KEY*****"

# Print the API key from the environment variable
print(os.environ.get("OPENAI_API_KEY"))

### Step 8: Creating Chroma Vector Store

I created a Chroma vector store from the text chunks and embeddings, which enables efficient similarity searches.


### Initializing a Chroma Vector Store

- **Initializing a Chroma Vector Store**  using the Chroma class. Chroma is a part of a library designed for handling and storing vector representations of text data, commonly used in natural language processing.

- **Using the `from_documents` Method**: The `from_documents` method is a static method of the Chroma class. It is used to create a new vector store from a given set of documents or text chunks.

- **Input Parameters**:
   - `texts`: This variable contains the text data that you want to store in the vector database. It could be raw text data or preprocessed text data. Since the texts were previously split into smaller chunks, `texts` would be a collection of these chunks.
   - `embeddings`: This is an instance of an embeddings generator (likely from OpenAI, as suggested by `OpenAIEmbeddings()` used earlier in our code). This generator is used to create embeddings for the text data. Embeddings are numerical representations that capture the semantic meaning of the text.

- **Storing in Variable `db`**: The result of `Chroma.from_documents(texts, embeddings)`—the Chroma vector store—is then stored in the variable `db`. This vector store (`db`) can be used for various operations like similarity searches, where you can find the most semantically relevant documents or text chunks for a given query.

`Chroma.from_documents(texts, embeddings)` This line of code essentially creates a searchable database of text data, where each piece of text is represented by its semantic embedding, facilitating efficient and semantically driven search and retrieval operations.

In [None]:
# Create a Chroma vector store from the text chunks
db = Chroma.from_documents(texts, embeddings)

### Step 9: Setting up VectorDBQA

I initialized a VectorDBQA for question-answering purposes, using ChatOpenAI as the language model.

### Initializing a Vector-based Question Answering System

`qa = VectorDBQA.from_chain_type(llm=ChatOpenAI(), chain_type="map_reduce", vectorstore=db, k=1)`

- **Initialize a VectorDBQA**: This line of code is responsible for initializing a Vector-based Question Answering (QA) system. In this context, "QA" stands for Question Answering, a natural language processing task where a machine is trained to answer questions posed in human language.

- **Input Parameters**:
   - `llm=ChatOpenAI()`: Here, `llm` is set to `ChatOpenAI()`, which initializes a Chat-based language model for answering questions. The `ChatOpenAI` model is likely designed to generate text responses based on the input query.
   - `chain_type="map_reduce"`: This parameter specifies the type of processing chain used by the QA system. In this case, it's set to "map_reduce," which refer to a specific approach or algorithm used for vector-based question answering.
   - `vectorstore=db`: `vectorstore` is set to `db`, which presumably refers to the Chroma vector store created earlier using text data and embeddings.
   - `k=1`: This parameter specifies the number of top-k results or answers to return for a given query. In this case, it's set to 1, indicating that the system should provide the top-ranked answer.

- **Functionality**:
   - This line of code effectively creates a QA system that is designed to answer questions based on a vector store of text data. The QA system uses the specified language model (`ChatOpenAI`), a particular processing chain type, and the vector store (`db`) as a source of information to generate answers.
   - The `k` parameter allows us to control how many answers the system should return for each question.

This code segment sets up the foundation for a vector-based question answering system, enabling it to respond to questions based on the information stored in the vector database.



In [21]:
# Initialize a VectorDBQA
qa = VectorDBQA.from_chain_type(llm=ChatOpenAI(), chain_type="map_reduce", vectorstore=db, k=1)


### Step 10: Executing the First Query

`query = "What is the document about"`
`qa.run(query)`

- **Define a Query**: In this line of code, a query is defined as a text string. The query serves as the question or input for the Question Answering (QA) system. In this case, the query is set to "What is the document about."

- **Run the Query**: The code `qa.run(query)` is responsible for running the defined query through the Question Answering (QA) system. This step triggers the QA system to process the input query and generate a response based on its knowledge and understanding of the provided data.

- **Output**: The output of this operation is the response generated by the QA system in answer to the query. The response will typically be a text string that provides an answer or explanation related to the query.

- **Usage**: This code segment demonstrates how to use the initialized QA system to obtain answers to specific questions or queries. You can replace the query with different questions to get relevant responses from the system.



In [9]:
# Define a query and run it through the QA system
query = "What is the document about"
qa.run(query)



'The document is about fine-tuning language models using human preferences.'

### Step 11: Conducting Additional Queries
I defined another query, "Explain fine-tuning language models in the language of a rapper," and ran it through the QA system. The system provided a creative rapper-themed explanation of fine-tuning language models.

In [8]:
# Define another query and run it through the QA system
query = "Explain fine-tuning language models in the language and tone of a rapper"
qa.run(query)


"Yo, listen up, let me drop some knowledge in a rapper's style\nFine-tuning language models, it's like takin' somethin' dope and makin' it wild\nWe start with these Large Language Models, they're the real deal\nTrained on trillions of tokens, they got that mass appeal\n\nBut we ain't satisfied, we want 'em to be the best\nSo we put 'em to the test, pushin' 'em beyond the rest, no time for rest\nWe got these procedures, like supervised instruction tuning\nTeachin' 'em specific tasks, get 'em groovin', no misassumptions\n\nThen there's Reinforcement Learning from Human Feedback\nLet the people guide 'em, give 'em that street cred, that real-life street beat\nThey learn from feedback, adjustin' and adaptin'\nSpittin' rhymes that'll blow your mind, no misconceptions, just straight snappin'\n\nBut hold up, there's more, we ain't done yet\nContrastive post-training, yo, don't you forget\nIt's all about alignment, findin' that perfect match\nStartin' easy and movin' to the hard batch, ain't n

**Isn't it amazing ?** The answer provided by the system in the language and tone of a rapper is not only creative but also demonstrates the versatility and adaptability of the question-answering system. It seamlessly translates complex concepts like "fine-tuning language models" into a hip-hop narrative, making it engaging and easy to understand. This response showcases the system's ability to tailor its answers to the user's specified style and tone, adding a layer of fun and accessibility to the interaction. It exemplifies how AI can bring a touch of creativity and humor to technical explanations, making information more relatable and enjoyable for users.

# Conclusion
In conclusion, our project showcases the power of modern NLP techniques and tools, enabling us to create a versatile question-answering system that can be applied to diverse use cases. By combining embeddings, vector databases, and prompt engineering, we've unlocked the full potential of large language models like ChatGPT, facilitating efficient and semantically driven search and retrieval operations. This project marks a significant stride towards more accessible and powerful natural language understanding and interaction systems.