# Embedding with Retrieval QA

The notebook primarily centers around utilizing the Pinecone, OpenAI, and Langchain libraries to extract question-answer pairs from a document. The main technique employed is Retrieval QA through indexing.Retrieval QA still mean that the model use is an LLMs model. This approach involves leveraging a vector store, with Pinecone being used for storage, to effectively manage and retrieve vectors.

Reference:
1. Notebook - https://colab.research.google.com/drive/17r8QmiH8m7irJ08r9-QkjOaBF9e9nZVd : the pipeline of indexing are adapted from this notebook

By Sacha Heyer - Heyer, S. (2023) ‘Generative AI - Document Retrieval and Question Answering with LLMs’, Google Cloud - Community, 27 July. Available at: https://medium.com/google-cloud/generative-ai-document-retrieval-and-question-answering-with-llms-2b0fb80ae76d 

The difference and advantages of tuning over indexing:
- New documents are available in real-time, compared to tuning, which might require a couple of hours.
- We circumvent the context size limitations. Most LLMs allow around 4000 tokens per request. This makes it impossible to provide a large amount of data. With the Indexing approach, our LLM can rely on unlimited data because of the retrieval of similar documents we only sent the relevant ones.
- Restricted documents that aren’t supposed to be available for everyone can be filtered on runtime. Compared to tuning that does not know anything about access restrictions to documents.
- Cheaper because no LLM fine-tuning is required.
- Explainable due to the underlying data, this helps to verify if an answer is correct if needed. We know the truth.
- Combined with prompt engineering, we can avoid hallucinations.

In [1]:
# Loading environment variables using the dotenv library.
import os
from dotenv import load_dotenv
load_dotenv()

True

In [2]:
# Importing the necessary libraries
import openai
import langchain
import pinecone

  from tqdm.autonotebook import tqdm


## Loading Document

In [3]:
filename = "data/authorize_doc/Kuiper_FCC-20-102A1.txt"

In [4]:
#loading the document
def import_document(filename):
    encodings = ['utf-8', 'ISO-8859-1', 'utf-16', 'ascii', 'cp1252']
    for enc in encodings:
        try:
            with open(filename, 'r', encoding=enc) as file:
                document_text = file.read()
            return document_text
        except UnicodeDecodeError:
            continue
        except FileNotFoundError:
            print(f"Error: File '{filename}' not found.")
            return None
        except Exception as e:
            print(f"Error occurred while importing the document: {e}")
            return None
    print(f"Error: Could not decode file with any of the tried encodings: {encodings}")
    return None

document = import_document(filename)
if document is not None:
    print("Document content:")
    print(document)

Document content:
﻿	Federal Communications Commission	FCC 20-102


Before the
FEDERAL COMMUNICATIONS COMMISSION
WASHINGTON, D.C. 20554


In the Matter of

Kuiper Systems, LLC

Application for Authority to Deploy and Operate a Ka-band Non-Geostationary Satellite Orbit System
 
)
)
)
)
)
)





IBFS File No. SAT-LOA-20190704-00057

Call Sign S3051

ORDER AND AUTHORIZATION

Adopted:  July 29, 2020	Released:  July 30, 2020

By the Commission:

I. INTRODUCTION
1. In this Order and Authorization, we grant, to the extent set forth below, the request of Kuiper Systems LLC (Kuiper or Amazon) to deploy a non-geostationary satellite orbit (NGSO) system to provide service using certain Fixed-Satellite Service (FSS) and Mobile-Satellite Service (MSS) Ka-band frequencies with conditions adopted herein.  
2. Specifically, we grant Kuiper’s application for authority to deploy and operate its NGSO FSS system in the 17.7-17.8 GHz, 17.8-18.6 GHz, 18.8-19.3 GHz, 19.3-19.7 GHz, 19.7-20.2 GHz, 27.5-28.6 GHz

## Chunking and Overlapping the document

The chunking and overlapping of the document are use the same parameter for all the 3 approaches: In-Context Learning, Indexing and Tune QA model

In [5]:
from langchain.schema import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [6]:
#Create the text splitter with specific parameters
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,           # Check if this parameter is valid
    chunk_overlap=100,        # Check if this parameter is valid
    length_function=len,      # Check if this parameter is valid
    keep_separator=True       # This is a valid parameter as per the traceback
)

# Instantiate the Document with the content
doc = Document(page_content=document)

# Split the document into smaller chunks
split_docs = text_splitter.split_documents([doc])



In [7]:
#we see here the document is separate into chunks and are overlapping 
split_docs

[Document(page_content='\ufeff\tFederal Communications Commission\tFCC 20-102\n\n\nBefore the\nFEDERAL COMMUNICATIONS COMMISSION\nWASHINGTON, D.C. 20554\n\n\nIn the Matter of\n\nKuiper Systems, LLC\n\nApplication for Authority to Deploy and Operate a Ka-band Non-Geostationary Satellite Orbit System\n \n)\n)\n)\n)\n)\n)\n\n\n\n\n\nIBFS File No. SAT-LOA-20190704-00057\n\nCall Sign S3051\n\nORDER AND AUTHORIZATION\n\nAdopted:  July 29, 2020\tReleased:  July 30, 2020\n\nBy the Commission:', metadata={}),
 Document(page_content='I. INTRODUCTION\n1. In this Order and Authorization, we grant, to the extent set forth below, the request of Kuiper Systems LLC (Kuiper or Amazon) to deploy a non-geostationary satellite orbit (NGSO) system to provide service using certain Fixed-Satellite Service (FSS) and Mobile-Satellite Service (MSS) Ka-band frequencies with conditions adopted herein.', metadata={}),
 Document(page_content='2. Specifically, we grant Kuiper’s application for authority to deploy an

## Embedding the chunks document and store

creating embedding for all the document chunks using OpenAI "ada" and store in Pinecone Vector db

In [8]:
#from langchain.llms import ChatOpenAI
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Pinecone

In [9]:
#seeing what gpt model are available for us
openai.Model.list()

<OpenAIObject list at 0x18f0eae0090> JSON: {
  "object": "list",
  "data": [
    {
      "id": "davinci",
      "object": "model",
      "created": 1649359874,
      "owned_by": "openai",
      "permission": [
        {
          "id": "modelperm-uJaD4C9nXA6tPNoBII9hcYF4",
          "object": "model_permission",
          "created": 1692634268,
          "allow_create_engine": false,
          "allow_sampling": true,
          "allow_logprobs": true,
          "allow_search_indices": false,
          "allow_view": true,
          "allow_fine_tuning": false,
          "organization": "*",
          "group": null,
          "is_blocking": false
        }
      ],
      "root": "davinci",
      "parent": null
    },
    {
      "id": "gpt-4-0314",
      "object": "model",
      "created": 1687882410,
      "owned_by": "openai",
      "permission": [
        {
          "id": "modelperm-eJxIX8r0jTNkOTBy09sEo3Cp",
          "object": "model_permission",
          "created": 1693326339,
    

### Embedding model

In [10]:
#embedding model 
embeddings = OpenAIEmbeddings(modelName="text-embedding-ada-002")

#finding the dimension
query_result = embeddings.embed_query("Mak Kau Hijau")
len(query_result)

                    modelName was transferred to model_kwargs.
                    Please confirm that modelName is what you intended.


1536

In this context, Pinecone serves as our designated vector store, where we configure the dimensions to 1536 and adopt cosine similarity as the chosen metric. Alternatively, other vector stores like Chroma or custom vector store can also be considered for implementation.

### Store Embedding

In [21]:
# Initialize Pinecone
pinecone_api_key = os.getenv("PINECONE_API_KEY")
pinecone_env = os.getenv("PINECONE_ENV")

if not pinecone_api_key or not pinecone_env:
    raise ValueError("Environment variables not set.")

pinecone.init(api_key=pinecone_api_key, environment=pinecone_env)

index_name = "embedding"
#only lower case name allow in Pinecone

In [24]:
# Create index if it doesn't exist
if index_name not in pinecone.list_indexes():
    pinecone.create_index(name=index_name, metric='cosine', dimension=len(query_result))

# Store documents in Pinecone
docsearch = Pinecone.from_documents(split_docs, embeddings, index_name=index_name)



Feel free to check Pinecone indexes tab at this point to confirm the presence of vectors.

## Query the embed chunks with questions

In [25]:
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI

In [26]:
# Initialize the chat model
openai_api_key = os.getenv("OPENAI_API_KEY")
if not openai_api_key:
    raise ValueError("OPENAI_API_KEY not set.")

llm = ChatOpenAI(
    model_name="gpt-3.5-turbo",
    temperature=0,  # Don't be creative and make up an answer
    request_timeout=120,
    openai_api_key=openai_api_key
)


temperature: 0, which mean the model will not generate creative answers and will stick to the most probable answers

We will now create the qa model with our own characteristics of what it do. Below are component of the model

### Retrieving the embedded docs

Setting up the retriever to get the most relevant documents for a given query.

In [27]:
# Setting up the retriever

# Only retrieve documents that have a relevance score above a certain threshold
#retriever = docsearch.as_retriever(search_type="similarity_score_threshold", search_kwargs={'score_threshold': 0.8})

#docsearch in the vector database in this case Pinecone
# Only get the two most similar document  chunks from the vector database
retriever = docsearch.as_retriever(search_kwargs={'k': 2})
#k: how many document chunks are returned 


RESOURCE: creating a chain, retriever, custom prompt- https://python.langchain.com/docs/use_cases/question_answering/how_to/vector_db_qa
We can also modify the search by passing specific search arguments through the retriever to the search function, using the search_kwargs keyword argument.

- k defines how many documents are returned; defaults to 4.
- score_threshold allows you to set a minimum relevance for documents returned by the retriever, if you are using the "similarity_score_threshold" search type.
- fetch_k determines the amount of documents to pass to the MMR algorithm; defaults to 20.
- lambda_mult controls the diversity of results returned by the MMR algorithm, with 1 being minimum diversity and 0 being maximum. Defaults to 0.5.
- filter allows you to define a filter on what documents should be retrieved, based on the documents' metadata. This has no effect if the Vectorstore doesn't store any metadata.

### setting up the prompt - chain_type_kwargs

In [28]:
# Setting up the prompt 
from langchain.prompts import PromptTemplate

# Define the template without direct variable embedding
prompt_template = """
Follow exactly those 3 steps:
1. Read the context below and aggregate this data
Context: {context}
2. Answer the question using only this context
3. Show the source for your answers
User Question: {question}

If you don't have any context and are unsure of the answer, reply that you don't know about this topic.
"""

PROMPT = PromptTemplate(
    template=prompt_template, 
    input_variables=["context", "question"]
)

chain_type_kwargs = {"prompt": PROMPT}

In [29]:
#Reference: https://python.langchain.com/docs/use_cases/question_answering/how_to/vector_db_qa - you can see how this is form in the link
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever, chain_type_kwargs=chain_type_kwargs, return_source_documents=True)

qa will be the question answering model that extracts answers from the provided context(documents)

In [30]:
qa

RetrievalQA(memory=None, callbacks=None, callback_manager=None, verbose=False, tags=None, metadata=None, combine_documents_chain=StuffDocumentsChain(memory=None, callbacks=None, callback_manager=None, verbose=False, tags=None, metadata=None, input_key='input_documents', output_key='output_text', llm_chain=LLMChain(memory=None, callbacks=None, callback_manager=None, verbose=False, tags=None, metadata=None, prompt=PromptTemplate(input_variables=['context', 'question'], output_parser=None, partial_variables={}, template="\nFollow exactly those 3 steps:\n1. Read the context below and aggregate this data\nContext: {context}\n2. Answer the question using only this context\n3. Show the source for your answers\nUser Question: {question}\n\nIf you don't have any context and are unsure of the answer, reply that you don't know about this topic.\n", template_format='f-string', validate_template=True), llm=ChatOpenAI(cache=None, verbose=False, callbacks=None, callback_manager=None, tags=None, metad

### Defining Lists of questions we want to ask

the process of retrieving relevant documents and answering questions usually involves embedding the questions internally. For example, the retriever might embed the question and compare it to precomputed embeddings of the documents to find the most relevant ones. Similarly, the QA model might embed the question and the retrieved documents to generate an answer.

The questions dictionary contains question identifiers as keys and the actual questions as values.

In [31]:
# List of questions
questions = {
    "const_name": "What's the name of the satellite constellation the company seeks to deploy or operate?",
    "date_release": "On which date was the document released?",
    "date_50": "By which date must the company launch and operate half of its satellites?",
    "date_100": "By which date is the company expected to have all its satellites operational?",
    "total_sat_const": "How many satellites is the company authorized to deploy and operate for this constellation?",
    "altitude": "At which authorized altitudes will the company deploy its satellites?",
    "inclination": "What are the authorized satellite inclinations within the corresponding altitudes?",
    "number_orb_plane": "How many orbital planes, corresponding to given altitudes and inclinations, has the company been authorized for?",
    "total_sat_per_orb_plane": "How many satellites are allocated to each orbital plane?",
    "total_sat_per_alt_incl": "How many satellites, for each altitude and inclination, are there across all matching orbital planes?",
    "operational_lifetime": "What is the satellite's expected operational lifetime in years?"
}

# Store the results
results_content = {}
#Store the source_document
source_documents = {}

### Processing the questions using the qa model

Processing each question, retrieving the relevant documents, extracting the answer, and printing the results

In [32]:
for key, query in questions.items():
    matched_docs = retriever.get_relevant_documents(query)
    result = qa({"context": matched_docs, "query": query})
    results_content[key] = result['result']
    source_documents[key] = result['source_documents']  # Assuming the key in the result is 'source_documents'

for key, answer in results_content.items():
    print(f"\n## Question ({key}): {questions[key]}\n")
    print(f"Answer: {answer}\n")
    print(f"Source Documents: {source_documents[key]}\n")



## Question (const_name): What's the name of the satellite constellation the company seeks to deploy or operate?

Answer: Based on the given context, the name of the satellite constellation the company seeks to deploy or operate is Swarm.

Source:
Licensed Non-Geostationary Orbit Satellite Network, FCC 20-56, para. 30, 2020 WL 1977107, Order and Declaratory Ruling (April 22, 2020) (ViaSat Order)

Source Documents: [Document(page_content='or an applicant whose entry raised interference concerns for earlier authorized operators; or an applicant whose system proposed deployment of thousands of satellites; Lockheed involved one satellite.  Northrop Grumman involved three NGSO satellites and four GSO satellites. DigitalGlobe involved three NGSO satellites.  Space Imaging involved one satellite.  Swarm involved 150 satellites.', metadata={}), Document(page_content='Licensed Non-Geostationary Orbit Satellite Network, FCC 20-56, para. 30, 2020 WL 1977107, Order and Declaratory Ruling (April 2

Explaination of the code above:

----------------------------------------
1. The retriever object is used to find the most relevant documents for each question from a collection of documents. ONE WE BUILD ABOVE
2. The qa object is a question answering model that extracts answers from the provided context (documents).ONE WE BUILD ABOVE
3. The results_content dictionary will store the answers extracted (or formed) by the qa model, and the source_documents dictionary will store the documents from which the answers were extracted.

----------------------------------------------
1. The first loop iterates over each 'questions' dictionaries
- For each question, the retriever finds the most relevant documents (matched_docs) from the collection of documents.
- The qa model then extracts the answer from the matched_docs using the question (query) as input.
- The extracted answer and source documents are then stored in the results_content and source_documents dictionaries, respectively.
2. The second loop iterates over each answer in the results_content dictionary and prints the question, answer, and source documents for each question.

---------------------------------------------------
Understanding the Embedding of the 'questions' - retriever and qa model roles

Note that the list of questions are not directly "embedded". However, the embedding of text into numerical vectors is internally handled by both the qa model and the retriever. Specifically, they convert the text of the query and matched_docs into numerical vectors that the model can process. This internal conversion is not explicitly shown in the provided code snippet. Therefore, the text from the questions dictionary is transformed into numerical vectors by the qa model and retriever during the process of extracting answers.

- qa model is based on GPT 3.5 model, it is used to process the questions and extract the answers from the documents retrieved by the retriever.
- embedding of the docs chunks use OpenAI embedding: text-embedding-ada-002 model and stored in Pinecone

