# Purpose

Create RAG application for question answering over a document. These include extracting information, retrieving the relevant context, and utilizing this context to generate accurate results.

![](attachment:image.png)


Typical RAG Application

Step 1: extract information from this document.

Step 2: break the document into smaller chunks, to fit into LLM context windows.

Step 3: two strategies to save documents for future retrieval:
- store the text as-is for keyword based retrieval.
- convert text into vector embeddings, for more efficient retrieval.

Step 4: save this to a relevant database.

Step 5: obtain relevant chunks based on user inputs.

Step 6: incorporate relevant document chunks as part of LLM context, for generating the output.

`Note:` 
- Steps 1 - 4 are referred to as the indexing pipeline, wherein documents are indexed in a database offline, prior to user interactions. 
- Steps 5 - 6 happen in real-time as the user is querying the application.


In [None]:
import requests
import fitz # PyMuPDF module
import io

url = "https://s2.q4cdn.com/299287126/files/doc_financials/2023/q1/Q1-2023-Amazon-Earnings-Release.pdf"
request = requests.get(url)
# request.content

In [44]:
filestream = io.BytesIO(request.content)
file = fitz.open(stream=filestream, filetype='pdf')
page_content = [page.get_text() for page in file]
page_content

['AMAZON.COM ANNOUNCES FIRST QUARTER RESULTS\nSEATTLE—(BUSINESS WIRE) April 27, 2023—Amazon.com, Inc. (NASDAQ: AMZN) today announced financial results \nfor its first quarter ended March 31, 2023. \n•\nNet sales increased 9% to $127.4 billion in the first quarter, compared with $116.4 billion in first quarter 2022.\nExcluding the $2.4 billion unfavorable impact from year-over-year changes in foreign exchange rates throughout the\nquarter, net sales increased 11% compared with first quarter 2022.\n•\nNorth America segment sales increased 11% year-over-year to $76.9 billion.\n•\nInternational segment sales increased 1% year-over-year to $29.1 billion, or increased 9% excluding changes\nin foreign exchange rates.\n•\nAWS segment sales increased 16% year-over-year to $21.4 billion.\n•\nOperating income increased to $4.8 billion in the first quarter, compared with $3.7 billion in first quarter 2022. First\nquarter 2023 operating income includes approximately $0.5 billion of charges related 

In [45]:
text = ''.join(page_content)
text



## Chunking Data

LLMs typically have a token limit. Chunking involves dividing a lengthy text into smaller sections that an LLM can process more efficiently.

These chunks should be of standard size (at minimum) containing answers to common questions. This is because sometimes the question have answers at multiple locations within the document. We would ideally want to capture all disparate parts of the document(s) containing the answers, link them together, and pass to an LLM for answering based on these filtered and concatenated document chunks.

![image.png](attachment:image.png)

The maximum context length is basically the maximum length for concatenating various chunks together, includes: question, context, and answer.

5 levels of chunking:
- Fixed Size Chunking: 
    - without considering the content or structure. 
    - simple to implement but may result in chunks that lack coherence or context.
- Recursive Chunking: splits the text into smaller chunks using a set of separators (like newlines or spaces) in a hierarchical and iterative manner.
- Document Based Chunking: split based on inherent structure, such as markdown formatting, code syntax, or table layouts.
- Semantic Chunking: extract semantic meaning from embeddings and assess the semantic relationship between chunks. 
- Agentic Chunking: use a language model to determine how much and what text should be included in a chunk based on the context.

`Cosine similarity metric` is used to compare the question with document chunks, to find the top chunks, most likely to contain the answer. It include a keyword metric to weight contexts with certain keywords. 

Ex: weight contexts that contain the words “abstract” or “summary” when asking the question to summarize a document.

### Fixed Size Chunk
split up the texts into chunks, when they reach a provided maximum token length.

In [46]:
import re
import pandas as pd
import tiktoken
tokenizer = tiktoken.get_encoding('cl100k_base')

In [47]:
def split_into_many(text: str, tokenizer: tiktoken.Encoding, max_tokens: int = 1024) -> list:
    """ Function to split a string into many strings of a specified number of tokens """
    #A Split the text into sentences
    #B Get the number of tokens for each sentence
    #C Loop through the sentences and tokens joined together in a tuple
    #D If the number of tokens so far plus the number of tokens in the current sentence is greater than the max number of tokens, then start a new chunk
    #E add the sentence to the chunk and add the number of tokens to the total
    sentences = re.split(r'(?<=\n)(?=[A-Z])', text) #A
    n_tokens = [len(tokenizer.encode(_)) for _ in sentences] #B
    chunks = [['', 0]]
    for sentence, n_token in zip(sentences, n_tokens): #C
        if chunks[-1][1] + n_token > max_tokens: #D
            chunks.append(['', 0])
        #E
        chunks[-1][0] += sentence
        chunks[-1][1] += n_token

    return pd.DataFrame(chunks, columns=['chunk', 'n_token'])

df_chunks = split_into_many(text, tokenizer)
df_chunks

Unnamed: 0,chunk,n_token
0,AMAZON.COM ANNOUNCES FIRST QUARTER RESULTS\nSE...,1009
1,Continued to delight customers with convenient...,1019
2,Broadridge’s LTX electronic trading platform c...,1005
3,Expanded the Ring lineup with new devices. Rin...,1002
4,Syria. The company turned its fulfillment cent...,878
5,These forward-looking statements are inherentl...,1006
6,"Acquisitions, net of cash acquired, and other\...",1022
7,"Basic\n \n10,171 \n10,250 \nDiluted\n \n10,17...",1005
8,Accumulated other comprehensive income (loss)\...,1014
9,F/X impact -- favorable\n$ \n57 \n$ \n126 \n$ ...,985


## Retrieval Methods

after chunking, store these documents in an appropriate format so that relevant documents can be easily retrieved in response to future queries. There are 2 characteristic methods to retrieve relevant LLM context: keyword based retrieval and vector embeddings based retrieval.

### Keyword Based Retrieval

Sort relevant documents by keyword match. Using Term Frequency (TF) and Inverse Document Frequency (IDF).

`TF` measures how often a term appears in a document. The more times a term occurs in a document, the more relevant that document is to the term.

- $TF(t,d) = \frac{\text{number of times term } t \text{ appears in document } d}{\text{total number of terms in document } d}$

`IDF` measures the importance of a term across the entire corpus of documents. Higher importance to terms that are rare in the corpus and lower importance to terms that are common.

- $IDF(t) = \frac{\text{total number of documents}}{\text{number of documents containing term }t}$

The `TF\-IDF` score: $TF\text{-}IDF(t,d) = TF(t,d) \times IDF(t)$

Given a query Q containing keywords $q_{i,...,n}$ `Okapi BM25 score` of a document D:

- $score(D,Q)=\sum_{i=1}^{n}IDF(q_i).\frac{f(q_i,D)(k_1+1)}{f(q_i,D)+k_1(1-b+b\frac{|D|}{avgdl})}$

    - $f(q_i, D)$: the number of times $q_i$ occurs in $D$.
    - $k_1, b$: constants.
    - $avgdl$: the average document length.

BM25 characteristic:
- $score \in [0;1] \\
\begin{cases}
    0 \text{ is no keyword overlaps between Query and Document.} \\
    1 \text{  is Document contains all keywords in Query.}
\end{cases}
$
- uses IDF to weigh the importance of terms across the corpus.
- The term frequency $f(q_i,D)$ is normalized using a saturation function, which prevents the score from increasing linearly with term frequency. This addresses a limitation of basic TF-IDF.
- Document length normalization $\frac{|D|}{avgdl}$ adjusts for the fact that longer documents are more likely to have higher term frequencies.

e.g.

In [48]:
from rank_bm25 import BM25Okapi
corpus = ["Hello there how are you!",
          "It is quite windy in Boston",
          "How is the weather tomorrow?"]
tokenized_corpus = [doc.split(" ") for doc in corpus]
bm25 = BM25Okapi(tokenized_corpus)
query = "windy day"
tokenized_query = query.split(" ")
doc_scores = bm25.get_scores(tokenized_query)
doc_scores

array([0.        , 0.48362189, 0.        ])

`Note:` the 3rd document is related to the query too (both discuss the weather) so that the 3rd doc should have had a non-zero score.

-> the concept of `semantic similarity` and `vector embeddings` comes in.

`Semantic search` means that the algorithm is intelligent enough to know that '*cowboys*' and the '*wild west*' are similar concepts.

This becomes important for RAG as the user types in a query that is not exactly present in the document.

### Vector Embeddings

encode data into high-dimensional vectors and using distance metrics to measure similarity between these vectors. e.g.

![image.png](attachment:image.png)

In [49]:
texts = ['the boy went to a party',
         'the boy went to a party',
         """We found evidence of bias in our models via running the SEAT (May et al, 2019) and the Winogender (Rudinger et al, 2018) benchmarks. Together, these benchmarks consist of 7 tests that measure whether models contain implicit biases when applied to gendered names, regional names, and some stereotypes. 
         For example, we found that our models more strongly associate (a) European American names with positive sentiment, when compared to African American names, and (b) negative stereotypes with black women."""]

- paid version

In [50]:
# import openai
# openai.api_key = 'API key'
# def get_embedding(text, model="text-embedding-ada-002"):
#     return openai.embeddings.create(input = text, model=model)['data']

# response = get_embedding(texts)
# embeddings = [item['embedding'] for item in response]

- free version

In [51]:
import spacy
nlp = spacy.load("en_core_web_md") # en_core_web_md or en_core_web_trf
def get_embedding(text, model=nlp):
    return model(text).vector

embeddings = [get_embedding(text) for text in texts]
len(embeddings[0])

300

In [56]:
(embeddings[0] != embeddings[1]).sum()

0

cosine similarity between 2 vecs

In [67]:
from scipy.spatial.distance import cosine
print(cosine(embeddings[0], embeddings[1]))
print(cosine(embeddings[0], embeddings[2]))

1.483545730707192e-09
0.4857179978363477


In [64]:
from sklearn.metrics.pairwise import cosine_similarity
print(cosine_similarity([embeddings[0]], [embeddings[1]]))
print(cosine_similarity([embeddings[0]], [embeddings[2]]))

[[1.0000001]]
[[0.514282]]


`Note:` cosine (scipy) + cosine_similarity (sklearn) = 1

### Vector Embeddings For Finding Relevant Context

In [31]:
prompt = """What was the sales increase for Amazon in the first quarter?"""

- paid ver

In [32]:
# def get_completion(prompt, model="gpt-3.5-turbo"):
#     response = openai.ChatCompletion.create(
#         model=model,
#         temperature=0,
#         messages=[{"role": "user", "content": prompt}]
#     ) #calling OpenAI Completions endpoint
#     return response['choices'][0]['message']['content'].strip()

# get_completion(prompt)

- free ver

In [80]:
from transformers import GPTNeoForCausalLM, GPT2Tokenizer, GPT2LMHeadModel

def get_completion(prompt: str, model_name: str = 'distilgpt2') -> str:
    # Load model and tokenizer
    tokenizer = GPT2Tokenizer.from_pretrained(model_name)
    if model_name == 'EleutherAI/gpt-neo-125M':
        model = GPTNeoForCausalLM.from_pretrained(model_name) 
    elif model_name == 'distilgpt2':
        model = GPT2LMHeadModel.from_pretrained(model_name) 

    inputs = tokenizer(prompt, return_tensors="pt")
    outputs = model.generate(
        inputs.input_ids,
        # attention_mask=inputs['attention_mask'],
        # pad_token_id=tokenizer.eos_token_id,
        max_length=200, 
        num_return_sequences=1)
    return tokenizer.decode(outputs[0], skip_special_tokens=True).strip()

print(get_completion(prompt, 'EleutherAI/gpt-neo-125M'))
print(get_completion(prompt, 'distilgpt2'))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


What was the sales increase for Amazon in the first quarter based on the context below?
Context:
```
Net income was $3.2 billion in the first quarter, or $0.31 per diluted share, compared with net loss of $3.8 billion, or
$0.38 per diluted share, in first quarter 2022. All share and per share information for comparable prior year periods
throughout this release have been retroactively adjusted to reflect the 20-for-1 stock split effected on May 27, 2022.
• First quarter 2023 net income includes a pre-tax valuation loss of $0.5 billion included in non-operating
expense from the common stock investment in Rivian Automotive, Inc., compared to a pre-tax valuation loss
of $7.6 billion from the investment in first quarter 2022.
```

###


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


What was the sales increase for Amazon in the first quarter based on the context below?
Context:
```
Net income was $3.2 billion in the first quarter, or $0.31 per diluted share, compared with net loss of $3.8 billion, or
$0.38 per diluted share, in first quarter 2022. All share and per share information for comparable prior year periods
throughout this release have been retroactively adjusted to reflect the 20-for-1 stock split effected on May 27, 2022.
• First quarter 2023 net income includes a pre-tax valuation loss of $0.5 billion included in non-operating
expense from the common stock investment in Rivian Automotive, Inc., compared to a pre-tax valuation loss
of $7.6 billion from the investment in first quarter 2022.
```
Net income was $3.2 billion in the first quarter, or $0.31 per diluted


`Note:` the above answer maybe not wrong, but it is not the one we are looking for. 

-> It is important to feed the right context to the LLM — in this case, this would be sales performance in Q1 2023. 

We have a choice of the three contexts below to append to the LLM:

In [34]:
context1="""Net sales increased 9% to $127.4 billion in the first quarter, compared with $116.4 billion in first quarter 2022.
Excluding the $2.4 billion unfavorable impact from year-over-year changes in foreign exchange rates throughout the
quarter, net sales increased 11% compared with first quarter 2022.
North America segment sales increased 11% year-over-year to $76.9 billion.
International segment sales increased 1% year-over-year to $29.1 billion, or increased 9% excluding changes
in foreign exchange rates.
AWS segment sales increased 16% year-over-year to $21.4 billion."""

context2="""Operating income increased to $4.8 billion in the first quarter, compared with $3.7 billion in first quarter 2022. First
quarter 2023 operating income includes approximately $0.5 billion of charges related to estimated severance costs.
North America segment operating income was $0.9 billion, compared with operating loss of $1.6 billion in
first quarter 2022.
International segment operating loss was $1.2 billion, compared with operating loss of $1.3 billion in first
quarter 2022.
AWS segment operating income was $5.1 billion, compared with operating income of $6.5 billion in first
quarter 2022.
"""

context3="""Net income was $3.2 billion in the first quarter, or $0.31 per diluted share, compared with net loss of $3.8 billion, or
$0.38 per diluted share, in first quarter 2022. All share and per share information for comparable prior year periods
throughout this release have been retroactively adjusted to reflect the 20-for-1 stock split effected on May 27, 2022.
• First quarter 2023 net income includes a pre-tax valuation loss of $0.5 billion included in non-operating
expense from the common stock investment in Rivian Automotive, Inc., compared to a pre-tax valuation loss
of $7.6 billion from the investment in first quarter 2022."""

In [68]:
for _ in [context1, context2, context3]:
    print(cosine_similarity([get_embedding(prompt)], [get_embedding(_)]))

[[0.85571593]]
[[0.9209387]]
[[0.9801308]]


`Note:` the context3 has the highest cosine similarity with the query embeddings.

-> appending this context to the user input and sending it to the LLM is more likely to give an answer relevant to the user input.

In [70]:
prompt = f"""What was the sales increase for Amazon in the first quarter based on the context below?
Context:
```
{context3}
```
"""
print(get_completion(prompt, 'EleutherAI/gpt-neo-125M'))
print(get_completion(prompt, 'distilgpt2'))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


What was the sales increase for Amazon in the first quarter based on the context below?
Context:
```
Net income was $3.2 billion in the first quarter, or $0.31 per diluted share, compared with net loss of $3.8 billion, or
$0.38 per diluted share, in first quarter 2022. All share and per share information for comparable prior year periods
throughout this release have been retroactively adjusted to reflect the 20-for-1 stock split effected on May 27, 2022.
• First quarter 2023 net income includes a pre-tax valuation loss of $0.5 billion included in non-operating
expense from the common stock investment in Rivian Automotive, Inc., compared to a pre-tax valuation loss
of $7.6 billion from the investment in first quarter 2022.
```

###                  


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


What was the sales increase for Amazon in the first quarter based on the context below?
Context:
```
Net income was $3.2 billion in the first quarter, or $0.31 per diluted share, compared with net loss of $3.8 billion, or
$0.38 per diluted share, in first quarter 2022. All share and per share information for comparable prior year periods
throughout this release have been retroactively adjusted to reflect the 20-for-1 stock split effected on May 27, 2022.
• First quarter 2023 net income includes a pre-tax valuation loss of $0.5 billion included in non-operating
expense from the common stock investment in Rivian Automotive, Inc., compared to a pre-tax valuation loss
of $7.6 billion from the investment in first quarter 2022.
```
Net income was $3.2 billion in the first quarter, or $0.31 per diluted


`Note:` better reponse with context

## Augmented Generation

This step is to retrieve the context in real-time based on user input, and use the retrieved context to generate the LLM output.

![image.png](attachment:image.png)

Step 1: get the embeddings for the question (query).

Step 2: compute pairwise distances between the input query embedding, and context embeddings.

Step 3: append these contexts, ranked by similarity. If the running context length is greater than the maximum context length, the context is truncated.

Step 4: the user query and relevant context are sent to the LLM, for generating the output.

In [None]:
df_chunks.head()

In [None]:
def create_context(query: str, df_chunks: pd.DataFrame, max_len: int = 1800) -> str:
    """Create a context for a question by finding the most similar context from the dataframe"""
    #A Get the embeddings for the question
    #B Get the distances from the embeddings
    #C Sort by distance and add the text to the context until the context is too long 
    #D Add the length of the text to the current length
    #E If the context is too long, break
    #F Else add it to the text that is being returned
    #G Return the context
    embedded_query = get_embedding(query)
    df_chunks['embedded_chunk'] = df_chunks['chunk'].apply(lambda chunk: get_embedding(chunk))
    df_chunks['similarity'] = df_chunks['embedded_chunk'].apply(lambda embedded_chunk: cosine_similarity([embedded_chunk], [embedded_query]))

    sum_len = 0
    contexts = []
    for i, row in df_chunks.sort_values('similarity').iterrows():
        if sum_len + row['n_token'] > max_len:
            break
        sum_len += row['n_token']
        contexts.append(row['chunk'])

    return '\n\n###\n\n'.join(contexts)

create_context("What was the sales increase for Amazon in the first quarter", df_chunks)

Use LLM to answer the questions from the created context

In [None]:
def answer_question(df_chunks: pd.DataFrame, query: str) -> str:
    """Answer a question based on the most similar context from the dataframe texts"""
    context = create_context(query, df_chunks)
    prompt = f"""Answer the question based on the context provided.
    Question:
    ```{query}.```
    Context:
    ```{context}```
    """
    return get_completion(prompt)

answer_question(df_chunks, question='What was the sales increase for Amazon in the first quarter')

In [None]:
answer_question(df_chunks, question='What was the Comprehensive income (loss) for Amazon for the Three Months Ended March 31, 2022?')

## Evaluation