# Overview of embeddings-based retrieval

In this lesson we'll index a PDF. We'll have the vector database calculate the embeddings and pass the information to the llm as additional information.

## Installation

In [4]:
%pip install -q -r requirements.txt


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


## Reading the text from a PDF

You can view the pdf in your browser [here](./microsoft_annual_report_2022.pdf) if you would like. 

In [5]:
from pypdf import PdfReader

reader = PdfReader("data/microsoft_annual_report_2022.pdf")

# Reading the pages and extracting the text
pdf_texts = [p.extract_text().strip() for p in reader.pages]

# Filte out the empty text strings
pdf_texts = [text for text in pdf_texts if text]

#print(word_wrap(pdf_texts[0]))
# Print out the first piece
print(pdf_texts[0])

1 
Dear shareholders, colleagues, customers, and partners:  
We are living through a period of historic economic, societal, and geopolitical change. The world in 2022 looks nothing like 
the world in 2019. As I write this, inflation is at a 40 -year high, supply chains are stretched, and the war in Ukraine is 
ongoing. At the same time, we are entering a technological era with the potential to power awesome advancements 
across every sector of our economy and society. As the world’s largest software company, this places us at a historic 
intersection of opportunity and responsibility to the world around us.  
Our mission to empower every person and every organization on the planet to achieve more has never been more 
urgent or more necessary. For all the uncertainty in the world, one thing is clear: People and organizations in every 
industry are increasingly looking to digital technology to overcome today’s challenges and emerge stronger. And no 
company is better positioned to help t

**Note: Take a look at the first page of the document**
(pdf_texts[0] is the first page)

# Load Langchain text splitter tools

### Recursive Character splitter
Splitting text by recursively look at characters.

Recursively tries to split by different characters to find one that works.


In [6]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

*Note: Because the way we read in the pdf via text there are no paragraphs so chunking doesn't really work* 

In [7]:
character_splitter = RecursiveCharacterTextSplitter(
    #separators=["\n\n", "\n", ". ", " ", ""],
    chunk_size=1000,
    chunk_overlap=0,
)
character_split_texts = character_splitter.split_text('\n\n'.join(pdf_texts))

# print the first 5 chunks
for i in range(5):
    print("Chunk#=" + str(i) + " Chunk=" + character_split_texts[i],"\n")

print(f"\nTotal chunks: {len(character_split_texts)}")

Chunk#=0 Chunk=1 
Dear shareholders, colleagues, customers, and partners:  
We are living through a period of historic economic, societal, and geopolitical change. The world in 2022 looks nothing like 
the world in 2019. As I write this, inflation is at a 40 -year high, supply chains are stretched, and the war in Ukraine is 
ongoing. At the same time, we are entering a technological era with the potential to power awesome advancements 
across every sector of our economy and society. As the world’s largest software company, this places us at a historic 
intersection of opportunity and responsibility to the world around us.  
Our mission to empower every person and every organization on the planet to achieve more has never been more 
urgent or more necessary. For all the uncertainty in the world, one thing is clear: People and organizations in every 
industry are increasingly looking to digital technology to overcome today’s challenges and emerge stronger. And no 

Chunk#=1 Chunk=company


For Example: chunk0

- 960 characters
- 155 words
- 193 tokens

[Check out the OpenAI tokenizer](https://platform.openai.com/tokenizer)



## SentenceTransformersTokenTextSplitter
Splitting text to tokens using sentence model tokenizer. The default model is `sentence-transformers/all-mpnet-base-v2'`

In [8]:
from langchain.text_splitter import SentenceTransformersTokenTextSplitter

token_splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=0, tokens_per_chunk=256)

token_split_texts = []
for text in character_split_texts:
    token_split_texts += token_splitter.split_text(text)

# print the first 5 chunks
for i in range(5):
    print("Chunk#=" + str(i) + " Chunk=",token_split_texts[i],"\n")
print(f"\nTotal chunks: {len(token_split_texts)}")

  from .autonotebook import tqdm as notebook_tqdm


Chunk#=0 Chunk= 1 dear shareholders, colleagues, customers, and partners : we are living through a period of historic economic, societal, and geopolitical change. the world in 2022 looks nothing like the world in 2019. as i write this, inflation is at a 40 - year high, supply chains are stretched, and the war in ukraine is ongoing. at the same time, we are entering a technological era with the potential to power awesome advancements across every sector of our economy and society. as the world ’ s largest software company, this places us at a historic intersection of opportunity and responsibility to the world around us. our mission to empower every person and every organization on the planet to achieve more has never been more urgent or more necessary. for all the uncertainty in the world, one thing is clear : people and organizations in every industry are increasingly looking to digital technology to overcome today ’ s challenges and emerge stronger. and no 

Chunk#=1 Chunk= company i

Note: for cleanup rm -rf ~/.cache/huggingface/hub/models--sentence-transformers--a

## Setting up the embeddings

In [9]:
import chromadb
from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction

embedding_function = SentenceTransformerEmbeddingFunction()
model = embedding_function.models
print("Model used for embeddings:")
print(model)

Model used for embeddings:
{'all-MiniLM-L6-v2': SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)}


## Calculate embedding of the first chunk

In [10]:
embedding = embedding_function([token_split_texts[0]])
print(embedding)

[array([-5.13007939e-02, -1.17733087e-02,  4.94027101e-02, -7.97345638e-02,
        1.48219615e-02, -1.70856435e-02, -3.71209905e-02,  1.02431932e-02,
        2.94433925e-02, -1.15510272e-02, -2.93060150e-02,  5.98760024e-02,
        4.19336706e-02,  9.10995353e-04, -2.48187600e-04,  5.57056069e-02,
       -9.50447619e-02, -1.12392426e-01, -4.46705669e-02, -2.80091935e-03,
       -2.43211202e-02, -1.56625286e-02, -4.96853776e-02,  1.72557365e-02,
       -4.19295765e-02,  2.93179043e-02, -2.14536637e-02, -7.14381263e-02,
       -3.47528681e-02,  2.16383133e-02, -1.52904131e-02,  5.06820604e-02,
        2.92439535e-02,  4.42516655e-02,  3.54527347e-02,  2.26889979e-02,
        6.83844984e-02, -2.03300379e-02,  1.90997757e-02, -9.92355794e-02,
       -1.11536924e-02, -1.27988830e-01, -3.55314650e-02,  1.63593609e-02,
        6.48760647e-02, -1.84402391e-02,  3.01106423e-02, -6.47870940e-04,
       -6.11424632e-02,  6.25531375e-03, -1.04723796e-01, -1.23378366e-01,
        7.17751011e-02, 

## Setting up the ChromaDB vector database

Note: Beware of similarity metrics.

In [11]:
#chroma_client = chromadb.Client()
#chroma_client = chromadb.Client(persist_directory="../data/chroma_db/")
from chromadb.config import DEFAULT_TENANT, DEFAULT_DATABASE, Settings

chroma_client = chromadb.PersistentClient(
    path="data/chroma_db/",
    settings=Settings(),
    tenant=DEFAULT_TENANT,
    database=DEFAULT_DATABASE,
)

COLLECTION_NAME = "microsoft_annual_report_2022"

# Removing any existing collection
#collection = chroma_client.get_collection(COLLECTION_NAME)
#if (collection):
#chroma_client.delete_collection(COLLECTION_NAME)

# The default is Euclidean distance

#chroma_collection = chroma_client.create_collection(COLLECTION_NAME, embedding_function=embedding_function)

# The use metadata for cosine

chroma_collection = chroma_client.create_collection(COLLECTION_NAME, embedding_function=embedding_function,metadata={"hnsw:space": "cosine"})


## Index all chunks in the vector database

In [12]:
ids = [str(i) for i in range(len(token_split_texts))]

chroma_collection.add(ids=ids, documents=token_split_texts)

# Print the number of documents indexed
chroma_collection.count()

349

## Helper function to print the results our vector search

In [13]:
def print_results_and_documents(results, retrieved_documents, word_wrap):
    """
    Prints keys and values from the results dictionary and documents with word wrapping.

    Args:
        results (dict): A dictionary where keys are strings and values are either strings or lists.
        retrieved_documents (list): A list of documents to be printed.
        word_wrap (function): A function to apply word wrapping to the documents.

    Returns:
        None
    """
    # Iterate through the dictionary and print each key with its associated value
    for key, value in results.items():
        print(f"{key}:")

        # Check if the value is a list and print its elements
        if isinstance(value, list):
            for i, item in enumerate(value):
                print(f"  Item {i+1}: {item}")
        else:
            # Directly print the value if it's not a list
            print(f"  {value}")

        print()  # Add a newline for better readability

    # Iterate through the list of documents and print each one with word wrapping
    for document in retrieved_documents:
        print(word_wrap(document))
        print('\n')

## Searching the vector database

In [14]:
query = "What was the total revenue?"

results = chroma_collection.query(query_texts=query,
                                   n_results=5, 
                                   include=['documents', 'embeddings', "distances"])

retrieved_documents = results['documents'][0]

from helper_utils import word_wrap
print_results_and_documents(results, retrieved_documents, word_wrap)

ids:
  Item 1: ['293', '331', '319', '194', '320']

embeddings:
  Item 1: [[-0.03879594 -0.00056903 -0.00237735 ... -0.09649955 -0.03808513
  -0.07317519]
 [ 0.00312779  0.01599092 -0.0443514  ... -0.06616331  0.02492891
  -0.01841382]
 [ 0.01663858 -0.05533892  0.0223398  ... -0.09261022 -0.02912271
  -0.03096254]
 [-0.00725529 -0.00870073  0.02653533 ... -0.08715806 -0.03856301
  -0.07218203]
 [ 0.052216   -0.03315055  0.03330245 ... -0.07793505 -0.00250688
  -0.02969674]]

documents:
  Item 1: ['74 note 13 — unearned revenue unearned revenue by segment was as follows : ( in millions ) june 30, 2022 2021 productivity and business processes $ 24, 558 $ 22, 120 intelligent cloud 19, 371 17, 710 more personal computing 4, 479 4, 311 total $ 48, 408 $ 44, 141 changes in unearned revenue were as follows : ( in millions ) year ended june 30, 2022 balance, beginning of period $ 44, 141 deferral of revenue 110, 455 recognition of unearned revenue ( 106, 188 ) balance, end of period $ 48, 408

**Questions for understanding:**
- what are the chromadb query output fields
- how does a chromadb query find the documents
- Can you explain how the similarity search works


## Using the RAG documents in the LLM prompt

### Setting up the LLM


In [14]:
from openai import OpenAI
openai_client = OpenAI()

## Mixing the retrieved documents in the prompt

In [39]:
def rag(query, retrieved_documents, model="gpt-4o"):
    information = "\n\n".join(retrieved_documents)

    messages = [
        {
            "role": "system",
            "content": "You are a helpful expert financial research assistant. Your users are asking questions about information contained in an annual report."
            "You will be shown the user's question, and the relevant information from the annual report. Answer the user's question using only this information."
        },
        {"role": "user", "content": f"Question: {query}. \n Information: {information}"}
    ]
    
    response = openai_client.chat.completions.create(
        model=model,
        messages=messages,
    )
    content = response.choices[0].message.content
    return content

In [40]:
output = rag(query=query, retrieved_documents=retrieved_documents)

In [41]:
print(word_wrap(output))

The total revenue for the year ended June 30, 2022, was $198,270
million.
