<h2 style="text-align: center;">Pre-requisite: RAG101</h2>

## Background

We will address the context window issue we mentioned in RAG101. First, what is a context window? 

### Context WIndow

According to [Lark](https://www.larksuite.com/en_us/topics/ai-glossary/context-window-llms):

> A context window refers to a defined span of words within a text sequence that the Large Language Model (LLM) utilizes to extract contextual information.

The larger the context window, the more words/tokens the LLM can utilize to extract contextual information. As a result, the model with a larger context window can have a longer conversation. The Mixtral-8x7B model has a theoretical attention span of 128k tokens, source: [Mixtral on 🤗](https://huggingface.co/docs/transformers/en/model_doc/mixtral#model-details). 

Think of a scenario where the context you passed into your prompt template is a whole textbook. If the tokens in the textbook are more than your model's context window, then your model can only consider the most recent tokens up to their maximum limit. 

What if there is a way to make your model consider only the relevant portion of your textbook? This is where embeddings come in.

## Aim

Our aim in RAG103 is to build a system that uses embeddings to extract the most relevant part of the provided context to answer a query.

## Token, Tokenizer, Tokenization

* **Token**: In Natural Language Processing (NLP), a token is a meaningful unit of text, such as a word, punctuation mark, or number, obtained by breaking down text through a process called tokenization.

* **Tokenizer**: A tokenizer is an algorithm used in NLP to split text into tokens.

* **Tokenization**: Tokenization is the process of dividing text into a sequence of tokens.

Let's see a sample tokenization using the Mixtral-8x7B tokenizer.

In [1]:
import os
import torch
import transformers
from dotenv import load_dotenv
from utils.rag101 import NaiveRAG
from langchain_community.embeddings import HuggingFaceBgeEmbeddings

load_dotenv() # take environment variables from .env.
api_key = os.getenv("Huggingface_API_key")

In [2]:
rag = NaiveRAG(model=None, tokenizer=None) ## initialize RAG

sequence = "Using a Transformer network is simple."
tokens = rag.tokenizer.tokenize(sequence)

print(f"This is how the sequence '{sequence}' is tokenized using the Mixtral-8x7B tokenizer:\n")

for id, token in enumerate(tokens):
    print(f"Token {id}: {token}")

print("\nTokenization complete.")

This is how the sequence 'Using a Transformer network is simple.' is tokenized using the Mixtral-8x7B tokenizer:

Token 0: ▁Using
Token 1: ▁a
Token 2: ▁Trans
Token 3: former
Token 4: ▁network
Token 5: ▁is
Token 6: ▁simple
Token 7: .

Tokenization complete.


You can learn more about tokenizers [here](https://huggingface.co/learn/nlp-course/en/chapter2/4). Now that we are familiar with tokenizers and context windows, let's discuss embeddings.

## What is an embedding?

In NLP, an embedding is a representation of text where words, phrases, or even entire documents are mapped to vectors of real numbers. In our implementation, we will use the [BGE](https://huggingface.co/BAAI) open-source embedding model from [Hugging Face Hub](https://huggingface.co/docs/hub/index). The Hugging Face Hub is home to more than 350k open-source models. I have chosen the [BGE](https://huggingface.co/BAAI) embedding model because it is one of the [best open-source embedding models](https://huggingface.co/spaces/mteb/leaderboard). 


Let's explore how embeddings work!


In [5]:
embedding_model_id = "BAAI/bge-small-en-v1.5"
model_kwargs = {"device": "cpu"}
encode_kwargs = {"normalize_embeddings": True}

hf = HuggingFaceBgeEmbeddings(
    model_name=embedding_model_id, model_kwargs=model_kwargs, encode_kwargs=encode_kwargs
)

In [8]:
embedding = hf.embed_query("This is an introduction to Embeddings.")
len(embedding)

384