#### Step 1: Tokenization <Br>
First, we need to tokenize the documents.

In [1]:
documents = [
    "The cat sat on the mat",
    "The dog sat on the log",
    "The cat chased the dog"
]

# Tokenize the documents
tokenized_documents = [doc.lower().split() for doc in documents]
print(tokenized_documents)

[['the', 'cat', 'sat', 'on', 'the', 'mat'], ['the', 'dog', 'sat', 'on', 'the', 'log'], ['the', 'cat', 'chased', 'the', 'dog']]


#### Step 2: Calculate Term Frequency (TF)
Next, we calculate the term frequency for each term in each document.

In [2]:
from collections import Counter

# Calculate term frequency for each document
tf = [Counter(doc) for doc in tokenized_documents]
print(tf)

[Counter({'the': 2, 'cat': 1, 'sat': 1, 'on': 1, 'mat': 1}), Counter({'the': 2, 'dog': 1, 'sat': 1, 'on': 1, 'log': 1}), Counter({'the': 2, 'cat': 1, 'chased': 1, 'dog': 1})]


#### Step 3: Calculate Inverse Document Frequency (IDF)
Now, we calculate the inverse document frequency for each term across all documents.

In [4]:
import math

# Calculate document frequency for each term
df = Counter()
for doc in tokenized_documents:
    df.update(set(doc))

# Calculate IDF for each term
idf = {term: math.log(len(documents) / df[term]) for term in df}
print(idf)

{'on': 0.4054651081081644, 'cat': 0.4054651081081644, 'the': 0.0, 'sat': 0.4054651081081644, 'mat': 1.0986122886681098, 'dog': 0.4054651081081644, 'log': 1.0986122886681098, 'chased': 1.0986122886681098}


### Step 4: Calculate TF-IDF
Multiply TF by IDF for each term in each document.

In [5]:
# Calculate TF-IDF for each document
tf_idf = []
for doc_tf in tf:
    doc_tf_idf = {term: freq * idf[term] for term, freq in doc_tf.items()}
    tf_idf.append(doc_tf_idf)
print(tf_idf)

[{'the': 0.0, 'cat': 0.4054651081081644, 'sat': 0.4054651081081644, 'on': 0.4054651081081644, 'mat': 1.0986122886681098}, {'the': 0.0, 'dog': 0.4054651081081644, 'sat': 0.4054651081081644, 'on': 0.4054651081081644, 'log': 1.0986122886681098}, {'the': 0.0, 'cat': 0.4054651081081644, 'chased': 1.0986122886681098, 'dog': 0.4054651081081644}]


#### Step 5: Query Matching
Given a query, calculate its TF-IDF and match it against the documents.

In [6]:
# Tokenize the query
query = "cat sat"
tokenized_query = query.lower().split()

# Calculate term frequency for the query
query_tf = Counter(tokenized_query)

# Calculate TF-IDF for the query
query_tf_idf = {term: freq * idf.get(term, 0) for term, freq in query_tf.items()}
print(query_tf_idf)

# Calculate cosine similarity between query and each document
def cosine_similarity(doc_tf_idf, query_tf_idf):
    dot_product = sum(doc_tf_idf.get(term, 0) * query_tf_idf.get(term, 0) for term in query_tf_idf)
    doc_magnitude = math.sqrt(sum(value ** 2 for value in doc_tf_idf.values()))
    query_magnitude = math.sqrt(sum(value ** 2 for value in query_tf_idf.values()))
    if doc_magnitude == 0 or query_magnitude == 0:
        return 0.0
    return dot_product / (doc_magnitude * query_magnitude)

# Calculate similarity for each document
similarities = [cosine_similarity(doc_tf_idf, query_tf_idf) for doc_tf_idf in tf_idf]
print(similarities)

# Find the most relevant document
most_relevant_doc_index = similarities.index(max(similarities))
print(f"The most relevant document is: {documents[most_relevant_doc_index]}")

{'cat': 0.4054651081081644, 'sat': 0.4054651081081644}
[0.43976863279651823, 0.21988431639825912, 0.23135443112611218]
The most relevant document is: The cat sat on the mat


#### RAG retrieval using Python with the transformers library:

#### Step 1: Install Required Libraries
First, install the necessary libraries:

In [None]:
%pip install datasets

#### Step 2: Import Libraries and Load Models
Import the required libraries and load the pre-trained models.

In [8]:
from transformers import RagTokenizer, RagRetriever, RagTokenForGeneration

# Load the tokenizer, retriever, and model
tokenizer = RagTokenizer.from_pretrained("facebook/rag-token-nq")
retriever = RagRetriever.from_pretrained("facebook/rag-token-nq", index_name="exact", use_dummy_dataset=True)
model = RagTokenForGeneration.from_pretrained("facebook/rag-token-nq", retriever=retriever)

  from .autonotebook import tqdm as notebook_tqdm
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'DPRQuestionEncoderTokenizer'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The

ImportError: 
RagRetriever requires the 🤗 Datasets library but it was not found in your environment. You can install it with:
```
pip install datasets
```
In a notebook or a colab, you can install it by executing a cell with
```
!pip install datasets
```
then restarting your kernel.

Note that if you have a local folder named `datasets` or a local python file named `datasets.py` in your current
working directory, python may try to import this instead of the 🤗 Datasets library. You should rename this folder or
that python file if that's the case. Please note that you may need to restart your runtime after installation.


#### Step 3: Define the Knowledge Base
Define a small knowledge base for demonstration purposes.

In [None]:
knowledge_base = [
    "The cat sat on the mat.",
    "The dog sat on the log.",
    "The cat chased the dog.",
    "Cats are small, carnivorous mammals.",
    "Dogs are domesticated mammals, not natural wild animals."
]

# Add the knowledge base to the retriever
retriever.index.index_data(knowledge_base)

#### Step 4: Generate Responses Using RAG
Generate responses to a query using the RAG model.

In [None]:
# Define the query
query = "Tell me about cats and dogs."

# Tokenize the query
input_ids = tokenizer(query, return_tensors="pt").input_ids

# Generate the response
outputs = model.generate(input_ids)

# Decode the response
response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
print(response)