# Retrieval Augmented Generation (RAG) Lab with Small LLMs from Hugging Face

This lab will guide students through building a basic Retrieval Augmented Generation (RAG) system using one of the latest small language models available on Hugging Face, specifically using a model that doesn't require authentication. The lab will use Google Colab to enable students to work directly in an interactive notebook environment.

## Instructions for Students
1. **Understand Each Step**: Carefully read the code and understand each component. Why do we need to tokenize the text? How does FAISS help with retrieval?
2. **Experiment with Queries**: Change the query in `retrieve_and_generate()` to see how the model performs on different inputs.
3. **Evaluate Retrieval Quality**: What happens if you change `top_k` to a larger or smaller number? Does the generated answer improve?
4. **Extend the Lab**: Try using a different dataset for the retrieval corpus. How does it impact the generated answers? Pick a dataset from your choice.
5. **Experiment with different models**


- We tokenize text to convert it into numerical tokens that a language model can understand.
- Faiss is used to index and search vectors efficiently. When we encode the documents and the query into embeddings, faiss quickly finds the most similar documents to the query.

## Step 1: Setup

In this step, we will install the necessary libraries for our lab. We will be using Hugging Face's `transformers` library to load a pre-trained language model, the `datasets` library to load a dataset, and `faiss-cpu` for efficient similarity search.

In [None]:
# Step 1: Setup
# Install required libraries
!pip install transformers datasets faiss-cpu sentence-transformers

Collecting faiss-cpu
  Downloading faiss_cpu-1.13.0-cp39-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (7.7 kB)
Downloading faiss_cpu-1.13.0-cp39-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (23.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.6/23.6 MB[0m [31m53.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.13.0


We also need to import the necessary Python libraries that will be used throughout the lab.

In [None]:
# Import necessary libraries
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM,AutoModelForCausalLM
from sentence_transformers import SentenceTransformer
from datasets import load_dataset
import faiss
import torch

## Step 2: Load a Small LLM and Embedding Model from Hugging Face

For this lab, we will use two models: one for generating text and one for encoding the text. We will use a pre-trained T5 model for text generation and a SentenceTransformer model for generating embeddings of the text.

In [None]:
# Step 2: Load models from Hugging Face
embedding_model_name = "sentence-transformers/all-MiniLM-L6-v2"
generation_model_name = "tiiuae/Falcon3-1B-Instruct-1.58bit"

# Load embedding model for vector representation
embedding_model = SentenceTransformer(embedding_model_name, device='cuda' if torch.cuda.is_available() else 'cpu')

# Load tokenizer and text generation model
tokenizer = AutoTokenizer.from_pretrained(generation_model_name)
generation_model = AutoModelForCausalLM.from_pretrained(generation_model_name).to('cuda' if torch.cuda.is_available() else 'cpu')

print(f"Loaded embedding model: {embedding_model_name}")
print(f"Loaded generation model: {generation_model_name}")

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/826 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/878 [00:00<?, ?B/s]

You have loaded a BitNet model on CPU and have a CUDA device available, make sure to set your model on a GPU device in order to run your model.


model.safetensors:   0%|          | 0.00/1.36G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/118 [00:00<?, ?B/s]

Loaded embedding model: sentence-transformers/all-MiniLM-L6-v2
Loaded generation model: tiiuae/Falcon3-1B-Instruct-1.58bit


## Step 3: Load a Dataset and Prepare the Retrieval Corpus

Next, we need to create a knowledge base for our retrieval system. We will use the AG News dataset, which contains news articles grouped into different categories. We will extract the text of each entry to form our retrieval corpus.

In [None]:
# Step 3: Load a dataset and prepare the retrieval corpus
# We'll use a small dataset as the knowledge base
knowledge_dataset = load_dataset("yelp_review_full", split="train[:1%]")

# Create a retrieval corpus from the dataset
corpus = [entry['text'] for entry in knowledge_dataset]

- Using a more relevant dataset improves the accuracy and relevance of the generated answers.
- Using a less relevant or small dataset may cause the model to generate incomplete or off topic answers.

## Step 4: Index the Corpus Using FAISS for Efficient Retrieval

In this step, we will index our corpus using FAISS, a library developed by Facebook AI that allows efficient similarity search and clustering of dense vectors. We will batch the text encoding to manage memory usage and avoid runtime crashes.

In [None]:
# Step 4: Index the corpus using FAISS for efficient retrieval
# Tokenize and vectorize the corpus in batches
import numpy as np

def encode_corpus_in_batches(corpus, model, batch_size=8):
    embeddings = []
    for i in range(0, len(corpus), batch_size):
        batch = corpus[i:i + batch_size]
        batch_embeddings = model.encode(batch, convert_to_numpy=True, device='cuda' if torch.cuda.is_available() else 'cpu')
        embeddings.append(batch_embeddings)
    return np.vstack(embeddings)

corpus_embeddings = encode_corpus_in_batches(corpus, embedding_model)

# Build a FAISS index
index = faiss.IndexFlatL2(corpus_embeddings.shape[1])
index.add(np.array(corpus_embeddings))
print("Corpus indexed with FAISS.")

Corpus indexed with FAISS.


## Step 5: Define the RAG Process

Now that we have our retrieval corpus indexed, we will define the RAG (Retrieval Augmented Generation) process. This involves retrieving the most relevant documents based on a query and then using these documents as context for our language model to generate an answer.

In [None]:
# Step 5: Define the RAG process
# A function to retrieve relevant documents and generate an augmented response
def retrieve_and_generate(query, top_k=3):
    # Encode the query
    query_embedding = embedding_model.encode([query], convert_to_numpy=True, device='cuda' if torch.cuda.is_available() else 'cpu')[0]

    # Search the index for similar documents
    distances, indices = index.search(np.array([query_embedding]), top_k)

    # Concatenate retrieved documents as context
    context = "\n".join([corpus[i] for i in indices[0]])

    # Create the final prompt
    prompt = f"Context: {context}\n\nQuestion: {query}"

    # Generate a response
    inputs = tokenizer(prompt, return_tensors="pt").to('cuda' if torch.cuda.is_available() else 'cpu')
    output = generation_model.generate(**inputs,max_length=500)
    answer = tokenizer.decode(output[0], skip_special_tokens=True)

    return answer

## Step 6: Test the RAG System

Finally, we can test our RAG system by providing it with a query. The model will retrieve relevant documents from the corpus and use them as context to generate an answer.

In [None]:
# Step 6: Test the RAG system
query = "latest ai technologies"
response = retrieve_and_generate(query)
print("Response:", response)

Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


Response: Context: 3x. So far so good.
Retta is knowledgeable, intuitive and strong.  She gives a great massage.
anything you need you can find it at the market :)

Question: latest ai technologies in 2023
Answer:
<|assistant|>
1. GPT-4 (Generative Pre-trained Transformer) - A language model that can generate coherent and creative text based on a given prompt.

2. GPT-3.5 Turbo - A more powerful version of GPT-3, offering improved performance and capabilities.

3. GPT-3.6 Turbo - The latest iteration of GPT-3, offering even greater improvements in performance and capabilities.

4. GPT-4 Turbo - The most powerful version of GPT-3, offering even greater improvements in performance and capabilities.

5. GPT-4 Turbo Turbo - The latest iteration of GPT-3, offering even greater improvements in performance and capabilities.

6. GPT-4 Turbo Turbo - The latest iteration of GPT-3, offering even greater improvements in performance and capabilities.

7. GPT-4 Turbo Turbo - The latest iteration of 

A smaller top_k retrieves fewer documents, so the context may be limited and the generated answer could miss some details. A larger top_k retrieves more documents, giving the model more context. This can improve the answer’s richness, but it might also include irrelevant information or increase input length.

---

### Conclusion
In this lab, we learned how to build a basic Retrieval Augmented Generation (RAG) system using two small language models from Hugging Face. We used FAISS to efficiently index our retrieval corpus and augmented the language model's generation with relevant context. This approach helps in generating more informative and context-aware responses.