# Lesson 1 Hack-A-Rag

>This notebook is based on the open-source project [wow-rag](https://github.com/datawhalechina/wow-rag) by Datawhale China.  
>I’ve adapted and annotated parts of it for personal learning and experimentation.

## 1. Introduction of Retrieval-Augmented Generation (RAG) 

**RAG** (Retrieval-Augmented Generation) is a technique that enhances a language model by combining it with a retriever component. It allows the model to fetch relevant documents or knowledge **at inference time**, making it more accurate, up-to-date, and less prone to hallucination.


###  How RAG Works

1. **Retriever**: Finds relevant documents or chunks based on the input question.
2. **Generator**: A language model (e.g., GPT) that generates a response using both the query and the retrieved documents.

User Query → [Retriever] → Top-k Text Chunks → [Generator] → Final Answer





### Use Cases of Rag
| Domain           | Example                                              |
| ---------------- | ---------------------------------------------------- |
| Customer Support | Answering product questions using internal docs      |
| Legal / Finance  | Q\&A over contracts or compliance manuals            |
| Healthcare       | Accessing medical guidelines or patient records      |
| Education        | Personalized tutoring using a curated knowledge base |
| Research         | Summarizing and answering from scientific papers     |


### RAG vs SFT 

| Aspect              | RAG                                               | SFT (Fine-tuning)                                 |
| ------------------- | ------------------------------------------------- | ------------------------------------------------- |
| 🔧 Setup            | Plug external knowledge into a frozen model       | Retrain the model with task-specific labeled data |
| 📚 Knowledge Source | External documents (updated any time)             | Internal weights (static knowledge)               |
| 🔄 Updatable?       | ✅ Yes – update documents, no retraining needed    | ❌ No – requires retraining to update knowledge    |
| 🧠 Generalization   | Strong with relevant context                      | Strong if trained well, but fixed to domain       |
| 🧪 Sample Use       | Open-book QA, long documents, real-time knowledge | Closed-domain tasks, chatbots, classification     |
| 💰 Cost             | Inference-time retrieval + LLM API                | Expensive training, but cheap inference           |
| 📈 Scalability      | Easily scales to new domains via document updates | Requires labeled data for every new task          |


## 2. Build-a-rag !

### 2.1 Install required Libraries

####  What is `faiss`?

- **What it is:** Facebook AI Similarity Search
- **Use:** Efficient similarity search for high-dimensional vectors (e.g., embeddings)
- **Why we need it:** We use FAISS to index and retrieve the most relevant chunks based on vector similarity.

| Feature          | `faiss-cpu`                          | `faiss-gpu`                            |
| ---------------- | ------------------------------------ | -------------------------------------- |
| Platform         | CPU                                  | CUDA-enabled GPU                       |
| Speed            | Slower (especially for large data)   | Much faster on large datasets          |
| Memory usage     | Uses system RAM                      | Uses GPU VRAM                          |
| Dataset size     | Good for small/medium (<1M vectors)  | Suitable for large-scale search (10M+) |
| Installation     | Easy via pip                         | Requires NVIDIA GPU + CUDA setup       |
| Typical use case | Research, notebooks, lightweight RAG | Production, web-scale retrieval        |


In [1]:
%pip install faiss-cpu scikit-learn scipy 
%pip install openai 
%pip install python-dotenv 

Looking in indexes: https://mirrors.cernet.edu.cn/pypi/web/simpleNote: you may need to restart the kernel to use updated packages.

Looking in indexes: https://mirrors.cernet.edu.cn/pypi/web/simple
Note: you may need to restart the kernel to use updated packages.
Looking in indexes: https://mirrors.cernet.edu.cn/pypi/web/simple
Note: you may need to restart the kernel to use updated packages.


### 2.2 Environment and >.?

In [16]:
import os
from dotenv import load_dotenv

# Load env
load_dotenv()
api_key = os.getenv('API_KEY')

base_url = "hhttps://api.openai.com/v1"  # We use openai's model here
chat_model = "gpt-4.1-nano-2025-04-14"   # We will be using cheaper model as im broke AF
emb_model = "text-embedding-3-small"



### 2.3 Construct Client

```python
from openai import OpenAI

client = OpenAI(
    api_key=api_key,       # Required: Your OpenAI API key
    base_url=None,         # Optional: Override endpoint (e.g. Azure or self-hosted)
    organization=None,     # Optional: Your OpenAI organization ID
    timeout=None           # Optional: Set custom timeout (in seconds)
)


In [None]:
from openai import OpenAI
client = OpenAI(
    api_key = api_key,
    #base_url = base_url
)

is_valid = True if client.models.list() else False # True if the setup is correct

### 2.4 Construct the Document

In [21]:
embedding_text = """

Multimodal Agent AI systems have many applications. In addition to interactive AI, grounded multimodal models could help drive content generation for bots and AI agents, and assist in productivity applications, helping to re-play, paraphrase, action prediction or synthesize 3D or 2D scenario. Fundamental advances in agent AI help contribute towards these goals and many would benefit from a greater understanding of how to model embodied and empathetic in a simulate reality or a real world. Arguably many of these applications could have positive benefits.

However, this technology could also be used by bad actors. Agent AI systems that generate content can be used to manipulate or deceive people. Therefore, it is very important that this technology is developed in accordance with responsible AI guidelines. For example, explicitly communicating to users that content is generated by an AI system and providing the user with controls in order to customize such a system. It is possible the Agent AI could be used to develop new methods to detect manipulative content - partly because it is rich with hallucination performance of large foundation model - and thus help address another real world problem.

For examples, 1) in health topic, ethical deployment of LLM and VLM agents, especially in sensitive domains like healthcare, is paramount. AI agents trained on biased data could potentially worsen health disparities by providing inaccurate diagnoses for underrepresented groups. Moreover, the handling of sensitive patient data by AI agents raises significant privacy and confidentiality concerns. 2) In the gaming industry, AI agents could transform the role of developers, shifting their focus from scripting non-player characters to refining agent learning processes. Similarly, adaptive robotic systems could redefine manufacturing roles, necessitating new skill sets rather than replacing human workers. Navigating these transitions responsibly is vital to minimize potential socio-economic disruptions.

Furthermore, the agent AI focuses on learning collaboration policy in simulation and there is some risk if directly applying the policy to the real world due to the distribution shift. Robust testing and continual safety monitoring mechanisms should be put in place to minimize risks of unpredictable behaviors in real-world scenarios. Our “VideoAnalytica" dataset is collected from the Internet and considering which is not a fully representative source, so we already go through-ed the ethical review and legal process from both Microsoft and University Washington. Be that as it may, we also need to understand biases that might exist in this corpus. Data distributions can be characterized in many ways. In this workshop, we have captured how the agent level distribution in our dataset is different from other existing datasets. However, there is much more than could be included in a single dataset or workshop. We would argue that there is a need for more approaches or discussion linked to real tasks or topics and that by making these data or system available.

We will dedicate a segment of our project to discussing these ethical issues, exploring potential mitigation strategies, and deploying a responsible multi-modal AI agent. We hope to help more researchers answer these questions together via this paper.

"""

### 2.5 Chunking Document

### 2.5.1 Why do we need to chunk the document? 


In a RAG (Retrieval-Augmented Generation) system, long documents are **split into smaller pieces**, called **chunks**, before being embedded and stored. This process is essential for several reasons:


####  1. Context Window Limitations

Most embedding models and language models (e.g., GPT) have a maximum input size (called a **context window**), typically around **512 to 4096 tokens**.  
If a document is too long, it won't fit — so we must split it into smaller parts.


####  2. Improved Retrieval Accuracy

When a query is made, the retriever searches for **the most relevant chunks**, not entire documents.  
Smaller, focused chunks help the model retrieve **more precise** and **contextually relevant** information.

- Too large: one irrelevant section may dominate the similarity score  
- Too small: might lose meaning or break context


####  3. Embedding Models Perform Better on Shorter Text

Most embedding models (like `text-embedding-ada-002` or `MiniLM`) are trained to represent **short text units**, such as sentences or paragraphs.  
Chunking ensures we stay within the range the model was optimized for.


####  4. Flexibility in Retrieval Granularity

Chunked documents allow the system to:
- Retrieve only the relevant **subsections**
- Combine multiple chunks from different sources
- Avoid returning large blocks of irrelevant content



Here we simpily splits the document into sequential slices of 150 characters, regardless of word or sentence boundaries.


#### Other options strategy for chunking

| Strategy           | When to Use                          |
| ------------------ | ------------------------------------ |
| Fixed-size cutting | Fast prototype, basic use cases      |
| Sentence-based     | Better for language tasks & meaning  |
| Sliding window     | When continuity/context is important |
| Semantic chunking  | For high-quality production systems  |



### 2.5.2 Does the size matter (why `chunk_size= 150` here ?)

#### Chunk size affects :

| Aspect                | Impact                                                    |
| --------------------- | --------------------------------------------------------- |
|  Retrieval quality  | Small chunks = more precise, large chunks = more context  |
|  Embedding accuracy | Too long → diluted meaning, too short → incomplete        |
|  Model performance  | LLMs prefer well-formed input (sentences, not fragments)  |
|  API efficiency     | Larger chunks = fewer API calls, but more tokens per call |

#### Chunk strateguy for different document type 

| Document Type           | Recommended Chunk Strategy           |
| ----------------------- | ------------------------------------ |
| Short FAQs, tweets      | 100–150 characters (1–2 sentences)   |
| Web articles, emails    | 200–300 characters (2–3 sentences)   |
| Technical docs, papers  | 400–600 characters or sentence-based |
| Legal/Medical documents | Sentence-based or 512–1024 tokens    |


In [20]:
chunk_size = 150 # Try other size !

chunks = [embedding_text[i:i + chunk_size] for i in range(0, len(embedding_text), chunk_size)]

### 2.6 Vectorization

We vectorize each document and compare the cosine similarity of the vectors to find the document fragment that is closest to the question. Next, we embed these small text blocks to get a 1024-dimensional vector. For vectorization, we need to use the previous emb_model. Then, we store these vectors in a vector database for subsequent retrieval.

### Q:Why Normalize Embeddings After Chunking?

Even if all text chunks are of equal size (e.g., 150 characters), their **embedding magnitudes (vector norms)** can still vary due to differences in content.


### Why It Matters

- `faiss.IndexFlatIP` computes inner product as similarity.
- Without normalization: longer vectors may score higher regardless of meaning.
- With normalization: inner product becomes cosine similarity, focusing on semantic similarity.




In [None]:
from sklearn.preprocessing import normalize
import numpy as np
import faiss

embeddings = []

for chunk in chunks:
    response = client.embeddings.create(
        model=emb_model,
        input=chunk,
    )
    embeddings.append(response.data[0].embedding)


normalized_embeddings = normalize(np.array(embeddings).astype('float32'))

d = len(embeddings[0])
index = faiss.IndexFlatIP(d) # Create a Faiss index for storing and retrieving embedding vectors
index.add(normalized_embeddings) # Add normalized embedding to the Faiss index 
n_vectors = index.ntotal


print(n_vectors)

23


In [23]:

def match_text(input_text, index, chunks, k=2):
    """
    Given a set of chunks, find the top k chunks that are most similar to the input text.

    Parameters:

    input_text (str): Input text to match.
    index (faiss.Index): Faiss index to search.
    chunks (list of str): List of chunks.
    k (int, optional): Number of most similar chunks to return. Default is 2.

    Returns:

    str: Formatted string containing the most similar chunks and their similarity.

    """
    # Make sure K doesn't exceed the total chunks
    k = min(k, len(chunks))

    
    response = client.embeddings.create(
        model=emb_model,
        input=input_text,
    )


    input_embedding = response.data[0].embedding
    input_embedding = normalize(np.array([input_embedding]).astype('float32'))

    # Search the index for the k vectors most similar to the input embedding vector
    distances, indices = index.search(input_embedding, k)


    matching_texts = ""

    for i, idx in enumerate(indices[0]): 
        # Print every similar text
        print(f"similarity: {distances[0][i]:.4f}\nmatching text: \n{chunks[idx]}\n")
        # Add similarity and text content to the matching text string
        matching_texts += f"similarity: {distances[0][i]:.4f}\nmatching text: \n{chunks[idx]}\n"


    return matching_texts

In [24]:
input_text = "What are the applications of Agent AI systems ?"

matched_texts = match_text(input_text=input_text, index=index, chunks=chunks, k=2)

similarity: 0.6790
matching text: 

Multimodal Agent AI systems have many applications. In addition to interactive AI, grounded multimodal models could help drive content generation for

similarity: 0.5948
matching text: 
ystem and providing the user with controls in order to customize such a system. It is possible the Agent AI could be used to develop new methods to de



We can see that the matching text is not complete, which is related to the chunking method, but it does not affect the result because the model is generated based on this.

###  Construct questioning prompt

In [34]:
prompt = f"""
According to the documents found
{matched_texts}
generate
{input_text}
When answering questions, use the original text of the document if possible. Do not restate the question, just start answering it.
"""

In [35]:
def get_completion_stream(prompt):
    """
    Generates a streaming text reply using OpenAI's Chat Completions API.

    Parameters:
    prompt (str): The prompt text to generate the reply.

    Returns:
    None: This function directly prints the generated reply content.

    """
    
    response = client.chat.completions.create(
        model=chat_model,  
        messages=[
            {"role": "user", "content": prompt},
        ],
        stream=True,
    )
    
    if response:
        
        for chunk in response:
            
            content = chunk.choices[0].delta.content
            
            if content:
                
                print(content, end='', flush=True)


In [36]:
get_completion_stream(prompt)

Multimodal Agent AI systems have many applications. In addition to interactive AI, grounded multimodal models could help drive content generation.