# Module 3: Local RAG (Retrieval Augmented Generation)

**Goal**: Teach the LLM knowledge it doesn't have (private data) without fine-tuning.

**The Problem**: LLMs hallucinate or don't know about *your* specific documents.
**The Solution (RAG)**:
1.  **Retrieve**: Find relevant info from your documents based on the user's question.
2.  **Augment**: Paste that info into the prompt.
3.  **Generate**: Ask the LLM to answer using *only* that info.

**Components**:
- **Embedding Model**: `sentence-transformers/all-MiniLM-L6-v2` (Converts text to numbers/vectors).
- **Vector Database**: `ChromaDB` (Stores these vectors and finds similar ones).
- **LLM**: `Qwen2.5-1.5B` (Generates the answer).

In [4]:
!pip install langchain langchain-community langchain-text-splitters langchain-huggingface

Collecting langchain-huggingface
  Downloading langchain_huggingface-1.2.0-py3-none-any.whl.metadata (2.8 kB)
Downloading langchain_huggingface-1.2.0-py3-none-any.whl (30 kB)
Installing collected packages: langchain-huggingface
Successfully installed langchain-huggingface-1.2.0


In [5]:
!pip install tf-keras

Collecting tf-keras
  Downloading tf_keras-2.20.1-py3-none-any.whl.metadata (1.8 kB)
Downloading tf_keras-2.20.1-py3-none-any.whl (1.7 MB)
   ---------------------------------------- 0.0/1.7 MB ? eta -:--:--
   ---------------------------------------- 0.0/1.7 MB ? eta -:--:--
   ------ --------------------------------- 0.3/1.7 MB ? eta -:--:--
   ------------ --------------------------- 0.5/1.7 MB 2.0 MB/s eta 0:00:01
   ------------------------ --------------- 1.0/1.7 MB 2.2 MB/s eta 0:00:01
   ---------------------------------------- 1.7/1.7 MB 2.5 MB/s  0:00:01
Installing collected packages: tf-keras
Successfully installed tf-keras-2.20.1


In [1]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma
import gc

# Clean up memory from previous runs
gc.collect()
torch.cuda.empty_cache()

## 1. Prepare "Secret" Data
We will create a dummy document containing information the LLM definitely does *not* know. 
This proves retrieval is working.

In [2]:
# A fictional document
secret_document = """
Project "Crystal Weaver" is a top-secret initiative by the Antigravity Corporation started in 2042.
The goal of Crystal Weaver is to synthesize edible data crystals that allow humans to learn Python instantly by eating them.
The lead scientist is Dr. Xylar, who prefers to wear a neon green lab coat.
Unexpected side effects of eating the crystals include speaking in SQL queries during sleep and a craving for silicon chips.
"""

# 1. Split text into chunks (simulating a large document)
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=200,
    chunk_overlap=50
)
texts = text_splitter.split_text(secret_document)

print(f"Split into {len(texts)} chunks.")
print("Sample chunk:", texts[0])

Split into 4 chunks.
Sample chunk: Project "Crystal Weaver" is a top-secret initiative by the Antigravity Corporation started in 2042.


## 2. Initialize Vector Database (ChromaDB)
We use a small embedding model to turn text into vectors.

In [5]:
print("Loading embedding model...")
# Uses CPU by default for embeddings to save VRAM for the LLM
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

print("Creating Vector DB...")
# Create ChromaDB instance in memory
vector_db = Chroma.from_texts(
    texts=texts,
    embedding=embedding_model,
    collection_name="secret_project"
)

print("Data ingested!")

Loading embedding model...
Creating Vector DB...
Data ingested!


## 3. Test Retrieval
Let's ask a question and see if we can find the right chunk.

In [6]:
query = "What are the side effects of Crystal Weaver?"
docs = vector_db.similarity_search(query, k=2) # Get top 2 matches

print("--- Retrieved Context ---")
for i, doc in enumerate(docs):
    print(f"Content {i+1}: {doc.page_content}")

--- Retrieved Context ---
Content 1: Unexpected side effects of eating the crystals include speaking in SQL queries during sleep and a craving for silicon chips.
Content 2: The goal of Crystal Weaver is to synthesize edible data crystals that allow humans to learn Python instantly by eating them.


## 4. Load the LLM (Again)
We reload our 4-bit Qwen model to generate the final answer.

In [7]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4", 
    bnb_4bit_compute_dtype=torch.float16
)

model_id = "Qwen/Qwen2.5-1.5B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map="auto")

## 5. The RAG Loop (Retrieval + Generation)
We combine everything into a function.

In [8]:
def query_rag(question):
    # 1. Retrieve
    docs = vector_db.similarity_search(question, k=2)
    context_text = "\n".join([doc.page_content for doc in docs])
    
    # 2. Augment (Create Prompt)
    # We tell the model to ONLY use the context.
    prompt_template = [
        {"role": "system", "content": "You are a helpful assistant. Answer the user's question STRICTLY using the provided context. If the answer is not in the context, say 'I don't know'."},
        {"role": "user", "content": f"Context:\n{context_text}\n\nQuestion: {question}"}
    ]
    
    text = tokenizer.apply_chat_template(prompt_template, tokenize=False, add_generation_prompt=True)
    model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
    
    # 3. Generate
    with torch.no_grad():
        generated_ids = model.generate(**model_inputs, max_new_tokens=150)
        
    generated_ids = [ 
        output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
    ]
    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
    return response

# Test it!
q1 = "Who is the lead scientist of Project Crystal Weaver?"
print(f"Q: {q1}")
print(f"A: {query_rag(q1)}")

print("-" * 20)

q2 = "What happens if you eat the crystals?"
print(f"Q: {q2}")
print(f"A: {query_rag(q2)}")

Q: Who is the lead scientist of Project Crystal Weaver?
A: I don't have enough information to determine who the lead scientist of Project Crystal Weaver is. The provided context does not mention any specific scientists or their roles within the project. Therefore, I cannot answer this question based on the given information.
--------------------
Q: What happens if you eat the crystals?
A: If you eat the crystals, unexpected side effects such as speaking in SQL queries during sleep and a craving for silicon chips occur.
