In [None]:

### Key Features Implemented:
- **Model**: `unsloth/Qwen2.5-3B-bnb-4bit` ‚Äì pre-quantized with Unsloth's optimized 4-bit (NF4) format
- **Dynamic Quantization**: Unsloth selectively keeps higher precision for critical parameters (e.g., attention weights, outliers) while compressing others to 4 bits ‚Üí better accuracy + lower VRAM than standard bitsandbytes 4-bit
- **Domain-specific documents**: Custom text files on Python, Machine Learning, and RAG
- **Chunking**: 200-word chunks with 50-word overlap for better context retention
- **Embedding**: `all-MiniLM-L6-v2` sentence transformer
- **Vector Store**: FAISS (FlatL2 for exact search on small corpus)
- **Retrieval**: Top-k similar chunks fetched and injected into prompt
- **Generation**: Grounded responses using the quantized LLM
- **Memory Optimization**: Designed to run efficiently on free Colab T4 GPU (~6-7GB VRAM peak)

**Why Unsloth Dynamic 4-bit?**
Standard 4-bit quantization (bitsandbytes) compresses all weights uniformly ‚Üí can hurt accuracy.
Unsloth uses **dynamic scaling**: detects outlier weights and keeps them in higher precision ‚Üí near FP16 accuracy with 4-bit memory footprint.
Result: A 3B model runs in ~5GB VRAM instead of 12+GB.

In [None]:
from huggingface_hub import login
login(token="hf_AkUD")  # Replace with your actual token

In [14]:
from unsloth import FastLanguageModel
import torch

model_name = "unsloth/Qwen2.5-3B-bnb-4bit"

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_name,
    max_seq_length = 2048,  # Max input length
    dtype = torch.float16,  # Memory-efficient data type
    load_in_4bit = True,    # Dynamic 4-bit quantization
)

# Enable faster inference
FastLanguageModel.for_inference(model)

print("Model loaded successfully!")

==((====))==  Unsloth 2026.1.2: Fast Qwen2 patching. Transformers: 4.57.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.5.1
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Model loaded successfully!


In [15]:
!mkdir docs

# Sample document 1: python.txt
with open("docs/python.txt", "w") as f:
    f.write("""
Python is a high-level programming language known for its simplicity and readability.
It is widely used in web development, data analysis, artificial intelligence, and automation.
Key libraries include NumPy for numerical computations, Pandas for data manipulation, and TensorFlow for machine learning.
Python's syntax emphasizes code readability with indentation instead of braces.
""")

# Sample document 2: machine_learning.txt
with open("docs/machine_learning.txt", "w") as f:
    f.write("""
Machine learning is a subset of artificial intelligence that enables systems to learn from data.
Common types include supervised learning (e.g., regression, classification), unsupervised learning (e.g., clustering), and reinforcement learning.
Algorithms like decision trees, neural networks, and support vector machines are fundamental.
Overfitting is a common issue where models perform well on training data but poorly on new data.
""")

# Sample document 3: rag.txt (about RAG itself, for testing)
with open("docs/rag.txt", "w") as f:
    f.write("""
Retrieval-Augmented Generation (RAG) combines retrieval from external documents with generative AI models.
It improves accuracy by grounding responses in real data, reducing hallucinations.
Steps include indexing documents, embedding queries, retrieving chunks, and generating responses.
Unsloth's 4-bit quantization makes RAG efficient on limited hardware.
""")

print("Documents created! You can add more by uploading to the 'docs' folder.")

mkdir: cannot create directory ‚Äòdocs‚Äô: File exists
Documents created! You can add more by uploading to the 'docs' folder.


In [16]:
from sentence_transformers import SentenceTransformer
import os

# Load embedding model
embedder = SentenceTransformer('all-MiniLM-L6-v2')

# Improved chunking with overlap
def chunk_text(text, chunk_size=200, overlap=50):
    words = text.split()
    chunks = []
    i = 0
    while i < len(words):
        chunk = words[i:i + chunk_size]
        chunks.append(" ".join(chunk))
        i += (chunk_size - overlap)  # Move forward by chunk_size minus overlap
    return chunks

# Read and chunk all documents
texts = []
sources = []  # Track which file each chunk came from

for file in os.listdir("docs"):
    path = os.path.join("docs", file)
    with open(path, "r", encoding="utf-8") as f:
        content = f.read()
        chunks = chunk_text(content, chunk_size=200, overlap=50)
        texts.extend(chunks)
        sources.extend([file] * len(chunks))

embeddings = embedder.encode(texts)

print(f"Created {len(texts)} chunks (with 50-word overlap) from {len(os.listdir('docs'))} domain-specific documents.")
print("Domain: Artificial Intelligence, Python, and RAG systems")

Created 3 chunks (with 50-word overlap) from 3 domain-specific documents.
Domain: Artificial Intelligence, Python, and RAG systems


In [17]:
!pip install faiss-cpu



In [18]:
import faiss
import numpy as np

# Convert embeddings to numpy float32
embeddings = np.array(embeddings).astype('float32')

# Get dimension
dimension = embeddings.shape[1]

# Use simple FlatL2 index (best for small number of documents/chunks)
index = faiss.IndexFlatL2(dimension)

# Add embeddings directly (no training needed)
index.add(embeddings)

print(f"FAISS index built successfully with {len(texts)} chunks! üéâ")
print("Using exact search (FlatL2) ‚Äì perfect for small/medium document sets.")

# Retrieval function (unchanged)
def retrieve_chunks(query, top_k=3):
    query_emb = embedder.encode([query]).astype('float32')
    distances, indices = index.search(query_emb, top_k)
    retrieved = [texts[i] for i in indices[0]]
    return retrieved

print("Retrieval ready!")

FAISS index built successfully with 3 chunks! üéâ
Using exact search (FlatL2) ‚Äì perfect for small/medium document sets.
Retrieval ready!


In [19]:
def rag_generate(query, top_k=3, max_tokens=200):
    # Retrieve and display chunks
    chunks = retrieve_and_show(query, top_k)
    context = "\n\n".join(chunks)

    prompt = f"""Use ONLY the following context to answer the question. Be accurate, concise, and professional.

Context:
{context}

Question: {query}
Answer:"""

    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    outputs = model.generate(
        **inputs,
        max_new_tokens=max_tokens,
        temperature=0.7,
        do_sample=True
    )
    answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
    answer = answer[len(prompt):].strip()  # Remove prompt echo

    return answer

print("Full RAG pipeline ready!")

Full RAG pipeline ready!


In [20]:
# Enhanced retrieval with visibility
def retrieve_and_show(query, top_k=3):
    query_emb = embedder.encode([query]).astype('float32')
    distances, indices = index.search(query_emb, top_k)

    print(f"\nüîç Query: {query}")
    print(f"üìö Retrieved {top_k} most relevant chunks:\n")
    retrieved_chunks = []
    for idx, dist in zip(indices[0], distances[0]):
        chunk = texts[idx]
        source = sources[idx]
        print(f"Source: {source} | Distance: {dist:.4f}")
        print(f"Chunk: {chunk.strip()}\n")
        print("-" * 80)
        retrieved_chunks.append(chunk)
    return retrieved_chunks

In [21]:
# Test queries
queries = [
    "What is Python used for?",
    "Explain machine learning in simple terms.",
    "How does Retrieval-Augmented Generation work?",
    "What are key Python libraries for AI?"
]

for q in queries:
    print(f"\nQuestion: {q}")
    print(f"Answer: {rag_generate(q)}\n")
    print("-" * 50)


Question: What is Python used for?

üîç Query: What is Python used for?
üìö Retrieved 3 most relevant chunks:

Source: python.txt | Distance: 0.3753
Chunk: Python is a high-level programming language known for its simplicity and readability. It is widely used in web development, data analysis, artificial intelligence, and automation. Key libraries include NumPy for numerical computations, Pandas for data manipulation, and TensorFlow for machine learning. Python's syntax emphasizes code readability with indentation instead of braces.

--------------------------------------------------------------------------------
Source: rag.txt | Distance: 1.6343
Chunk: Retrieval-Augmented Generation (RAG) combines retrieval from external documents with generative AI models. It improves accuracy by grounding responses in real data, reducing hallucinations. Steps include indexing documents, embedding queries, retrieving chunks, and generating responses. Unsloth's 4-bit quantization makes RAG efficie

In [22]:
# VRAM Monitoring
import torch

def print_vram():
    if torch.cuda.is_available():
        print(f"GPU: {torch.cuda.get_device_name(0)}")
        print(f"VRAM Allocated: {torch.cuda.memory_allocated(0)/1024**3:.2f} GB")
        print(f"VRAM Reserved:  {torch.cuda.memory_reserved(0)/1024**3:.2f} GB")
    else:
        print("No GPU detected")

print("=== VRAM Usage After Model Load ===")
print_vram()

# Run !nvidia-smi for full view
!nvidia-smi

=== VRAM Usage After Model Load ===
GPU: Tesla T4
VRAM Allocated: 4.12 GB
VRAM Reserved:  4.27 GB
Thu Jan  8 15:47:42 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   79C    P0             69W /   70W |    4520MiB /  15360MiB |     42%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------