<a href="https://colab.research.google.com/github/Akhilesh-Banke/Basic_RAG/blob/main/First_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#  My First RAG System: From Scratch
> A hands-on Retrieval Augmented Generation (RAG) project built using only open-source tools.

---

### Overview
In this notebook, I'll build my own **RAG (Retrieval Augmented Generation)** system step-by-step.

 Goal: Make LLMs stop hallucinating by grounding them in real data — my data!

I'll cover:
1.  Loading custom knowledge
2.  Chunking it smartly
3.  Embedding with Sentence Transformers
4.  Storing & retrieving with FAISS
5.  Generating grounded answers

---

**Libraries used:**
transformers – for our LLM
sentence-transformers – for embeddings
faiss-cpu – for vector search
langchain – for text splitting

In [1]:
!pip install -q transformers sentence-transformers faiss-cpu langchain torch

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.4/31.4 MB[0m [31m66.2 MB/s[0m eta [36m0:00:00[0m
[?25h

### **Dependencies**

In [18]:
import os
import numpy as np
import faiss
from sentence_transformers import SentenceTransformer
from transformers import pipeline
from langchain.text_splitter import RecursiveCharacterTextSplitter

### **Our Data**
First, we need some custom knowledge. Let’s import the my_knwledge.txt

In [19]:
# Load our document
with open("/content/my_knwledge.txt") as f:
    knowledge_text = f.read()

print("✅ Knowledge base saved.")


✅ Knowledge base saved.


In [20]:
knowledge_text

'Company Policy Manual:\n- WFH Policy: All employees are eligible for a hybrid WFH schedule. Employees must be in the office on Tuesdays, Wednesdays, and Thursdays. Mondays and Fridays are optional remote days.\n- PTO Policy: Full-time employees receive 20 days of Paid Time Off (PTO) per year. PTO accrues monthly.\n- Tech Stack: The official backend language is Python, and the official frontend framework is React. For mobile development, we use React Native.'

### **Chunking**
We can’t feed the whole book to the model at once. We need to split it into index cards (chunks). Don’t just split by \n (newlines). We’ll cut sentences in half. We’ll use a smart splitter:

In [21]:

""" This splitter is smart. It tries to split on paragraphs ("\n\n"),
 then newlines ("\n"), then spaces (" "), to keep semantically
 related text together as much as possible. """

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=20,
    length_function=len
)

# Split into chunks
chunks = text_splitter.split_text(knowledge_text)

print(f"We have {len(chunks)} chunks:\n")
for i, chunk in enumerate(chunks):
    print(f"--- Chunk {i+1} ---\n{chunk}\n")


We have 5 chunks:

--- Chunk 1 ---
Company Policy Manual:

--- Chunk 2 ---
- WFH Policy: All employees are eligible for a hybrid WFH schedule. Employees must be in the office on Tuesdays, Wednesdays, and Thursdays. Mondays

--- Chunk 3 ---
Thursdays. Mondays and Fridays are optional remote days.

--- Chunk 4 ---
- PTO Policy: Full-time employees receive 20 days of Paid Time Off (PTO) per year. PTO accrues monthly.

--- Chunk 5 ---
- Tech Stack: The official backend language is Python, and the official frontend framework is React. For mobile development, we use React Native.



Yeah..! it intelligently broke our file into small, overlapping pieces.

### **Embeddings**
Now we turn those text chunks into numbers (vectors). We’ll use a popular, lightweight sentence-transformer model. It’s brilliant at understanding the meaning of a sentence:

In [22]:
from sentence_transformers import SentenceTransformer

# Load the embedding model
# 'all-MiniLM-L6-v2' is a fantastic, fast, and small model.
# It runs 100% on your local machine.
model = SentenceTransformer('all-MiniLM-L6-v2')

# Embed all our chunks
# This will take a moment as it "reads" and "understands" each chunk.
chunk_embeddings = model.encode(chunks)

print(f"Shape of our embeddings: {chunk_embeddings.shape}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Shape of our embeddings: (5, 384)


### **Vector Store with FAISS**
We have our vectors. Now we need a database to store them in a way we can search by similarity. It is where FAISS comes in.

In [23]:
import faiss
import numpy as np

# Get the dimension of our vectors (e.g., 384)
d = chunk_embeddings.shape[1]

# Create a FAISS index
# IndexFlatL2 is the simplest, most basic index. It calculates
# the exact distance (L2 distance) between our query and all vectors.
index = faiss.IndexFlatL2(d)

# Add our chunk embeddings to the index
# We must convert to float32 for FAISS
index.add(np.array(chunk_embeddings).astype('float32'))

print(f"FAISS index created with {index.ntotal} vectors.")

FAISS index created with 5 vectors.


That’s it. We just created an in-memory vector database.

### **Retrieve, Augment, Generate**
This is the final part. Here the user will ask a question. Let’s trace the full pipeline:

In [24]:
from transformers import pipeline

# 1. Load a "Question-Answering" or "Text-Generation" model
# We'll use a small, instruction-tuned model from Google.
generator = pipeline('text2text-generation', model='google/flan-t5-small')

# --- This is our RAG pipeline function ---
def answer_question(query):
    # 1. RETRIEVE
    # Embed the user's query
    query_embedding = model.encode([query]).astype('float32')

    # Search the FAISS index for the top k (e.g., k=2) most similar chunks
    k = 2
    distances, indices = index.search(query_embedding, k)

    # Get the actual text chunks from our original 'chunks' list
    retrieved_chunks = [chunks[i] for i in indices[0]]
    context = "\n\n".join(retrieved_chunks)

    # 2. AUGMENT
    # This is the "magic prompt." We combine the retrieved context
    # with the user's query.
    prompt_template = f"""
    Answer the following question using *only* the provided context.
    If the answer is not in the context, say "I don't have that information."

    Context:
    {context}

    Question:
    {query}

    Answer:
    """

    # 3. GENERATE
    # Feed the augmented prompt to our generative model
    answer = generator(prompt_template, max_length=100)
    print(f"--- CONTEXT ---\n{context}\n")
    return answer[0]['generated_text']

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/308M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

Device set to use cuda:0


Now, let’s ask our system some questions:

In [25]:
query_1 = "What is the WFH policy?"
print(f"Query: {query_1}")
print(f"Answer: {answer_question(query_1)}\n")

Query: What is the WFH policy?


Both `max_new_tokens` (=256) and `max_length`(=100) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


--- CONTEXT ---
- WFH Policy: All employees are eligible for a hybrid WFH schedule. Employees must be in the office on Tuesdays, Wednesdays, and Thursdays. Mondays

Company Policy Manual:

Answer: All employees are eligible for a hybrid WFH schedule. Employees must be in the office on Tuesdays, Wednesdays, and Thursdays. Mondays Company Policy Manual:



It worked! It didn’t just guess, it found the exact text and synthesized the answer.

Now, let’s ask a question the context cannot answer:

In [26]:
query_2 = "What is the company's dental plan?"
print(f"Query: {query_2}")
print(f"Answer: {answer_question(query_2)}\n")

Both `max_new_tokens` (=256) and `max_length`(=100) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Query: What is the company's dental plan?
--- CONTEXT ---
Company Policy Manual:

- WFH Policy: All employees are eligible for a hybrid WFH schedule. Employees must be in the office on Tuesdays, Wednesdays, and Thursdays. Mondays

Answer: I don't have that information.



It is critical. Because of our prompt (“only use the provided context”), the LLM didn’t hallucinate. It correctly stated it couldn’t find the answer.

### **Summary:**
Take a step back. What we just built in a few dozen lines of Python is the foundation of the next generation of AI. We solved the three biggest problems with LLMs:


Hallucinations: We grounded the model in reality.
Stale Knowledge: We can update the knowledge! Just we've to re-run the indexing on new documents.
Data Privacy: No data ever left our computer. The embedding model and the LLM all ran locally.