<a href="https://colab.research.google.com/github/Priyanshu-Srivastva/RAG-demo-Hugging-Face/blob/main/RAG_demo_Hugging_Face_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Retrieval-Augmented Generation**

**Install dependencies**

All these libraries together set up the environment for your project.

FAISS :- fast similarity search.

Sentence-Transformers :- convert text into embeddings.

Transformers :- provide pre-trained models.

Gradio :- user interface to demo your model.

tqdm :- progress visualization.

In [1]:
#install required libraries
!pip install -q faiss-cpu sentence-transformers transformers==4.44.2 gradio tqdm


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.7/43.7 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.5/9.5 MB[0m [31m64.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.4/31.4 MB[0m [31m52.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.6/3.6 MB[0m [31m52.6 MB/s[0m eta [36m0:00:00[0m
[?25h

**Imports & device detection**

This section imports all required libraries and sets up the environment. It ensures:-

File handling (os, glob, pathlib, json).

Data management (numpy, tqdm).

Deep learning (PyTorch, SentenceTransformers, HuggingFace).

Similarity search (FAISS).

Device selection (GPU if available, else CPU).

os :- Provides functions to interact with the operating system (like file paths, directories, environment variables).

glob :- Helps find all file paths matching a specific pattern (like *.txt).

SentenceTransformer :- Loads pre-trained models that convert text into semantic embeddings (vectors).

faiss :- Facebook AI Similarity Search library, used for fast similarity search & clustering in embeddings space.

pipeline (from Hugging Face Transformers) :- Provides ready-to-use NLP pipelines (e.g., summarization, sentiment analysis, question answering).

In [2]:
import os
import glob
import json
from pathlib import Path
from typing import List
from tqdm.auto import tqdm

import numpy as np
import torch

from sentence_transformers import SentenceTransformer
import faiss
from transformers import pipeline

# Device info
USE_CUDA = torch.cuda.is_available()
device = 0 if USE_CUDA else -1
print("CUDA available:", USE_CUDA)


CUDA available: True


**Load / upload dataset**

This block:-

Uploads a zip file (countries.zip).

Extracts it in Colab.

Reads all .txt files inside.

Stores their content in documents for later processing.

Prints confirmation with previews.

In [3]:
from google.colab import files
uploaded = files.upload()   # Select countries.zip

import zipfile
import os

zip_path = "countries.zip"  # uploaded file name
extract_path = "/content/"

with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall(extract_path)

print("Extracted files to:", extract_path)
print("Contents:", os.listdir(extract_path))

documents = []
for fname in os.listdir(extract_path):
    if fname.endswith(".txt"):
        with open(os.path.join(extract_path, fname), "r", encoding="utf-8") as f:
            text = f.read()
            documents.append((fname, text))

print(f"Loaded {len(documents)} documents")
for name, text in documents:
    print(" -", name, ":", text[:80], "...")


Saving countries.zip to countries.zip
Extracted files to: /content/
Contents: ['.config', 'China.txt', 'India.txt', 'countries.zip', 'Japan.txt', 'Russia.txt', 'USA.txt', 'sample_data']
Loaded 5 documents
 - China.txt : China,[i] officially the People's Republic of China (PRC),[j] is a country in Ea ...
 - India.txt : India, officially the Republic of India,[j][20] is a country in South Asia. It i ...
 - Japan.txt : Japan[a] is an island country in East Asia. Located in the Pacific Ocean off the ...
 - Russia.txt : Russia,[b] or the Russian Federation,[c] is a country spanning Eastern Europe an ...
 - USA.txt : The United States of America (USA), also known as the United States (U.S.) or Am ...


**Chunking function and building chunk**

This block:-

Defines a chunking function to split text into overlapping word-based segments.

Builds a list of chunks + metadata from all documents.

Prints total chunks and previews for validation.

This step is crucial because ML/NLP models can’t handle very large documents directly — chunking makes them processable while overlap preserves context.

In [4]:
# Simple word-based chunker (configurable)
CHUNK_MAX_WORDS = 200
CHUNK_OVERLAP = 40

def chunk_text(text: str, max_words=CHUNK_MAX_WORDS, overlap=CHUNK_OVERLAP) -> List[str]:
    words = text.split()
    chunks = []
    i = 0
    n = len(words)
    while i < n:
        j = min(i + max_words, n)
        chunk = " ".join(words[i:j])
        chunks.append(chunk)
        if j == n:
            break
        i = j - overlap
    return chunks

# Build chunks & metadata
chunks = []
meta = []  # list of dicts: {doc, chunk_idx, global_idx}
gidx = 0
for fname, text in documents:
    doc_chunks = chunk_text(text)
    for idx, c in enumerate(doc_chunks):
        chunks.append(c)
        meta.append({"doc": fname, "chunk_idx": idx, "global_idx": gidx})
        gidx += 1

print(f"Total chunks: {len(chunks)}")
if len(chunks) <= 10:
    for i, c in enumerate(chunks):
        print(i, "→", c[:120].replace("\n"," "), "...")


Total chunks: 20


**Load embedding model and embed chunks**

This block:-

Loads a pre-trained embedding model (all-MiniLM-L6-v2).

Defines a batch embedding function to avoid memory issues.

Converts all text chunks into dense vector embeddings.

Prepares data for similarity search and NLP tasks.

This is the core transformation step where documents become machine-understandable vectors.

In [5]:
EMBED_MODEL_NAME = "all-MiniLM-L6-v2"  # fast & small

print("Loading embedding model:", EMBED_MODEL_NAME)
embed_model = SentenceTransformer(EMBED_MODEL_NAME, device='cuda' if USE_CUDA else 'cpu')

# Embed in batches to avoid memory spikes
def embed_texts(texts, batch_size=64):
    embs = []
    for i in tqdm(range(0, len(texts), batch_size), desc="Embedding"):
        batch = texts[i:i+batch_size]
        e = embed_model.encode(batch, convert_to_numpy=True, show_progress_bar=False)
        embs.append(e)
    return np.vstack(embs).astype("float32")

embeddings = embed_texts(chunks)
print("Embeddings shape:", embeddings.shape)


Loading embedding model: all-MiniLM-L6-v2


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Embedding:   0%|          | 0/1 [00:00<?, ?it/s]

Embeddings shape: (20, 384)


**Build FAISS index**

This block:-

Creates a FAISS index with L2 distance for fast similarity search.

Adds all embeddings to the index.

Saves both the index and metadata to disk for reuse.

This is the retrieval backbone — turning your dataset into a searchable vector database.

In [6]:
INDEX_FILE = "/content/faiss.index"
META_FILE = "/content/chunks_meta.json"

dim = embeddings.shape[1]
index = faiss.IndexFlatL2(dim)
index.add(embeddings)
print("FAISS index: ntotal =", index.ntotal)

# Save index and metadata
faiss.write_index(index, INDEX_FILE)
with open(META_FILE, "w", encoding="utf-8") as fh:
    json.dump({"chunks": chunks, "meta": meta}, fh, ensure_ascii=False, indent=2)
print("Saved FAISS index and metadata to disk.")


FAISS index: ntotal = 20
Saved FAISS index and metadata to disk.


**Retriever function**

This function:-

Converts a query into an embedding.

Searches FAISS index for the most similar chunks.

Returns both the chunk text and its metadata.

This is the retrieval step in a Retrieval-Augmented Generation (RAG) pipeline — it connects user queries to the right knowledge chunks.

In [7]:
def retrieve(query: str, top_k: int = 4):
    qvec = embed_model.encode([query], convert_to_numpy=True).astype("float32")
    D, I = index.search(qvec, top_k)
    results = []
    for idx in I[0]:
        if idx < 0:
            continue
        results.append({"chunk": chunks[idx], "meta": meta[idx], "idx": int(idx)})
    return results


**Hugging Face LLM**

This block:-

Loads a text generation model (Flan-T5 or similar).

Defines generate_answer that:

Builds a prompt with retrieved context + question.

Calls the generator to produce a natural language answer.

This is the final stage of a RAG pipeline — retrieval (FAISS) + generation (Flan-T5).

In simple words: this is where your system becomes a QA assistant that answers based on your dataset.

In [8]:
# Choose a model:
# - "google/flan-t5-small" -> lightweight, good for Colab free
# - "google/flan-t5-base"  -> better, still reasonable
# - "mistralai/Mistral-7B-Instruct-v0.2" -> much better but requires large GPU / RAM

HF_MODEL = "google/flan-t5-small"   # change here if you want a different HF model

print("Loading HF generator:", HF_MODEL)
# We use text2text-generation pipeline (Flan-T5 expects text2text)
generator = pipeline("text2text-generation", model=HF_MODEL, device=device, max_new_tokens=200, do_sample=False)

def generate_answer(query: str, retrieved, max_new_tokens=200):
    context = "\n\n".join([r["chunk"] for r in retrieved])
    prompt = f"Use the context below to answer the question. If the answer is not contained in the context, say you don't know.\n\nContext:\n{context}\n\nQuestion: {query}\nAnswer:"
    out = generator(prompt, max_new_tokens=max_new_tokens, do_sample=False)
    # pipeline returns list of dicts with 'generated_text'
    return out[0]["generated_text"].strip()


Loading HF generator: google/flan-t5-small


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/308M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

**Simple Gradio chat UI**

This is a complete Retrieval-Augmented Generation (RAG) chatbot pipeline:-

User asks a question in Gradio UI.

retrieve finds relevant chunks from your documents.

generate_answer uses Flan-T5 (or whichever model) to answer using only retrieved context.

UI shows answer + evidence snippets.

In [9]:
import gradio as gr

def rag_chat(user_input):
    if not user_input.strip():
        return "Please ask a question."
    retrieved = retrieve(user_input, top_k=4)
    answer = generate_answer(user_input, retrieved)
    # This include small retrieved snippets in the UI
    snippets = "\n\n".join([f"[{r['meta']['doc']}#{r['meta']['chunk_idx']}]: {r['chunk'][:200]}..." for r in retrieved])
    return f"**Answer:**\n{answer}\n\n**Retrieved snippets:**\n{snippets}"

with gr.Blocks() as demo:
    gr.Markdown("## RAG demo (Hugging Face) — ask questions about the uploaded docs")
    inp = gr.Textbox(lines=2, placeholder="Ask a question...")
    out = gr.Markdown()
    btn = gr.Button("Ask")
    btn.click(lambda q: rag_chat(q), inputs=inp, outputs=out)
    # Allow hitting Enter to submit
    inp.submit(lambda q: rag_chat(q), inp, out)

print("Launching Gradio UI (press the public link or open inline).")
demo.launch(share=True)


Launching Gradio UI (press the public link or open inline).
Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://e93eae32e7484cfa3a.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


