<a href="https://colab.research.google.com/github/BPALAN-USD/AAI-520/blob/main/AAI_520_Assignment_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Instructions**

In this assignment, you will explore how retrieval-augmented generation (RAG) improves language model responses by grounding them in real data. Using TED Talk transcripts, you'll combine semantic search with a transformer model to generate accurate, context-aware answers.

The purpose of this assignment is to build a simple question answering (QA) system using Retrieval-augmented generation (RAG) techniques. You will use LangChain and HuggingFace tools to load a TED Talks dataset, embed and store document chunks using a vector database (FAISS), and query them using a pretrained transformer model. Through this assignment, students will gain hands-on experience in building real-world QA systems using open-domain documents.

# Overall Activities done in this Notebook

1. Load Dataset
2.

**1. Mount Google Drive and also Login to HuggingFace**

In [1]:
from google.colab import drive
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
from huggingface_hub import login
login()  # this will prompt you for a token

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

**2 Install the Required Packages**

In [3]:
!pip install IndicTransToolkit
!pip install langchain
!pip install faiss-cpu

!pip install sentence-transformers
!pip install -U langchain-community



**3. Import required Packages**

In [4]:
from transformers import pipeline
import torch
from datasets import load_dataset
from langchain.schema import Document
import ast
from tqdm import tqdm

**4. Load Dataset**

In [5]:


# --------------------------
# Setup translation pipeline
# --------------------------
DEVICE = 0 if torch.cuda.is_available() else -1

translator = pipeline(
    "translation",
    model="ai4bharat/indictrans2-indic-en-1B",
    trust_remote_code=True,
    device=DEVICE
)

# --------------------------
# Load dataset
# --------------------------
dataset = load_dataset("bigscience-data/roots_indic-hi_ted_talks_iwslt", split="train[:50]")

documents = []
for item in dataset:
    text = item.get("text", "").strip()
    if text:
        # parse meta string into dict
        meta_raw = item.get("meta", "{}")
        try:
            meta_dict = ast.literal_eval(meta_raw) if isinstance(meta_raw, str) else meta_raw
        except Exception:
            meta_dict = {"raw_meta": meta_raw}

        doc = Document(
            page_content=text,
            metadata={
                "file": meta_dict.get("file", "unknown"),
                "element": meta_dict.get("element", None),
                "dataset": "roots_indic-hi_ted_talks_iwslt"
            }
        )
        documents.append(doc)

print(f"Loaded {len(documents)} documents")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json: 0.00B [00:00, ?B/s]

configuration_indictrans.py: 0.00B [00:00, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/ai4bharat/indictrans2-indic-en-1B:
- configuration_indictrans.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_indictrans.py: 0.00B [00:00, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/ai4bharat/indictrans2-indic-en-1B:
- modeling_indictrans.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors:   0%|          | 0.00/4.09G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/163 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenization_indictrans.py: 0.00B [00:00, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/ai4bharat/indictrans2-indic-en-1B:
- tokenization_indictrans.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


dict.SRC.json: 0.00B [00:00, ?B/s]

dict.TGT.json: 0.00B [00:00, ?B/s]

model.SRC:   0%|          | 0.00/3.26M [00:00<?, ?B/s]

model.TGT:   0%|          | 0.00/759k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/96.0 [00:00<?, ?B/s]

Device set to use cuda:0


README.md:   0%|          | 0.00/4.10k [00:00<?, ?B/s]

dataset_infos.json:   0%|          | 0.00/931 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/2.05M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/254 [00:00<?, ? examples/s]

Loaded 50 documents


**5. Since Dataset is in Hindi, Perform Language Translation from Hindi to English**

In [6]:
# --------------------------
# Helper: split long Hindi text
# --------------------------
def chunk_text(text, max_chars=200):
    """Split text into smaller chunks to avoid tokenizer overflow."""
    sentences = text.split("।")  # split at Hindi full stop
    chunks = []
    current_chunk = ""
    for s in sentences:
        if len(current_chunk) + len(s) < max_chars:
            current_chunk += s + "।"
        else:
            chunks.append(current_chunk.strip())
            current_chunk = s + "।"
    if current_chunk:
        chunks.append(current_chunk.strip())
    return chunks

# --------------------------
# Translate documents while keeping source text
# --------------------------
translated_documents = []

for doc in tqdm(documents, desc="Translating documents"):
    chunks = chunk_text(doc.page_content, max_chars=200)
    translated_text = ""

    # Keep track of both source and translated chunks
    chunk_pairs = []

    for chunk in chunks:
        prefixed_chunk = f"hin_Deva eng_Latn {chunk}"
        translation = translator(
            prefixed_chunk,
            max_new_tokens=512,
            max_length=512,
            truncation=True,
            use_cache=False
        )
        translated_chunk = translation[0]['translation_text']
        translated_text += translated_chunk + " "

        chunk_pairs.append({
            "source_text": chunk,
            "translated_text": translated_chunk
        })

    translated_documents.append(
        Document(
            page_content=translated_text.strip(),
            metadata={
                **doc.metadata,
                "source_chunks": chunk_pairs  # store source + translation pairs
            }
        )
    )

print(f"Translated {len(translated_documents)} documents")

# --------------------------
# Example output
# --------------------------
example_doc = translated_documents[0]
print("Full Translated Text:\n", example_doc.page_content)
print("\nSource + Translated Chunks:")
for pair in example_doc.metadata["source_chunks"]:
    print("HI:", pair["source_text"])
    print("EN:", pair["translated_text"])
    print("-"*50)


Translating documents:   0%|          | 0/50 [00:00<?, ?it/s]Both `max_new_tokens` (=512) and `max_length`(=512) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=512) and `max_length`(=512) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=512) and `max_length`(=512) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=512) and `max_length`(=512) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingfa

Translated 50 documents
Full Translated Text:
 You may not realize it , but there are more bacteria in your body than there are stars in our entire galaxy . This wonderful world of bacteria within us is an integral part of our health and our technology is so rapidly evolving that today we can program these bacteria in the same way that we can program a computer.Now this diagram that you see here I know it looks like some kind of game . Now , in addition to programming these beautiful patterns , what else can we do with these bacteria ? And I decided to find out how we can program bacteria to detect and treat cancer-like diseases in our bodies . One of the amazing things about bacteria is that they can grow naturally inside tumors . That 's because the immune system doesn 't usually have access to tumors . So by finding these tumors and using them as safe places for bacteria to grow and thrive , we started using safe and health-benefiting probiotic bacteria and found that when they were




**6. Document Chunking for Embedding**

In [7]:
from langchain.docstore.document import Document
from langchain.text_splitter import CharacterTextSplitter

# Optional: further split long translated text into smaller chunks for embeddings
text_splitter = CharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50
)

rag_documents = []
for doc in translated_documents:
    chunks = text_splitter.split_text(doc.page_content)
    for chunk in chunks:
        rag_documents.append(Document(page_content=chunk, metadata=doc.metadata))

print(f"Total chunks for RAG: {len(rag_documents)}")


Total chunks for RAG: 50


**7. Embed the Document and Store in Vector Store**

In [8]:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS

# Initialize embeddings
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# Build FAISS index from documents
vector_store = FAISS.from_documents(rag_documents, embeddings)

# Optional: save index
vector_store.save_local("faiss_index")



  embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

**8. Build a RAG Model with Langchain LLM**

In [9]:
from langchain.chains import RetrievalQA
from langchain.llms import HuggingFacePipeline
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
import torch

# --------------------------
# Load model
# --------------------------
model_name = "google/flan-t5-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

llm_pipeline = pipeline(
    task="text2text-generation",
    model=model,
    tokenizer=tokenizer,
    device=0 if torch.cuda.is_available() else -1,  # GPU if available
    max_length=512
)

llm = HuggingFacePipeline(pipeline=llm_pipeline)

# --------------------------
# Build RetrievalQA chain
# --------------------------
llm_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=vector_store.as_retriever(search_type="similarity", search_kwargs={"k": 3}),
    return_source_documents=True
)

tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Device set to use cuda:0
  llm = HuggingFacePipeline(pipeline=llm_pipeline)


**9. Save the Model**

In [10]:
save_path = "/content/drive/MyDrive/my_rag_model"
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)


vector_store.save_local("/content/drive/MyDrive/faiss_index")



**10. Test the Model**

In [11]:
query = "Tell me about Bacterias ?"
result = llm_chain(query)

print("Answer:\n", result['result'])
print("\nSource Documents:")
for doc in result['source_documents']:
    print(doc.metadata.get("file", "unknown"))
    print(doc.page_content[:500], "...")  # show first 500 chars
    print("-"*80)


  result = llm_chain(query)
Token indices sequence length is longer than the specified maximum sequence length for this model (2711 > 512). Running this sequence through the model will result in indexing errors


Answer:
 They can grow inside tumors . They can grow inside tumors . They can grow inside tumors . They can grow inside tumors . They can grow inside tumors . They can grow inside tumors . They can grow inside tumors . They can grow inside tumors . They can grow inside tumors . They can grow inside tumors . They can grow inside tumors . They can grow inside tumors . They can grow inside tumors . They can grow inside tumors . They can grow inside tumors . They can grow inside tumors . They can grow inside tumors . They can grow inside tumors . They can grow inside tumors . They can grow inside tumors . They can grow inside tumors . They can grow inside tumors . They can grow inside tumors . They can grow inside tumors . They can grow inside tumors . They can grow inside tumors . They can grow inside tumors . They can grow inside tumors . They can grow inside tumors . They can grow inside tumors . They can grow inside tumors . They can grow inside tumors . They can grow inside tumors . T