# RAG-based Question Answering System with TED Talks

In this assignment, we are exploring how retrieval-augmented generation (RAG) improves language model responses by grounding them in real data. We will be using TED Talk transcripts to combine semantic search with a transformer model for generating accurate, context-aware answers.

## Objective
Building a simple question answering (QA) system using Retrieval-augmented generation (RAG) techniques with LangChain and HuggingFace tools to load a TED Talks dataset, embed and store document chunks using a vector database (FAISS), and query them using a pretrained transformer model.


## Step 1: Installing Required Dependencies

We are installing the necessary packages for building our RAG system including datasets for loading TED talks, LangChain for document processing and chaining, FAISS for vector storage, and transformers for the language model.


In [34]:
# Importing required libraries for RAG system
import torch
from datasets import load_dataset
from langchain.schema import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.llms import HuggingFacePipeline
from langchain.chains import RetrievalQA
from transformers import pipeline


## Step 2: Loading TED Talks Dataset

We are loading a manageable subset of English translations from the TED Talks dataset. We are using the gigant/ted_descriptions" and limiting to the first 1000 samples to manage memory usage effectively.


In [None]:
# Loading TED Talks dataset with working alternative
dataset = load_dataset("gigant/ted_descriptions", split="train[:1000]")  # Limit for memory
print(f"Loaded {len(dataset)} TED Talk samples")

Loaded 1000 TED Talk samples


## Step 3: Converting to LangChain Documents

We are converting the dataset items to LangChain Document objects with proper metadata including speaker information for better context during retrieval.


In [36]:
# Converting dataset items to LangChain Document objects
documents = []
for item in dataset:
    if item.get("descr"):  # Using 'descr' field which contains the description
        doc = Document(
            page_content=item["descr"],
            metadata={
                "url": item.get("url", ""),
                "source": "TED Talks"
            }
        )
        documents.append(doc)
print(f"Created {len(documents)} Document objects")

Created 1000 Document objects


## Step 4: Splitting Documents into Chunks

We are splitting the documents into smaller chunks for better retrieval performance. Using RecursiveCharacterTextSplitter with appropriate chunk size and overlap to maintain context continuity between chunks.


In [11]:
# Splitting documents into manageable chunks for better retrieval
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,  # Each chunk has 500 characters
    chunk_overlap=50  # 50 characters overlap for context continuity
)

# Split the documents into chunks
docs = text_splitter.split_documents(documents)
print(f"Split {len(documents)} documents into {len(docs)} chunks")


Split 1000 documents into 1230 chunks


## Step 5: Creating Embeddings and Vector Database

We are generating embeddings using HuggingFace sentence transformers and creating a FAISS vector database for efficient similarity search. This enables semantic retrieval of relevant document chunks.


In [37]:
# Creating embeddings using HuggingFace sentence transformer
embeddings_function = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

# Creating FAISS vector database from document chunks
db = FAISS.from_documents(docs, embeddings_function)
print("Vector database created successfully")


Vector database created successfully


## Step 6: Setting up Language Model Pipeline

We are initializing the HuggingFace transformer model pipeline for text generation. Using google/flan-t5-small model configured for CPU usage to ensure compatibility and efficient processing.


In [38]:
# Setting up device for CPU usage
device = torch.device("cpu")

# Creating HuggingFace pipeline for text generation
qa_pipeline = pipeline(
    "text2text-generation",
    model="google/flan-t5-small",
    max_length=256,
    device=device,
    do_sample=True,
    temperature=0.5,  # Adding some creativity
    top_p=0.9

)

# Wrapping pipeline into LangChain-compatible LLM
llm = HuggingFacePipeline(pipeline=qa_pipeline)
print("Language model pipeline created successfully")


Device set to use cpu


Language model pipeline created successfully


## Step 7: Building the RAG Question-Answering Chain

We are creating the complete RetrievalQA chain that connects our vector database retriever with the language model. This enables the system to retrieve relevant TED talk chunks and generate context-aware answers.


In [None]:
# Creating retriever from vector database (top 3 relevant chunks)
retriever = db.as_retriever(search_kwargs={"k": 3})

# Building the complete RetrievalQA chain with better configuration
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",  # Explicitly specifying chain type
    retriever=retriever,
    return_source_documents=True, 
    chain_type_kwargs={
        "prompt": None  # Use default prompt which works better with FLAN-T5
    }
)

print("RAG Question-Answering chain created successfully")

RAG Question-Answering chain created successfully


## Step 8: Testing the RAG System

We are testing our RAG-based question answering system with the sample questions provided in the instructions. This demonstrates how the system retrieves relevant TED talk content and generates context-aware answers.


In [39]:
# A Q&A function with source document display
def simple_qa_with_sources(question, retriever, pipeline):
    # Get relevant documents
    docs = retriever.get_relevant_documents(question)
    
    # Combine all context
    context = " ".join([doc.page_content for doc in docs])
    
    # Simple prompt format that works well with FLAN-T5
    prompt = f"Question: {question}\nContext: {context}\nAnswer:"
    
    # Generate answer
    answer = pipeline(prompt, max_length=200, do_sample=True, temperature=0.7)[0]['generated_text']
    
    return {
        "result": answer,
        "source_documents": docs
    }

result1 = simple_qa_with_sources("What do TED speakers say about climate change?", retriever, qa_pipeline)
print("Question 1: What do TED speakers say about climate change?")
print("\nAnswer:")
print(result1["result"])
print("\nSource Documents:")
for i, doc in enumerate(result1["source_documents"], 1):
    print(f"\n--- Source {i} ---")
    print(doc.page_content[:200] + "..." if len(doc.page_content) > 200 else doc.page_content)

print("\n" + "="*70 + "\n")

result2 = simple_qa_with_sources("What is the general opinion on education?", retriever, qa_pipeline)
print("Question 2: What is the general opinion on education?")
print("\nAnswer:")
print(result2["result"])
print("\nSource Documents:")
for i, doc in enumerate(result2["source_documents"], 1):
    print(f"\n--- Source {i} ---")
    print(doc.page_content[:200] + "..." if len(doc.page_content) > 200 else doc.page_content)

Question 1: What do TED speakers say about climate change?

Answer:
a solvable problem that we can tackle together

Source Documents:

--- Source 1 ---
"Dare mighty things." These words are at the entrance to NASA's Jet Propulsion Laboratory (JPL), and JPL science communicator Laura Tenenbaum says they embody the attitude we must take towards climate...

--- Source 2 ---
A brief answer to one of the key questions about climate change: Why act now? (Written by Myles Allen, David Biello and George Zaidan)

--- Source 3 ---
Lighting up the TED stage, Nobel laureate Al Gore takes stock of the current state of climate progress and calls attention to institutions that have failed to honor their promises by continuing to pou...


Question 2: What is the general opinion on education?

Answer:
Educator Nora Flanagan says we can reframe this moment as an opportunity to fix what's long been broken for teachers, students and families -- and shares four ways schools can reinvent themselves for a po

## Conclusion

We have successfully implemented a RAG-based question answering system using TED Talks data. The system demonstrates how retrieval-augmented generation improves language model responses by grounding them in real data. 

### Key Components Implemented:
- **Data Loading**: TED Talks dataset with 1000 samples
- **Document Processing**: Conversion to LangChain documents with metadata
- **Text Chunking**: Optimal splitting for retrieval performance
- **Embeddings**: Semantic vector representations using sentence transformers
- **Vector Database**: FAISS for efficient similarity search
- **Language Model**: FLAN-T5 for context-aware answer generation
- **RAG Chain**: Complete integration of retrieval and generation components

### System Capabilities:
The RAG system can answer questions about various topics covered in TED talks by retrieving relevant content and generating informed responses based on the retrieved context.
