# Assignment 4.1: 
Retrieval-Augmented Question Answering Using LangChain

Student: Mostafa Zamaniturk

# Instructions
In this assignment, you will explore how retrieval-augmented generation (RAG) improves language model responses by grounding them in real data. Using TED Talk transcripts, you'll combine semantic search with a transformer model to generate accurate, context-aware answers.

The purpose of this assignment is to build a simple question answering (QA) system using Retrieval-augmented generation (RAG) techniques. You will use LangChain and HuggingFace tools to load a TED Talks dataset, embed and store document chunks using a vector database (FAISS), and query them using a pretrained transformer model. 

Through this assignment, students will gain hands-on experience in building real-world QA systems using open-domain documents.

# Required Details
Hint 1:

Load a manageable subset of English translations from the TED Talks dataset, which is provided here for your convenience.

In [None]:
! pip install datasets

In [None]:
! pip install langchain


Hint 2:

Some sample questions that you can ask:

"What do TED speakers say about climate change?"

"What is the general opinion on education?"

In [8]:
# Step 1: Import Required Libraries for LLM + Document Retrieval Workflow
import os
import torch
from datasets import load_dataset
from transformers import pipeline
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.llms import HuggingFacePipeline
from langchain.chains import RetrievalQA
from langchain.schema import Document

In [None]:
# Step 2: Load the document. 
dataset = load_dataset("wikimedia/wikipedia", "20231101.en", split="train[:10]")

documents = []
for item in dataset:
    text = item["text"]
    title = item.get("title", "Unknown")
    if text:
        documents.append(
            Document(
                page_content=text,
                metadata={"title": title, "source": "wikipedia_20231101"}
            )
        )

# Step 3: Split the document. Each chunk has 500 characters, and 100 characters overlap with the next chunk for context continuity.
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
docs = text_splitter.split_documents(documents[:5])  # limit size to reduce memory

# Step 4: Embed using Hugging Face sentence transformer

embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
db = FAISS.from_documents(docs, embeddings)

README.md: 0.00B [00:00, ?B/s]

Resolving data files:   0%|          | 0/41 [00:00<?, ?it/s]

Downloading data:   0%|          | 0/41 [00:00<?, ?files/s]

20231101.en/train-00000-of-00041.parquet:   0%|          | 0.00/420M [00:00<?, ?B/s]

Cancellation requested; stopping current tasks.


KeyboardInterrupt: 

# Required Format
Convert your Jupyter Notebook or Python script into a single, clean PDF or HTML document file. Be sure to label each section clearly and ensure that the outputs are properly visible in the document.