# Assignment 4.1: 
Retrieval-Augmented Question Answering Using LangChain

Student: Mostafa Zamaniturk

# Instructions
In this assignment, you will explore how retrieval-augmented generation (RAG) improves language model responses by grounding them in real data. Using TED Talk transcripts, you'll combine semantic search with a transformer model to generate accurate, context-aware answers.

The purpose of this assignment is to build a simple question answering (QA) system using Retrieval-augmented generation (RAG) techniques. You will use LangChain and HuggingFace tools to load a TED Talks dataset, embed and store document chunks using a vector database (FAISS), and query them using a pretrained transformer model. 

Through this assignment, students will gain hands-on experience in building real-world QA systems using open-domain documents.

# Required Details
Hint 1:

Load a manageable subset of English translations from the TED Talks dataset, which is provided here for your convenience.

Hint 2:

Some sample questions that you can ask:

"What do TED speakers say about climate change?"

"What is the general opinion on education?"

In [8]:
! pip install rank_bm25

Python(2415) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting rank_bm25
  Downloading rank_bm25-0.2.2-py3-none-any.whl.metadata (3.2 kB)
Downloading rank_bm25-0.2.2-py3-none-any.whl (8.6 kB)
Installing collected packages: rank_bm25
Successfully installed rank_bm25-0.2.2


In [9]:
# Step 1: Import Required Libraries for LLM + Document Retrieval Workflow
import os
import torch
from datasets import load_dataset
from transformers import pipeline
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.llms import HuggingFacePipeline
from langchain.chains import RetrievalQA
from langchain.schema import Document
from langchain.retrievers import BM25Retriever, EnsembleRetriever
from rank_bm25 import BM25Okapi



The TED dataset had probem, so I decided to use wikipedia dataset as a reference for the RAG system.

In [None]:
# Step 2: Load the document. 
dataset = load_dataset(
    "wikimedia/wikipedia", 
    "20231101.en", 
    split="train",
    streaming=True
)
documents = []
for item in dataset:
    text = item["text"]
    title = item.get("title", "Unknown")
    if text:
        documents.append(
            Document(
                page_content=text,
                metadata={"title": title, "source": "wikipedia_20231101"}
            )
        )

Resolving data files:   0%|          | 0/41 [00:00<?, ?it/s]

  embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")


In [None]:
# Step 3: Split the document. Each chunk has 500 characters, and 100 characters overlap with the next chunk for context continuity.
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
docs = text_splitter.split_documents(documents[:5])  # limit size to reduce memory

# Step 4: Embed using Hugging Face sentence transformer

embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
db = FAISS.from_documents(docs, embeddings)


In [28]:
# Step 5: Load flan-t5-small on CPU (safest config). Hugging Face’s pipeline is wrapped into a LangChain-compatible llm

device = torch.device("cpu")

qa_pipeline = pipeline(
    "text2text-generation",
    model="google/flan-t5-base",
    max_length=256,
    device=device,
    do_sample=False
)
llm = HuggingFacePipeline(pipeline=qa_pipeline)

# Step 6: Build the Retrieval QA chain. First retrieves top 3 relevant text chunks from FAISS, then passes them to the LLM to answer your query.
#It enhances the LLM's ability to answer questions by grounding it in specific documents.

# retriever = db.as_retriever(search_kwargs={"k": 3})

# Use hybrid search in LangChain FAISS and BM25
# create a BM25 retriever

# if it use text insted of document use this lines of codes
    # bm25_retriever = BM25Retriever.from_texts([d.page_content for d in docs])
    # bm25_retriever.k = 3

bm25_retriever = BM25Retriever.from_documents(docs)
bm25_retriever.k = 3 # how many result to return

# create a FAISS retriever
faiss_retriever = db.as_retriever(search_kwargs={"k": 3})

# combine them with EnsemblerRetriever
retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, faiss_retriever],
    weights=[0, 1] # balance BM25 and FAISS scores
)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever, # hybrid retriever
    return_source_documents=True
)

# Step 7: Ask a question
#query = "What is tokenization in LLMs and why is it important?"
query = "Explain the structure of DNA."
result = qa_chain(query)

# Step 8: Show results
print("\n Question: ", query)
print("\n Answer:")
print(result["result"])

print("\n Source Documents:")
for i, doc in enumerate(result["source_documents"], 1):
    print(f"\n--- Source {i} ---")
    print(doc.page_content)

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

Device set to use cpu
Token indices sequence length is longer than the specified maximum sequence length for this model (758 > 512). Running this sequence through the model will result in indexing errors



 Question:  Explain the structure of DNA.

 Answer:
DNA is a ribonucleotide (ribonucleotide), a ribonucleotide (ribonucleotide), a ribonucleotide (ribonucleotide), a ribonucleotide (ribonucleotide), a ribonucleotide (ribonucleotide), a ribonucleotide (ribonucleotide), a ribonucleotide (ribonucleotide), a ribonucleotide (ribonucleotide), a ribonucleotide (ribonucleotide), a ribonucleotide (ribonucleotide), a ribonucleotide (ribonucleotide), a ribonucleotide (ribonucleotide), a ribonucleotide (ribonucleotide), a ribonu

 Source Documents:

--- Source 1 ---
Evolutionary

--- Source 2 ---
In English grammar, "a", and its variant "an", are indefinite articles.

History 

The earliest known certain ancestor of "A" is aleph (also written 'aleph), the first letter of the Phoenician alphabet, which consisted entirely of consonants (for that reason, it is also called an abjad to distinguish it from a true alphabet). In turn, the ancestor of aleph may have been a pictogram of an ox head in proto

- At first I used just faiss retriever, then I decided to improve the performance by using hybrid search in LangChain. For this purpose I implemented the BM25 to the model.
- For the chunk size, I used 500 chunk size, and 100 for overlap. I recieved no meaning answers, then, I decided to change it to 1000 as chunk size and 200 for overlap, calculation time is significantly increased.
- at first, I used the "flan-t5-small", answers were not correct (lahhucinated), then I used the "flan-t5-base" speed is decreased but answers were corect!
- Other adjustments I need to try:
    - different chunk size and overlap
    - use other powerful models
    - change the balance BM25 and FAISS, now is zero for BM25 and 100% for FAISS. 
    - for embeddings "all-MiniLM-L6-v2" is used, try to use different ones. if possible.
    - 

# Required Format
Convert your Jupyter Notebook or Python script into a single, clean PDF or HTML document file. Be sure to label each section clearly and ensure that the outputs are properly visible in the document.