<a href="https://colab.research.google.com/github/Titilegend/RAG-Powered-Q-A-System-for-Duolingo-English-Test-Documentation/blob/main/RAG_Powered_Q%26A_System_for_Duolingo_English_Test_Documentation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**RAG-Powered Q&A System for Duolingo English Test Documentation**

This notebook implements a Retrieval-Augmented Generation (RAG) system designed to answer domain-specific questions using official Duolingo English Test (DET) documentation. The system ingests PDF documents, processes and chunks the text, generates semantic embeddings, and stores them in a FAISS vector database to enable efficient similarity-based retrieval.

When a user submits a question, the system retrieves the most relevant document chunks and generates a grounded response based solely on the retrieved context. This approach improves factual accuracy and reduces hallucination compared to standalone language models.

The project demonstrates the following components:



*   Document loading and preprocessing
*   Text chunking for semantic search


*  Embedding generation using sentence-transformers
*   Vector storage and similarity search with FAISS
*   Context-grounded question answering


This implementation showcases how Retrieval-Augmented Generation can be applied to structured exam documentation to build an intelligent, domain-aware assistant.

In [3]:
!pip install requests==2.32.4 --quiet

In [1]:
!pip install -q \
langchain \
langchain-community \
faiss-cpu \
sentence-transformers \
pypdf \
requests==2.32.4

In [2]:
from pathlib import Path
from langchain_community.document_loaders import PyPDFLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS

In [3]:
from google.colab import files
uploaded = files.upload()

Saving det-guide-2025.pdf to det-guide-2025.pdf
Saving GPN The DET Handbook.pdf to GPN The DET Handbook.pdf


In [4]:
DATA_DIR = Path("/content")
def load_documents(data_dir:Path):
  docs = []
  for fp in data_dir.glob("*"):
    if fp.suffix.lower() == ".pdf":
      docs.extend(PyPDFLoader(str(fp)).load())
    elif fp.suffix.lower() in [".txt", ".md"]:
      docs.extend(TextLoader(str(fp),encoding="utf-8").load())
  return docs

raw_docs = load_documents(DATA_DIR)
print("Loaded docs:", len(raw_docs))

Loaded docs: 151


In [6]:
splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=150
)
chunks = splitter.split_documents(raw_docs)

embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
vectorstore = FAISS.from_documents(chunks, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k":4})

print("Chunks created:",len(chunks))

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Chunks created: 377


In [7]:
q = "What is the test structure and how long does the Duolingo English Test take?"
hits = retriever.get_relevant_documents(q)

for i,d in enumerate(hits,1):
  src = Path(d.metadata.get("source","")).name
  page = d.metadata.get("page",None)
  print(f"\n--- Result {i} (source={src},page={page})---\n")
  print(d.page_content[:600])


--- Result 1 (source=GPN The DET Handbook.pdf,page=5)---

Test Structure
Once you begin the test, you will
be guided through three stages:
Test Length
You will need about one hour of uninterrupted time to take the Duolingo English Test.
In this part, you will learn about the length and structure of the test. You will also learn about 
computer adaptive testing and how it aﬀects your test experience.
Section 1: Introduction and onboarding ~5 minutes
You will complete a tech check (to conﬁrm your computer's camera, speakers, and microphone are working 
properly), review the test rules and requirements, submit your ID, and set up your secondary camera.


--- Result 2 (source=GPN The DET Handbook.pdf,page=0)---

The Handbook
Everything you need to know to achieve
test readiness for the Duolingo English Test
0:35
RECORD NOW

--- Result 3 (source=GPN The DET Handbook.pdf,page=5)---

Section 2: Adaptive test ~45 minutes
Unlike other tests, the Duolingo English Test is not divided into distin

  warn_deprecated(


In [10]:
import json

def clean_notebook(filepath):
    with open(filepath, "r", encoding="utf-8") as f:
        notebook = json.load(f)
    if "metadata" in notebook and "widgets" in notebook["metadata"]:
        del notebook["metadata"]["widgets"]
    with open(filepath, "w", encoding="utf-8") as f:
        json.dump(notebook, f, indent=2)

# Usage
# clean_notebook("your_notebook_name.ipynb")
