## **Title of Project: GenAI Powered Adaptive Question Paper Generator**

# **Group Members:**
Suryansh Ambekar (202201090042)

Kaustubh Mahajan (202201070128)

Ayush Fating (202201070127)

Anom Nanda (202201060049)

## Introduction

Question generation is an important task in natural language processing because it helps automate assessment, build study materials, and assist learning systems. Instead of manually creating questions from long documents or chapters, a trained model can read the content and ask meaningful questions on its own. In this project we fine tune a T5 model using the SQuAD dataset, then combine it with a retrieval system to generate questions from unseen documents such as PDF notes. The goal is to generate context based questions that are relevant to the text instead of random or generic questions. To improve accuracy, we use sentence embeddings and FAISS search so the model receives more focused context before generating questions.

---

### Dataset Description

We use the SQuAD 2.0 dataset. It is a widely used benchmark dataset for reading comprehension. The dataset contains passages taken from Wikipedia articles. Each passage has multiple questions related to the content. Some questions have valid answers, while others are marked as unanswerable. For this project we only use the answerable questions. Each training sample has two parts:

1. context – a paragraph taken from the article

2. question – a human written question about the context

The dataset is provided in JSON format. It has a nested structure:

* A list of articles

* Each article has multiple paragraphs

* Each paragraph has the context text

* Each paragraph contains a list of question–answer pairs

We ignore the answers and only keep context and questions because the goal is not answer prediction, but question generation. We take around 8000 samples to reduce training time.

---

## Methodology

The project follows a step by step pipeline.

1. Fine tuning a question generation model

We use the model google/flan-t5-small because it is efficient, smaller, and easier to train. We format each context with the prefix generate question: so the model understands the task. The question text is used as the target label. We tokenize both input and output, pad sequences to fixed length, and replace padding tokens in labels with -100 so they are ignored during loss calculation. The model is trained for a few epochs using AdamW optimization.

2. Building retrieval using semantic embeddings

Language models do not understand similarity from raw text alone. To improve relevance, we use a sentence transformer model (all-MiniLM-L6-v2) to convert each context into a numerical vector called an embedding. Similar contexts will have embeddings close to each other. We store all embeddings in a FAISS index, which allows fast nearest neighbor search. During retrieval, we convert a query into an embedding and search the index to find the most related contexts.

3. Working with external PDF content

To apply the system to real documents, we read a PDF file and extract the text page by page. The full text is split into smaller chunks so the model can handle them. Each chunk is encoded into an embedding and stored in a separate FAISS index. When generating questions from a PDF, we first retrieve similar chunks to build a larger and more meaningful input context. This method is called Retrieval Augmented Generation. It helps the T5 model produce better questions even if it has never seen the document before.

4. Question generation

For each chunk we combine it with similar chunks retrieved by FAISS. The merged text is passed into the fine tuned T5 model. The model generates a question from the content. This loop continues until the required number of questions is reached. The output is a list of questions formatted for the user.

---

## Section 1: Install & Import Libraries
Install required libraries

kagglehub - to download datasets from Kaggle

datasets, transformers, accelerate, sentencepiece - for NLP and T5 model

faiss-cpu - for similarity search

sentence-transformers - to create vector embeddings of text

pypdf2 - to read PDF files

tqdm - to show progress bars


In [None]:

!pip install kagglehub datasets transformers accelerate sentencepiece faiss-cpu sentence-transformers pypdf2 tqdm



import os
import json
import torch
from torch.utils.data import Dataset, DataLoader
from tqdm.auto import tqdm

from transformers import T5Tokenizer, T5ForConditionalGeneration
from torch.optim import AdamW

from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
from PyPDF2 import PdfReader


Collecting faiss-cpu
  Downloading faiss_cpu-1.13.0-cp39-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (7.7 kB)
Collecting pypdf2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Downloading faiss_cpu-1.13.0-cp39-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (23.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.6/23.6 MB[0m [31m107.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m20.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pypdf2, faiss-cpu
Successfully installed faiss-cpu-1.13.0 pypdf2-3.0.1


## Section 2 - Download the Dataset

In this part we:
- Use `kagglehub` to download the SQuAD 2.0 dataset.
- Load the training and development JSON files.
- These files contain articles, paragraphs, questions, and answers.


In [None]:
import kagglehub

path = kagglehub.dataset_download("buildformacarov/squad-20")
print("Dataset Path:", path)

train_path = os.path.join(path, "train-v2.0.json")
dev_path   = os.path.join(path, "dev-v2.0.json")

with open(train_path, "r") as f:
    train_data = json.load(f)

with open(dev_path, "r") as f:
    dev_data = json.load(f)


Downloading from https://www.kaggle.com/api/v1/datasets/download/buildformacarov/squad-20?dataset_version_number=1...


100%|██████████| 9.81M/9.81M [00:00<00:00, 51.8MB/s]

Extracting files...





Dataset Path: /root/.cache/kagglehub/datasets/buildformacarov/squad-20/versions/1


## Section 3 - Extract Context and Question Pairs

The SQuAD file has a nested structure:
- data → articles
- each article has many paragraphs
- each paragraph has:
  - `context` (the passage)
  - `qas` (list of question answer pairs)

We only want:
- context text
- question text
- and only for questions that are answerable (`is_impossible` is `False`).

We store them in two lists:
- `contexts`
- `questions`


In [None]:
def extract_pairs(data):
    contexts = []
    questions = []
    for article in data["data"]:
        for p in article["paragraphs"]:
            context = p["context"]
            for qa in p["qas"]:
                if qa["is_impossible"] == False:
                    contexts.append(context)
                    questions.append(qa["question"])
    return contexts, questions

train_contexts, train_questions = extract_pairs(train_data)
dev_contexts, dev_questions     = extract_pairs(dev_data)

print("Train samples:", len(train_contexts))
subset = 8000
train_contexts = train_contexts[:subset]
train_questions = train_questions[:subset]


Train samples: 86821


## Section 4 - Load T5 Model

We use `google/flan-t5-small`:
- It is a small version of T5 which is easier to train on a GPU.
- We load the tokenizer and the model.
- Then we move the model to the GPU (device "cuda") to speed up training.


In [None]:
model_name = "google/flan-t5-small"

tokenizer = T5Tokenizer.from_pretrained(model_name)
model     = T5ForConditionalGeneration.from_pretrained(model_name)
model.to("cuda")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/308M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

T5ForConditionalGeneration(
  (shared): Embedding(32128, 512)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 512)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=512, out_features=384, bias=False)
              (k): Linear(in_features=512, out_features=384, bias=False)
              (v): Linear(in_features=512, out_features=384, bias=False)
              (o): Linear(in_features=384, out_features=512, bias=False)
              (relative_attention_bias): Embedding(32, 6)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseGatedActDense(
              (wi_0): Linear(in_features=512, out_features=1024, bias=False)
              (wi_1): Linear(in_features=512, out_features=1024, bias=False)
              (wo): 

## Section 5 - Create PyTorch Dataset and DataLoader

We create a custom Dataset class:
- Input to the model is: `"generate question: " + context`
- Target (label) is: the real question text
- We tokenize both input and label
- We pad or cut sequences to fixed length so that they fit in batches
- We replace pad tokens in labels with `-100` so that loss is not computed on pad positions

Then we create:
- `train_dl` DataLoader for training
- `dev_dl` DataLoader for validation (not used much here but kept for future work)


In [None]:
class QGDataset(Dataset):
    def __init__(self, contexts, questions, tokenizer, max_len=256):
        self.contexts = contexts
        self.questions = questions
        self.tokenizer = tokenizer
        self.max_len  = max_len

    def __len__(self):
        return len(self.contexts)

    def __getitem__(self, idx):
        prompt = "generate question: " + self.contexts[idx]

        inputs = self.tokenizer(
            prompt,
            truncation=True,
            padding="max_length",
            max_length=self.max_len,
            return_tensors="pt"
        )

        labels = self.tokenizer(
            self.questions[idx],
            truncation=True,
            padding="max_length",
            max_length=64,
            return_tensors="pt"
        )["input_ids"]

        labels[labels == self.tokenizer.pad_token_id] = -100

        return {
            "input_ids": inputs["input_ids"].squeeze(),
            "attention_mask": inputs["attention_mask"].squeeze(),
            "labels": labels.squeeze()
        }

train_ds = QGDataset(train_contexts, train_questions, tokenizer)
dev_ds   = QGDataset(dev_contexts,   dev_questions,   tokenizer)

train_dl = DataLoader(train_ds, batch_size=4, shuffle=True)
dev_dl   = DataLoader(dev_ds, batch_size=4)


## Section 6 - Train the Model

We now fine tune the T5 model:

Steps in each epoch:
1. Put model in train mode.
2. Loop over training batches.
3. Move batch tensors to GPU.
4. Call model on the batch to get `loss`.
5. Backpropagate the loss and update the weights using AdamW optimizer.
6. Track and print average loss.

We train for a small number of epochs to save time.


In [None]:
optimizer = AdamW(model.parameters(), lr=2e-4)

epochs = 3

for epoch in range(epochs):
    model.train()
    total_loss = 0

    pbar = tqdm(train_dl, desc=f"Epoch {epoch+1}/{epochs}")

    for batch in pbar:
        batch = {k: v.to("cuda") for k, v in batch.items()}

        outputs = model(**batch)
        loss = outputs.loss

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
        pbar.set_postfix({"loss": loss.item()})

    print(f"=== Epoch {epoch+1}/{epochs} | Avg Loss: {total_loss/len(train_dl):.4f} ===")


Epoch 1/3:   0%|          | 0/2000 [00:00<?, ?it/s]

=== Epoch 1/3 | Avg Loss: 2.1144 ===


Epoch 2/3:   0%|          | 0/2000 [00:00<?, ?it/s]

=== Epoch 2/3 | Avg Loss: 1.8024 ===


Epoch 3/3:   0%|          | 0/2000 [00:00<?, ?it/s]

=== Epoch 3/3 | Avg Loss: 1.6006 ===


## Section 7 - Save the Fine Tuned Model

After training, we save:
- Model weights
- Tokenizer

This allows us to load and use the trained model later without retraining.


In [None]:
model.save_pretrained("qg_t5_model")
tokenizer.save_pretrained("qg_t5_model")


('qg_t5_model/tokenizer_config.json',
 'qg_t5_model/special_tokens_map.json',
 'qg_t5_model/spiece.model',
 'qg_t5_model/added_tokens.json')

In [None]:
import os

file_names = os.listdir('qg_t5_model')
print(file_names)

['spiece.model', 'special_tokens_map.json', 'model.safetensors', 'config.json', 'tokenizer_config.json', 'added_tokens.json', 'generation_config.json']


In [None]:
from google.colab import files

files.download('/content/qg_t5_model/model.safetensors')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Section 8 - Question Generation Function

We now define a helper function `generate_question`:

- Takes a context string.
- Adds the prefix `"generate question:"`.
- Encodes it using the tokenizer.
- Uses `model.generate` to create a question.
- Decodes the output tokens to text.


In [None]:
def generate_question(context, max_len=50):
    input_text = "generate question: " + context

    tokens = tokenizer.encode(input_text, return_tensors="pt").to("cuda")

    out = model.generate(
        tokens,
        max_length=max_len,
        num_beams=4,
        do_sample=True,
        temperature=0.7,
        top_p=0.9,
    )

    return tokenizer.decode(out[0], skip_special_tokens=True)


## Section 9 - RAG Embeddings using SentenceTransformer and FAISS

We now create a retrieval layer:

- Use `all-MiniLM-L6-v2` to convert each training context into a vector embedding.
- Store all embeddings in a FAISS index for fast similarity search.
- Later we can retrieve the most similar contexts for any query.


In [None]:
print("Building embeddings. This will take a few minutes...")

embedder = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = embedder.encode(train_contexts, convert_to_numpy=True, show_progress_bar=True)

dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(embeddings)

print("FAISS index built:", index.ntotal)


Building embeddings. This will take a few minutes...


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/250 [00:00<?, ?it/s]

FAISS index built: 8000


## Section 10 - Retrieval Function for SQuAD Contexts

We define `retrieve_context`:

- Takes a text query.
- Converts it to an embedding.
- Searches in the FAISS index.
- Returns top `k` most similar contexts.


In [None]:
def retrieve_context(query, top_k=3):
    q_emb = embedder.encode([query], convert_to_numpy=True)
    D, I = index.search(q_emb, top_k)
    return [train_contexts[i] for i in I[0]]


## Section 11 and 12 - PDF Text Extraction and Chunking

Now we want to use our system on a PDF file.

Steps:
1. Read a PDF file using `PyPDF2`.
2. Extract text from each page.
3. Split the full text into chunks of fixed size (words per chunk).
4. Each chunk will be treated as a small context.

We will build a separate FAISS index only for the PDF chunks.


In [None]:
from PyPDF2 import PdfReader

def extract_pdf_text(pdf_path):
    reader = PdfReader(pdf_path)
    text = ""
    for page in reader.pages:
        txt = page.extract_text()
        if txt:
            text += txt + "\n"
    return text


In [None]:
def chunk_text(text, chunk_size=300):
    """
    Break text into chunks of ~chunk_size words.
    Each chunk should still represent a coherent idea/paragraph.
    """
    chunks = []
    words = text.split()
    for i in range(0, len(words), chunk_size):
        chunk = " ".join(words[i:i+chunk_size])
        chunks.append(chunk.strip())
    return chunks


## Section 13: Build Embeddings and FAISS Index from PDF Chunks

In this part:
- We read and chunk the PDF.
- We compute embeddings for each chunk.
- We build a new FAISS index `pdf_index` that only contains these PDF chunks.


In [None]:
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

# Change this path to the PDF you want to index
#pdf_path = "/content/week1to10.pdf"  # or "/content/week51to54.pdf"

print("Reading and chunking PDF...")
pdf_text = extract_pdf_text(pdf_path)
pdf_chunks = chunk_text(pdf_text, chunk_size=300)

print(f"Total PDF chunks: {len(pdf_chunks)}")

print("Building embeddings from PDF chunks... This will take a bit.")
embedder = SentenceTransformer("all-MiniLM-L6-v2")

pdf_embeddings = embedder.encode(
    pdf_chunks,
    convert_to_numpy=True,
    show_progress_bar=True
)

dimension = pdf_embeddings.shape[1]
pdf_index = faiss.IndexFlatL2(dimension)
pdf_index.add(pdf_embeddings)

print("PDF FAISS index built with", pdf_index.ntotal, "entries.")


Reading and chunking PDF...
Total PDF chunks: 26
Building embeddings from PDF chunks... This will take a bit.


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

PDF FAISS index built with 26 entries.


## Section 14: Retrieval from PDF and Question Generation

We now connect everything:

1. `retrieve_similar_chunks`  
   - Given a chunk, find similar chunks from the same PDF.

2. `generate_questions_from_pdf`  
   - For each chunk in the PDF:
     - Retrieve similar chunks to add more context.
     - Merge them into one larger context.
     - Pass this merged context to our trained T5 model.
     - Generate questions.
   - Stop when we reach the required number of questions.


In [None]:
def retrieve_similar_chunks(chunk, top_k=2):
    """
    Given one chunk of text, find top_k similar chunks from the same PDF.
    """
    q_emb = embedder.encode([chunk], convert_to_numpy=True)
    D, I = pdf_index.search(q_emb, top_k)
    return [pdf_chunks[i] for i in I[0]]


def generate_questions_from_pdf(n_questions=10, top_k=2, questions_per_chunk=1):
    """
    Go through PDF chunks, use RAG to add similar context,
    and generate questions using the fine-tuned T5 model.
    """
    generated = []

    for ch in pdf_chunks:
        # retrieve similar chunks from the same PDF
        similar_chunks = retrieve_similar_chunks(ch, top_k=top_k)
        merged_context = ch + "\n" + "\n".join(similar_chunks)

        # generate multiple questions per chunk if needed
        for _ in range(questions_per_chunk):
            q = generate_question(merged_context)
            generated.append(q)

            if len(generated) >= n_questions:
                return generated

    return generated


## Section 15: Ask User for PDF and Number of Questions

Finally, we:
- Ask the user for the PDF path.
- Ask how many questions to generate.
- Rebuild the PDF embeddings for this PDF.
- Call `generate_questions_from_pdf`.
- Print the questions with serial numbers.


In [None]:
pdf_path = input("Enter the PDF file path: ")
num_q = int(input("How many questions do you want to generate? "))

# Step 1 — Read and chunk PDF
pdf_text = extract_pdf_text(pdf_path)
pdf_chunks = chunk_text(pdf_text, chunk_size=300)

# Step 2 — Build embeddings on PDF chunks
print("Creating PDF embeddings. Please wait...")
pdf_embeddings = embedder.encode(
    pdf_chunks,
    convert_to_numpy=True,
    show_progress_bar=True
)

dimension = pdf_embeddings.shape[1]
pdf_index = faiss.IndexFlatL2(dimension)
pdf_index.add(pdf_embeddings)

# Step 3 — Generate questions
questions = generate_questions_from_pdf(
    n_questions=num_q,
    top_k=2,
    questions_per_chunk=1
)

print("\n===== Generated Questions =====\n")
for i, q in enumerate(questions, start=1):
    print(f"{i}. {q}")


Enter the PDF file path: /content/week11to30.pdf
How many questions do you want to generate? 25
Creating PDF embeddings. Please wait...


Batches:   0%|          | 0/1 [00:00<?, ?it/s]


===== Generated Questions =====

1. The CART Algorithm for Classification is an ensemble learning method that combines multiple decision trees to improve what?
2. What is a common method for building robust and reliable classification models?
3. The library provides the class for building decision tree models Key parameters to tune in the tree include what?s?
4. What is a common technique to estimate the model's performance?
5. The tree structure provides a visual representation of the decision - making process , with binary splits at each node The size of the nodes and the proportion of each class label indicate what?
6. What is a common method to estimate the model's performance?
7. What kind of model is used to predict credit risk prediction?
8. What type of model is used for credit risk prediction?
9. What is a method used to estimate the model's performance?
10. Classification and Regression Trees are a tree-like model that recursively partition the data based on what feature?
11