### 🧩 **Overview of pipeline**



Load PDF → docs (like ingestion).

Chunk docs → chunks (like DocumentChunking).

Combine chunk text into a single context string.

LLM generates questions from the text.

QA model extracts answers from the original context.

We assemble Flashcard(question, answer) objects.

# **Install dependencies**




In this section, we are setting up the foundations of the pipeline.
We install the libraries required for:
- loading PDFs,
- chunking the text,
- running LLMs,
- and performing extractive QA.




In [1]:
!pip install -q transformers langchain-community pydantic langchain-text-splitters pypdf


[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/2.5 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━[0m [32m2.0/2.5 MB[0m [31m60.4 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m47.6 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/328.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m328.3/328.3 kB[0m [31m27.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m60.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.7/64.7 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.9/50.9 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver 

# **Imports & basic setup**


Here we import all the tools needed throughout the notebook.
This includes:
- typing utilities,
- regex for light text cleaning,
- Pydantic for structured flashcard models,
- Transformers for running LLMs,
- LangChain's PDF loader + text splitter,
which together form the ingestion/chunking backbone.


In [2]:
from typing import List
import re
from pathlib import Path

import torch
from pydantic import BaseModel, Field

from transformers import (
    AutoTokenizer,
    AutoModelForSeq2SeqLM,
    pipeline as hf_pipeline,
)

from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter


# **Flashcard schema**

Flashcard schema:
Defines the exact structure of the output we want.
We keep it clean and minimal — a question and its corresponding answer.
The Flashcards class is just a container for a list of Flashcard objects.

In [3]:
class Flashcard(BaseModel):
    question: str = Field(description="The question side of the flashcard")
    answer: str = Field(description="The answer side of the flashcard")

class Flashcards(BaseModel):
    cards: List[Flashcard] = Field(
        description="List of flashcards with question/answer pairs"
    )


# **LLMs: Question generator + QA model**

Here we load two LLM components:
1. A question-generation model (FLAN-T5)
   - This is responsible for synthesising exam-style questions.
2. A QA model (DistilBERT SQuAD)
   - This extracts answers from the PDF text itself.



In [4]:
# -------- Question generation LLM (via transformers) --------
QG_MODEL_NAME = "google/flan-t5-base"  # or "google/flan-t5-small"

device = 0 if torch.cuda.is_available() else -1

qg_tokenizer = AutoTokenizer.from_pretrained(QG_MODEL_NAME)
qg_model = AutoModelForSeq2SeqLM.from_pretrained(
    QG_MODEL_NAME,
    torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
)

qg_pipeline = hf_pipeline(
    "text2text-generation",
    model=qg_model,
    tokenizer=qg_tokenizer,
    device=device,
    max_new_tokens=256,
)


# -------- QA model to extract answers from context --------
QA_MODEL_NAME = "distilbert/distilbert-base-uncased-distilled-squad"  # lighter than BART

qa_pipeline = hf_pipeline(
    "question-answering",
    model=QA_MODEL_NAME,
    device=device,
)


tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Device set to use cuda:0


config.json:   0%|          | 0.00/451 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/265M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cuda:0


# **Load & chunk PDF**

**PDF ingestion**:
This step is the ingestion layer.
We rely on LangChain's PyPDFLoader to convert PDFs into Document objects.

**Chunking**:
We use RecursiveCharacterTextSplitter to produce overlapping text chunks.
The chunking logic mimics a typical RAG preprocessing step — but here,
the chunks are merged back together later for question generation.

In [5]:
def load_pdf(path: str):
    """Load a PDF into LangChain-style Document objects."""
    loader = PyPDFLoader(path)
    docs = loader.load()
    return docs


def chunk_documents(
    docs,
    chunk_size: int = 800,
    chunk_overlap: int = 150,
):
    """
    Use RecursiveCharacterTextSplitter to chunk documents.
    This roughly mimics the chunking in a Banking-RAG style pipeline.
    """
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        separators=["\n\n", "\n", ".", "!", "?", " "],
    )
    chunks = splitter.split_documents(docs)
    return chunks


# **Question generation with LLM**

Question generation:
Given a large context string, FLAN-T5 attempts to produce several high-quality,
exam-style questions.

We:
1. Build a prompt describing the expected behaviour.
2. Generate one sequence of candidate questions.
3. Clean and filter the lines so only valid questions remain.

This step is the "creative" part of the flashcard pipeline.

In [6]:
def generate_questions_llm(text: str, num_questions: int = 5) -> List[str]:
    """
    Use FLAN-T5 to generate exam-style questions from text.
    Returns a list of questions (strings).
    """
    # Truncate to stay within context limits
    max_chars = 1500
    if len(text) > max_chars:
        text = text[:max_chars]

    prompt = f"""
You are a helpful study assistant.

Given the following text, generate {num_questions} concise exam-style questions
that cover different key concepts from the text.

Rules:
- Return ONLY the questions, each on its own line.
- Do not number the questions.
- Do not include answers.
- Do not repeat the text.

Text:
\"\"\"{text}\"\"\"
"""

    outputs = qg_pipeline(
        prompt,
        num_return_sequences=1,
        do_sample=True,
        temperature=0.7,
        top_p=0.9,
    )

    raw = outputs[0]["generated_text"]
    lines = [ln.strip() for ln in raw.splitlines() if ln.strip()]

    # Clean numbering/bullets and filter to “question-like” lines
    questions: List[str] = []
    for ln in lines:
        cleaned = re.sub(r"^[-*\d\.\)\s]+", "", ln).strip()
        if not cleaned:
            continue
        if cleaned.endswith("?") or cleaned.lower().startswith(
            ("what", "why", "how", "when", "who", "where", "which")
        ):
            questions.append(cleaned)

    # Deduplicate & cap at num_questions
    seen = set()
    final_qs = []
    for q in questions:
        if q not in seen:
            seen.add(q)
            final_qs.append(q)
        if len(final_qs) >= num_questions:
            break

    return final_qs


# **Answer questions using the QA model**

Answer extraction:
Once questions are generated, we pass each question + full context
to the QA model. This guarantees the answers stay *faithful* to the text.

This mirrors a retrieval-augmented step,
but without needing a vector database for this simple notebook prototype.

In [7]:
def answer_questions_with_qa(questions: List[str], context: str) -> List[Flashcard]:
    cards: List[Flashcard] = []

    for q in questions:
        try:
            result = qa_pipeline(
                question=q,
                context=context,
            )
            answer = result.get("answer", "").strip()
        except Exception as e:
            answer = f"[ERROR answering question: {e}]"

        if answer:
            cards.append(Flashcard(question=q, answer=answer))

    return cards


# **Full “PDF → Flashcards” pipeline**

This ties together the entire workflow:
1. Load PDF.
2. Chunk the document.
3. Merge chunk text into a single context string.
4. Generate a set of exam-style questions.
5. Use extractive QA to obtain grounded answers.
6. Package everything into a Flashcards Pydantic model.

In [8]:
def generate_flashcards_from_pdf(
    pdf_path: str,
    num_cards: int = 5,
) -> Flashcards:
    """
    End-to-end:
    - Load PDF
    - Chunk documents
    - Combine chunk text into a single context
    - Use LLM to create questions
    - Use QA model to answer from the context
    """
    pdf_path = str(pdf_path)
    docs = load_pdf(pdf_path)
    chunks = chunk_documents(docs)

    # Combine chunk texts into one context string
    context = "\n\n".join(c.page_content for c in chunks)

    # 1) generate questions
    questions = generate_questions_llm(context, num_questions=num_cards)

    if not questions:
        # Fallback: generic questions per chunk
        fallback_sentences = context.split(".")[:num_cards]
        questions = [f"What is the key idea of: {s.strip()}?" for s in fallback_sentences if s.strip()]

    # 2) answer questions from the context
    cards = answer_questions_with_qa(questions, context=context)

    return Flashcards(cards=cards)


File upload:
This simply allows you to upload a PDF in Colab and immediately run the pipeline.
This is equivalent to receiving a file upload in a FastAPI endpoint.

In [9]:
from google.colab import files

uploaded = files.upload()
uploaded  # pick a PDF and note its filename


Saving FNB-MS-General-Terms-and-Conditions-23022022.pdf to FNB-MS-General-Terms-and-Conditions-23022022 (1).pdf


{'FNB-MS-General-Terms-and-Conditions-23022022 (1).pdf': b'%PDF-1.7\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n<</Type/Catalog/Pages 2 0 R/Lang(en) /StructTreeRoot 167 0 R/MarkInfo<</Marked true>>/Metadata 3310 0 R/ViewerPreferences 3311 0 R>>\r\nendobj\r\n2 0 obj\r\n<</Type/Pages/Count 32/Kids[ 4 0 R 24 0 R 26 0 R 62 0 R 74 0 R 84 0 R 86 0 R 88 0 R 96 0 R 98 0 R 118 0 R 120 0 R 122 0 R 124 0 R 126 0 R 128 0 R 130 0 R 132 0 R 134 0 R 136 0 R 138 0 R 140 0 R 142 0 R 144 0 R 146 0 R 148 0 R 150 0 R 152 0 R 154 0 R 156 0 R 163 0 R 165 0 R] >>\r\nendobj\r\n3 0 obj\r\n<</MSIP_Label_216eec4e-c7b8-491d-b7d8-90a69632743d_Enabled(true) /MSIP_Label_216eec4e-c7b8-491d-b7d8-90a69632743d_SetDate(2025-07-04T10:00:20Z) /MSIP_Label_216eec4e-c7b8-491d-b7d8-90a69632743d_Method(Standard) /MSIP_Label_216eec4e-c7b8-491d-b7d8-90a69632743d_Name(216eec4e-c7b8-491d-b7d8-90a69632743d) /MSIP_Label_216eec4e-c7b8-491d-b7d8-90a69632743d_SiteId(4032514a-830a-4f20-9539-81bbc35b3cd9) /MSIP_Label_216eec4e-c7b8-491d-b7d8-90a696

In [10]:
pdf_file = list(uploaded.keys())[0]  # just take the first uploaded file
flashcards_struct = generate_flashcards_from_pdf(pdf_file, num_cards=5)

for i, card in enumerate(flashcards_struct.cards, start=1):
    print(f"\nFlashcard {i}")
    print("Q:", card.question)
    print("A:", card.answer)



Flashcard 1
Q: What is the key idea of: FNB Merchant Services General Terms and Conditions 
Date last amended: 1 May 2025 
 
The Bank will provide You with Acquiring Services to enable You to accept Payment Instruments from Your 
Customers to pay for goods and/or services?
A: governing the relationship between the Parties in relation to the Acquiring 
Services and products

Flashcard 2
Q: What is the key idea of: The General Terms and Conditions form part of Your Merchant 
Agreement and must be read in conjunction with t he remaining Terms and Conditions of Your Merchant 
Agreement and the FNB General Terms and Conditions available on the FNB Website?
A: obligations

Flashcard 3
Q: What is the key idea of: It contains important 
information about the rights and obligations relating to You and the Bank in respect of the Acquiring Services 
and products delivered by the Bank?
A: MERCHANT AGREEMENT

Flashcard 4
Q: What is the key idea of: A copy of the Terms and Conditions is available o