# Part 1: The Data Factory

This notebook implements the data generation pipeline to transform the `2024-Annual-Report.pdf` into a fine-tuning dataset. 

**Assessment Requirement:** Generate 10 Q/A pairs for **each** chunk.

In [1]:
import os
from dotenv import load_dotenv
import sys
from tqdm import tqdm
import json
import random
import re

# Load environment variables
load_dotenv("../.env")

# Add project src to path
sys.path.append(os.path.abspath("../"))

from src.services.llm_services import load_config, get_llm
from src.utils.data_processing import load_and_clean_pdf, chunk_text
from src.utils.json_helper import extract_json_from_llm

config = load_config("../src/config/config.yaml")

## 1. Ingestion & Cleaning

**Step 1:** Load the documents and clean them (remove headers, footers, and extra whitespace).

In [2]:
pdf_path = os.path.join("..", config.get("pdf_path", "data/pdfs/2024-Annual-Report.pdf"))
raw_text = load_and_clean_pdf(pdf_path)
print(f"Loaded and cleaned {len(raw_text)} characters.")

Loaded and cleaned 632153 characters.


## 2. Chunking Strategy

**Step 2:** Split the documents into chunks of 1500 characters.

In [3]:
chunk_size = 1500
chunk_overlap = 200
chunks = chunk_text(raw_text, chunk_size=chunk_size, chunk_overlap=chunk_overlap)
print(f"Created {len(chunks)} chunks.")

Created 487 chunks.


## 3. The Generation Loop

**Step 3:** For each chunk, generate 10 Q/A pairs using a two-step pipeline.

In [4]:
llm = get_llm(config)

QUESTION_GEN_SYSTEM = """You are a Financial Data Architect. Generate 10 distinct questions based ONLY on the provided chunk.
Balance across: HARD FACTS, STRATEGIC SUMMARIES, and STYLISTIC/CREATIVE outputs.
Return ONLY a JSON list of strings [\"Q1\", \"Q2\", ...]."""

ANSWER_GEN_SYSTEM = """You are a Senior Financial Analyst. Answer the provided questions based strictly on the context.
Return ONLY a JSON list of objects with 'question' and 'answer' keys."""

dataset = []

# ASSESSMENT REQUIREMENT: GENERATE FOR ALL CHUNKS
target_chunks = chunks

for i, chunk in enumerate(tqdm(target_chunks, desc="Step 3: Processing all chunks")):
    try:
        # Step A: Questions
        q_res = llm.invoke([("system", QUESTION_GEN_SYSTEM), ("user", f"Chunk Content: {chunk}")])
        questions = extract_json_from_llm(q_res.content)
        if not questions or not isinstance(questions, list): continue
        
        # Step B: Answers
        a_res = llm.invoke([("system", ANSWER_GEN_SYSTEM), ("user", f"Context: {chunk}\n\nQuestions: {json.dumps(questions)}")])
        pairs = extract_json_from_llm(a_res.content)
        if pairs and isinstance(pairs, list): dataset.extend(pairs)
            
    except Exception as e:
        print(f"\nError in chunk {i}: {e}")
        continue

print(f"\nTotal Generated Q/A Pairs: {len(dataset)}")

Step 3: Processing all chunks: 100%|██████████| 487/487 [2:26:31<00:00, 18.05s/it]  


Total Generated Q/A Pairs: 4710





## 4. Storage & Splitting

**Step 4:** Store the generated data in JSONL format and split 80/20.

In [5]:
if dataset:
    random.shuffle(dataset)
    split_idx = int(0.8 * len(dataset))
    train_set, test_set = dataset[:split_idx], dataset[split_idx:]
    
    out_dir = os.path.join("..", config['train_data_path'])
    os.makedirs(out_dir, exist_ok=True)
    
    with open(os.path.join(out_dir, 'train.jsonl'), 'w') as f:
        for item in train_set: f.write(json.dumps(item) + "\n")
    with open(os.path.join(out_dir, 'golden_test_set.jsonl'), 'w') as f:
        for item in test_set: f.write(json.dumps(item) + "\n")
            
    print(f"Saved {len(train_set)} to train.jsonl and {len(test_set)} to golden_test_set.jsonl")

Saved 3768 to train.jsonl and 942 to golden_test_set.jsonl
