# 01 — Test data pipeline (download → raw → processed)

This notebook validates your two scripts:
- `load_existing_dataset.py` → downloads GSM8K and writes **raw** JSON into `gsm8k-distillation/data/raw`
- `import_data.py` → reads raw JSON and writes **processed** JSON into `gsm8k-distillation/data/processed`

Run it from the repo root.


In [2]:
import os
from pathlib import Path

# Adjust if your repo root differs
REPO_ROOT = Path('.').resolve().parents[1]
RAW_DIR = REPO_ROOT / 'gsm8k-distillation' / 'data' / 'raw'
PROCESSED_DIR = REPO_ROOT / 'gsm8k-distillation' / 'data' / 'processed'
print('Repo root:', REPO_ROOT)
print('RAW_DIR:', RAW_DIR)
print('PROCESSED_DIR:', PROCESSED_DIR)
RAW_DIR.mkdir(parents=True, exist_ok=True)
PROCESSED_DIR.mkdir(parents=True, exist_ok=True)


Repo root: /Users/marcobonvissuto/Desktop/Università/Magistrale/Secondo Anno
RAW_DIR: /Users/marcobonvissuto/Desktop/Università/Magistrale/Secondo Anno/gsm8k-distillation/data/raw
PROCESSED_DIR: /Users/marcobonvissuto/Desktop/Università/Magistrale/Secondo Anno/gsm8k-distillation/data/processed


In [6]:
# If your scripts live in src/data, import them from there.
# If needed, extend sys.path to include 
import sys
src_path = (REPO_ROOT / 'src').as_posix()
if src_path not in sys.path:
    sys.path.insert(0, src_path)

from data.load_existing_dataset import GSM8KDatasetProcessor as RawProcessor
from data.gsm8k_loader import GSM8KDatasetLoader as ProcessedLoader

print('Imports OK')


Imports OK


In [7]:
processor = RawProcessor(base_path=RAW_DIR.as_posix())

# Download only if missing (avoid re-downloading)
raw_train = RAW_DIR / 'gsm8k_cot_train.json'
raw_test  = RAW_DIR / 'gsm8k_cot_test.json'

if raw_train.exists() and raw_test.exists():
    print('Raw files already exist — skipping download.')
else:
    print('Downloading GSM8K → raw JSON...')
    processor.download_and_prepare_gsm8k()
    print('Done')

print('Raw train exists:', raw_train.exists(), 'size:', raw_train.stat().st_size if raw_train.exists() else None)
print('Raw test  exists:', raw_test.exists(),  'size:', raw_test.stat().st_size  if raw_test.exists()  else None)


Downloading GSM8K → raw JSON...

Processing train split...
✓ Downloaded 7473 examples
 Saved 7473 examples to /Users/marcobonvissuto/Desktop/Università/Magistrale/Secondo Anno/gsm8k-distillation/data/raw/gsm8k_cot_train.json

--- Example from train split ---
Question: Natalia sold clips to 48 of her friends in April, and then she sold half as many...
Reasoning: Natalia sold 48/2 = <<48/2=24>>24 clips in May.
Natalia sold 48+24 = <<48+24=72>>72 clips altogether...
Answer: 72
----------------------------------------

Processing test split...
✓ Downloaded 1319 examples
 Saved 1319 examples to /Users/marcobonvissuto/Desktop/Università/Magistrale/Secondo Anno/gsm8k-distillation/data/raw/gsm8k_cot_test.json

--- Example from test split ---
Question: Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning an...
Reasoning: Janet sells 16 - 3 - 4 = <<16-3-4=9>>9 duck eggs a day.
She makes 9 * 2 = $<<9*2=18>>18 every day at...
Answer: 18
----------------------------------

In [8]:
# Quick sanity check: load a few raw records
import json

def peek_json(path, n=3):
    with open(path, 'r', encoding='utf-8') as f:
        data = json.load(f)
    print('Rows:', len(data))
    for i, ex in enumerate(data[:n]):
        print('\n--- Example', i, '---')
        print('question:', ex.get('question','')[:120])
        print('reasoning:', ex.get('reasoning','')[:120])
        print('answer:', ex.get('answer'))

if raw_train.exists():
    peek_json(raw_train, n=2)


Rows: 7473

--- Example 0 ---
question: Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natali
reasoning: Natalia sold 48/2 = <<48/2=24>>24 clips in May.
Natalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.
answer: 72

--- Example 1 ---
question: Weng earns $12 an hour for babysitting. Yesterday, she just did 50 minutes of babysitting. How much did she earn?
reasoning: Weng earns 12/60 = $<<12/60=0.2>>0.2 per minute.
Working 50 minutes, she earned 0.2 x 50 = $<<0.2*50=10>>10.
answer: 10


In [9]:
# Process: raw → processed

train_loader = ProcessedLoader(raw_train)
test_loader  = ProcessedLoader(raw_test)

train_loader.print_statistics()
test_loader.print_statistics()

processed_train = PROCESSED_DIR / 'gsm8k_train_processed.json'
processed_test  = PROCESSED_DIR / 'gsm8k_test_processed.json'

train_loader.save_processed_dataset(train_loader.examples, processed_train)
test_loader.save_processed_dataset(test_loader.examples, processed_test)

print('Processed train:', processed_train, processed_train.exists(), processed_train.stat().st_size)
print('Processed test :', processed_test,  processed_test.exists(),  processed_test.stat().st_size)


✓ Loaded 7473 examples from gsm8k_cot_train.json
✓ Loaded 1319 examples from gsm8k_cot_test.json

DATASET STATISTICS
File: gsm8k_cot_train.json
Total examples: 7473
Split: train

Question length (words):
  Mean: 45.1
  Range: [9, 183]

Reasoning length (words):
  Mean: 49.7
  Range: [2, 214]

Examples with reasoning: 100.0%


DATASET STATISTICS
File: gsm8k_cot_test.json
Total examples: 1319
Split: test

Question length (words):
  Mean: 46.3
  Range: [15, 164]

Reasoning length (words):
  Mean: 50.8
  Range: [3, 171]

Examples with reasoning: 100.0%

Saved to /Users/marcobonvissuto/Desktop/Università/Magistrale/Secondo Anno/gsm8k-distillation/data/processed/gsm8k_train_processed.json
Saved to /Users/marcobonvissuto/Desktop/Università/Magistrale/Secondo Anno/gsm8k-distillation/data/processed/gsm8k_test_processed.json
Processed train: /Users/marcobonvissuto/Desktop/Università/Magistrale/Secondo Anno/gsm8k-distillation/data/processed/gsm8k_train_processed.json True 4387982
Processed tes

In [10]:
# Final validation: schema and a couple of rows
peek_json(processed_train, n=2)


Rows: 7473

--- Example 0 ---
question: Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natali
reasoning: Natalia sold 48/2 = 24 clips in May.
Natalia sold 48+24 = 72 clips altogether in April and May.
answer: 72

--- Example 1 ---
question: Weng earns $12 an hour for babysitting. Yesterday, she just did 50 minutes of babysitting. How much did she earn?
reasoning: Weng earns 12/60 = $0.2 per minute.
Working 50 minutes, she earned 0.2 x 50 = $10.
answer: 10
