# Data Split: Train / Dev / Test

**IMPORTANT**: This notebook splits the dataset into Train/Dev/Test.

- **Train**: Used for training models
- **Dev**: Used for model selection, feature selection, hyperparameter tuning
- **Test**: **ONLY** used in final evaluation notebook (never for training/selection!)

This prevents data leakage and ensures fair evaluation.


In [None]:
# Setup (run 00_setup.ipynb first)
import sys
from pathlib import Path
BASE_PATH = Path('/content/semeval-context-tree-modular')
DATA_PATH = Path('/content/drive/MyDrive/semeval_data')
sys.path.insert(0, str(BASE_PATH))

from src.data.loader import load_dataset
from src.data.splitter import split_dataset
from src.storage.manager import StorageManager

storage = StorageManager(
    base_path=str(BASE_PATH),
    data_path=str(DATA_PATH),
    github_path=str(BASE_PATH)
)


In [None]:
# Load dataset
dataset = load_dataset(dataset_name="ailsntua/QEvasion")
train_raw = dataset['train']


In [None]:
# Split dataset
# IMPORTANT: Test set is separated FIRST and will ONLY be used in final evaluation!
train_ds, dev_ds, test_ds = split_dataset(
    dataset=train_raw,
    test_ratio=0.15,  # 15% for final test
    dev_ratio=0.15,   # 15% for dev (model/feature selection)
    seed=42
)


In [None]:
# Save splits
storage.save_splits(train_ds, dev_ds, test_ds)

print("✅ Splits saved!")
print(f"   Train: {len(train_ds)} samples")
print(f"   Dev: {len(dev_ds)} samples")
print(f"   Test: {len(test_ds)} samples")
print("\n⚠️  Remember: Test set is ONLY for final evaluation!")
