# Sanity Check

**NLP Multi-Type Classification Project**

This notebook performs sanity checks on a small subset of the data to verify:
- Data loading works correctly
- Label mappings are consistent
- No obvious data quality issues
- Configs resolve to real paths

## Sanity Check Checklist

### Data Loading
- [ ] Load 1% subset of training data
- [ ] Verify schema matches expected format
- [ ] Print sample rows for manual inspection

### Label Validation
- [ ] Verify all labels are in expected range
- [ ] Test label mapping T1/T2/T3/T4 ↔ 0/1/2/3
- [ ] Ensure consistency across splits

### Config Validation
- [ ] Load all config files (data_config.yaml, models_baseline.yaml, etc.)
- [ ] Verify all required keys exist
- [ ] Verify paths resolve correctly

### Split Integrity
- [ ] Verify no family_id appears in multiple splits (smoke test)
- [ ] Check label distribution is reasonable

### Metrics Function Contracts
- [ ] Verify eval_utils schemas are loaded
- [ ] Verify result file schemas are defined

---

**TODO**: Implement code cells below to complete each checklist item.


In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import yaml
from pathlib import Path
import sys

# Add src to path
sys.path.append('../src')
from constants import LABELS, LABEL2ID, ID2LABEL, REQUIRED_COLUMNS
from schema import ProcessedRow

print("Libraries imported successfully.")
print(f"Labels: {LABELS}")
print(f"Label mapping: {LABEL2ID}")


## 1. Load Small Subset

Load 1% of training data for quick validation.


In [None]:
# TODO: Load subset
# train_df = pd.read_csv('../data/processed/train_4class.csv')
# subset_df = train_df.sample(frac=0.01, random_state=42)
# print(f"Loaded {len(subset_df)} samples (1% of train)")
# print(subset_df.head())

print("TODO: Implement subset loading")


## 2. Verify Label Mapping

Test that label mappings are consistent.


In [None]:
# TODO: Verify label mapping
# print("Label distribution in subset:")
# print(subset_df['label'].value_counts().sort_index())

# # Test mapping
# for label_str, label_int in LABEL2ID.items():
#     reverse = ID2LABEL[label_int]
#     assert reverse == label_str, f"Mapping inconsistency: {label_str} -> {label_int} -> {reverse}"
#     print(f"✓ {label_str} ↔ {label_int}")

print("TODO: Implement label validation")


## 3. Load and Validate Configs

Load all YAML configs and verify structure.


In [None]:
# TODO: Load configs
# config_files = [
#     '../configs/data_config.yaml',
#     '../configs/models_baseline.yaml',
#     '../configs/models_transformer.yaml',
#     '../configs/project.yaml'
# ]

# for config_file in config_files:
#     with open(config_file, 'r') as f:
#         config = yaml.safe_load(f)
#     print(f"✓ Loaded {Path(config_file).name}")
#     print(f"  Keys: {list(config.keys())}")

print("TODO: Implement config loading")


## 4. Verify No Cross-Split Families

Quick check on subset to ensure family grouping works.


In [None]:
# TODO: Verify family grouping
# print(f"Unique families in subset: {subset_df['family_id'].nunique()}")
# print("Family_id value counts (top 10):")
# print(subset_df['family_id'].value_counts().head(10))

print("TODO: Implement family grouping check")
