# Data Pipeline: RAW → INTERIM → PROCESSED

This notebook demonstrates the complete data preprocessing pipeline:

1. **RAW**: Load original untouched data from `data/raw/`
2. **INTERIM**: Engineer features and save to `data/interim/`
3. **PROCESSED**: Apply scaling/encoding and save to `data/processed/`

All steps are validated and logged for reproducibility.

In [None]:
import sys
import os
sys.path.append(os.path.abspath('..'))

from src.utils.seed import set_seed
from src.utils.config import Config
from src.preprocessing import DataPipeline

set_seed(42)
print("✓ Imports and seed configured")

In [None]:
config = Config('../configs/config.yaml')

print("Configuration loaded:")
print(f"  RAW: {config.data['raw_data_path']}")
print(f"  INTERIM: {config.data['interim_data_path']}")
print(f"  PROCESSED: {config.data['processed_data_path']}")

In [None]:
# Initialize pipeline
pipeline = DataPipeline(config)

print("Pipeline initialized!")
print("Ready to execute: RAW → INTERIM → PROCESSED")

In [None]:
# Run complete pipeline
raw_df, interim_df, processed_df = pipeline.run()

print("\n" + "="*60)
print("PIPELINE EXECUTION SUMMARY")
print("="*60)
print(f"RAW data shape: {raw_df.shape}")
print(f"INTERIM data shape: {interim_df.shape}")
print(f"PROCESSED data shape: {processed_df.shape}")
print("="*60)

## Validation Results

The pipeline includes automatic validation:
- ✓ Required columns exist
- ✓ Target column present
- ✓ No duplicate column names
- ✓ Missing value analysis
- ✓ Data type verification
- ✓ Safety checks (prevents writing to raw/)

All validation results are logged above.

## Next Steps

Now that data is processed, you can:
1. Run `01_data_exploration.ipynb` to explore features
2. Run `02_baseline_training.ipynb` to train baseline models
3. Continue with advanced models in subsequent notebooks

All notebooks will automatically load from `data/processed/feature_matrix_processed.csv`

In [None]:
print("✓ Data pipeline complete!")
print("Proceed to exploratory data analysis and model training.")