# Data Preparation Pipeline

## Objective

Prepare data for modeling:
1. **Handle remaining issues** - Missing values, outliers
2. **Feature selection** - Remove highly correlated features for linear models
3. **Scaling** - Standardize features
4. **Train/test splits** - Proper validation strategy

## Note

Your data appears already preprocessed (from existing notebooks).
This notebook documents the preparation strategy and creates clean splits for modeling.

---

In [None]:
# Setup
import sys
sys.path.insert(0, '../..')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from src.bankruptcy_prediction.data import DataLoader, MetadataParser

plt.style.use('seaborn-v0_8-darkgrid')
%matplotlib inline

print("‚úì Setup complete")

In [None]:
# Load data
loader = DataLoader()
metadata = MetadataParser.from_default()

# Load both full and reduced datasets
df_full = loader.load_poland(horizon=1, dataset_type='full')
df_reduced = loader.load_poland(horizon=1, dataset_type='reduced')

print(f"Full dataset: {df_full.shape}")
print(f"Reduced dataset: {df_reduced.shape}")
print(f"\nFull dataset is for Random Forest (handles multicollinearity)")
print(f"Reduced dataset is for Logistic/GLM (correlation pruned)")

## 1. Data Quality Check

Verify preprocessing quality.

In [None]:
X_full, y = loader.get_features_target(df_full)
X_reduced, _ = loader.get_features_target(df_reduced)

print("=" * 70)
print("DATA QUALITY REPORT")
print("=" * 70)

print(f"\nüìä Full Dataset:")
print(f"  Features: {len(X_full.columns)}")
print(f"  Missing values: {X_full.isnull().sum().sum():,}")
print(f"  Infinite values: {np.isinf(X_full.select_dtypes(include=[np.number])).sum().sum()}")

print(f"\nüìä Reduced Dataset:")
print(f"  Features: {len(X_reduced.columns)}")
print(f"  Missing values: {X_reduced.isnull().sum().sum():,}")
print(f"  Infinite values: {np.isinf(X_reduced.select_dtypes(include=[np.number])).sum().sum()}")

print(f"\nüìä Target:")
print(f"  Samples: {len(y):,}")
print(f"  Bankruptcy rate: {y.mean():.2%}")
print(f"  Class balance: {y.value_counts().to_dict()}")

print("\n" + "=" * 70)

### Interpretation:

**Data appears preprocessed with:**
- ‚úÖ Missing values handled (imputation or indicators)
- ‚úÖ No infinite values
- ‚úÖ Two feature sets prepared (full vs reduced)

**Reduced dataset:**
- Correlation pruning applied (removed features with r > 0.9)
- Suitable for Logistic Regression and GLM
- Prevents multicollinearity issues

**Full dataset:**
- All features retained
- Suitable for Random Forest (not affected by correlation)
- Maximum information preserved

## 2. Train/Test Split Strategy

Create stratified splits for reproducible evaluation.

In [None]:
# Split parameters
TEST_SIZE = 0.2
RANDOM_STATE = 42

print(f"Split strategy: {(1-TEST_SIZE)*100:.0f}% train, {TEST_SIZE*100:.0f}% test")
print(f"Random seed: {RANDOM_STATE}")
print(f"Stratification: YES (preserve class balance)\n")

# Split full dataset (for Random Forest)
X_train_full, X_test_full, y_train, y_test = train_test_split(
    X_full, y,
    test_size=TEST_SIZE,
    random_state=RANDOM_STATE,
    stratify=y
)

# Split reduced dataset (for Logistic/GLM)
X_train_reduced, X_test_reduced, _, _ = train_test_split(
    X_reduced, y,
    test_size=TEST_SIZE,
    random_state=RANDOM_STATE,
    stratify=y
)

print("‚úì Splits created:\n")
print(f"Full dataset:")
print(f"  Train: {X_train_full.shape} ({len(X_train_full):,} samples)")
print(f"  Test:  {X_test_full.shape} ({len(X_test_full):,} samples)")

print(f"\nReduced dataset:")
print(f"  Train: {X_train_reduced.shape} ({len(X_train_reduced):,} samples)")
print(f"  Test:  {X_test_reduced.shape} ({len(X_test_reduced):,} samples)")

print(f"\nClass balance preserved:")
print(f"  Train set: {y_train.mean():.2%} bankrupt")
print(f"  Test set:  {y_test.mean():.2%} bankrupt")

## 3. Feature Scaling

Standardize features for models that require it.

In [None]:
# Fit scaler on training data only (prevent data leakage)
scaler_full = StandardScaler()
scaler_reduced = StandardScaler()

X_train_full_scaled = scaler_full.fit_transform(X_train_full)
X_test_full_scaled = scaler_full.transform(X_test_full)

X_train_reduced_scaled = scaler_reduced.fit_transform(X_train_reduced)
X_test_reduced_scaled = scaler_reduced.transform(X_test_reduced)

# Convert back to DataFrames with column names
X_train_full_scaled = pd.DataFrame(X_train_full_scaled, 
                                    columns=X_train_full.columns,
                                    index=X_train_full.index)
X_test_full_scaled = pd.DataFrame(X_test_full_scaled, 
                                   columns=X_test_full.columns,
                                   index=X_test_full.index)

X_train_reduced_scaled = pd.DataFrame(X_train_reduced_scaled, 
                                       columns=X_train_reduced.columns,
                                       index=X_train_reduced.index)
X_test_reduced_scaled = pd.DataFrame(X_test_reduced_scaled, 
                                      columns=X_test_reduced.columns,
                                      index=X_test_reduced.index)

print("‚úì Feature scaling complete\n")
print("Scaled features have mean=0, std=1")
print(f"\nExample - Train set (scaled):")
print(f"  Mean: {X_train_full_scaled.mean().mean():.6f}")
print(f"  Std:  {X_train_full_scaled.std().mean():.6f}")

### Which Models Need Scaling?

**Require scaling:**
- ‚úÖ Logistic Regression (gradient-based optimization)
- ‚úÖ Neural Networks (gradient-based)
- ‚úÖ GLM (for numerical stability)

**Don't require scaling:**
- ‚ùå Random Forest (tree-based, scale-invariant)
- ‚ùå XGBoost (tree-based)
- ‚ùå LightGBM (tree-based)

**Strategy:**
- We keep both scaled and unscaled versions
- Use appropriate version for each model

## 4. Save Prepared Data

Save splits for use in modeling notebooks.

In [None]:
# Create directory
import os
os.makedirs('../../data/processed/splits', exist_ok=True)

# Save unscaled (for tree models)
X_train_full.to_parquet('../../data/processed/splits/X_train_full.parquet')
X_test_full.to_parquet('../../data/processed/splits/X_test_full.parquet')
X_train_reduced.to_parquet('../../data/processed/splits/X_train_reduced.parquet')
X_test_reduced.to_parquet('../../data/processed/splits/X_test_reduced.parquet')

# Save scaled (for linear models)
X_train_full_scaled.to_parquet('../../data/processed/splits/X_train_full_scaled.parquet')
X_test_full_scaled.to_parquet('../../data/processed/splits/X_test_full_scaled.parquet')
X_train_reduced_scaled.to_parquet('../../data/processed/splits/X_train_reduced_scaled.parquet')
X_test_reduced_scaled.to_parquet('../../data/processed/splits/X_test_reduced_scaled.parquet')

# Save targets
y_train.to_frame('y').to_parquet('../../data/processed/splits/y_train.parquet')
y_test.to_frame('y').to_parquet('../../data/processed/splits/y_test.parquet')

print("‚úì Saved all splits to: data/processed/splits/\n")
print("Files created:")
print("  ‚Ä¢ X_train_full.parquet (unscaled, for RF)")
print("  ‚Ä¢ X_train_reduced.parquet (unscaled, correlation pruned)")
print("  ‚Ä¢ X_train_full_scaled.parquet (scaled, for Logit)")
print("  ‚Ä¢ X_train_reduced_scaled.parquet (scaled, for GLM)")
print("  ‚Ä¢ ... and corresponding test sets")
print("  ‚Ä¢ y_train.parquet, y_test.parquet")

## 5. Preparation Summary

In [None]:
# Create summary
summary = {
    'Total Samples': len(X_full),
    'Train Samples': len(X_train_full),
    'Test Samples': len(X_test_full),
    'Train/Test Split': f"{(1-TEST_SIZE)*100:.0f}/{TEST_SIZE*100:.0f}",
    'Random Seed': RANDOM_STATE,
    'Stratified': 'Yes',
    'Features (Full)': len(X_full.columns),
    'Features (Reduced)': len(X_reduced.columns),
    'Bankruptcy Rate (Train)': f"{y_train.mean():.2%}",
    'Bankruptcy Rate (Test)': f"{y_test.mean():.2%}",
    'Scaling Applied': 'Yes (StandardScaler)',
}

summary_df = pd.DataFrame.from_dict(summary, orient='index', columns=['Value'])

print("\n" + "=" * 70)
print("DATA PREPARATION SUMMARY")
print("=" * 70)
display(summary_df)
print("=" * 70)

## Summary & Next Steps

### What We Did:

1. ‚úÖ **Quality Check** - Verified preprocessed data
2. ‚úÖ **Train/Test Splits** - Stratified 80/20 split (seed=42)
3. ‚úÖ **Feature Scaling** - StandardScaler (fit on train only)
4. ‚úÖ **Saved Splits** - Ready for modeling

### Data Ready for Modeling:

**For Random Forest:**
- Use: `X_train_full.parquet` (unscaled)
- Features: All 66 features
- Handles multicollinearity

**For Logistic Regression:**
- Use: `X_train_reduced_scaled.parquet` (scaled)
- Features: Correlation-pruned
- Prevents multicollinearity issues

**For GLM:**
- Use: `X_train_reduced_scaled.parquet` (scaled)
- Same as Logistic (linear model)

### Next Steps:

1. **Baseline Models** (`04_baseline_models.ipynb`)
   - Train Logistic Regression
   - Train Random Forest
   - Train GLM
   - Compare performance

2. **Advanced Models** (`05_advanced_models.ipynb`)
   - XGBoost, LightGBM, Neural Networks
   - Push toward higher accuracy

3. **Calibration** (`06_model_calibration.ipynb`)
   - Probability reliability
   - Threshold selection

In [None]:
print("\n‚úì Data preparation complete!")
print("\nüìä Ready for modeling:")
print(f"  ‚Ä¢ {len(X_train_full):,} training samples")
print(f"  ‚Ä¢ {len(X_test_full):,} test samples")
print(f"  ‚Ä¢ {len(X_full.columns)} features (full)")
print(f"  ‚Ä¢ {len(X_reduced.columns)} features (reduced)")
print(f"\nNext: 04_baseline_models.ipynb")