# Notebook 03: Empirical Feature Validation & Allocation

**Created:** October 31, 2025  
**Purpose:** Data-driven feature selection and allocation to CORE vs ML sets  
**Approach:** Random Forest + VIF → Allocate based on model requirements

---

## Why This Approach?

**Previous Failure:** Theory-based pre-selection led to P1A validation failure
- Example: `Grain_Trade_YoY` had -67.84 importance (actively harmful!)
- Root cause: Pre-selected features based on domain knowledge without empirical validation

**New Approach:** Let data guide feature selection
1. [OK] **Notebook 02:** Created ~184 comprehensive features (ALL transformations)
2. [OK] **Notebook 03 (THIS):** Empirical validation → Allocate to CORE/ML
3.  **Notebook 04:** Data preparation with empirically-validated features

---

## Allocation Strategy

### CORE Features (ARIMAX/SARIMAX Input)
**Requirements:**
- [OK] Stationary transformation (diff, pct, yoy, mom, ma30_dev)
- [OK] Positive permutation importance (> 0)
- [OK] VIF < 10 (low multicollinearity)
- [OK] Economically interpretable
-  **Target:** 8-12 features per route

### ML Features (XGBoost Input)
**Requirements:**
- [OK] Positive or neutral importance (> -5)
- [OK] Can include levels, non-stationary, correlated features
- [OK] Complementary to CORE (prefer diversity)
-  **Target:** 15-20 features per route

---

## Quality Gate Criteria

| Metric | Threshold | P1A Target | P3A Target |
|--------|-----------|------------|------------|
| **Max Importance** | > 20 | [OK] | [OK] |
| **Mean Importance** | > 5 | [OK] | [OK] |
| **% Positive Features** | > 75% | [OK] | [OK] |
| **Max VIF (CORE)** | < 10 | [OK] | [OK] |

**Outcome:**
- [OK] **PASS:** Proceed to Phase 3 (Data Preparation)
- [WARN] **PARTIAL:** Proceed with caution, monitor failed route
- [FAIL] **FAIL:** Return to Notebook 02 (revise features)

---

## Section 1: Setup & Load Data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Feature selection & validation
from sklearn.ensemble import RandomForestRegressor
from sklearn.inspection import permutation_importance
from sklearn.metrics import mean_squared_error, r2_score
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.width', None)

print("[OK] Libraries imported")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

[OK] Libraries imported
Pandas version: 2.3.3
NumPy version: 2.3.3


In [2]:
# Load comprehensive features from Notebook 02
features_all = pd.read_csv('data/processed/features/features_comprehensive.csv',
                            index_col='Date', parse_dates=True)

# Load labels
labels = pd.read_csv('data/processed/intermediate/labels.csv',
                     index_col='Date', parse_dates=True)

print("[OK] Data loaded")
print(f"\nComprehensive features: {features_all.shape}")
print(f"Date range: {features_all.index.min()} to {features_all.index.max()}")
print(f"\nLabels: {labels.shape}")
print(f"Routes: {labels.columns.tolist()}")

[OK] Data loaded

Comprehensive features: (1153, 128)
Date range: 2021-03-01 00:00:00 to 2025-10-10 00:00:00

Labels: (1153, 2)
Routes: ['P1A_82', 'P3A_82']


In [3]:
# Check for features with excessive missing values (>90%)
missing_pct = (features_all.isnull().sum() / len(features_all) * 100).sort_values(ascending=False)
excessive_missing = missing_pct[missing_pct > 90]

print(" DATA QUALITY VERIFICATION")
print("=" * 80)

if len(excessive_missing) > 0:
    print(f"\n[WARN]  Found {len(excessive_missing)} features with >90% missing values:")
    print(f"\n(These will be automatically excluded from validation)\n")
    for feat, pct in excessive_missing.items():
        print(f"  - {feat:50s}: {pct:5.2f}%")
    
    # Drop these features
    features_all = features_all.drop(columns=excessive_missing.index)
    print(f"\n[OK] Dropped {len(excessive_missing)} features")
    print(f"   Remaining features: {features_all.shape[1]}")
else:
    print("\n[OK] No features with >90% missing values")

# Verify no infinite values (should be cleaned in Notebook 02)
inf_counts = np.isinf(features_all).sum()
features_with_inf = inf_counts[inf_counts > 0]

if len(features_with_inf) > 0:
    print(f"\n[FAIL] ERROR: Found {len(features_with_inf)} features with infinite values!")
    print("   This should NOT happen - data cleaning must occur in Notebook 02.")
    for feat, count in features_with_inf.items():
        print(f"  - {feat}: {count} infinite values")
    raise ValueError("Infinite values detected - data not properly cleaned in Notebook 02")
else:
    print("\n[OK] No infinite values (data properly cleaned)")

print(f"\nFinal feature count for validation: {features_all.shape[1]}")

 DATA QUALITY VERIFICATION

[OK] No features with >90% missing values

[OK] No infinite values (data properly cleaned)

Final feature count for validation: 128


## Section 2: Step 1 - Comprehensive RF Validation (ALL Features)

In [4]:
def comprehensive_rf_validation(X, y, route_name, n_trees=200, n_repeats=10, random_state=42):
    """
    Validate ALL features using Random Forest + Permutation Importance.
    
    Parameters:
    -----------
    X : pd.DataFrame
        Feature matrix (all features)
    y : pd.Series
        Target variable (route rates)
    route_name : str
        Route identifier (e.g., 'P1A', 'P3A')
    n_trees : int
        Number of trees in Random Forest
    n_repeats : int
        Number of permutation repeats
    random_state : int
        Random seed for reproducibility
        
    Returns:
    --------
    results : pd.DataFrame
        Feature rankings (Feature, Importance, Std)
    rf_model : RandomForestRegressor
        Trained model
    X_train, X_test, y_train, y_test : splits
        Training/test data (for later analysis)
    """
    print(f"\n{'=' * 80}")
    print(f"COMPREHENSIVE RF VALIDATION: {route_name}")
    print(f"{'=' * 80}")

    # Drop NaN rows (from transformations)
    valid_mask = ~(X.isnull().any(axis=1) | y.isnull())
    X_clean = X[valid_mask]
    y_clean = y[valid_mask]

    print(f"\nOriginal data: {len(X)} rows")
    print(f"After dropping NaN: {len(X_clean)} rows ({len(X_clean)/len(X)*100:.1f}%)")
    print(f"Features: {X_clean.shape[1]}")

    # Temporal 70/30 split (preserve time order)
    split_idx = int(0.7 * len(X_clean))
    X_train = X_clean.iloc[:split_idx]
    X_test = X_clean.iloc[split_idx:]
    y_train = y_clean.iloc[:split_idx]
    y_test = y_clean.iloc[split_idx:]

    print(f"\nTemporal Split:")
    print(f"  Train: {len(X_train)} rows ({X_train.index.min().date()} to {X_train.index.max().date()})")
    print(f"  Test:  {len(X_test)} rows ({X_test.index.min().date()} to {X_test.index.max().date()})")

    # Train Random Forest
    print(f"\nTraining Random Forest...")
    print(f"  Trees: {n_trees}")
    print(f"  Max depth: 10")
    print(f"  Random state: {random_state}")
    
    rf = RandomForestRegressor(
        n_estimators=n_trees,
        max_depth=10,
        min_samples_split=20,
        min_samples_leaf=10,
        random_state=random_state,
        n_jobs=-1,
        verbose=0
    )
    rf.fit(X_train, y_train)
    print("[OK] Training complete")

    # Evaluate on test set
    y_pred_train = rf.predict(X_train)
    y_pred_test = rf.predict(X_test)
    
    rmse_train = np.sqrt(mean_squared_error(y_train, y_pred_train))
    rmse_test = np.sqrt(mean_squared_error(y_test, y_pred_test))
    r2_train = r2_score(y_train, y_pred_train)
    r2_test = r2_score(y_test, y_pred_test)

    print(f"\nModel Performance:")
    print(f"  Train RMSE: ${rmse_train:,.2f}")
    print(f"  Test RMSE:  ${rmse_test:,.2f}")
    print(f"  Train R²:   {r2_train:.4f}")
    print(f"  Test R²:    {r2_test:.4f}")

    # Permutation Importance (on test set)
    print(f"\nComputing Permutation Importance...")
    print(f"  Repeats: {n_repeats}")
    print(f"  Scoring: neg_root_mean_squared_error")
    print(f"  (This may take 2-5 minutes...)")
    
    perm_imp = permutation_importance(
        rf, X_test, y_test,
        n_repeats=n_repeats,
        scoring='neg_root_mean_squared_error',
        random_state=random_state,
        n_jobs=-1
    )
    print("[OK] Permutation importance complete")

    # Create results dataframe
    results = pd.DataFrame({
        'Feature': X_train.columns,
        'Importance': perm_imp.importances_mean,
        'Std': perm_imp.importances_std
    }).sort_values('Importance', ascending=False).reset_index(drop=True)

    # Summary statistics
    print(f"\n{'=' * 80}")
    print("FEATURE IMPORTANCE SUMMARY")
    print(f"{'=' * 80}")
    print(f"  Total features:         {len(results)}")
    print(f"  Max Importance:         {results['Importance'].max():10.2f}")
    print(f"  Mean Importance:        {results['Importance'].mean():10.2f}")
    print(f"  Median Importance:      {results['Importance'].median():10.2f}")
    print(f"  % Positive (>0):        {(results['Importance'] > 0).mean() * 100:10.1f}%")
    print(f"  % Strong (>10):         {(results['Importance'] > 10).mean() * 100:10.1f}%")
    print(f"  Harmful (<-10):         {(results['Importance'] < -10).sum():10d} features")
    print(f"  Catastrophic (<-50):    {(results['Importance'] < -50).sum():10d} features")

    # Show top 10 and bottom 10 features
    print(f"\n TOP 10 FEATURES:")
    print(results.head(10).to_string(index=False))

    print(f"\n BOTTOM 10 FEATURES:")
    print(results.tail(10).to_string(index=False))

    return results, rf, X_train, X_test, y_train, y_test

print("[OK] RF validation function defined")

[OK] RF validation function defined


In [5]:
# Run RF validation for P1A
print("\n" + "="*80)
print("STARTING P1A VALIDATION")
print("="*80)

feature_ranking_p1a, rf_p1a, X_train_p1a, X_test_p1a, y_train_p1a, y_test_p1a = comprehensive_rf_validation(
    X=features_all,
    y=labels['P1A_82'],
    route_name='P1A (Atlantic - Grain Route)',
    n_trees=200,
    n_repeats=10,
    random_state=42
)

# Save ranking
feature_ranking_p1a.to_csv('data/processed/features/p1a_feature_ranking_all.csv', index=False)
print("\n[OK] P1A feature ranking saved: data/processed/features/p1a_feature_ranking_all.csv")


STARTING P1A VALIDATION

COMPREHENSIVE RF VALIDATION: P1A (Atlantic - Grain Route)

Original data: 1153 rows
After dropping NaN: 1073 rows (93.1%)
Features: 128

Temporal Split:
  Train: 751 rows (2021-04-14 to 2024-04-18)
  Test:  322 rows (2024-04-19 to 2025-10-10)

Training Random Forest...
  Trees: 200
  Max depth: 10
  Random state: 42
[OK] Training complete

Model Performance:
  Train RMSE: $1,025.83
  Test RMSE:  $2,329.72
  Train R²:   0.9860
  Test R²:    0.6159

Computing Permutation Importance...
  Repeats: 10
  Scoring: neg_root_mean_squared_error
  (This may take 2-5 minutes...)
[OK] Permutation importance complete

FEATURE IMPORTANCE SUMMARY
  Total features:         128
  Max Importance:            3088.02
  Mean Importance:             23.97
  Median Importance:           -0.00
  % Positive (>0):              27.3%
  % Strong (>10):                3.9%
  Harmful (<-10):                  2 features
  Catastrophic (<-50):             1 features

 TOP 10 FEATURES:
       

In [6]:
# Run RF validation for P3A
print("\n" + "="*80)
print("STARTING P3A VALIDATION")
print("="*80)

feature_ranking_p3a, rf_p3a, X_train_p3a, X_test_p3a, y_train_p3a, y_test_p3a = comprehensive_rf_validation(
    X=features_all,
    y=labels['P3A_82'],
    route_name='P3A (Pacific - Coal + Grain Route)',
    n_trees=200,
    n_repeats=10,
    random_state=42
)

# Save ranking
feature_ranking_p3a.to_csv('data/processed/features/p3a_feature_ranking_all.csv', index=False)
print("\n[OK] P3A feature ranking saved: data/processed/features/p3a_feature_ranking_all.csv")


STARTING P3A VALIDATION

COMPREHENSIVE RF VALIDATION: P3A (Pacific - Coal + Grain Route)

Original data: 1153 rows
After dropping NaN: 1073 rows (93.1%)
Features: 128

Temporal Split:
  Train: 751 rows (2021-04-14 to 2024-04-18)
  Test:  322 rows (2024-04-19 to 2025-10-10)

Training Random Forest...
  Trees: 200
  Max depth: 10
  Random state: 42
[OK] Training complete

Model Performance:
  Train RMSE: $651.44
  Test RMSE:  $1,554.16
  Train R²:   0.9937
  Test R²:    0.6735

Computing Permutation Importance...
  Repeats: 10
  Scoring: neg_root_mean_squared_error
  (This may take 2-5 minutes...)
[OK] Permutation importance complete

FEATURE IMPORTANCE SUMMARY
  Total features:         128
  Max Importance:            1783.06
  Mean Importance:             15.01
  Median Importance:            0.00
  % Positive (>0):              74.2%
  % Strong (>10):                4.7%
  Harmful (<-10):                  2 features
  Catastrophic (<-50):             0 features

 TOP 10 FEATURES:
   

## Section 3: Step 2 - Initial Feature Screening

In [7]:
def screen_features(ranking_df, threshold=-10, route_name=''):
    """
    Screen features based on importance threshold.
    Drop harmful features (importance < threshold).
    
    Parameters:
    -----------
    ranking_df : pd.DataFrame
        Feature ranking from RF validation
    threshold : float
        Minimum importance to keep (default: -10)
        Features below this are considered actively harmful
    route_name : str
        Route identifier
        
    Returns:
    --------
    keep_features : list
        Features to keep (importance > threshold)
    drop_features : list
        Features to drop (importance <= threshold)
    """
    print(f"\n{'=' * 80}")
    print(f"FEATURE SCREENING: {route_name}")
    print(f"{'=' * 80}")
    print(f"\nThreshold: Importance > {threshold}")

    keep_features = ranking_df[ranking_df['Importance'] > threshold]['Feature'].tolist()
    drop_features = ranking_df[ranking_df['Importance'] <= threshold]['Feature'].tolist()

    print(f"\n SCREENING RESULTS:")
    print(f"  [OK] Keep: {len(keep_features)} features ({len(keep_features)/len(ranking_df)*100:.1f}%)")
    print(f"  [FAIL] Drop: {len(drop_features)} features ({len(drop_features)/len(ranking_df)*100:.1f}%)")

    if drop_features:
        print(f"\n[FAIL] DROPPED FEATURES (harmful):")
        print(f"\n{'Rank':<6} {'Feature':<55} {'Importance':<12}")
        print("-" * 75)
        for i, feat in enumerate(drop_features[:20], 1):  # Show top 20 worst
            imp = ranking_df[ranking_df['Feature'] == feat]['Importance'].values[0]
            print(f"{i:<6} {feat:<55} {imp:>10.2f}")
        if len(drop_features) > 20:
            print(f"\n   ... and {len(drop_features) - 20} more")
    else:
        print(f"\n[OK] No harmful features detected (all importance > {threshold})")

    return keep_features, drop_features


print("[OK] Screening function defined")

[OK] Screening function defined


In [8]:
# Screen P1A features
p1a_keep, p1a_drop = screen_features(
    feature_ranking_p1a,
    threshold=-10,
    route_name='P1A'
)


FEATURE SCREENING: P1A

Threshold: Importance > -10

 SCREENING RESULTS:
  [OK] Keep: 126 features (98.4%)
  [FAIL] Drop: 2 features (1.6%)

[FAIL] DROPPED FEATURES (harmful):

Rank   Feature                                                 Importance  
---------------------------------------------------------------------------
1      Grain_Trade_YoY_level                                       -32.49
2      P3EA_1MON_level                                            -111.47


In [9]:
# Screen P3A features
p3a_keep, p3a_drop = screen_features(
    feature_ranking_p3a,
    threshold=-10,
    route_name='P3A'
)


FEATURE SCREENING: P3A

Threshold: Importance > -10

 SCREENING RESULTS:
  [OK] Keep: 126 features (98.4%)
  [FAIL] Drop: 2 features (1.6%)

[FAIL] DROPPED FEATURES (harmful):

Rank   Feature                                                 Importance  
---------------------------------------------------------------------------
1      Panamax_Deliveries_DWT_level                                -16.29
2      TC5yr_Atlantic_level                                        -16.50


## Section 4: Step 3 - Feature Allocation Logic

In [10]:
def allocate_features(ranking_df, features_all_df, keep_features,
                      core_target=10, ml_target=20, route_name=''):
    """
    Allocate features into CORE (ARIMAX) vs ML (XGBoost) sets.
    
    CORE Requirements:
    - Stationary transformation (diff, pct, yoy, mom, ma30_dev, vol30)
    - Positive importance (> 0)
    - Will check VIF < 10 in next step
    - Target: 8-12 features
    
    ML Requirements:
    - Positive or neutral importance (> -5)
    - Can include levels, composite indices, correlated features
    - Complementary to CORE (prefer diversity)
    - Target: 15-20 features
    
    Parameters:
    -----------
    ranking_df : pd.DataFrame
        Feature ranking
    features_all_df : pd.DataFrame
        All features data
    keep_features : list
        Features that passed screening
    core_target : int
        Target number of CORE features
    ml_target : int
        Target number of ML features
    route_name : str
        Route identifier
        
    Returns:
    --------
    core_candidates : pd.DataFrame
        CORE feature candidates (before VIF filtering)
    ml_candidates : pd.DataFrame
        ML feature candidates (before final selection)
    ranking_filtered : pd.DataFrame
        Full ranking with stationarity flag
    """
    print(f"\n{'=' * 80}")
    print(f"FEATURE ALLOCATION: {route_name}")
    print(f"{'=' * 80}")

    # Filter to keep_features only
    ranking_filtered = ranking_df[ranking_df['Feature'].isin(keep_features)].copy()

    print(f"\nFiltered features: {len(ranking_filtered)} (after screening)")

    # Identify stationary features (for CORE candidates)
    stationary_suffixes = ['_diff', '_pct', '_yoy', '_mom', '_ma30_dev', '_vol30']
    ranking_filtered['Stationary'] = ranking_filtered['Feature'].apply(
        lambda x: any(x.endswith(suffix) for suffix in stationary_suffixes)
    )

    stationary_count = ranking_filtered['Stationary'].sum()
    level_count = (~ranking_filtered['Stationary']).sum()

    print(f"\nFeature types:")
    print(f"  Stationary (transformations): {stationary_count}")
    print(f"  Levels/Other:                 {level_count}")

    # CORE candidates: stationary + positive importance
    print(f"\n{'=' * 80}")
    print("CORE CANDIDATES (ARIMAX Input)")
    print(f"{'=' * 80}")
    print(f"Requirements: Stationary + Positive Importance + VIF<10 (checked next)")

    core_candidates = ranking_filtered[
        (ranking_filtered['Stationary'] == True) &
        (ranking_filtered['Importance'] > 0)
    ].head(core_target * 2)  # Select 2x target for VIF filtering

    print(f"\nSelected: {len(core_candidates)} candidates (will filter to {core_target} via VIF)")
    print(f"\nTop {min(15, len(core_candidates))} CORE candidates:")
    print(core_candidates.head(15).to_string(index=False))

    # ML candidates: positive/neutral importance
    print(f"\n{'=' * 80}")
    print("ML CANDIDATES (XGBoost Input)")
    print(f"{'=' * 80}")
    print(f"Requirements: Importance > -5 (allows neutral features)")

    ml_candidates = ranking_filtered[
        (ranking_filtered['Importance'] > -5)
    ].head(ml_target * 2)  # Select 2x target for diversity

    print(f"\nSelected: {len(ml_candidates)} candidates (will select best {ml_target})")
    print(f"\nTop {min(20, len(ml_candidates))} ML candidates:")
    print(ml_candidates.head(20).to_string(index=False))

    return core_candidates, ml_candidates, ranking_filtered


print("[OK] Allocation function defined")

[OK] Allocation function defined


In [11]:
# Allocate P1A features
p1a_core_candidates, p1a_ml_candidates, p1a_ranking_filtered = allocate_features(
    ranking_df=feature_ranking_p1a,
    features_all_df=features_all,
    keep_features=p1a_keep,
    core_target=10,
    ml_target=20,
    route_name='P1A'
)


FEATURE ALLOCATION: P1A

Filtered features: 126 (after screening)

Feature types:
  Stationary (transformations): 96
  Levels/Other:                 30

CORE CANDIDATES (ARIMAX Input)
Requirements: Stationary + Positive Importance + VIF<10 (checked next)

Selected: 20 candidates (will filter to 10 via VIF)

Top 15 CORE candidates:
                       Feature  Importance       Std  Stationary
               Atlantic_IP_yoy   13.300019  1.916568        True
 Coal_Trade_Volume_Index_vol30   13.154564 14.463436        True
   Panamax_Orderbook_Pct_vol30    7.705753  1.172678        True
                       MGO_yoy    6.988966  2.458136        True
                  PDOPEX_vol30    6.706575  3.641188        True
                     BPI_vol30    4.349367  7.067327        True
   Coal_Trade_Volume_Index_yoy    2.588564  0.537171        True
                       BPI_yoy    2.560062  1.213179        True
  Capesize_Orderbook_Pct_vol30    2.413733  0.938572        True
     Atlantic_Po

In [12]:
# Allocate P3A features
p3a_core_candidates, p3a_ml_candidates, p3a_ranking_filtered = allocate_features(
    ranking_df=feature_ranking_p3a,
    features_all_df=features_all,
    keep_features=p3a_keep,
    core_target=10,
    ml_target=20,
    route_name='P3A'
)


FEATURE ALLOCATION: P3A

Filtered features: 126 (after screening)

Feature types:
  Stationary (transformations): 96
  Levels/Other:                 30

CORE CANDIDATES (ARIMAX Input)
Requirements: Stationary + Positive Importance + VIF<10 (checked next)

Selected: 20 candidates (will filter to 10 via VIF)

Top 15 CORE candidates:
                       Feature  Importance      Std  Stationary
   Panamax_Orderbook_Pct_vol30   17.580039 7.954505        True
                       MGO_yoy   17.349434 3.160071        True
Panamax_Fleet_Growth_YoY_vol30   13.796821 3.106985        True
                     P4_82_yoy   10.107181 6.149886        True
               Atlantic_IP_yoy    6.490443 2.080036        True
        Panamax_Idle_Pct_vol30    4.651007 2.151705        True
  Capesize_Orderbook_Pct_vol30    4.115066 3.075942        True
                    P4_82_diff    2.857903 0.979944        True
                       BPI_yoy    2.733056 0.893819        True
                     P4_82

## Section 5: Step 4 - VIF Analysis on CORE Candidates

In [13]:
def calculate_vif_and_filter(features_df, candidate_features, target_count=10, route_name=''):
    """
    Calculate VIF for CORE candidates and iteratively remove high-VIF features.
    
    Goal: Ensure all CORE features have VIF < 10 (ARIMAX requirement)
    
    Algorithm:
    1. Calculate VIF for all candidates
    2. If max VIF < 10: Done
    3. Else: Remove feature with highest VIF, repeat
    4. Stop when: VIF < 10 OR reached target_count
    
    Parameters:
    -----------
    features_df : pd.DataFrame
        All features data
    candidate_features : pd.DataFrame
        CORE candidates with rankings
    target_count : int
        Target number of final CORE features
    route_name : str
        Route identifier
        
    Returns:
    --------
    final_features : list
        Final CORE features (VIF < 10)
    final_vif_df : pd.DataFrame
        VIF values for final features
    """
    print(f"\n{'=' * 80}")
    print(f"VIF ANALYSIS: {route_name} CORE FEATURES")
    print(f"{'=' * 80}")

    # Get feature data
    feature_names = candidate_features['Feature'].tolist()
    X = features_df[feature_names].dropna()

    print(f"\nInitial candidates: {len(feature_names)}")
    print(f"Target count: {target_count}")
    print(f"Goal: VIF < 10 for all features\n")

    iteration = 0
    # Iteratively remove high-VIF features
    while len(feature_names) > target_count:
        iteration += 1
        
        # Calculate VIF
        vif_data = []
        for i, col in enumerate(feature_names):
            try:
                vif = variance_inflation_factor(X[feature_names].values, i)
                vif_data.append({'Feature': col, 'VIF': vif})
            except:
                # Handle singular matrix (perfect collinearity)
                vif_data.append({'Feature': col, 'VIF': np.inf})

        vif_df = pd.DataFrame(vif_data).sort_values('VIF', ascending=False)

        # Check if all VIF < 10
        max_vif = vif_df['VIF'].max()
        if max_vif < 10:
            print(f"\n[OK] Iteration {iteration}: All features have VIF < 10 (max = {max_vif:.2f})")
            break

        # Remove worst feature
        worst_feature = vif_df.iloc[0]['Feature']
        worst_vif = vif_df.iloc[0]['VIF']
        print(f"Iteration {iteration}: Removing {worst_feature[:50]:<50} (VIF = {worst_vif:.2f})")

        feature_names.remove(worst_feature)
        X = X.drop(columns=[worst_feature])

        if len(feature_names) == target_count:
            print(f"\n[OK] Reached target count: {target_count} features")
            break

    # Final VIF calculation
    print(f"\n{'=' * 80}")
    print(f"FINAL VIF REPORT")
    print(f"{'=' * 80}")

    final_vif = []
    for i, col in enumerate(feature_names):
        try:
            vif = variance_inflation_factor(X.values, i)
            final_vif.append({'Feature': col, 'VIF': vif})
        except:
            final_vif.append({'Feature': col, 'VIF': np.inf})

    final_vif_df = pd.DataFrame(final_vif).sort_values('VIF', ascending=False)

    print(f"\n FINAL {route_name} CORE FEATURES ({len(feature_names)}):")
    print(f"\n{'Rank':<6} {'Feature':<55} {'VIF':<10}")
    print("-" * 72)
    for i, row in final_vif_df.iterrows():
        print(f"{i+1:<6} {row['Feature']:<55} {row['VIF']:>8.2f}")

    print(f"\n{'=' * 80}")
    print(f"VIF Summary:")
    print(f"  Max VIF:  {final_vif_df['VIF'].max():.2f}")
    print(f"  Mean VIF: {final_vif_df['VIF'].mean():.2f}")
    print(f"  Min VIF:  {final_vif_df['VIF'].min():.2f}")

    if final_vif_df['VIF'].max() < 10:
        print(f"\n[OK] All features pass VIF < 10 threshold")
    else:
        print(f"\n[WARN]  WARNING: Some features still have VIF >= 10")
        print(f"   Consider: Return to feature engineering OR accept higher VIF")

    return feature_names, final_vif_df


print("[OK] VIF analysis function defined")

[OK] VIF analysis function defined


In [14]:
# VIF analysis for P1A CORE
p1a_core_final, p1a_core_vif = calculate_vif_and_filter(
    features_df=features_all,
    candidate_features=p1a_core_candidates,
    target_count=10,
    route_name='P1A'
)

# Save VIF report
p1a_core_vif.to_csv('data/processed/features/p1a_core_vif_report.csv', index=False)
print("\n[OK] P1A CORE VIF report saved: data/processed/features/p1a_core_vif_report.csv")


VIF ANALYSIS: P1A CORE FEATURES

Initial candidates: 20
Target count: 10
Goal: VIF < 10 for all features

Iteration 1: Removing Atlantic_Port_Calls_vol30                          (VIF = 10.02)

[OK] Iteration 2: All features have VIF < 10 (max = 8.42)

FINAL VIF REPORT

 FINAL P1A CORE FEATURES (19):

Rank   Feature                                                 VIF       
------------------------------------------------------------------------
16     BPI_pct                                                     8.42
18     BPI_diff                                                    7.01
6      BPI_vol30                                                   5.69
10     C5TC_vol30                                                  4.61
14     P4_82_yoy                                                   4.05
8      BPI_yoy                                                     3.67
13     P4_82_pct                                                   3.25
2      Coal_Trade_Volume_Index_vol30         

In [15]:
# VIF analysis for P3A CORE
p3a_core_final, p3a_core_vif = calculate_vif_and_filter(
    features_df=features_all,
    candidate_features=p3a_core_candidates,
    target_count=10,
    route_name='P3A'
)

# Save VIF report
p3a_core_vif.to_csv('data/processed/features/p3a_core_vif_report.csv', index=False)
print("\n[OK] P3A CORE VIF report saved: data/processed/features/p3a_core_vif_report.csv")


VIF ANALYSIS: P3A CORE FEATURES

Initial candidates: 20
Target count: 10
Goal: VIF < 10 for all features

Iteration 1: Removing BPI_pct                                            (VIF = 12.43)
Iteration 2: Removing Atlantic_Port_Calls_vol30                          (VIF = 12.18)

[OK] Iteration 3: All features have VIF < 10 (max = 6.89)

FINAL VIF REPORT

 FINAL P3A CORE FEATURES (18):

Rank   Feature                                                 VIF       
------------------------------------------------------------------------
16     P4_82_vol30                                                 6.89
6      Panamax_Idle_Pct_vol30                                      5.88
15     MGO_vol30                                                   5.82
8      P4_82_diff                                                  4.56
10     P4_82_pct                                                   4.46
4      P4_82_yoy                                                   4.03
3      Panamax_Fleet_Growth_Yo

## Section 6: Step 5 - Finalize ML Feature Sets

In [16]:
def finalize_ml_features(core_features, ml_candidates, features_all_df, target_count=20, route_name=''):
    """
    Select final ML features (complementary to CORE).
    
    Strategy:
    1. Prioritize features NOT in CORE (for diversity)
    2. Then select by importance ranking
    3. Can include levels, correlated features (XGBoost robust)
    
    Parameters:
    -----------
    core_features : list
        Final CORE features
    ml_candidates : pd.DataFrame
        ML candidates with rankings
    features_all_df : pd.DataFrame
        All features data
    target_count : int
        Target number of ML features
    route_name : str
        Route identifier
        
    Returns:
    --------
    ml_final : list
        Final ML features
    """
    print(f"\n{'=' * 80}")
    print(f"FINALIZING ML FEATURES: {route_name}")
    print(f"{'=' * 80}")

    # Prioritize features NOT in CORE (for diversity)
    ml_candidates_sorted = ml_candidates.copy()
    ml_candidates_sorted['In_CORE'] = ml_candidates_sorted['Feature'].isin(core_features)
    ml_candidates_sorted = ml_candidates_sorted.sort_values(
        ['In_CORE', 'Importance'], ascending=[True, False]
    )

    # Select top features
    ml_final = ml_candidates_sorted.head(target_count)['Feature'].tolist()

    overlap_count = len(set(ml_final) & set(core_features))
    unique_count = len(ml_final) - overlap_count

    print(f"\nSelected: {len(ml_final)} features")
    print(f"  Overlap with CORE: {overlap_count} features ({overlap_count/len(ml_final)*100:.1f}%)")
    print(f"  Unique to ML:      {unique_count} features ({unique_count/len(ml_final)*100:.1f}%)")

    # Show final ML features
    ml_final_df = ml_candidates_sorted.head(target_count)[['Feature', 'Importance', 'In_CORE']]
    
    print(f"\n FINAL {route_name} ML FEATURES ({len(ml_final)}):")
    print(f"\n{'Rank':<6} {'Feature':<55} {'Importance':<12} {'In CORE':<10}")
    print("-" * 85)
    for i, (idx, row) in enumerate(ml_final_df.iterrows(), 1):
        in_core_str = '[OK]' if row['In_CORE'] else ''
        print(f"{i:<6} {row['Feature']:<55} {row['Importance']:>10.2f}  {in_core_str:<10}")

    return ml_final


print("[OK] ML finalization function defined")

[OK] ML finalization function defined


In [17]:
# Finalize P1A ML features
p1a_ml_final = finalize_ml_features(
    core_features=p1a_core_final,
    ml_candidates=p1a_ml_candidates,
    features_all_df=features_all,
    target_count=20,
    route_name='P1A'
)


FINALIZING ML FEATURES: P1A

Selected: 20 features
  Overlap with CORE: 0 features (0.0%)
  Unique to ML:      20 features (100.0%)

 FINAL P1A ML FEATURES (20):

Rank   Feature                                                 Importance   In CORE   
-------------------------------------------------------------------------------------
1      BPI_level                                                  3088.02            
2      TC5yr_Atlantic_level                                         71.32            
3      P1EA_CURMON_level                                            14.53            
4      P1EA_1MON_level                                               8.87            
5      P3EA_1Q_level                                                 5.35            
6      Panamax_Deliveries_DWT_level                                  5.26            
7      Panamax_Idle_Pct_level                                        3.90            
8      Atlantic_Port_Calls_level                             

In [18]:
# Finalize P3A ML features
p3a_ml_final = finalize_ml_features(
    core_features=p3a_core_final,
    ml_candidates=p3a_ml_candidates,
    features_all_df=features_all,
    target_count=20,
    route_name='P3A'
)


FINALIZING ML FEATURES: P3A

Selected: 20 features
  Overlap with CORE: 0 features (0.0%)
  Unique to ML:      20 features (100.0%)

 FINAL P3A ML FEATURES (20):

Rank   Feature                                                 Importance   In CORE   
-------------------------------------------------------------------------------------
1      BPI_level                                                  1783.06            
2      P4_82_level                                                 113.02            
3      P3EA_1Q_level                                                 2.37            
4      Atlantic_Port_Calls_vol30                                     1.89            
5      MGO_level                                                     1.31            
6      P1EA_CURMON_level                                             1.20            
7      TC5yr_Pacific_level                                           1.10            
8      World_Coal_Trade_MT_level                             

## Section 7: Step 6 - Save Final Feature Sets & Allocation Decisions

In [19]:
print("\n" + "="*80)
print("SAVING FINAL FEATURE SETS & ALLOCATION DECISIONS")
print("="*80)

# Create allocation decision log
decisions = []

# P1A CORE
for feat in p1a_core_final:
    imp = feature_ranking_p1a[feature_ranking_p1a['Feature'] == feat]['Importance'].values[0]
    vif = p1a_core_vif[p1a_core_vif['Feature'] == feat]['VIF'].values[0]
    decisions.append({
        'Route': 'P1A',
        'Set': 'CORE',
        'Feature': feat,
        'Importance': imp,
        'VIF': vif,
        'Rationale': 'Stationary, positive importance, VIF < 10'
    })

# P1A ML
for feat in p1a_ml_final:
    imp = feature_ranking_p1a[feature_ranking_p1a['Feature'] == feat]['Importance'].values[0]
    in_core = feat in p1a_core_final
    decisions.append({
        'Route': 'P1A',
        'Set': 'ML',
        'Feature': feat,
        'Importance': imp,
        'VIF': np.nan,
        'Rationale': f"Positive/neutral importance{', also in CORE' if in_core else ', unique to ML'}"
    })

# P3A CORE
for feat in p3a_core_final:
    imp = feature_ranking_p3a[feature_ranking_p3a['Feature'] == feat]['Importance'].values[0]
    vif = p3a_core_vif[p3a_core_vif['Feature'] == feat]['VIF'].values[0]
    decisions.append({
        'Route': 'P3A',
        'Set': 'CORE',
        'Feature': feat,
        'Importance': imp,
        'VIF': vif,
        'Rationale': 'Stationary, positive importance, VIF < 10'
    })

# P3A ML
for feat in p3a_ml_final:
    imp = feature_ranking_p3a[feature_ranking_p3a['Feature'] == feat]['Importance'].values[0]
    in_core = feat in p3a_core_final
    decisions.append({
        'Route': 'P3A',
        'Set': 'ML',
        'Feature': feat,
        'Importance': imp,
        'VIF': np.nan,
        'Rationale': f"Positive/neutral importance{', also in CORE' if in_core else ', unique to ML'}"
    })

# Save decisions
decisions_df = pd.DataFrame(decisions)
decisions_df.to_csv('data/processed/features/feature_allocation_decisions.csv', index=False)
print("\n[OK] Allocation decisions saved: data/processed/features/feature_allocation_decisions.csv")
print(f"   Total decisions logged: {len(decisions_df)}")

# Save final feature sets (data)
features_all[p1a_core_final].to_csv('data/processed/features/p1a_core_features_final.csv')
features_all[p1a_ml_final].to_csv('data/processed/features/p1a_ml_features_final.csv')
features_all[p3a_core_final].to_csv('data/processed/features/p3a_core_features_final.csv')
features_all[p3a_ml_final].to_csv('data/processed/features/p3a_ml_features_final.csv')

print("\n[OK] Final feature data saved:")
print(f"   - P1A CORE: {len(p1a_core_final)} features × {len(features_all)} rows")
print(f"   - P1A ML:   {len(p1a_ml_final)} features × {len(features_all)} rows")
print(f"   - P3A CORE: {len(p3a_core_final)} features × {len(features_all)} rows")
print(f"   - P3A ML:   {len(p3a_ml_final)} features × {len(features_all)} rows")

print("\n" + "="*80)
print("[OK] ALL FINAL FEATURE SETS SAVED")
print("="*80)


SAVING FINAL FEATURE SETS & ALLOCATION DECISIONS

[OK] Allocation decisions saved: data/processed/features/feature_allocation_decisions.csv
   Total decisions logged: 77

[OK] Final feature data saved:
   - P1A CORE: 19 features × 1153 rows
   - P1A ML:   20 features × 1153 rows
   - P3A CORE: 18 features × 1153 rows
   - P3A ML:   20 features × 1153 rows

[OK] ALL FINAL FEATURE SETS SAVED


## Section 8: Step 7 - Quality Gate Validation

In [20]:
def validate_quality_gate(core_vif_df, ml_importance_df, route_name=''):
    """
    Check if final feature sets pass quality gate.
    
    Quality Criteria:
    - CORE: Max VIF < 10, Mean VIF < 5
    - ML: Max Importance > 20, Mean Importance > 5, % Positive > 75%
    
    Parameters:
    -----------
    core_vif_df : pd.DataFrame
        VIF report for CORE features
    ml_importance_df : pd.DataFrame
        Importance rankings for ML features
    route_name : str
        Route identifier
        
    Returns:
    --------
    overall_pass : bool
        True if passed quality gate
    """
    print(f"\n{'=' * 80}")
    print(f"QUALITY GATE VALIDATION: {route_name}")
    print(f"{'=' * 80}")

    # CORE checks
    max_vif = core_vif_df['VIF'].max()
    mean_vif = core_vif_df['VIF'].mean()
    core_pass_strict = max_vif < 10
    core_pass_ideal = mean_vif < 5

    print(f"\n CORE Features (ARIMAX):")
    print(f"  Count:    {len(core_vif_df)}")
    print(f"  Max VIF:  {max_vif:.2f} (threshold: < 10) {'[OK]' if core_pass_strict else '[FAIL]'}")
    print(f"  Mean VIF: {mean_vif:.2f} (ideal: < 5) {'[OK]' if core_pass_ideal else '[WARN]'}")

    # ML checks
    max_imp = ml_importance_df['Importance'].max()
    mean_imp = ml_importance_df['Importance'].mean()
    median_imp = ml_importance_df['Importance'].median()
    pct_positive = (ml_importance_df['Importance'] > 0).mean() * 100
    
    ml_pass_max = max_imp > 20
    ml_pass_mean = mean_imp > 5
    ml_pass_pct = pct_positive > 75

    print(f"\n ML Features (XGBoost):")
    print(f"  Count:               {len(ml_importance_df)}")
    print(f"  Max Importance:      {max_imp:.2f} (threshold: > 20) {'[OK]' if ml_pass_max else '[FAIL]'}")
    print(f"  Mean Importance:     {mean_imp:.2f} (threshold: > 5) {'[OK]' if ml_pass_mean else '[FAIL]'}")
    print(f"  Median Importance:   {median_imp:.2f}")
    print(f"  % Positive Features: {pct_positive:.1f}% (threshold: > 75%) {'[OK]' if ml_pass_pct else '[FAIL]'}")

    # Overall decision
    core_pass = core_pass_strict
    ml_pass = ml_pass_max and ml_pass_mean and ml_pass_pct
    overall_pass = core_pass and ml_pass

    print(f"\n{'=' * 80}")
    print(f"DECISION FOR {route_name}:")
    print(f"{'=' * 80}")
    
    if overall_pass:
        print(f"\n[OK] {route_name} PASSED QUALITY GATE")
        print(f"   - CORE features: VIF < 10 [OK]")
        print(f"   - ML features: Strong importance [OK]")
        print(f"   - Ready for modeling")
    elif core_pass and not ml_pass:
        print(f"\n[WARN]  {route_name} PARTIAL PASS (CORE OK, ML WEAK)")
        print(f"   - CORE features: VIF < 10 [OK]")
        print(f"   - ML features: Below thresholds [FAIL]")
        print(f"   - Consider: Proceed but expect ML underperformance")
    elif ml_pass and not core_pass:
        print(f"\n[WARN]  {route_name} PARTIAL PASS (ML OK, CORE WEAK)")
        print(f"   - CORE features: VIF >= 10 [FAIL]")
        print(f"   - ML features: Strong importance [OK]")
        print(f"   - Consider: Accept higher VIF OR revise CORE features")
    else:
        print(f"\n[FAIL] {route_name} FAILED QUALITY GATE")
        print(f"   - CORE features: VIF >= 10 [FAIL]")
        print(f"   - ML features: Below thresholds [FAIL]")
        print(f"   - Action: Return to Notebook 02 (feature engineering)")

    return overall_pass


print("[OK] Quality gate function defined")

[OK] Quality gate function defined


In [21]:
# Validate P1A
p1a_ml_importance = feature_ranking_p1a[feature_ranking_p1a['Feature'].isin(p1a_ml_final)]

p1a_pass = validate_quality_gate(
    core_vif_df=p1a_core_vif,
    ml_importance_df=p1a_ml_importance,
    route_name='P1A'
)


QUALITY GATE VALIDATION: P1A

 CORE Features (ARIMAX):
  Count:    19
  Max VIF:  8.42 (threshold: < 10) [OK]
  Mean VIF: 3.14 (ideal: < 5) [OK]

 ML Features (XGBoost):
  Count:               20
  Max Importance:      3088.02 (threshold: > 20) [OK]
  Mean Importance:     160.21 (threshold: > 5) [OK]
  Median Importance:   0.99
  % Positive Features: 80.0% (threshold: > 75%) [OK]

DECISION FOR P1A:

[OK] P1A PASSED QUALITY GATE
   - CORE features: VIF < 10 [OK]
   - ML features: Strong importance [OK]
   - Ready for modeling


In [22]:
# Validate P3A
p3a_ml_importance = feature_ranking_p3a[feature_ranking_p3a['Feature'].isin(p3a_ml_final)]

p3a_pass = validate_quality_gate(
    core_vif_df=p3a_core_vif,
    ml_importance_df=p3a_ml_importance,
    route_name='P3A'
)


QUALITY GATE VALIDATION: P3A

 CORE Features (ARIMAX):
  Count:    18
  Max VIF:  6.89 (threshold: < 10) [OK]
  Mean VIF: 3.48 (ideal: < 5) [OK]

 ML Features (XGBoost):
  Count:               20
  Max Importance:      1783.06 (threshold: > 20) [OK]
  Mean Importance:     95.24 (threshold: > 5) [OK]
  Median Importance:   0.08
  % Positive Features: 100.0% (threshold: > 75%) [OK]

DECISION FOR P3A:

[OK] P3A PASSED QUALITY GATE
   - CORE features: VIF < 10 [OK]
   - ML features: Strong importance [OK]
   - Ready for modeling


In [23]:
# Overall Sprint Decision
print("\n" + "="*80)
print("OVERALL SPRINT DECISION")
print("="*80)

if p1a_pass and p3a_pass:
    decision = "[OK] PROCEED TO PHASE 3: Data Preparation"
    action = "Both routes passed quality gate. Ready for Notebook 04."
    status = "PASS"
elif p1a_pass or p3a_pass:
    decision = "[WARN]  PARTIAL PASS: Proceed with Caution"
    if p1a_pass:
        action = "P1A passed, P3A weak. Proceed to modeling, monitor P3A performance."
    else:
        action = "P3A passed, P1A weak. Proceed to modeling, monitor P1A performance."
    status = "PARTIAL"
else:
    decision = "[FAIL] FAIL: Return to Phase 1 (Feature Engineering)"
    action = "Both routes failed. Revise features in Notebook 02, iterate."
    status = "FAIL"

print(f"\n{decision}")
print(f"\n Action: {action}")

# Save decision
decision_summary = pd.DataFrame([{
    'Date': pd.Timestamp.now(),
    'P1A_Pass': p1a_pass,
    'P3A_Pass': p3a_pass,
    'Overall_Status': status,
    'Decision': decision,
    'Action': action,
    'P1A_CORE_Count': len(p1a_core_final),
    'P1A_ML_Count': len(p1a_ml_final),
    'P3A_CORE_Count': len(p3a_core_final),
    'P3A_ML_Count': len(p3a_ml_final),
    'P1A_Max_VIF': p1a_core_vif['VIF'].max(),
    'P3A_Max_VIF': p3a_core_vif['VIF'].max(),
    'P1A_Max_Importance': p1a_ml_importance['Importance'].max(),
    'P3A_Max_Importance': p3a_ml_importance['Importance'].max(),
}])

decision_summary.to_csv('data/processed/features/quality_gate_decision.csv', index=False)
print("\n[OK] Quality gate decision saved: data/processed/features/quality_gate_decision.csv")

print("\n" + "="*80)
print("NOTEBOOK 03 COMPLETE")
print("="*80)
print("\n Summary:")
print(f"   - P1A CORE: {len(p1a_core_final)} features (VIF < 10: {p1a_pass})")
print(f"   - P1A ML:   {len(p1a_ml_final)} features")
print(f"   - P3A CORE: {len(p3a_core_final)} features (VIF < 10: {p3a_pass})")
print(f"   - P3A ML:   {len(p3a_ml_final)} features")
print(f"\n Next Step: {action}")


OVERALL SPRINT DECISION

[OK] PROCEED TO PHASE 3: Data Preparation

 Action: Both routes passed quality gate. Ready for Notebook 04.

[OK] Quality gate decision saved: data/processed/features/quality_gate_decision.csv

NOTEBOOK 03 COMPLETE

 Summary:
   - P1A CORE: 19 features (VIF < 10: True)
   - P1A ML:   20 features
   - P3A CORE: 18 features (VIF < 10: True)
   - P3A ML:   20 features

 Next Step: Both routes passed quality gate. Ready for Notebook 04.


## Section 9: Step 8 - REDLINE: Reduce CORE to 6-8 Features

**CRITICAL DECISION:** Reduce CORE features from 19/18 to 6-8 to address ARIMAX overfitting

**Rationale:**
- Previous ARIMAX tests showed severe overfitting (Val R² < -1.0)
- Too many exogenous features relative to data (600 rows, 25 params = 24 obs/param)
- Target: 6-8 features per route for better parameter-to-observation ratio

**Selection Algorithm:**
1. **Filter by importance:** Keep only features with Importance > 5 (strong predictors)
2. **Prioritize daily features:** Prefer vol30 (score=3) > yoy/diff (score=2) > pct (score=1)
3. **Data quality:** Prioritize features with < 5% missing values
4. **Re-check VIF:** Ensure all final features have VIF < 10
5. **Target:** Exactly 6-8 features

**All decisions made programmatically with full audit trail**

In [24]:
def reduce_core_features_programmatic(core_features, core_vif_df, feature_ranking_df,
                                      features_all_df, target_range=(6, 8), route_name=''):
    """
    Reduce CORE features to 6-8 using programmatic criteria.
    
    Selection Algorithm:
    1. Filter by importance > 5 (strong predictors)
    2. Calculate feature type score: vol30=3, yoy/diff=2, pct/mom=1, other=0
    3. Calculate data quality score: % non-missing values
    4. Rank by: (importance_rank * 0.5) + (type_score * 0.3) + (quality_score * 0.2)
    5. Select top 6-8, re-check VIF
    
    Parameters:
    -----------
    core_features : list
        Current CORE features (19/18)
    core_vif_df : pd.DataFrame
        VIF values for CORE features
    feature_ranking_df : pd.DataFrame
        Feature importance rankings
    features_all_df : pd.DataFrame
        All feature data
    target_range : tuple
        (min, max) number of features to select
    route_name : str
        Route identifier
        
    Returns:
    --------
    final_features : list
        Reduced CORE features (6-8)
    reduction_audit : pd.DataFrame
        Audit trail of selection decisions
    """
    
    print(f"\n{'=' * 80}")
    print(f"PROGRAMMATIC CORE FEATURE REDUCTION: {route_name}")
    print(f"{'=' * 80}")
    
    print(f"\nInitial CORE features: {len(core_features)}")
    print(f"Target: {target_range[0]}-{target_range[1]} features")
    
    # Step 1: Get importance scores
    feature_scores = []
    for feat in core_features:
        importance = feature_ranking_df[feature_ranking_df['Feature'] == feat]['Importance'].values[0]
        vif = core_vif_df[core_vif_df['Feature'] == feat]['VIF'].values[0]
        
        # Data quality: % non-missing
        data_quality = (1 - features_all_df[feat].isnull().sum() / len(features_all_df)) * 100
        
        # Feature type score (daily transformations preferred)
        if feat.endswith('_vol30'):
            type_score = 3
            type_name = 'vol30 (30-day volatility)'
        elif feat.endswith('_yoy') or feat.endswith('_diff'):
            type_score = 2
            type_name = 'yoy/diff (year-over-year or first diff)'
        elif feat.endswith('_pct') or feat.endswith('_mom'):
            type_score = 1
            type_name = 'pct/mom (percentage change)'
        else:
            type_score = 0
            type_name = 'other'
        
        feature_scores.append({
            'Feature': feat,
            'Importance': importance,
            'VIF': vif,
            'Data_Quality_Pct': data_quality,
            'Type_Score': type_score,
            'Type_Name': type_name
        })
    
    scores_df = pd.DataFrame(feature_scores)
    
    # Step 2: Filter by importance > 5
    print(f"\n Step 1: Filter by Importance > 5")
    strong_features = scores_df[scores_df['Importance'] > 5].copy()
    print(f"   Remaining: {len(strong_features)} features")
    
    if len(strong_features) < target_range[0]:
        # Relax threshold if too few features
        print(f"   [WARN]  Too few features with Importance > 5")
        print(f"   Relaxing to Importance > 1")
        strong_features = scores_df[scores_df['Importance'] > 1].copy()
        print(f"   Remaining: {len(strong_features)} features")
    
    # Step 3: Calculate composite score
    print(f"\n Step 2: Calculate Composite Score")
    print(f"   Formula: (Importance_rank * 0.5) + (Type_score * 0.3) + (Quality_score * 0.2)")
    
    # Normalize scores to 0-1
    strong_features['Importance_norm'] = (strong_features['Importance'] - strong_features['Importance'].min()) / (strong_features['Importance'].max() - strong_features['Importance'].min() + 1e-10)
    strong_features['Type_score_norm'] = strong_features['Type_Score'] / 3.0  # Max score is 3
    strong_features['Quality_norm'] = strong_features['Data_Quality_Pct'] / 100.0
    
    # Composite score
    strong_features['Composite_Score'] = (
        strong_features['Importance_norm'] * 0.5 +
        strong_features['Type_score_norm'] * 0.3 +
        strong_features['Quality_norm'] * 0.2
    )
    
    # Sort by composite score
    strong_features = strong_features.sort_values('Composite_Score', ascending=False)
    
    # Step 4: Select top 6-8 features
    target_count = min(target_range[1], len(strong_features))
    if target_count > target_range[0] and len(strong_features) >= target_range[0]:
        # Prefer upper bound if possible
        selected_features = strong_features.head(target_count)['Feature'].tolist()
    else:
        selected_features = strong_features.head(target_range[0])['Feature'].tolist()
    
    print(f"\n Step 3: Select Top {len(selected_features)} Features")
    
    # Step 5: VIF re-check
    print(f"\n Step 4: VIF Re-check")
    X_selected = features_all_df[selected_features].dropna()
    
    final_vif = []
    for i, col in enumerate(selected_features):
        try:
            vif = variance_inflation_factor(X_selected.values, i)
            final_vif.append({'Feature': col, 'VIF': vif})
        except:
            final_vif.append({'Feature': col, 'VIF': np.inf})
    
    final_vif_df = pd.DataFrame(final_vif)
    max_vif = final_vif_df['VIF'].max()
    
    print(f"   Max VIF: {max_vif:.2f}")
    
    if max_vif >= 10:
        print(f"   [WARN]  Some features have VIF >= 10, removing highest VIF feature")
        worst_feat = final_vif_df.loc[final_vif_df['VIF'].idxmax(), 'Feature']
        selected_features.remove(worst_feat)
        print(f"   Removed: {worst_feat}")
        
        # Recalculate VIF
        X_selected = features_all_df[selected_features].dropna()
        final_vif = []
        for i, col in enumerate(selected_features):
            try:
                vif = variance_inflation_factor(X_selected.values, i)
                final_vif.append({'Feature': col, 'VIF': vif})
            except:
                final_vif.append({'Feature': col, 'VIF': np.inf})
        
        final_vif_df = pd.DataFrame(final_vif)
        max_vif = final_vif_df['VIF'].max()
        print(f"   New Max VIF: {max_vif:.2f}")
    
    # Create audit trail
    audit_df = strong_features[strong_features['Feature'].isin(selected_features)].copy()
    audit_df = audit_df.merge(final_vif_df, on='Feature', suffixes=('_old', '_new'))
    audit_df = audit_df[['Feature', 'Importance', 'VIF_new', 'Data_Quality_Pct', 
                          'Type_Score', 'Type_Name', 'Composite_Score']]
    audit_df = audit_df.sort_values('Composite_Score', ascending=False)
    
    # Display results
    print(f"\n{'=' * 80}")
    print(f"FINAL REDUCED CORE FEATURES: {route_name} ({len(selected_features)} features)")
    print(f"{'=' * 80}")
    print(f"\n{'Rank':<6} {'Feature':<40} {'Imp':<8} {'VIF':<6} {'Quality':<8} {'Type':<6} {'Score':<8}")
    print("-" * 90)
    for i, row in audit_df.iterrows():
        print(f"{audit_df.index.tolist().index(i)+1:<6} {row['Feature']:<40} {row['Importance']:>6.2f}  {row['VIF_new']:>5.2f}  {row['Data_Quality_Pct']:>6.1f}%  {int(row['Type_Score']):>4}     {row['Composite_Score']:>6.4f}")
    
    print(f"\n{'=' * 80}")
    print(f"REDUCTION SUMMARY:")
    print(f"{'=' * 80}")
    print(f"  Initial features: {len(core_features)}")
    print(f"  Final features:   {len(selected_features)}")
    print(f"  Reduction:        {len(core_features) - len(selected_features)} features ({(1 - len(selected_features)/len(core_features))*100:.1f}%)")
    print(f"  Max VIF:          {max_vif:.2f}")
    print(f"  Feature types:")
    type_counts = audit_df['Type_Name'].value_counts()
    for type_name, count in type_counts.items():
        print(f"    - {type_name}: {count} features")
    
    return selected_features, audit_df

print("[OK] Feature reduction function defined")

[OK] Feature reduction function defined


In [25]:
# Apply reduction to P1A
p1a_core_reduced, p1a_reduction_audit = reduce_core_features_programmatic(
    core_features=p1a_core_final,
    core_vif_df=p1a_core_vif,
    feature_ranking_df=feature_ranking_p1a,
    features_all_df=features_all,
    target_range=(6, 8),
    route_name='P1A'
)


PROGRAMMATIC CORE FEATURE REDUCTION: P1A

Initial CORE features: 19
Target: 6-8 features

 Step 1: Filter by Importance > 5
   Remaining: 5 features
   [WARN]  Too few features with Importance > 5
   Relaxing to Importance > 1
   Remaining: 12 features

 Step 2: Calculate Composite Score
   Formula: (Importance_rank * 0.5) + (Type_score * 0.3) + (Quality_score * 0.2)

 Step 3: Select Top 8 Features

 Step 4: VIF Re-check
   Max VIF: 5.16

FINAL REDUCED CORE FEATURES: P1A (8 features)

Rank   Feature                                  Imp      VIF    Quality  Type   Score   
------------------------------------------------------------------------------------------
1      Coal_Trade_Volume_Index_vol30             13.15   3.01    97.4%     3     0.9886
2      Atlantic_IP_yoy                           13.30   1.54    99.9%     2     0.8998
3      Panamax_Orderbook_Pct_vol30                7.71   2.59    97.4%     3     0.7568
4      PDOPEX_vol30                               6.71   1.46    

In [26]:
# Apply reduction to P3A
p3a_core_reduced, p3a_reduction_audit = reduce_core_features_programmatic(
    core_features=p3a_core_final,
    core_vif_df=p3a_core_vif,
    feature_ranking_df=feature_ranking_p3a,
    features_all_df=features_all,
    target_range=(6, 8),
    route_name='P3A'
)


PROGRAMMATIC CORE FEATURE REDUCTION: P3A

Initial CORE features: 18
Target: 6-8 features

 Step 1: Filter by Importance > 5
   Remaining: 5 features
   [WARN]  Too few features with Importance > 5
   Relaxing to Importance > 1
   Remaining: 14 features

 Step 2: Calculate Composite Score
   Formula: (Importance_rank * 0.5) + (Type_score * 0.3) + (Quality_score * 0.2)

 Step 3: Select Top 8 Features

 Step 4: VIF Re-check
   Max VIF: 4.04

FINAL REDUCED CORE FEATURES: P3A (8 features)

Rank   Feature                                  Imp      VIF    Quality  Type   Score   
------------------------------------------------------------------------------------------
1      Panamax_Orderbook_Pct_vol30               17.58   3.14    97.4%     3     0.9948
2      MGO_yoy                                   17.35   1.10    98.9%     2     0.8908
3      Panamax_Fleet_Growth_YoY_vol30            13.80   3.28    97.4%     3     0.8803
4      P4_82_yoy                                 10.11   1.05    

In [27]:
# Update ML features: Add features dropped from CORE
print("\n" + "="*80)
print("UPDATING ML FEATURE SETS")
print("="*80)

# P1A: Add dropped CORE features to ML set
p1a_dropped_from_core = [f for f in p1a_core_final if f not in p1a_core_reduced]
print(f"\nP1A:")
print(f"  Dropped from CORE: {len(p1a_dropped_from_core)} features")
print(f"  Current ML:        {len(p1a_ml_final)} features")

# Add dropped features to ML (avoid duplicates)
p1a_ml_updated = list(p1a_ml_final) + [f for f in p1a_dropped_from_core if f not in p1a_ml_final]
print(f"  Updated ML:        {len(p1a_ml_updated)} features")

# P3A: Add dropped CORE features to ML set
p3a_dropped_from_core = [f for f in p3a_core_final if f not in p3a_core_reduced]
print(f"\nP3A:")
print(f"  Dropped from CORE: {len(p3a_dropped_from_core)} features")
print(f"  Current ML:        {len(p3a_ml_final)} features")

# Add dropped features to ML (avoid duplicates)
p3a_ml_updated = list(p3a_ml_final) + [f for f in p3a_dropped_from_core if f not in p3a_ml_final]
print(f"  Updated ML:        {len(p3a_ml_updated)} features")

# Verify no overlap
p1a_overlap = set(p1a_core_reduced) & set(p1a_ml_updated)
p3a_overlap = set(p3a_core_reduced) & set(p3a_ml_updated)

print(f"\n OVERLAP CHECK:")
print(f"  P1A: {len(p1a_overlap)} overlapping features {'[OK]' if len(p1a_overlap) == 0 else '[FAIL]'}")
print(f"  P3A: {len(p3a_overlap)} overlapping features {'[OK]' if len(p3a_overlap) == 0 else '[FAIL]'}")

if len(p1a_overlap) > 0:
    print(f"\n[FAIL] P1A overlap detected: {p1a_overlap}")
    
if len(p3a_overlap) > 0:
    print(f"\n[FAIL] P3A overlap detected: {p3a_overlap}")


UPDATING ML FEATURE SETS

P1A:
  Dropped from CORE: 11 features
  Current ML:        20 features
  Updated ML:        31 features

P3A:
  Dropped from CORE: 10 features
  Current ML:        20 features
  Updated ML:        30 features

 OVERLAP CHECK:
  P1A: 0 overlapping features [OK]
  P3A: 0 overlapping features [OK]


In [28]:
# Save v2 feature sets
print("\n" + "="*80)
print("SAVING V2 FEATURE SETS (6-8 CORE + UPDATED ML)")
print("="*80)

# Save reduced CORE features
features_all[p1a_core_reduced].to_csv('data/processed/features/p1a_core_features_final_v2.csv')
features_all[p3a_core_reduced].to_csv('data/processed/features/p3a_core_features_final_v2.csv')

print("\n[OK] V2 CORE features saved:")
print(f"   - P1A: {len(p1a_core_reduced)} features × {len(features_all)} rows")
print(f"   - P3A: {len(p3a_core_reduced)} features × {len(features_all)} rows")

# Save updated ML features
features_all[p1a_ml_updated].to_csv('data/processed/features/p1a_ml_features_final_v2.csv')
features_all[p3a_ml_updated].to_csv('data/processed/features/p3a_ml_features_final_v2.csv')

print("\n[OK] V2 ML features saved:")
print(f"   - P1A: {len(p1a_ml_updated)} features × {len(features_all)} rows")
print(f"   - P3A: {len(p3a_ml_updated)} features × {len(features_all)} rows")

# Save reduction audit trails
p1a_reduction_audit.to_csv('data/processed/features/p1a_core_reduction_audit_v2.csv', index=False)
p3a_reduction_audit.to_csv('data/processed/features/p3a_core_reduction_audit_v2.csv', index=False)

print("\n[OK] Reduction audit trails saved:")
print(f"   - data/processed/features/p1a_core_reduction_audit_v2.csv")
print(f"   - data/processed/features/p3a_core_reduction_audit_v2.csv")

# Create comprehensive v2 allocation decisions
decisions_v2 = []

# P1A CORE v2
for feat in p1a_core_reduced:
    audit_row = p1a_reduction_audit[p1a_reduction_audit['Feature'] == feat].iloc[0]
    decisions_v2.append({
        'Route': 'P1A',
        'Set': 'CORE_v2',
        'Feature': feat,
        'Importance': audit_row['Importance'],
        'VIF': audit_row['VIF_new'],
        'Data_Quality_Pct': audit_row['Data_Quality_Pct'],
        'Type_Score': audit_row['Type_Score'],
        'Composite_Score': audit_row['Composite_Score'],
        'Rationale': f"Selected via programmatic reduction: {audit_row['Type_Name']}, Composite Score={audit_row['Composite_Score']:.4f}"
    })

# P1A ML v2 (moved from CORE)
for feat in p1a_dropped_from_core:
    imp = feature_ranking_p1a[feature_ranking_p1a['Feature'] == feat]['Importance'].values[0]
    decisions_v2.append({
        'Route': 'P1A',
        'Set': 'ML_v2',
        'Feature': feat,
        'Importance': imp,
        'VIF': np.nan,
        'Data_Quality_Pct': np.nan,
        'Type_Score': np.nan,
        'Composite_Score': np.nan,
        'Rationale': 'Moved from CORE to ML (reduction)'
    })

# P1A ML v2 (original)
for feat in p1a_ml_final:
    imp = feature_ranking_p1a[feature_ranking_p1a['Feature'] == feat]['Importance'].values[0]
    decisions_v2.append({
        'Route': 'P1A',
        'Set': 'ML_v2',
        'Feature': feat,
        'Importance': imp,
        'VIF': np.nan,
        'Data_Quality_Pct': np.nan,
        'Type_Score': np.nan,
        'Composite_Score': np.nan,
        'Rationale': 'Original ML feature'
    })

# P3A CORE v2
for feat in p3a_core_reduced:
    audit_row = p3a_reduction_audit[p3a_reduction_audit['Feature'] == feat].iloc[0]
    decisions_v2.append({
        'Route': 'P3A',
        'Set': 'CORE_v2',
        'Feature': feat,
        'Importance': audit_row['Importance'],
        'VIF': audit_row['VIF_new'],
        'Data_Quality_Pct': audit_row['Data_Quality_Pct'],
        'Type_Score': audit_row['Type_Score'],
        'Composite_Score': audit_row['Composite_Score'],
        'Rationale': f"Selected via programmatic reduction: {audit_row['Type_Name']}, Composite Score={audit_row['Composite_Score']:.4f}"
    })

# P3A ML v2 (moved from CORE)
for feat in p3a_dropped_from_core:
    imp = feature_ranking_p3a[feature_ranking_p3a['Feature'] == feat]['Importance'].values[0]
    decisions_v2.append({
        'Route': 'P3A',
        'Set': 'ML_v2',
        'Feature': feat,
        'Importance': imp,
        'VIF': np.nan,
        'Data_Quality_Pct': np.nan,
        'Type_Score': np.nan,
        'Composite_Score': np.nan,
        'Rationale': 'Moved from CORE to ML (reduction)'
    })

# P3A ML v2 (original)
for feat in p3a_ml_final:
    imp = feature_ranking_p3a[feature_ranking_p3a['Feature'] == feat]['Importance'].values[0]
    decisions_v2.append({
        'Route': 'P3A',
        'Set': 'ML_v2',
        'Feature': feat,
        'Importance': imp,
        'VIF': np.nan,
        'Data_Quality_Pct': np.nan,
        'Type_Score': np.nan,
        'Composite_Score': np.nan,
        'Rationale': 'Original ML feature'
    })

decisions_v2_df = pd.DataFrame(decisions_v2)
decisions_v2_df.to_csv('data/processed/features/feature_allocation_decisions_v2.csv', index=False)

print("\n[OK] V2 allocation decisions saved:")
print(f"   - data/processed/features/feature_allocation_decisions_v2.csv")
print(f"   - Total entries: {len(decisions_v2_df)}")

print("\n" + "="*80)
print("V2 FEATURE SETS COMPLETE")
print("="*80)
print("\n Final Summary:")
print(f"  P1A CORE v2: {len(p1a_core_reduced)} features (was {len(p1a_core_final)})")
print(f"  P1A ML v2:   {len(p1a_ml_updated)} features (was {len(p1a_ml_final)})")
print(f"  P3A CORE v2: {len(p3a_core_reduced)} features (was {len(p3a_core_final)})")
print(f"  P3A ML v2:   {len(p3a_ml_updated)} features (was {len(p3a_ml_final)})")
print(f"\n[OK] NO OVERLAP between CORE and ML sets")
print(f"\n Ready for Notebook 04: ARIMAX Test Configurations")


SAVING V2 FEATURE SETS (6-8 CORE + UPDATED ML)

[OK] V2 CORE features saved:
   - P1A: 8 features × 1153 rows
   - P3A: 8 features × 1153 rows

[OK] V2 ML features saved:
   - P1A: 31 features × 1153 rows
   - P3A: 30 features × 1153 rows

[OK] Reduction audit trails saved:
   - data/processed/features/p1a_core_reduction_audit_v2.csv
   - data/processed/features/p3a_core_reduction_audit_v2.csv

[OK] V2 allocation decisions saved:
   - data/processed/features/feature_allocation_decisions_v2.csv
   - Total entries: 77

V2 FEATURE SETS COMPLETE

 Final Summary:
  P1A CORE v2: 8 features (was 19)
  P1A ML v2:   31 features (was 20)
  P3A CORE v2: 8 features (was 18)
  P3A ML v2:   30 features (was 20)

[OK] NO OVERLAP between CORE and ML sets

 Ready for Notebook 04: ARIMAX Test Configurations
