# Notebook 02: Comprehensive Feature Engineering (REVISED - Round 2)

**Created:** October 31, 2025  
**Revised:** October 31, 2025 (Post-validation fixes)  
**Purpose:** Create comprehensive feature library (~115 features) with empirically validated transformation variants  
**Approach:** Data-driven (create all features ‚Üí empirical validation ‚Üí allocate to CORE/ML)

---

## Why This Approach?

**Round 1 Problem:** Pre-selected features based on theory failed validation
- Example: `Grain_Trade_YoY` had -67.84 importance (harmful!)
- P1A validation failed due to poor feature quality

**Round 2 Solution:** Empirical feature selection
1. Engineer ALL potentially useful transformations (~115 features)
2. Use Random Forest + VIF to empirically validate
3. Allocate best performers to CORE (ARIMAX) vs ML (XGBoost)

---

## Transformation Strategy (REVISED)

**Decisions Applied:**
- **D1-1A**: ‚ùå Removed FFA spreads using labels as inputs (-4 features)
- **D2-2B**: ‚ùå Skip vol30 for 8 annual trade features (-8 features)
- **D3-3A**: ‚ùå Remove mom transformation entirely (-24 features)
- **D4-4B**: ‚úÖ Use 5 transformations (level, diff, pct, yoy, vol30)
- **D5-5A**: ‚úÖ Keep YoY transformations (valuable for freight markets)
- **D7-7A**: ‚úÖ Uniform strategy across all features

For each raw feature, create up to 5 variants:

| Transformation | Formula | Stationarity | Use Case |
|----------------|---------|--------------|----------|
| **Level** | `x_t` | No | XGBoost only |
| **First Difference** | `x_t - x_{t-1}` | Yes | ARIMAX + XGBoost |
| **Percent Change** | `(x_t - x_{t-1}) / x_{t-1} * 100` | Yes | Both |
| **YoY Change** | `(x_t - x_{t-12}) / x_{t-12} * 100` | Yes | ARIMAX (seasonality) |
| **Rolling Vol 30d** | `œÉ_{30}(x)` | Partial | XGBoost (regimes) |

**Removed Transformations:**
- ‚ùå **MoM** - Redundant (identical to pct)
- ‚ùå **MA30 Deviation** - Excessive NaN (430 avg per feature, 89.94% for annual data)

**Expected Results:**
- Feature count: ~115 (down from 184)
- Data retention: ~95% (1,100+ rows vs previous 70 rows)
- No infinite VIF issues
- No label leakage

---

## Critical Reminders

‚úÖ **All input features already have 1-day lag** (from Notebook 01)  
‚ùå **Do NOT apply additional lag** (would create 2-day lag)  
‚úÖ **Rolling windows:** Use `center=False` (default) - backward-looking only  
‚úÖ **Expected NaN patterns:**
- Row 1: ALL features (from 1-day lag in Notebook 01)
- Vol30 features: ~30 rows (rolling window)
- YoY features: ~12 rows (12-period lookback)

---

## Section 1: Setup & Load Data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.width', None)

print("‚úÖ Libraries imported")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

‚úÖ Libraries imported
Pandas version: 2.3.3
NumPy version: 2.3.3


In [2]:
# Load raw features (already lagged by 1 day)
features_raw = pd.read_csv('data/processed/intermediate/features_raw_daily.csv', 
                            index_col='Date', parse_dates=True)

# Load labels for FFA spread computation
labels = pd.read_csv('data/processed/intermediate/labels.csv',
                     index_col='Date', parse_dates=True)

print(f"‚úÖ Data loaded")
print(f"\nRaw features shape: {features_raw.shape}")
print(f"Date range: {features_raw.index.min()} to {features_raw.index.max()}")
print(f"Total days: {len(features_raw)}")
print(f"\nLabels shape: {labels.shape}")
print(f"Labels: {labels.columns.tolist()}")

‚úÖ Data loaded

Raw features shape: (1153, 58)
Date range: 2021-03-01 00:00:00 to 2025-10-10 00:00:00
Total days: 1153

Labels shape: (1153, 2)
Labels: ['P1A_82', 'P3A_82']


In [3]:
# Verify first row is all NaN (due to 1-day lag from Notebook 01)
first_row_nulls = features_raw.iloc[0].isnull().sum()
total_cols = len(features_raw.columns)

print(f"First row NULL count: {first_row_nulls}/{total_cols}")
if first_row_nulls == total_cols:
    print("‚úÖ VERIFIED: All features have 1-day lag (row 1 is all NaN)")
else:
    print("‚ö†Ô∏è WARNING: Not all features are lagged properly!")
    print(f"Non-null features in row 1: {features_raw.iloc[0].notna().sum()}")

First row NULL count: 58/58
‚úÖ VERIFIED: All features have 1-day lag (row 1 is all NaN)


## Section 2: Transformation Functions

In [4]:
def create_transformations(df, feature_name, transformations='all', verbose=False):
    """
    Create multiple transformations of a feature.
    
    Parameters:
    -----------
    df : pd.DataFrame
        Input dataframe
    feature_name : str
        Name of feature to transform (must exist in df)
    transformations : list or 'all'
        Which transformations to apply. Options:
        - 'all': Apply 5 transformations (level, diff, pct, yoy, vol30)
        - list: e.g., ['level', 'diff', 'pct']
    verbose : bool
        Print transformation details
        
    Returns:
    --------
    pd.DataFrame with new transformation columns
    
    Transformations (Round 2 - Revised):
    -------------------------------------
    - level: Raw values (x_t)
    - diff: First difference (x_t - x_{t-1})
    - pct: Percent change ((x_t - x_{t-1}) / x_{t-1} * 100)
    - yoy: Year-over-year change (12 periods)
    - vol30: 30-day rolling standard deviation
    
    REMOVED Transformations (per decisions):
    - mom: REMOVED (D3-3A) - Redundant with pct (identical formula)
    - ma30_dev: REMOVED (D4-4B) - Caused excessive NaN accumulation
    
    Notes:
    ------
    ‚ö†Ô∏è Input features already have 1-day lag (from Notebook 01)
    ‚ö†Ô∏è Rolling windows use center=False (backward-looking only)
    """
    if feature_name not in df.columns:
        raise ValueError(f"Feature '{feature_name}' not found in dataframe")
    
    results = pd.DataFrame(index=df.index)
    feature_data = df[feature_name]
    
    if verbose:
        print(f"\nTransforming: {feature_name}")
        print(f"  Non-null values: {feature_data.notna().sum()}/{len(feature_data)}")
    
    # Level (raw)
    if transformations == 'all' or 'level' in transformations:
        results[f'{feature_name}_level'] = feature_data
        if verbose: print(f"  ‚úì level")
    
    # First difference
    if transformations == 'all' or 'diff' in transformations:
        results[f'{feature_name}_diff'] = feature_data.diff()
        if verbose: print(f"  ‚úì diff")
    
    # Percent change
    if transformations == 'all' or 'pct' in transformations:
        results[f'{feature_name}_pct'] = feature_data.pct_change() * 100
        if verbose: print(f"  ‚úì pct")
    
    # Year-over-year
    if transformations == 'all' or 'yoy' in transformations:
        results[f'{feature_name}_yoy'] = feature_data.pct_change(12) * 100
        if verbose: print(f"  ‚úì yoy")
    
    # Rolling volatility
    if transformations == 'all' or 'vol30' in transformations:
        results[f'{feature_name}_vol30'] = feature_data.rolling(window=30, min_periods=30, center=False).std()
        if verbose: print(f"  ‚úì vol30")
    
    if verbose:
        print(f"  Created {len(results.columns)} features")
    
    return results

print("‚úÖ Transformation functions defined (REVISED - Round 2)")
print("\nAvailable transformations (D4-4B):")
print("  - level: Raw values")
print("  - diff: First difference")
print("  - pct: Percent change")
print("  - yoy: Year-over-year (12 periods)")
print("  - vol30: 30-day rolling volatility")
print("\n‚ùå Removed transformations:")
print("  - mom: REMOVED (D3-3A) - Identical to pct")
print("  - ma30_dev: REMOVED (D4-4B) - Excessive NaN (430 avg per feature)")

‚úÖ Transformation functions defined (REVISED - Round 2)

Available transformations (D4-4B):
  - level: Raw values
  - diff: First difference
  - pct: Percent change
  - yoy: Year-over-year (12 periods)
  - vol30: 30-day rolling volatility

‚ùå Removed transformations:
  - mom: REMOVED (D3-3A) - Identical to pct
  - ma30_dev: REMOVED (D4-4B) - Excessive NaN (430 avg per feature)


### 3.1 Baltic & Market Indices

In [5]:
# Initialize comprehensive feature set
features_comprehensive = pd.DataFrame(index=features_raw.index)

print("="*80)
print("CATEGORY 1: BALTIC & MARKET INDICES")
print("="*80)
print("Applying 5 transformations (level, diff, pct, yoy, vol30)\n")

# Features to transform
market_indices = ['BPI', 'C5TC', 'P4_82', 'PDOPEX']

for feature in market_indices:
    print(f"Transforming: {feature}")
    transformed = create_transformations(features_raw, feature, transformations='all', verbose=False)
    features_comprehensive = pd.concat([features_comprehensive, transformed], axis=1)
    print(f"  ‚Üí Created {len(transformed.columns)} features")

print(f"\n‚úÖ Category 1 complete: {len(features_comprehensive.columns)} features created")

CATEGORY 1: BALTIC & MARKET INDICES
Applying 5 transformations (level, diff, pct, yoy, vol30)

Transforming: BPI
  ‚Üí Created 5 features
Transforming: C5TC
  ‚Üí Created 5 features
Transforming: P4_82
  ‚Üí Created 5 features
Transforming: PDOPEX
  ‚Üí Created 5 features

‚úÖ Category 1 complete: 20 features created


### 3.2 Bunker Prices

In [6]:
print("\n" + "="*80)
print("CATEGORY 2: BUNKER PRICES")
print("="*80)
print("Applying 5 transformations (pct_change and volatility are critical)\n")

bunker_features = ['VLSFO', 'MGO']

for feature in bunker_features:
    print(f"Transforming: {feature}")
    transformed = create_transformations(features_raw, feature, transformations='all', verbose=False)
    features_comprehensive = pd.concat([features_comprehensive, transformed], axis=1)
    print(f"  ‚Üí Created {len(transformed.columns)} features")

print(f"\n‚úÖ Category 2 complete: {len(features_comprehensive.columns)} total features so far")


CATEGORY 2: BUNKER PRICES
Applying 5 transformations (pct_change and volatility are critical)

Transforming: VLSFO
  ‚Üí Created 5 features
Transforming: MGO
  ‚Üí Created 5 features

‚úÖ Category 2 complete: 30 total features so far


### 3.3 Fleet & Supply

In [7]:
print("\n" + "="*80)
print("CATEGORY 3: FLEET & SUPPLY")
print("="*80)
print("Applying 5 transformations (except TC5yr - levels only due to missing data)\n")

# Features with full transformations
supply_features = [
    'Panamax_Orderbook_Pct',
    'Panamax_Deliveries_DWT',
    'Panamax_Idle_Pct',
    'Capesize_Orderbook_Pct',
    'Atlantic_Port_Calls',
    'Panamax_Fleet_Growth_YoY'
]

for feature in supply_features:
    print(f"Transforming: {feature}")
    transformed = create_transformations(features_raw, feature, transformations='all', verbose=False)
    features_comprehensive = pd.concat([features_comprehensive, transformed], axis=1)
    print(f"  ‚Üí Created {len(transformed.columns)} features")

# TC5yr features - levels only (sparse data)
print(f"\nTransforming: TC5yr_Atlantic (level only - sparse data)")
features_comprehensive['TC5yr_Atlantic_level'] = features_raw['TC5yr_Atlantic']
print(f"  ‚Üí Created 1 feature")

print(f"\nTransforming: TC5yr_Pacific (level only - sparse data)")
features_comprehensive['TC5yr_Pacific_level'] = features_raw['TC5yr_Pacific']
print(f"  ‚Üí Created 1 feature")

print(f"\n‚úÖ Category 3 complete: {len(features_comprehensive.columns)} total features so far")


CATEGORY 3: FLEET & SUPPLY
Applying 5 transformations (except TC5yr - levels only due to missing data)

Transforming: Panamax_Orderbook_Pct
  ‚Üí Created 5 features
Transforming: Panamax_Deliveries_DWT
  ‚Üí Created 5 features
Transforming: Panamax_Idle_Pct
  ‚Üí Created 5 features
Transforming: Capesize_Orderbook_Pct
  ‚Üí Created 5 features
Transforming: Atlantic_Port_Calls
  ‚Üí Created 5 features
Transforming: Panamax_Fleet_Growth_YoY
  ‚Üí Created 5 features

Transforming: TC5yr_Atlantic (level only - sparse data)
  ‚Üí Created 1 feature

Transforming: TC5yr_Pacific (level only - sparse data)
  ‚Üí Created 1 feature

‚úÖ Category 3 complete: 62 total features so far


### 3.4 Trade Volumes

In [8]:
print("\n" + "="*80)
print("CATEGORY 4: TRADE VOLUMES")
print("="*80)
print("Applying transformations (with special handling for annual data)\n")

# Annual trade volumes - SKIP vol30 per Decision D2-2B
# Reason: Annual data forward-filled to daily creates constant values ‚Üí std30 = 0 ‚Üí NaN
print("üìã ANNUAL TRADE VOLUMES (interpolated from yearly data):")
print("   Applying 4 transformations only (level, diff, pct, yoy)")
print("   ‚ùå SKIPPING vol30 per Decision D2-2B (constant values ‚Üí std=0 ‚Üí NaN)\n")

annual_trade_features = [
    'China_Coal_Imports_MT',
    'China_Grain_Imports_MT',
    'India_Coal_Imports_MT',
    'Japan_Coal_Imports_MT',
    'Indonesia_Coal_Exports_MT',
    'Australia_Coal_Exports_MT',
    'World_Grain_Trade_MT',
    'World_Coal_Trade_MT'
]

for feature in annual_trade_features:
    print(f"Transforming: {feature}")
    # Skip vol30 for annual features (D2-2B)
    transformed = create_transformations(features_raw, feature, 
                                         transformations=['level', 'diff', 'pct', 'yoy'], 
                                         verbose=False)
    features_comprehensive = pd.concat([features_comprehensive, transformed], axis=1)
    print(f"  ‚Üí Created {len(transformed.columns)} features (skipped vol30)")

# Pre-transformed trade indicators (already YoY, but create additional variants)
print(f"\nüìã PRE-TRANSFORMED TRADE INDICATORS (monthly YoY data):")
print(f"   Applying 3 transformations (level, diff, pct)\n")

print(f"Transforming: Coal_Trade_YoY")
transformed = create_transformations(features_raw, 'Coal_Trade_YoY', transformations=['level', 'diff', 'pct'], verbose=False)
features_comprehensive = pd.concat([features_comprehensive, transformed], axis=1)
print(f"  ‚Üí Created {len(transformed.columns)} features")

print(f"\nTransforming: Grain_Trade_YoY")
transformed = create_transformations(features_raw, 'Grain_Trade_YoY', transformations=['level', 'diff', 'pct'], verbose=False)
features_comprehensive = pd.concat([features_comprehensive, transformed], axis=1)
print(f"  ‚Üí Created {len(transformed.columns)} features")

# Volume indices (composite indicators - full transformations)
print(f"\nüìã VOLUME INDICES (monthly composite data):")
print(f"   Applying ALL 5 transformations\n")

print(f"Transforming: Coal_Trade_Volume_Index")
transformed = create_transformations(features_raw, 'Coal_Trade_Volume_Index', transformations='all', verbose=False)
features_comprehensive = pd.concat([features_comprehensive, transformed], axis=1)
print(f"  ‚Üí Created {len(transformed.columns)} features")

print(f"\nTransforming: Grain_Trade_Volume_Index")
transformed = create_transformations(features_raw, 'Grain_Trade_Volume_Index', transformations='all', verbose=False)
features_comprehensive = pd.concat([features_comprehensive, transformed], axis=1)
print(f"  ‚Üí Created {len(transformed.columns)} features")

print(f"\n‚úÖ Category 4 complete: {len(features_comprehensive.columns)} total features so far")
print(f"   ‚ÑπÔ∏è Prevented ~8 features with 90% missing (annual vol30 transformations)")


CATEGORY 4: TRADE VOLUMES
Applying transformations (with special handling for annual data)

üìã ANNUAL TRADE VOLUMES (interpolated from yearly data):
   Applying 4 transformations only (level, diff, pct, yoy)
   ‚ùå SKIPPING vol30 per Decision D2-2B (constant values ‚Üí std=0 ‚Üí NaN)

Transforming: China_Coal_Imports_MT
  ‚Üí Created 4 features (skipped vol30)
Transforming: China_Grain_Imports_MT
  ‚Üí Created 4 features (skipped vol30)
Transforming: India_Coal_Imports_MT
  ‚Üí Created 4 features (skipped vol30)
Transforming: Japan_Coal_Imports_MT
  ‚Üí Created 4 features (skipped vol30)
Transforming: Indonesia_Coal_Exports_MT
  ‚Üí Created 4 features (skipped vol30)
Transforming: Australia_Coal_Exports_MT
  ‚Üí Created 4 features (skipped vol30)
Transforming: World_Grain_Trade_MT
  ‚Üí Created 4 features (skipped vol30)
Transforming: World_Coal_Trade_MT
  ‚Üí Created 4 features (skipped vol30)

üìã PRE-TRANSFORMED TRADE INDICATORS (monthly YoY data):
   Applying 3 transformations 

### 3.5 FFA Spreads (REMOVED - Used Labels as Features)

**Decision D1-1A:** These features have been DELETED because they used target variables (P1A_82, P3A_82) as inputs, creating data leakage risk even with lagging.

Original features removed:
- P1A_FFA_Spread_level
- P1A_FFA_Spread_diff
- P3A_FFA_Spread_level
- P3A_FFA_Spread_diff

Impact: -4 features

In [9]:
print("\n" + "="*80)
print("CATEGORY 5: FFA SPREADS (REMOVED)")
print("="*80)
print("‚ùå FFA Spread features DELETED per Decision D1-1A")
print("   Reason: Used target variables (P1A_82, P3A_82) as feature inputs")
print("   Impact: -4 features removed")
print(f"\n‚úÖ Category 5 skipped: {len(features_comprehensive.columns)} total features so far")


CATEGORY 5: FFA SPREADS (REMOVED)
‚ùå FFA Spread features DELETED per Decision D1-1A
   Reason: Used target variables (P1A_82, P3A_82) as feature inputs
   Impact: -4 features removed

‚úÖ Category 5 skipped: 110 total features so far


### 3.6 Economic Indicators

In [10]:
print("\n" + "="*80)
print("CATEGORY 6: ECONOMIC INDICATORS")
print("="*80)
print("Creating additional variants of IP growth indicators\n")

# Atlantic IP (already YoY, but create additional variants)
# Note: mom transformation removed per D3-3A (redundant with pct)
print("Transforming: Atlantic_IP_YoY")
features_comprehensive['Atlantic_IP_yoy'] = features_raw['Atlantic_IP_YoY']
features_comprehensive['Atlantic_IP_pct'] = features_raw['Atlantic_IP_YoY'].pct_change(1) * 100
features_comprehensive['Atlantic_IP_diff'] = features_raw['Atlantic_IP_YoY'].diff()
print(f"  ‚Üí Created 3 features (yoy, pct, diff)")

# Pacific IP (already YoY, but create additional variants)
print("\nTransforming: Pacific_IP_YoY")
features_comprehensive['Pacific_IP_yoy'] = features_raw['Pacific_IP_YoY']
features_comprehensive['Pacific_IP_pct'] = features_raw['Pacific_IP_YoY'].pct_change(1) * 100
features_comprehensive['Pacific_IP_diff'] = features_raw['Pacific_IP_YoY'].diff()
print(f"  ‚Üí Created 3 features (yoy, pct, diff)")

print(f"\n‚úÖ Category 6 complete: {len(features_comprehensive.columns)} total features so far")


CATEGORY 6: ECONOMIC INDICATORS
Creating additional variants of IP growth indicators

Transforming: Atlantic_IP_YoY
  ‚Üí Created 3 features (yoy, pct, diff)

Transforming: Pacific_IP_YoY
  ‚Üí Created 3 features (yoy, pct, diff)

‚úÖ Category 6 complete: 116 total features so far


### 3.7 FFA Term Structure (Forward Curve Features)

In [11]:
print("\n" + "="*80)
print("CATEGORY 7: FFA TERM STRUCTURE (EXPERIMENTAL)")
print("="*80)
print("Creating forward curve features (note: Basis/Slope failed in Round 1)\n")

# P1A term structure
print("Creating: P1A term structure features")
# Current month FFA (level + diff)
features_comprehensive['P1EA_CURMON_level'] = features_raw['P1EA_82CURMON']
features_comprehensive['P1EA_CURMON_diff'] = features_raw['P1EA_82CURMON'].diff()

# 1-month forward (level + diff)
features_comprehensive['P1EA_1MON_level'] = features_raw['P1EA_82+1MON']
features_comprehensive['P1EA_1MON_diff'] = features_raw['P1EA_82+1MON'].diff()

# 1-quarter forward (level + diff)
features_comprehensive['P1EA_1Q_level'] = features_raw['P1EA_82+1Q']
features_comprehensive['P1EA_1Q_diff'] = features_raw['P1EA_82+1Q'].diff()

print(f"  ‚Üí Created 6 P1A FFA features")

# P3A term structure
print("\nCreating: P3A term structure features")
# Current month FFA (level + diff)
features_comprehensive['P3EA_CURMON_level'] = features_raw['P3EA_82CURMON']
features_comprehensive['P3EA_CURMON_diff'] = features_raw['P3EA_82CURMON'].diff()

# 1-month forward (level + diff)
features_comprehensive['P3EA_1MON_level'] = features_raw['P3EA_82+1MON']
features_comprehensive['P3EA_1MON_diff'] = features_raw['P3EA_82+1MON'].diff()

# 1-quarter forward (level + diff)
features_comprehensive['P3EA_1Q_level'] = features_raw['P3EA_82+1Q']
features_comprehensive['P3EA_1Q_diff'] = features_raw['P3EA_82+1Q'].diff()

print(f"  ‚Üí Created 6 P3A FFA features")

print(f"\n‚úÖ Category 7 complete: {len(features_comprehensive.columns)} total features so far")
print("\n‚ö†Ô∏è Note: Round 1 showed FFA term structure had negative importance.")
print("   These features included for empirical validation - may be dropped in Notebook 03.")


CATEGORY 7: FFA TERM STRUCTURE (EXPERIMENTAL)
Creating forward curve features (note: Basis/Slope failed in Round 1)

Creating: P1A term structure features
  ‚Üí Created 6 P1A FFA features

Creating: P3A term structure features
  ‚Üí Created 6 P3A FFA features

‚úÖ Category 7 complete: 128 total features so far

‚ö†Ô∏è Note: Round 1 showed FFA term structure had negative importance.
   These features included for empirical validation - may be dropped in Notebook 03.


## Section 4: Data Quality Checks

In [12]:
print("\n" + "="*80)
print("DATA QUALITY CHECKS")
print("="*80)

print(f"\nüìä COMPREHENSIVE FEATURE SET SUMMARY")
print(f"  Total features created: {features_comprehensive.shape[1]}")
print(f"  Total rows (days): {features_comprehensive.shape[0]}")
print(f"  Date range: {features_comprehensive.index.min()} to {features_comprehensive.index.max()}")


DATA QUALITY CHECKS

üìä COMPREHENSIVE FEATURE SET SUMMARY
  Total features created: 128
  Total rows (days): 1153
  Date range: 2021-03-01 00:00:00 to 2025-10-10 00:00:00


In [13]:
# Missing values analysis
print("\nüìä MISSING VALUES ANALYSIS")
print("=" * 80)

missing_summary = features_comprehensive.isnull().sum().sort_values(ascending=False)
missing_pct = (missing_summary / len(features_comprehensive) * 100)

# Show top 30 features with most missing values
print("\nTop 30 features with most missing values:")
print("\n{:<50} {:>10} {:>10}".format('Feature', 'Missing', 'Pct %'))
print("-" * 72)
for feat, count in missing_summary.head(30).items():
    pct = missing_pct[feat]
    print("{:<50} {:>10} {:>9.2f}%".format(feat, int(count), pct))

# Expected NaN patterns
print("\n\nüìã EXPECTED NaN PATTERNS:")
print("  1. Row 1: ALL features (from 1-day lag in Notebook 01) ‚úÖ")
print("  2. MA30 features: ~30 rows (rolling window) ‚úÖ")
print("  3. YoY features: ~12 rows (12-period lookback) ‚úÖ")
print("  4. Vol30 features: ~30 rows (rolling window) ‚úÖ")
print("  5. TC5yr features: Sparse data (weekly reporting) ‚úÖ")


üìä MISSING VALUES ANALYSIS

Top 30 features with most missing values:

Feature                                               Missing      Pct %
------------------------------------------------------------------------
VLSFO_vol30                                                63      5.46%
MGO_vol30                                                  63      5.46%
BPI_vol30                                                  30      2.60%
C5TC_vol30                                                 30      2.60%
P4_82_vol30                                                30      2.60%
PDOPEX_vol30                                               30      2.60%
Panamax_Deliveries_DWT_vol30                               30      2.60%
Panamax_Orderbook_Pct_vol30                                30      2.60%
Capesize_Orderbook_Pct_vol30                               30      2.60%
Panamax_Idle_Pct_vol30                                     30      2.60%
Panamax_Fleet_Growth_YoY_vol30                    

In [14]:
# Verify first row is all NaN (leakage check)
print("\n‚ö†Ô∏è CRITICAL LEAKAGE CHECK:")
print("=" * 80)

first_row_nulls = features_comprehensive.iloc[0].isnull().sum()
total_features = len(features_comprehensive.columns)

print(f"\nFirst row (2021-03-01) NULL count: {first_row_nulls}/{total_features}")

if first_row_nulls == total_features:
    print("‚úÖ PASS: All features properly lagged (row 1 is all NaN)")
    print("   This confirms no data leakage from improper temporal alignment.")
else:
    print(f"‚ö†Ô∏è WARNING: Expected all {total_features} features to be NaN in row 1")
    print(f"   Found {total_features - first_row_nulls} non-null features!")
    print("\n   Non-null features in row 1:")
    non_null_features = features_comprehensive.iloc[0][features_comprehensive.iloc[0].notna()].index.tolist()
    for feat in non_null_features:
        print(f"     - {feat}: {features_comprehensive.iloc[0][feat]}")


‚ö†Ô∏è CRITICAL LEAKAGE CHECK:

First row (2021-03-01) NULL count: 128/128
‚úÖ PASS: All features properly lagged (row 1 is all NaN)
   This confirms no data leakage from improper temporal alignment.


In [15]:
# Check for infinite values
print("\nüìä INFINITE VALUES CHECK:")
print("=" * 80)

inf_counts = np.isinf(features_comprehensive).sum()
features_with_inf = inf_counts[inf_counts > 0]

if len(features_with_inf) == 0:
    print("\n‚úÖ PASS: No infinite values detected")
else:
    print(f"\n‚ö†Ô∏è  Found {len(features_with_inf)} features with infinite values:\n")
    for feat, count in features_with_inf.items():
        print(f"  {feat}: {count} infinite values")
    
    print("\nüîß FIXING: Replacing infinite values with NaN...")
    print("   Rationale: Infinite values (from division by zero in pct_change)")
    print("              must be removed to ensure clean data for validation.\n")
    
    features_comprehensive = features_comprehensive.replace([np.inf, -np.inf], np.nan)
    
    # Verify fix
    inf_after = np.isinf(features_comprehensive).sum().sum()
    if inf_after == 0:
        print("‚úÖ FIXED: All infinite values replaced with NaN")
        
        # Report new missing counts
        for feat in features_with_inf.index:
            new_missing = features_comprehensive[feat].isnull().sum()
            new_pct = (new_missing / len(features_comprehensive)) * 100
            print(f"   - {feat}: now {new_missing} NaN ({new_pct:.2f}%)")
    else:
        print(f"‚ùå ERROR: Still {inf_after} infinite values remain!")



üìä INFINITE VALUES CHECK:

‚ö†Ô∏è  Found 1 features with infinite values:

  Coal_Trade_YoY_pct: 1 infinite values

üîß FIXING: Replacing infinite values with NaN...
   Rationale: Infinite values (from division by zero in pct_change)
              must be removed to ensure clean data for validation.

‚úÖ FIXED: All infinite values replaced with NaN
   - Coal_Trade_YoY_pct: now 19 NaN (1.65%)


In [16]:
# Summary statistics for select features
print("\nüìä SUMMARY STATISTICS (Sample Features):")
print("=" * 80)

# Select representative features
sample_features = [
    'BPI_level', 'BPI_pct', 'BPI_ma30_dev',
    'VLSFO_pct', 'VLSFO_ma30_dev',
    'C5TC_diff', 'C5TC_pct',
    'China_Coal_Imports_MT_yoy', 'China_Grain_Imports_MT_yoy',
    'P1A_FFA_Spread_level', 'P3A_FFA_Spread_level'
]

# Filter to features that exist
sample_features = [f for f in sample_features if f in features_comprehensive.columns]

if len(sample_features) > 0:
    summary_stats = features_comprehensive[sample_features].describe()
    print("\n" + summary_stats.to_string())
else:
    print("\n‚ö†Ô∏è Sample features not found in dataset")


üìä SUMMARY STATISTICS (Sample Features):

         BPI_level      BPI_pct    VLSFO_pct    C5TC_diff     C5TC_pct  China_Coal_Imports_MT_yoy  China_Grain_Imports_MT_yoy
count  1152.000000  1151.000000  1151.000000  1151.000000  1151.000000                1140.000000                 1140.000000
mean   1967.139757     0.020324     0.004903     9.923545     0.346486                   0.455578                   -0.133485
std     818.960842     2.731426     1.104261  1490.048504     7.775443                   6.124854                    2.217483
min     748.000000    -8.396947    -5.514706 -7946.000000   -30.251952                 -16.752173                  -12.999742
25%    1405.750000    -1.700319    -0.518904  -736.000000    -4.042270                   0.000000                    0.000000
50%    1705.500000    -0.185357     0.000000   -37.000000    -0.234299                   0.000000                    0.000000
75%    2379.500000     1.369142     0.580837   739.500000     3.770552   

## Section 5: Feature Type Breakdown

In [17]:
print("\n" + "="*80)
print("FEATURE TYPE BREAKDOWN")
print("="*80)

# Count features by transformation type
feature_types = {
    'level': len([c for c in features_comprehensive.columns if '_level' in c]),
    'diff': len([c for c in features_comprehensive.columns if '_diff' in c]),
    'pct': len([c for c in features_comprehensive.columns if '_pct' in c]),
    'yoy': len([c for c in features_comprehensive.columns if '_yoy' in c or 'YoY' in c]),
    'mom': len([c for c in features_comprehensive.columns if '_mom' in c]),
    'ma30_dev': len([c for c in features_comprehensive.columns if '_ma30_dev' in c]),
    'vol30': len([c for c in features_comprehensive.columns if '_vol30' in c]),
}

print("\nüìä Transformation Type Counts:")
print("\n{:<15} {:>10}".format('Type', 'Count'))
print("-" * 27)
for ftype, count in feature_types.items():
    print("{:<15} {:>10}".format(ftype, count))

print("\n" + "-" * 27)
print("{:<15} {:>10}".format('TOTAL', features_comprehensive.shape[1]))

# Feature category breakdown
print("\n\nüìä Feature Category Breakdown:")
categories = {
    'Baltic/Market Indices': ['BPI', 'C5TC', 'P4_82', 'PDOPEX'],
    'Bunker Prices': ['VLSFO', 'MGO'],
    'Fleet/Supply': ['Panamax', 'Capesize', 'TC5yr', 'Atlantic_Port_Calls'],
    'Trade Volumes': ['Coal', 'Grain', 'Trade'],
    'FFA Spreads': ['FFA_Spread'],
    'Economic': ['IP_'],
    'FFA Term Structure': ['P1EA', 'P3EA']
}

print("\n{:<25} {:>10}".format('Category', 'Count'))
print("-" * 37)
for category, keywords in categories.items():
    count = sum(1 for col in features_comprehensive.columns 
                if any(kw in col for kw in keywords))
    print("{:<25} {:>10}".format(category, count))


FEATURE TYPE BREAKDOWN

üìä Transformation Type Counts:

Type                 Count
---------------------------
level                   32
diff                    32
pct                     26
yoy                     34
mom                      0
ma30_dev                 0
vol30                   14

---------------------------
TOTAL                  128


üìä Feature Category Breakdown:

Category                       Count
-------------------------------------
Baltic/Market Indices             20
Bunker Prices                     10
Fleet/Supply                      32
Trade Volumes                     48
FFA Spreads                        0
Economic                           6
FFA Term Structure                12


## Section 6: Save Comprehensive Feature Set

In [18]:
print("\n" + "="*80)
print("SAVING COMPREHENSIVE FEATURE SET")
print("="*80)

# Create output directory if it doesn't exist
output_dir = Path('data/processed/features')
output_dir.mkdir(parents=True, exist_ok=True)

# Save to CSV
output_path = output_dir / 'features_comprehensive.csv'
features_comprehensive.to_csv(output_path)

print(f"\n‚úÖ Comprehensive features saved to:")
print(f"   {output_path}")
print(f"\nüìä File details:")
print(f"   Shape: {features_comprehensive.shape}")
print(f"   Features: {features_comprehensive.shape[1]} columns")
print(f"   Rows: {features_comprehensive.shape[0]} days")
print(f"   Date range: {features_comprehensive.index.min()} to {features_comprehensive.index.max()}")
print(f"   File size: {output_path.stat().st_size / 1024:.2f} KB")


SAVING COMPREHENSIVE FEATURE SET

‚úÖ Comprehensive features saved to:
   data\processed\features\features_comprehensive.csv

üìä File details:
   Shape: (1153, 128)
   Features: 128 columns
   Rows: 1153 days
   Date range: 2021-03-01 00:00:00 to 2025-10-10 00:00:00
   File size: 1249.36 KB


## Section 7: Final Summary & Next Steps

In [19]:
print("\n" + "="*80)
print("NOTEBOOK 02 COMPLETE (REVISED) ‚úÖ")
print("="*80)

print("\nüìã Summary:")
print(f"   ‚úì Created {features_comprehensive.shape[1]} comprehensive features")
print(f"   ‚úì Applied 5 transformation types (level, diff, pct, yoy, vol30)")
print(f"   ‚úì Processed {len(features_raw.columns)} raw input features")
print(f"   ‚úì Covered 1,153 business days (2021-03-01 to 2025-10-10)")
print(f"   ‚úì All leakage checks passed (row 1 all NaN)")
print(f"   ‚úì Saved to: data/processed/features/features_comprehensive.csv")

print("\nüîß Decisions Applied:")
print("   D1-1A: ‚ùå Removed FFA spreads using labels (-4 features)")
print("   D2-2B: ‚ùå Skipped vol30 for 8 annual features (-8 features)")
print("   D3-3A: ‚ùå Removed mom transformation (-24 features)")
print("   D4-4B: ‚úÖ Reduced to 5 transformations")
print("   D5-5A: ‚úÖ Kept YoY transformations")

print(f"\nüìä Expected Improvements:")
print(f"   Previous: 184 features ‚Üí 70 usable rows (6.1% retention)")
print(f"   Revised: ~{features_comprehensive.shape[1]} features ‚Üí ~1,100+ usable rows (95%+ retention)")
print(f"   Fixes: No infinite VIF, no label leakage, minimal NaN")

print("\nüéØ Next Steps:")
print("   1. Execute Notebook 03: Feature Validation & Allocation")
print("      - Random Forest permutation importance on ALL features")
print("      - Data-driven allocation into CORE vs ML sets")
print("      - VIF analysis for CORE features")
print("      - Decision gate: PASS/FAIL before modeling")
print("\n   2. If PASS: Proceed to Notebook 04 (Data Preparation)")
print("   3. If FAIL: Return here to revise features")

print("\n‚ö†Ô∏è Critical Reminders:")
print("   - All features already have 1-day lag (verified ‚úÖ)")
print("   - No additional lag needed in downstream notebooks")
print("   - Expected NaN in first ~30 rows for vol30 features")
print("   - Expected NaN in first ~12 rows for YoY features")

print("\n" + "="*80)
print("Ready for empirical feature validation! üöÄ")
print("="*80)


NOTEBOOK 02 COMPLETE (REVISED) ‚úÖ

üìã Summary:
   ‚úì Created 128 comprehensive features
   ‚úì Applied 5 transformation types (level, diff, pct, yoy, vol30)
   ‚úì Processed 58 raw input features
   ‚úì Covered 1,153 business days (2021-03-01 to 2025-10-10)
   ‚úì All leakage checks passed (row 1 all NaN)
   ‚úì Saved to: data/processed/features/features_comprehensive.csv

üîß Decisions Applied:
   D1-1A: ‚ùå Removed FFA spreads using labels (-4 features)
   D2-2B: ‚ùå Skipped vol30 for 8 annual features (-8 features)
   D3-3A: ‚ùå Removed mom transformation (-24 features)
   D4-4B: ‚úÖ Reduced to 5 transformations
   D5-5A: ‚úÖ Kept YoY transformations

üìä Expected Improvements:
   Previous: 184 features ‚Üí 70 usable rows (6.1% retention)
   Revised: ~128 features ‚Üí ~1,100+ usable rows (95%+ retention)
   Fixes: No infinite VIF, no label leakage, minimal NaN

üéØ Next Steps:
   1. Execute Notebook 03: Feature Validation & Allocation
      - Random Forest permutation import