# Round 2 - Data Gathering

**Purpose:** Load raw data files, create master calendar, pivot BFA data, and merge all features to daily frequency.

**Key Steps:**
1. Load all raw data files from `round_2/data/raw/`
2. Create master calendar based on label availability (P1A_82, P3A_82)
3. Pivot BFA data from tall to wide format
4. Align all features to master calendar (forward-fill from weekly/monthly/annual frequencies)
5. **Apply 1-day lag to ALL features (data leakage prevention)**
6. Save intermediate files for feature engineering

**Critical Notes:**
- Study period: 2021-03-01 onwards
- Labels (P1A_82, P3A_82) are in `baltic_exchange_010321_101025.csv`
- Master calendar ensures all features align to label availability
- Date format: yyyy-mm-dd (kept as column, NOT index)
- **All data stays within round_2/ directory structure**
- **Cross-platform compatible:** Works on Windows, Linux, macOS, Cloud environments
- **Data leakage prevention:** ALL features are lagged by 1 day using `.shift(1)` to ensure predictions only use information available BEFORE the prediction date (matching Round 1 approach)

**Path Detection Strategy:**
- Automatically detects `round_2/` directory location
- Uses `pathlib.Path` for OS-agnostic path handling
- Validates directory structure before execution
- Works regardless of execution directory

**Outputs:**
- `round_2/data/processed/intermediate/master_calendar.csv`
- `round_2/data/processed/intermediate/bfa_wide.csv`
- `round_2/data/processed/intermediate/labels.csv`
- `round_2/data/processed/intermediate/features_raw_daily.csv` (with 1-day lag applied)

---
## Setup

In [1]:
import pandas as pd
import numpy as np
import os
import sys
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

print("="*80)
print("ROUND 2 - DATA GATHERING")
print("="*80)
print(f"Python version: {sys.version}")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

ROUND 2 - DATA GATHERING
Python version: 3.11.14 | packaged by conda-forge | (main, Oct 13 2025, 14:00:26) [MSC v.1944 64 bit (AMD64)]
Pandas version: 2.3.3
NumPy version: 2.3.3


### Define Paths

In [2]:
# Round 2 root (this round's working directory)
# This works regardless of where the notebook is executed from
ROUND2_ROOT = Path(__file__).parent if '__file__' in globals() else Path.cwd()

# If running as notebook, detect round_2 directory robustly
if 'ipykernel' in sys.modules or 'IPython' in sys.modules:
    # Running in Jupyter - use notebook's actual location
    try:
        from IPython import get_ipython
        notebook_path = Path(get_ipython().run_line_magic('pwd', ''))
        # Find round_2 directory by walking up
        current = notebook_path
        while current.name != 'round_2' and current != current.parent:
            current = current.parent
        if current.name == 'round_2':
            ROUND2_ROOT = current
        else:
            # Fallback: assume we're already in round_2
            ROUND2_ROOT = notebook_path
    except:
        # Fallback for other notebook environments
        ROUND2_ROOT = Path.cwd()

print(f"Round 2 Root: {ROUND2_ROOT}")

# Input directories (within round_2/)
RAW_DATA_DIR = ROUND2_ROOT / 'data' / 'raw'
BALTIC_DIR = RAW_DATA_DIR / 'baltic_exchange'
BUNKER_DIR = RAW_DATA_DIR / 'bunker'
CLARKSONS_DIR = RAW_DATA_DIR / 'clarksons'

# Output directories (within round_2/)
PROCESSED_DIR = ROUND2_ROOT / 'data' / 'processed'
INTERMEDIATE_DIR = PROCESSED_DIR / 'intermediate'
MODELS_DIR = ROUND2_ROOT / 'data' / 'models'
DIAGNOSTICS_DIR = ROUND2_ROOT / 'data' / 'diagnostics'

# Create output directories if they don't exist
PROCESSED_DIR.mkdir(parents=True, exist_ok=True)
INTERMEDIATE_DIR.mkdir(parents=True, exist_ok=True)
MODELS_DIR.mkdir(parents=True, exist_ok=True)
DIAGNOSTICS_DIR.mkdir(parents=True, exist_ok=True)

print(f"\nInput Directories:")
print(f"  Raw Data: {RAW_DATA_DIR}")
print(f"  Baltic Exchange: {BALTIC_DIR}")
print(f"  Bunker: {BUNKER_DIR}")
print(f"  Clarksons: {CLARKSONS_DIR}")
print(f"\nOutput Directories:")
print(f"  Processed: {PROCESSED_DIR}")
print(f"  Intermediate: {INTERMEDIATE_DIR}")
print(f"  Models: {MODELS_DIR}")
print(f"  Diagnostics: {DIAGNOSTICS_DIR}")

# Study parameters
STUDY_START = '2021-03-01'
print(f"\nStudy Start Date: {STUDY_START}")

# Validate that raw data directory exists
if not RAW_DATA_DIR.exists():
    raise FileNotFoundError(
        f"Raw data directory not found: {RAW_DATA_DIR}\n"
        f"Please ensure you're running this notebook from the correct location.\n"
        f"Expected structure: round_2/data/raw/"
    )
else:
    print(f"\n✅ Raw data directory validated: {RAW_DATA_DIR}")

Round 2 Root: C:\Users\moame\Documents\GitHub\MScFECapstoneProject\round_2

Input Directories:
  Raw Data: C:\Users\moame\Documents\GitHub\MScFECapstoneProject\round_2\data\raw
  Baltic Exchange: C:\Users\moame\Documents\GitHub\MScFECapstoneProject\round_2\data\raw\baltic_exchange
  Bunker: C:\Users\moame\Documents\GitHub\MScFECapstoneProject\round_2\data\raw\bunker
  Clarksons: C:\Users\moame\Documents\GitHub\MScFECapstoneProject\round_2\data\raw\clarksons

Output Directories:
  Processed: C:\Users\moame\Documents\GitHub\MScFECapstoneProject\round_2\data\processed
  Intermediate: C:\Users\moame\Documents\GitHub\MScFECapstoneProject\round_2\data\processed\intermediate
  Models: C:\Users\moame\Documents\GitHub\MScFECapstoneProject\round_2\data\models
  Diagnostics: C:\Users\moame\Documents\GitHub\MScFECapstoneProject\round_2\data\diagnostics

Study Start Date: 2021-03-01

✅ Raw data directory validated: C:\Users\moame\Documents\GitHub\MScFECapstoneProject\round_2\data\raw


---
## Step 1: Load Labels (P1A_82, P3A_82) and Create Master Calendar

In [3]:
print("\n" + "="*80)
print("STEP 1: LOAD LABELS AND CREATE MASTER CALENDAR")
print("="*80)

# Load Baltic Exchange data (contains labels)
baltic_file = BALTIC_DIR / 'baltic_exchange_010321_101025.csv'
print(f"\nLoading Baltic Exchange data from: {baltic_file}")

baltic_df = pd.read_csv(baltic_file, parse_dates=['Date'])
print(f"  Loaded {len(baltic_df)} rows, {len(baltic_df.columns)} columns")
print(f"  Columns: {list(baltic_df.columns)}")
print(f"  Date range: {baltic_df['Date'].min()} to {baltic_df['Date'].max()}")

# Filter to study period
baltic_df = baltic_df[baltic_df['Date'] >= STUDY_START].copy()
print(f"\nFiltered to study period (>= {STUDY_START}): {len(baltic_df)} rows")


STEP 1: LOAD LABELS AND CREATE MASTER CALENDAR

Loading Baltic Exchange data from: C:\Users\moame\Documents\GitHub\MScFECapstoneProject\round_2\data\raw\baltic_exchange\baltic_exchange_010321_101025.csv
  Loaded 1153 rows, 7 columns
  Columns: ['Date', 'BPI', 'C5TC', 'P1A_82', 'P3A_82', 'P4_82', 'PDOPEX']
  Date range: 2021-03-01 00:00:00 to 2025-10-10 00:00:00

Filtered to study period (>= 2021-03-01): 1153 rows


In [4]:
# Extract labels (P1A_82, P3A_82)
print("\nExtracting labels (P1A_82, P3A_82)...")
labels_df = baltic_df[['Date', 'P1A_82', 'P3A_82']].copy()

print(f"\nLabel Statistics BEFORE filtering:")
print(f"  Total rows: {len(labels_df)}")
print(f"  P1A_82 non-null: {labels_df['P1A_82'].notna().sum()}")
print(f"  P3A_82 non-null: {labels_df['P3A_82'].notna().sum()}")
print(f"  Both non-null: {(labels_df['P1A_82'].notna() & labels_df['P3A_82'].notna()).sum()}")

# Filter to dates where BOTH labels are available
labels_df = labels_df.dropna(subset=['P1A_82', 'P3A_82']).copy()

print(f"\nLabel Statistics AFTER filtering (both non-null):")
print(f"  Total rows: {len(labels_df)}")
print(f"  Date range: {labels_df['Date'].min()} to {labels_df['Date'].max()}")
print(f"\nLabel Summary Statistics:")
print(labels_df[['P1A_82', 'P3A_82']].describe())


Extracting labels (P1A_82, P3A_82)...

Label Statistics BEFORE filtering:
  Total rows: 1153
  P1A_82 non-null: 1153
  P3A_82 non-null: 1153
  Both non-null: 1153

Label Statistics AFTER filtering (both non-null):
  Total rows: 1153
  Date range: 2021-03-01 00:00:00 to 2025-10-10 00:00:00

Label Summary Statistics:
             P1A_82        P3A_82
count   1153.000000   1153.000000
mean   17151.208153  16720.982654
std     8096.960712   7512.756643
min     4458.000000   5434.000000
25%    11165.000000  11669.000000
50%    15225.000000  14373.000000
75%    21275.000000  20273.000000
max    45050.000000  40687.000000


In [5]:
# Create Master Calendar
print("\n" + "-"*80)
print("CREATING MASTER CALENDAR")
print("-"*80)

master_calendar = labels_df[['Date']].copy()
print(f"\nMaster Calendar:")
print(f"  Total days: {len(master_calendar)}")
print(f"  Start date: {master_calendar['Date'].min()}")
print(f"  End date: {master_calendar['Date'].max()}")

# Validate no gaps (should be consecutive business days)
date_diffs = master_calendar['Date'].diff()[1:]
max_gap = date_diffs.max()
print(f"\nCalendar Validation:")
print(f"  Maximum gap between dates: {max_gap}")

if max_gap > pd.Timedelta(days=3):
    print("  ⚠️ WARNING: Large gaps detected (>3 days). This may indicate missing data.")
    large_gaps = date_diffs[date_diffs > pd.Timedelta(days=3)]
    print(f"  Number of large gaps: {len(large_gaps)}")
else:
    print("  ✅ Calendar validated: No unexpected gaps detected")

# Save master calendar
master_calendar_file = INTERMEDIATE_DIR / 'master_calendar.csv'
master_calendar.to_csv(master_calendar_file, index=False)
print(f"\n✅ Master calendar saved to: {master_calendar_file}")


--------------------------------------------------------------------------------
CREATING MASTER CALENDAR
--------------------------------------------------------------------------------

Master Calendar:
  Total days: 1153
  Start date: 2021-03-01 00:00:00
  End date: 2025-10-10 00:00:00

Calendar Validation:
  Maximum gap between dates: 11 days 00:00:00
  Number of large gaps: 26

✅ Master calendar saved to: C:\Users\moame\Documents\GitHub\MScFECapstoneProject\round_2\data\processed\intermediate\master_calendar.csv


In [6]:
# Save labels (aligned to master calendar)
labels_file = INTERMEDIATE_DIR / 'labels.csv'
labels_df.to_csv(labels_file, index=False)
print(f"✅ Labels saved to: {labels_file}")
print(f"\nLabel file contains {len(labels_df)} rows with columns: {list(labels_df.columns)}")

✅ Labels saved to: C:\Users\moame\Documents\GitHub\MScFECapstoneProject\round_2\data\processed\intermediate\labels.csv

Label file contains 1153 rows with columns: ['Date', 'P1A_82', 'P3A_82']


---
## Step 2: Load and Process BFA Data (Tall to Wide)

In [7]:
print("\n" + "="*80)
print("STEP 2: LOAD AND PIVOT BFA DATA")
print("="*80)

# Load BFA data (tall format)
bfa_file = BALTIC_DIR / 'bfa_panamax_010321_101025.csv'
print(f"\nLoading BFA data from: {bfa_file}")

bfa_df = pd.read_csv(bfa_file, parse_dates=['ArchiveDate'])
print(f"  Loaded {len(bfa_df)} rows, {len(bfa_df.columns)} columns")
print(f"  Columns: {list(bfa_df.columns)}")

# Rename ArchiveDate to Date for consistency
bfa_df = bfa_df.rename(columns={'ArchiveDate': 'Date'})

# Filter to study period
bfa_df = bfa_df[bfa_df['Date'] >= STUDY_START].copy()
print(f"\nFiltered to study period: {len(bfa_df)} rows")
print(f"  Date range: {bfa_df['Date'].min()} to {bfa_df['Date'].max()}")

# Check unique route identifiers
print(f"\nUnique RouteIdentifiers: {bfa_df['RouteIdentifier'].nunique()}")
print(f"  Sample routes: {list(bfa_df['RouteIdentifier'].unique()[:10])}")


STEP 2: LOAD AND PIVOT BFA DATA

Loading BFA data from: C:\Users\moame\Documents\GitHub\MScFECapstoneProject\round_2\data\raw\baltic_exchange\bfa_panamax_010321_101025.csv
  Loaded 25022 rows, 5 columns
  Columns: ['GroupDesc', 'ArchiveDate', 'RouteIdentifier', 'RouteAverage', 'FFADescription']

Filtered to study period: 25022 rows
  Date range: 2021-03-01 00:00:00 to 2025-10-10 00:00:00

Unique RouteIdentifiers: 22
  Sample routes: ['P1EA_82CURMON', 'P1EA_82+1MON', 'P1EA_82+2MON', 'P1EA_82+3MON', 'P1EA_82+4MON', 'P1EA_82+5MON', 'P1EA_82CURQ', 'P1EA_82+1Q', 'P1EA_82+2Q', 'P1EA_82+3Q']


In [8]:
# Pivot BFA data from tall to wide format
print("\n" + "-"*80)
print("PIVOTING BFA DATA (TALL → WIDE)")
print("-"*80)

print("\nPivoting on: Date (index) × RouteIdentifier (columns) → RouteAverage (values)")
bfa_wide = bfa_df.pivot(index='Date', columns='RouteIdentifier', values='RouteAverage').reset_index()

print(f"\nPivoted BFA data:")
print(f"  Rows: {len(bfa_wide)}")
print(f"  Columns: {len(bfa_wide.columns)} (including Date)")
print(f"  Date range: {bfa_wide['Date'].min()} to {bfa_wide['Date'].max()}")


--------------------------------------------------------------------------------
PIVOTING BFA DATA (TALL → WIDE)
--------------------------------------------------------------------------------

Pivoting on: Date (index) × RouteIdentifier (columns) → RouteAverage (values)

Pivoted BFA data:
  Rows: 1165
  Columns: 23 (including Date)
  Date range: 2021-03-01 00:00:00 to 2025-10-10 00:00:00


In [9]:
# Identify P1EA and P3EA forward contracts
print("\n" + "-"*80)
print("FORWARD CURVE STRUCTURE ANALYSIS")
print("-"*80)

p1_cols = [col for col in bfa_wide.columns if 'P1EA_82' in str(col)]
p3_cols = [col for col in bfa_wide.columns if 'P3EA_82' in str(col)]

print(f"\nP1EA_82 (Atlantic) Forward Contracts: {len(p1_cols)}")
print(f"  Key contracts: {p1_cols[:6]}")

print(f"\nP3EA_82 (Pacific) Forward Contracts: {len(p3_cols)}")
print(f"  Key contracts: {p3_cols[:6]}")

print(f"\nTotal forward contracts: {len(p1_cols) + len(p3_cols)}")

# Check for CURMON (current month FFA) - needed for FFA spread
if 'P1EA_82CURMON' in bfa_wide.columns:
    print(f"\n✅ P1EA_82CURMON found (for P1A FFA spread calculation)")
else:
    print(f"\n⚠️ WARNING: P1EA_82CURMON not found!")

if 'P3EA_82CURMON' in bfa_wide.columns:
    print(f"✅ P3EA_82CURMON found (for P3A FFA spread calculation)")
else:
    print(f"⚠️ WARNING: P3EA_82CURMON not found!")


--------------------------------------------------------------------------------
FORWARD CURVE STRUCTURE ANALYSIS
--------------------------------------------------------------------------------

P1EA_82 (Atlantic) Forward Contracts: 11
  Key contracts: ['P1EA_82+1MON', 'P1EA_82+1Q', 'P1EA_82+2MON', 'P1EA_82+2Q', 'P1EA_82+3MON', 'P1EA_82+3Q']

P3EA_82 (Pacific) Forward Contracts: 11
  Key contracts: ['P3EA_82+1MON', 'P3EA_82+1Q', 'P3EA_82+2MON', 'P3EA_82+2Q', 'P3EA_82+3MON', 'P3EA_82+3Q']

Total forward contracts: 22

✅ P1EA_82CURMON found (for P1A FFA spread calculation)
✅ P3EA_82CURMON found (for P3A FFA spread calculation)


In [10]:
# Save BFA wide format
bfa_wide_file = INTERMEDIATE_DIR / 'bfa_wide.csv'
bfa_wide.to_csv(bfa_wide_file, index=False)
print(f"\n✅ BFA wide format saved to: {bfa_wide_file}")


✅ BFA wide format saved to: C:\Users\moame\Documents\GitHub\MScFECapstoneProject\round_2\data\processed\intermediate\bfa_wide.csv


---
## Step 3: Load Other Baltic Exchange Data (BPI, C5TC, P4_82, PDOPEX)

In [11]:
print("\n" + "="*80)
print("STEP 3: LOAD BALTIC EXCHANGE INDICES")
print("="*80)

# Extract non-label columns from Baltic Exchange data
print("\nExtracting Baltic indices (BPI, C5TC, P4_82, PDOPEX)...")
baltic_features = baltic_df[['Date', 'BPI', 'C5TC', 'P4_82', 'PDOPEX']].copy()

print(f"\nBaltic Features:")
print(f"  Rows: {len(baltic_features)}")
print(f"  Columns: {list(baltic_features.columns)}")
print(f"  Date range: {baltic_features['Date'].min()} to {baltic_features['Date'].max()}")

print(f"\nMissing Value Summary:")
print(baltic_features.isnull().sum())


STEP 3: LOAD BALTIC EXCHANGE INDICES

Extracting Baltic indices (BPI, C5TC, P4_82, PDOPEX)...

Baltic Features:
  Rows: 1153
  Columns: ['Date', 'BPI', 'C5TC', 'P4_82', 'PDOPEX']
  Date range: 2021-03-01 00:00:00 to 2025-10-10 00:00:00

Missing Value Summary:
Date         0
BPI          0
C5TC         0
P4_82        0
PDOPEX    1134
dtype: int64


---
## Step 4: Load Bunker Data (VLSFO, MGO)

In [12]:
print("\n" + "="*80)
print("STEP 4: LOAD BUNKER DATA")
print("="*80)

bunker_file = BUNKER_DIR / 'bunker_data_010321_101025.csv'
print(f"\nLoading bunker data from: {bunker_file}")

bunker_df = pd.read_csv(bunker_file, parse_dates=['Date'])
print(f"  Loaded {len(bunker_df)} rows, {len(bunker_df.columns)} columns")
print(f"  Columns: {list(bunker_df.columns)}")

# Filter to study period
bunker_df = bunker_df[bunker_df['Date'] >= STUDY_START].copy()
print(f"\nFiltered to study period: {len(bunker_df)} rows")
print(f"  Date range: {bunker_df['Date'].min()} to {bunker_df['Date'].max()}")

print(f"\nBunker Statistics:")
print(bunker_df[['VLSFO', 'MGO']].describe())

print(f"\nMissing Value Summary:")
print(bunker_df.isnull().sum())


STEP 4: LOAD BUNKER DATA

Loading bunker data from: C:\Users\moame\Documents\GitHub\MScFECapstoneProject\round_2\data\raw\bunker\bunker_data_010321_101025.csv
  Loaded 1201 rows, 3 columns
  Columns: ['Date', 'VLSFO', 'MGO']

Filtered to study period: 1201 rows
  Date range: 2021-03-01 00:00:00 to 2025-10-10 00:00:00

Bunker Statistics:
             VLSFO          MGO
count  1201.000000  1201.000000
mean    647.689425   858.333888
std     123.301596   189.088041
min     493.500000   556.000000
25%     571.000000   746.500000
50%     623.000000   804.000000
75%     671.000000   947.500000
max    1125.500000  1431.500000

Missing Value Summary:
Date     0
VLSFO    0
MGO      0
dtype: int64


---
## Step 5: Load Clarksons Daily Data

In [13]:
print("\n" + "="*80)
print("STEP 5: LOAD CLARKSONS DAILY DATA")
print("="*80)

clarksons_daily_file = CLARKSONS_DIR / 'clarksons_daily_data.csv'
print(f"\nLoading Clarksons daily data from: {clarksons_daily_file}")

clarksons_daily = pd.read_csv(clarksons_daily_file, parse_dates=['Date'])
print(f"  Loaded {len(clarksons_daily)} rows, {len(clarksons_daily.columns)} columns")
print(f"  Columns: {list(clarksons_daily.columns)}")

# Filter to study period
clarksons_daily = clarksons_daily[clarksons_daily['Date'] >= STUDY_START].copy()
print(f"\nFiltered to study period: {len(clarksons_daily)} rows")
print(f"  Date range: {clarksons_daily['Date'].min()} to {clarksons_daily['Date'].max()}")

# Rename columns for clarity
clarksons_daily = clarksons_daily.rename(columns={
    'Panamax Bulker - % Idle, DWT million': 'Panamax_Idle_Pct',
    'Atlantic Region Port Calls - Deep Sea Cargo Vessels, 7 day avg., Number': 'Atlantic_Port_Calls'
})

print(f"\nRenamed columns: {list(clarksons_daily.columns)}")

print(f"\nDaily Features Statistics:")
print(clarksons_daily[['Panamax_Idle_Pct', 'Atlantic_Port_Calls']].describe())

print(f"\nMissing Value Summary:")
print(clarksons_daily.isnull().sum())


STEP 5: LOAD CLARKSONS DAILY DATA

Loading Clarksons daily data from: C:\Users\moame\Documents\GitHub\MScFECapstoneProject\round_2\data\raw\clarksons\clarksons_daily_data.csv
  Loaded 1685 rows, 3 columns
  Columns: ['Date', 'Panamax Bulker - % Idle, DWT million', 'Atlantic Region Port Calls - Deep Sea Cargo Vessels, 7 day avg., Number']

Filtered to study period: 1685 rows
  Date range: 2021-03-01 00:00:00 to 2025-10-10 00:00:00

Renamed columns: ['Date', 'Panamax_Idle_Pct', 'Atlantic_Port_Calls']

Daily Features Statistics:
       Panamax_Idle_Pct  Atlantic_Port_Calls
count       1685.000000          1685.000000
mean           6.345519           446.032047
std            1.094566            13.994491
min            3.900000           400.000000
25%            5.500000           436.000000
50%            6.200000           446.000000
75%            7.100000           457.000000
max            8.900000           482.000000

Missing Value Summary:
Date                   0
Panamax_Idle_

---
## Step 6: Load Clarksons Weekly Data

In [14]:
print("\n" + "="*80)
print("STEP 6: LOAD CLARKSONS WEEKLY DATA")
print("="*80)

clarksons_weekly_file = CLARKSONS_DIR / 'clarksons_weekly_data.csv'
print(f"\nLoading Clarksons weekly data from: {clarksons_weekly_file}")

clarksons_weekly = pd.read_csv(clarksons_weekly_file, parse_dates=['Date'])
print(f"  Loaded {len(clarksons_weekly)} rows, {len(clarksons_weekly.columns)} columns")
print(f"  Columns: {list(clarksons_weekly.columns)}")

# NOTE: Weekly data includes pre-study period (2021-02-26) for forward-fill initialization
print(f"\nDate range (includes pre-study period): {clarksons_weekly['Date'].min()} to {clarksons_weekly['Date'].max()}")

# Remove thousand separators from numeric columns
for col in clarksons_weekly.columns:
    if col != 'Date':
        clarksons_weekly[col] = clarksons_weekly[col].str.replace(',', '').astype(float)

# Rename columns for clarity
clarksons_weekly = clarksons_weekly.rename(columns={
    '5 Year Timecharter Rate 82,000 dwt Bulkcarrier (Atlantic Region), $/day': 'TC5yr_Atlantic',
    '5 Year Timecharter Rate 82,000 dwt Bulkcarrier (Pacific Region), $/day': 'TC5yr_Pacific'
})

print(f"\nRenamed columns: {list(clarksons_weekly.columns)}")

print(f"\nWeekly Features Statistics:")
print(clarksons_weekly[['TC5yr_Atlantic', 'TC5yr_Pacific']].describe())

print(f"\nMissing Value Summary:")
print(clarksons_weekly.isnull().sum())


STEP 6: LOAD CLARKSONS WEEKLY DATA

Loading Clarksons weekly data from: C:\Users\moame\Documents\GitHub\MScFECapstoneProject\round_2\data\raw\clarksons\clarksons_weekly_data.csv
  Loaded 244 rows, 3 columns
  Columns: ['Date', '5 Year Timecharter Rate 82,000 dwt Bulkcarrier (Atlantic Region), $/day', '5 Year Timecharter Rate 82,000 dwt Bulkcarrier (Pacific Region), $/day']

Date range (includes pre-study period): 2021-02-26 00:00:00 to 2025-10-24 00:00:00

Renamed columns: ['Date', 'TC5yr_Atlantic', 'TC5yr_Pacific']

Weekly Features Statistics:
       TC5yr_Atlantic  TC5yr_Pacific
count      244.000000     244.000000
mean     14148.971311   14016.372951
std       1512.014086    1634.788486
min      11000.000000   11750.000000
25%      13000.000000   12650.000000
50%      13750.000000   13700.000000
75%      15400.000000   14800.000000
max      17500.000000   18000.000000

Missing Value Summary:
Date              0
TC5yr_Atlantic    0
TC5yr_Pacific     0
dtype: int64


---
## Step 7: Load Clarksons Monthly Data

In [15]:
print("\n" + "="*80)
print("STEP 7: LOAD CLARKSONS MONTHLY DATA")
print("="*80)

clarksons_monthly_file = CLARKSONS_DIR / 'clarksons_monthly_data.csv'
print(f"\nLoading Clarksons monthly data from: {clarksons_monthly_file}")

clarksons_monthly = pd.read_csv(clarksons_monthly_file, parse_dates=['Date'])
print(f"  Loaded {len(clarksons_monthly)} rows, {len(clarksons_monthly.columns)} columns")
print(f"  Columns: {list(clarksons_monthly.columns)}")

# Filter to study period (include pre-study period for forward-fill)
# Keep 2021-02 for forward-fill initialization
clarksons_monthly = clarksons_monthly[clarksons_monthly['Date'] >= '2021-02-01'].copy()
print(f"\nFiltered to include pre-study period (>= 2021-02-01): {len(clarksons_monthly)} rows")
print(f"  Date range: {clarksons_monthly['Date'].min()} to {clarksons_monthly['Date'].max()}")

# Remove thousand separators from numeric columns
for col in clarksons_monthly.columns:
    if col != 'Date' and clarksons_monthly[col].dtype == 'object':
        clarksons_monthly[col] = clarksons_monthly[col].str.replace(',', '').astype(float)

# Rename columns for clarity
clarksons_monthly = clarksons_monthly.rename(columns={
    'Atlantic Region Industrial Production Growth, % Yr/Yr': 'Atlantic_IP_YoY',
    'Pacific Region Industrial Production Growth, % Yr/Yr': 'Pacific_IP_YoY',
    'Panamax Bulkcarrier Deliveries, DWT': 'Panamax_Deliveries_DWT',
    'Capesize Orderbook % Fleet': 'Capesize_Orderbook_Pct',
    'Panamax Fleet Growth, % Yr/Yr': 'Panamax_Fleet_Growth_YoY',
    'Panamax Orderbook % Fleet': 'Panamax_Orderbook_Pct',
    'Monthly Global Seaborne Coal Trade Indicator, % Yr/Yr': 'Coal_Trade_YoY',
    'Monthly Global Seaborne Coal Trade Indicator, % Yr/Yr 3mma': 'Coal_Trade_YoY_3mma',
    'Monthly Global Seaborne Coal Trade Indicator, Volume Index': 'Coal_Trade_Volume_Index',
    'Monthly Global Seaborne Grain Trade Indicator, % Yr/Yr': 'Grain_Trade_YoY',
    'Monthly Global Seaborne Grain Trade Indicator, % Yr/Yr 3mma': 'Grain_Trade_YoY_3mma',
    'Monthly Global Seaborne Grain Trade Indicator, Volume Index': 'Grain_Trade_Volume_Index'
})

print(f"\nRenamed columns: {list(clarksons_monthly.columns)}")

print(f"\nMonthly Features Statistics:")
print(clarksons_monthly.describe())

print(f"\nMissing Value Summary:")
print(clarksons_monthly.isnull().sum())


STEP 7: LOAD CLARKSONS MONTHLY DATA

Loading Clarksons monthly data from: C:\Users\moame\Documents\GitHub\MScFECapstoneProject\round_2\data\raw\clarksons\clarksons_monthly_data.csv
  Loaded 57 rows, 13 columns
  Columns: ['Date', 'Atlantic Region Industrial Production Growth, % Yr/Yr', 'Pacific Region Industrial Production Growth, % Yr/Yr', 'Panamax Bulkcarrier Deliveries, DWT', 'Capesize Orderbook % Fleet', 'Panamax Fleet Growth, % Yr/Yr', 'Panamax Orderbook % Fleet', 'Monthly Global Seaborne Coal Trade Indicator, % Yr/Yr', 'Monthly Global Seaborne Coal Trade Indicator, % Yr/Yr 3mma', 'Monthly Global Seaborne Coal Trade Indicator, Volume Index', 'Monthly Global Seaborne Grain Trade Indicator, % Yr/Yr', 'Monthly Global Seaborne Grain Trade Indicator, % Yr/Yr 3mma', 'Monthly Global Seaborne Grain Trade Indicator, Volume Index']

Filtered to include pre-study period (>= 2021-02-01): 57 rows
  Date range: 2021-02-01 00:00:00 to 2025-10-01 00:00:00

Renamed columns: ['Date', 'Atlantic_IP_

---
## Step 8: Load Clarksons Annual Data

In [16]:
print("\n" + "="*80)
print("STEP 8: LOAD CLARKSONS ANNUAL DATA")
print("="*80)

clarksons_annual_file = CLARKSONS_DIR / 'clarksons_annual_data.csv'
print(f"\nLoading Clarksons annual data from: {clarksons_annual_file}")

clarksons_annual = pd.read_csv(clarksons_annual_file)
print(f"  Loaded {len(clarksons_annual)} rows, {len(clarksons_annual.columns)} columns")
print(f"  Columns: {list(clarksons_annual.columns)}")

# Convert Date to datetime (year format)
clarksons_annual['Date'] = pd.to_datetime(clarksons_annual['Date'], format='%Y')

# Display all available years
print(f"\nAll available years in annual data: {sorted(clarksons_annual['Date'].dt.year.unique())}")

# Include 2020 and 2021 for forward-fill initialization (pre-study period)
# No filtering needed - use all available years
print(f"\nUsing all {len(clarksons_annual)} rows (no year filtering)")
print(f"  Date range: {clarksons_annual['Date'].min().year} to {clarksons_annual['Date'].max().year}")

# Remove thousand separators from numeric columns
for col in clarksons_annual.columns:
    if col != 'Date' and clarksons_annual[col].dtype == 'object':
        clarksons_annual[col] = clarksons_annual[col].str.replace(',', '').astype(float)

# Rename columns for clarity
clarksons_annual = clarksons_annual.rename(columns={
    'China Seaborne Coal Imports, Million Tonnes': 'China_Coal_Imports_MT',
    'China Seaborne Grain Imports, Million Tonnes': 'China_Grain_Imports_MT',
    'World Seaborne Grain Trade (including Soybeans), Million Tonnes': 'World_Grain_Trade_MT',
    'World Seaborne Grain Trade (including Soybeans), Billion Tonne-miles': 'World_Grain_Trade_BTM',
    'World Seaborne Grain Trade (including Soybeans), % Yr/Yr (tonnes)': 'World_Grain_Trade_YoY_MT',
    'World Seaborne Grain Trade (including Soybeans), % Yr/Yr (tonne-miles)': 'World_Grain_Trade_YoY_BTM',
    'India Seaborne Coal Imports, Million Tonnes': 'India_Coal_Imports_MT',
    'Japan Seaborne Coal Imports, Million Tonnes': 'Japan_Coal_Imports_MT',
    'Indonesia Seaborne Coal Exports, Million Tonnes': 'Indonesia_Coal_Exports_MT',
    'Australia Seaborne Coal Exports, Million Tonnes': 'Australia_Coal_Exports_MT',
    'World Seaborne Total Coal Trade, Billion Tonne-miles': 'World_Coal_Trade_BTM',
    'World Seaborne Total Coal Trade, Million Tonnes': 'World_Coal_Trade_MT',
    'World Seaborne Total Coal Trade, % Yr/Yr (tonnes)': 'World_Coal_Trade_YoY_MT',
    'World Seaborne Total Coal Trade, % Yr/Yr (tonne-miles)': 'World_Coal_Trade_YoY_BTM'
})

print(f"\nRenamed columns: {list(clarksons_annual.columns)}")

print(f"\nAnnual Features Statistics:")
print(clarksons_annual.describe())

print(f"\nMissing Value Summary:")
print(clarksons_annual.isnull().sum())


STEP 8: LOAD CLARKSONS ANNUAL DATA

Loading Clarksons annual data from: C:\Users\moame\Documents\GitHub\MScFECapstoneProject\round_2\data\raw\clarksons\clarksons_annual_data.csv
  Loaded 8 rows, 15 columns
  Columns: ['Date', 'China Seaborne Coal Imports, Million Tonnes', 'China Seaborne Grain Imports, Million Tonnes', 'World Seaborne Grain Trade (including Soybeans), Million Tonnes', 'World Seaborne Grain Trade (including Soybeans), Billion Tonne-miles', 'World Seaborne Grain Trade (including Soybeans), % Yr/Yr (tonnes)', 'World Seaborne Grain Trade (including Soybeans), % Yr/Yr (tonne-miles)', 'India Seaborne Coal Imports, Million Tonnes', 'Japan Seaborne Coal Imports, Million Tonnes', 'Indonesia Seaborne Coal Exports, Million Tonnes', 'Australia Seaborne Coal Exports, Million Tonnes', 'World Seaborne Total Coal Trade, Billion Tonne-miles', 'World Seaborne Total Coal Trade, Million Tonnes', 'World Seaborne Total Coal Trade, % Yr/Yr (tonnes)', 'World Seaborne Total Coal Trade, % Yr/Y

---
## Step 9: Align All Features to Master Calendar (Forward-Fill)

In [17]:
print("\n" + "="*80)
print("STEP 9: ALIGN ALL FEATURES TO MASTER CALENDAR")
print("="*80)

print("\nMaster Calendar Details:")
print(f"  Total days: {len(master_calendar)}")
print(f"  Start: {master_calendar['Date'].min()}")
print(f"  End: {master_calendar['Date'].max()}")


STEP 9: ALIGN ALL FEATURES TO MASTER CALENDAR

Master Calendar Details:
  Total days: 1153
  Start: 2021-03-01 00:00:00
  End: 2025-10-10 00:00:00


### 9.1: Merge Daily Features

In [18]:
print("\n" + "-"*80)
print("9.1: MERGING DAILY FEATURES")
print("-"*80)

# Start with master calendar
features_daily = master_calendar.copy()
print(f"\nStarting with master calendar: {len(features_daily)} rows")

# Merge Baltic features (BPI, C5TC, P4_82, PDOPEX)
print("\nMerging Baltic features (BPI, C5TC, P4_82, PDOPEX)...")
features_daily = features_daily.merge(baltic_features, on='Date', how='left')
print(f"  After merge: {len(features_daily)} rows, {len(features_daily.columns)} columns")

# Forward-fill PDOPEX (has value at 2021-03-01, needs forward-fill for subsequent dates)
print("\n  Forward-filling PDOPEX (sparse daily feature)...")
pdopex_before = features_daily['PDOPEX'].isnull().sum()
features_daily['PDOPEX'] = features_daily['PDOPEX'].fillna(method='ffill')
pdopex_after = features_daily['PDOPEX'].isnull().sum()
print(f"    PDOPEX missing: {pdopex_before} → {pdopex_after} (filled {pdopex_before - pdopex_after} values)")

# Merge BFA wide (FFA forward curves)
print("\nMerging BFA wide (FFA forward curves)...")
features_daily = features_daily.merge(bfa_wide, on='Date', how='left')
print(f"  After merge: {len(features_daily)} rows, {len(features_daily.columns)} columns")

# Merge Bunker data (VLSFO, MGO)
print("\nMerging Bunker data (VLSFO, MGO)...")
features_daily = features_daily.merge(bunker_df, on='Date', how='left')
print(f"  After merge: {len(features_daily)} rows, {len(features_daily.columns)} columns")

# Merge Clarksons daily (Panamax_Idle_Pct, Atlantic_Port_Calls)
print("\nMerging Clarksons daily (Panamax_Idle_Pct, Atlantic_Port_Calls)...")
features_daily = features_daily.merge(clarksons_daily, on='Date', how='left')
print(f"  After merge: {len(features_daily)} rows, {len(features_daily.columns)} columns")

print(f"\n✅ Daily features merged. Total columns: {len(features_daily.columns)}")


--------------------------------------------------------------------------------
9.1: MERGING DAILY FEATURES
--------------------------------------------------------------------------------

Starting with master calendar: 1153 rows

Merging Baltic features (BPI, C5TC, P4_82, PDOPEX)...
  After merge: 1153 rows, 5 columns

  Forward-filling PDOPEX (sparse daily feature)...
    PDOPEX missing: 1134 → 0 (filled 1134 values)

Merging BFA wide (FFA forward curves)...
  After merge: 1153 rows, 27 columns

Merging Bunker data (VLSFO, MGO)...
  After merge: 1153 rows, 29 columns

Merging Clarksons daily (Panamax_Idle_Pct, Atlantic_Port_Calls)...
  After merge: 1153 rows, 31 columns

✅ Daily features merged. Total columns: 31


### 9.2: Merge and Forward-Fill Weekly Features

In [19]:
print("\n" + "-"*80)
print("9.2: MERGING AND FORWARD-FILLING WEEKLY FEATURES")
print("-"*80)

print("\nWeekly features: TC5yr_Atlantic, TC5yr_Pacific")
print(f"  Weekly data rows: {len(clarksons_weekly)}")
print(f"  Date range: {clarksons_weekly['Date'].min()} to {clarksons_weekly['Date'].max()}")

# Merge weekly data
features_daily = features_daily.merge(clarksons_weekly, on='Date', how='left')
print(f"\nAfter merge: {len(features_daily)} rows, {len(features_daily.columns)} columns")

# Check missing values before forward-fill
weekly_cols = ['TC5yr_Atlantic', 'TC5yr_Pacific']
print(f"\nMissing values BEFORE forward-fill:")
for col in weekly_cols:
    missing_count = features_daily[col].isnull().sum()
    print(f"  {col}: {missing_count} ({missing_count/len(features_daily)*100:.1f}%)")

# Forward-fill weekly data to daily frequency
print(f"\nApplying forward-fill...")
features_daily[weekly_cols] = features_daily[weekly_cols].fillna(method='ffill')

# Check missing values after forward-fill
print(f"\nMissing values AFTER forward-fill:")
for col in weekly_cols:
    missing_count = features_daily[col].isnull().sum()
    print(f"  {col}: {missing_count}")
    if missing_count > 0:
        print(f"    ⚠️ WARNING: Still has missing values after forward-fill")

print(f"\n✅ Weekly features forward-filled")


--------------------------------------------------------------------------------
9.2: MERGING AND FORWARD-FILLING WEEKLY FEATURES
--------------------------------------------------------------------------------

Weekly features: TC5yr_Atlantic, TC5yr_Pacific
  Weekly data rows: 244
  Date range: 2021-02-26 00:00:00 to 2025-10-24 00:00:00

After merge: 1153 rows, 33 columns

Missing values BEFORE forward-fill:
  TC5yr_Atlantic: 922 (80.0%)
  TC5yr_Pacific: 922 (80.0%)

Applying forward-fill...

Missing values AFTER forward-fill:
  TC5yr_Atlantic: 4
  TC5yr_Pacific: 4

✅ Weekly features forward-filled


### 9.3: Merge and Forward-Fill Monthly Features

In [20]:
print("\n" + "-"*80)
print("9.3: MERGING AND FORWARD-FILLING MONTHLY FEATURES")
print("-"*80)

monthly_feature_cols = [
    'Atlantic_IP_YoY', 'Pacific_IP_YoY', 'Panamax_Deliveries_DWT',
    'Capesize_Orderbook_Pct', 'Panamax_Fleet_Growth_YoY', 'Panamax_Orderbook_Pct',
    'Coal_Trade_YoY', 'Coal_Trade_YoY_3mma', 'Coal_Trade_Volume_Index',
    'Grain_Trade_YoY', 'Grain_Trade_YoY_3mma', 'Grain_Trade_Volume_Index'
]

print(f"\nMonthly features: {len(monthly_feature_cols)} columns")
print(f"  {monthly_feature_cols}")
print(f"\n  Monthly data rows: {len(clarksons_monthly)}")
print(f"  Date range: {clarksons_monthly['Date'].min()} to {clarksons_monthly['Date'].max()}")

# Convert monthly dates to start of month for proper merging
clarksons_monthly['Date'] = pd.to_datetime(clarksons_monthly['Date'])
features_daily['YearMonth'] = features_daily['Date'].dt.to_period('M').dt.to_timestamp()
clarksons_monthly['YearMonth'] = clarksons_monthly['Date'].dt.to_period('M').dt.to_timestamp()

# Merge monthly data on YearMonth
print(f"\nMerging on YearMonth column...")
features_daily = features_daily.merge(
    clarksons_monthly[['YearMonth'] + monthly_feature_cols],
    on='YearMonth',
    how='left'
)
print(f"  After merge: {len(features_daily)} rows, {len(features_daily.columns)} columns")

# Check missing values before forward-fill
print(f"\nMissing values BEFORE forward-fill:")
for col in monthly_feature_cols:
    missing_count = features_daily[col].isnull().sum()
    print(f"  {col}: {missing_count} ({missing_count/len(features_daily)*100:.1f}%)")

# Forward-fill monthly data to daily frequency
print(f"\nApplying forward-fill...")
features_daily[monthly_feature_cols] = features_daily[monthly_feature_cols].fillna(method='ffill')

# Check missing values after forward-fill
print(f"\nMissing values AFTER forward-fill:")
for col in monthly_feature_cols:
    missing_count = features_daily[col].isnull().sum()
    print(f"  {col}: {missing_count}")
    if missing_count > 0:
        print(f"    ⚠️ WARNING: Still has missing values after forward-fill")

# Drop temporary YearMonth column
features_daily = features_daily.drop(columns=['YearMonth'])

print(f"\n✅ Monthly features forward-filled")


--------------------------------------------------------------------------------
9.3: MERGING AND FORWARD-FILLING MONTHLY FEATURES
--------------------------------------------------------------------------------

Monthly features: 12 columns
  ['Atlantic_IP_YoY', 'Pacific_IP_YoY', 'Panamax_Deliveries_DWT', 'Capesize_Orderbook_Pct', 'Panamax_Fleet_Growth_YoY', 'Panamax_Orderbook_Pct', 'Coal_Trade_YoY', 'Coal_Trade_YoY_3mma', 'Coal_Trade_Volume_Index', 'Grain_Trade_YoY', 'Grain_Trade_YoY_3mma', 'Grain_Trade_Volume_Index']

  Monthly data rows: 57
  Date range: 2021-02-01 00:00:00 to 2025-10-01 00:00:00

Merging on YearMonth column...
  After merge: 1153 rows, 46 columns

Missing values BEFORE forward-fill:
  Atlantic_IP_YoY: 73 (6.3%)
  Pacific_IP_YoY: 73 (6.3%)
  Panamax_Deliveries_DWT: 8 (0.7%)
  Capesize_Orderbook_Pct: 0 (0.0%)
  Panamax_Fleet_Growth_YoY: 0 (0.0%)
  Panamax_Orderbook_Pct: 0 (0.0%)
  Coal_Trade_YoY: 8 (0.7%)
  Coal_Trade_YoY_3mma: 8 (0.7%)
  Coal_Trade_Volume_Index: 8

### 9.4: Interpolate and Forward-Fill Annual Features

In [21]:
print("\n" + "-"*80)
print("9.4: INTERPOLATING AND FORWARD-FILLING ANNUAL FEATURES")
print("-"*80)

annual_feature_cols = [
    'China_Coal_Imports_MT', 'China_Grain_Imports_MT',
    'World_Grain_Trade_MT', 'World_Grain_Trade_BTM',
    'World_Grain_Trade_YoY_MT', 'World_Grain_Trade_YoY_BTM',
    'India_Coal_Imports_MT', 'Japan_Coal_Imports_MT',
    'Indonesia_Coal_Exports_MT', 'Australia_Coal_Exports_MT',
    'World_Coal_Trade_BTM', 'World_Coal_Trade_MT',
    'World_Coal_Trade_YoY_MT', 'World_Coal_Trade_YoY_BTM'
]

print(f"\nAnnual features: {len(annual_feature_cols)} columns")
print(f"  Annual data rows: {len(clarksons_annual)}")
print(f"  Date range: {clarksons_annual['Date'].min().year} to {clarksons_annual['Date'].max().year}")

# Convert annual dates to start of year
features_daily['Year'] = features_daily['Date'].dt.year
clarksons_annual['Year'] = clarksons_annual['Date'].dt.year

# Merge annual data on Year
print(f"\nMerging on Year column...")
features_daily = features_daily.merge(
    clarksons_annual[['Year'] + annual_feature_cols],
    on='Year',
    how='left'
)
print(f"  After merge: {len(features_daily)} rows, {len(features_daily.columns)} columns")

# Check missing values before interpolation
print(f"\nMissing values BEFORE interpolation/forward-fill:")
for col in annual_feature_cols:
    missing_count = features_daily[col].isnull().sum()
    print(f"  {col}: {missing_count} ({missing_count/len(features_daily)*100:.1f}%)")

# Apply linear interpolation then forward-fill for annual data
print(f"\nApplying linear interpolation followed by forward-fill...")
for col in annual_feature_cols:
    features_daily[col] = features_daily[col].interpolate(method='linear').fillna(method='ffill')

# Check missing values after interpolation
print(f"\nMissing values AFTER interpolation/forward-fill:")
for col in annual_feature_cols:
    missing_count = features_daily[col].isnull().sum()
    print(f"  {col}: {missing_count}")
    if missing_count > 0:
        print(f"    ⚠️ WARNING: Still has missing values after interpolation")

# Drop temporary Year column
features_daily = features_daily.drop(columns=['Year'])

print(f"\n✅ Annual features interpolated and forward-filled")


--------------------------------------------------------------------------------
9.4: INTERPOLATING AND FORWARD-FILLING ANNUAL FEATURES
--------------------------------------------------------------------------------

Annual features: 14 columns
  Annual data rows: 8
  Date range: 2020 to 2027

Merging on Year column...
  After merge: 1153 rows, 60 columns

Missing values BEFORE interpolation/forward-fill:
  China_Coal_Imports_MT: 0 (0.0%)
  China_Grain_Imports_MT: 0 (0.0%)
  World_Grain_Trade_MT: 0 (0.0%)
  World_Grain_Trade_BTM: 0 (0.0%)
  World_Grain_Trade_YoY_MT: 0 (0.0%)
  World_Grain_Trade_YoY_BTM: 0 (0.0%)
  India_Coal_Imports_MT: 0 (0.0%)
  Japan_Coal_Imports_MT: 0 (0.0%)
  Indonesia_Coal_Exports_MT: 0 (0.0%)
  Australia_Coal_Exports_MT: 0 (0.0%)
  World_Coal_Trade_BTM: 0 (0.0%)
  World_Coal_Trade_MT: 0 (0.0%)
  World_Coal_Trade_YoY_MT: 0 (0.0%)
  World_Coal_Trade_YoY_BTM: 0 (0.0%)

Applying linear interpolation followed by forward-fill...

Missing values AFTER interpolation/f

In [22]:
print("\n" + "-"*80)
print("9.5: APPLYING 1-DAY LAG TO ALL FEATURES (DATA LEAKAGE PREVENTION)")
print("-"*80)

print("\nStrategy: Apply 1-day lag to ALL features (matching Round 1 approach)")
print("  Rationale: Prevents data leakage by ensuring features used for prediction")
print("             are only from information available BEFORE the prediction date")
print("  Method: Shift all non-Date columns by 1 business day using .shift(1)")

# Get all feature columns (exclude Date)
feature_cols = [col for col in features_daily.columns if col != 'Date']
print(f"\nTotal features to lag: {len(feature_cols)}")

# Count missing values BEFORE lagging
missing_before = features_daily[feature_cols].isnull().sum().sum()
print(f"\nMissing values BEFORE lagging: {missing_before}")

# Apply 1-day lag to ALL features
print(f"\nApplying 1-day shift to all {len(feature_cols)} features...")
features_daily[feature_cols] = features_daily[feature_cols].shift(1)

# Count missing values AFTER lagging
missing_after = features_daily[feature_cols].isnull().sum().sum()
print(f"Missing values AFTER lagging: {missing_after}")
print(f"  Additional missing values from lag: {missing_after - missing_before}")
print(f"  (This is expected: first row will have NaN for all features)")

# Verify first row has all NaN features (except Date)
first_row_nan_count = features_daily.iloc[0][feature_cols].isnull().sum()
print(f"\nFirst row validation:")
print(f"  Date: {features_daily.iloc[0]['Date']}")
print(f"  Features with NaN: {first_row_nan_count}/{len(feature_cols)}")
if first_row_nan_count == len(feature_cols):
    print(f"  ✅ First row correctly has all features as NaN (due to 1-day lag)")
else:
    print(f"  ⚠️ WARNING: Expected all features to be NaN in first row!")

# Show example of lag effect (second row should have first row's original values)
print(f"\nLag effect example (comparing dates):")
print(f"  Row 2 Date: {features_daily.iloc[1]['Date']}")
print(f"  Row 2 BPI: {features_daily.iloc[1]['BPI']}")
print(f"  (This value was originally from Row 1: {master_calendar.iloc[0]['Date']})")

print(f"\n✅ 1-day lag applied to all features")
print(f"\nFinal dataset shape: {features_daily.shape}")
print(f"  Rows: {len(features_daily)}")
print(f"  Columns: {len(features_daily.columns)} (1 Date + {len(feature_cols)} lagged features)")


--------------------------------------------------------------------------------
9.5: APPLYING 1-DAY LAG TO ALL FEATURES (DATA LEAKAGE PREVENTION)
--------------------------------------------------------------------------------

Strategy: Apply 1-day lag to ALL features (matching Round 1 approach)
  Rationale: Prevents data leakage by ensuring features used for prediction
             are only from information available BEFORE the prediction date
  Method: Shift all non-Date columns by 1 business day using .shift(1)

Total features to lag: 58

Missing values BEFORE lagging: 600

Applying 1-day shift to all 58 features...
Missing values AFTER lagging: 658
  Additional missing values from lag: 58
  (This is expected: first row will have NaN for all features)

First row validation:
  Date: 2021-03-01 00:00:00
  Features with NaN: 58/58
  ✅ First row correctly has all features as NaN (due to 1-day lag)

Lag effect example (comparing dates):
  Row 2 Date: 2021-03-02 00:00:00
  Row 2 BPI: 2

### 9.5: Apply 1-Day Lag to ALL Features (Data Leakage Prevention)

---
## Step 10: Final Data Quality Checks

In [23]:
print("\n" + "="*80)
print("STEP 10: FINAL DATA QUALITY CHECKS")
print("="*80)

print(f"\nFinal Dataset Dimensions:")
print(f"  Rows: {len(features_daily)}")
print(f"  Columns: {len(features_daily.columns)}")
print(f"  Date range: {features_daily['Date'].min()} to {features_daily['Date'].max()}")

print(f"\nColumn List ({len(features_daily.columns)} total):")
for i, col in enumerate(features_daily.columns, 1):
    print(f"  {i:2d}. {col}")


STEP 10: FINAL DATA QUALITY CHECKS

Final Dataset Dimensions:
  Rows: 1153
  Columns: 59
  Date range: 2021-03-01 00:00:00 to 2025-10-10 00:00:00

Column List (59 total):
   1. Date
   2. BPI
   3. C5TC
   4. P4_82
   5. PDOPEX
   6. P1EA_82+1MON
   7. P1EA_82+1Q
   8. P1EA_82+2MON
   9. P1EA_82+2Q
  10. P1EA_82+3MON
  11. P1EA_82+3Q
  12. P1EA_82+4MON
  13. P1EA_82+4Q
  14. P1EA_82+5MON
  15. P1EA_82CURMON
  16. P1EA_82CURQ
  17. P3EA_82+1MON
  18. P3EA_82+1Q
  19. P3EA_82+2MON
  20. P3EA_82+2Q
  21. P3EA_82+3MON
  22. P3EA_82+3Q
  23. P3EA_82+4MON
  24. P3EA_82+4Q
  25. P3EA_82+5MON
  26. P3EA_82CURMON
  27. P3EA_82CURQ
  28. VLSFO
  29. MGO
  30. Panamax_Idle_Pct
  31. Atlantic_Port_Calls
  32. TC5yr_Atlantic
  33. TC5yr_Pacific
  34. Atlantic_IP_YoY
  35. Pacific_IP_YoY
  36. Panamax_Deliveries_DWT
  37. Capesize_Orderbook_Pct
  38. Panamax_Fleet_Growth_YoY
  39. Panamax_Orderbook_Pct
  40. Coal_Trade_YoY
  41. Coal_Trade_YoY_3mma
  42. Coal_Trade_Volume_Index
  43. Grain_Trade_Yo

In [24]:
# Check for missing values across all features
print("\n" + "-"*80)
print("MISSING VALUE SUMMARY (ALL FEATURES)")
print("-"*80)

missing_summary = features_daily.isnull().sum()
missing_summary = missing_summary[missing_summary > 0].sort_values(ascending=False)

print(f"\n**IMPORTANT**: All features have been lagged by 1 day to prevent data leakage.")
print(f"  This means Row 1 (date={features_daily.iloc[0]['Date'].date()}) will have ALL features as NaN.")
print(f"  Total features: {len([col for col in features_daily.columns if col != 'Date'])}")
print(f"  Expected NaN count for Row 1: {len([col for col in features_daily.columns if col != 'Date'])}")

if len(missing_summary) > 0:
    print(f"\n⚠️ Features with missing values: {len(missing_summary)}")
    print(f"\nBreakdown by category:")
    print(f"\n  1. Expected missing values (from 1-day lag):")
    print(f"     - Row 1 has ALL {len([col for col in features_daily.columns if col != 'Date'])} features as NaN (due to lag)")
    print(f"\n  2. Additional missing values (from data availability):")
    
    for col, count in missing_summary.items():
        pct = count / len(features_daily) * 100
        if count == 1:
            print(f"     - {col}: {count} ({pct:.1f}%) - from 1-day lag only")
        else:
            print(f"     - {col}: {count} ({pct:.1f}%)")
else:
    print(f"\n✅ No missing values detected in any feature (except expected lag-induced NaN in Row 1)!")


--------------------------------------------------------------------------------
MISSING VALUE SUMMARY (ALL FEATURES)
--------------------------------------------------------------------------------

**IMPORTANT**: All features have been lagged by 1 day to prevent data leakage.
  This means Row 1 (date=2021-03-01) will have ALL features as NaN.
  Total features: 58
  Expected NaN count for Row 1: 58

⚠️ Features with missing values: 58

Breakdown by category:

  1. Expected missing values (from 1-day lag):
     - Row 1 has ALL 58 features as NaN (due to lag)

  2. Additional missing values (from data availability):
     - P3EA_82+4Q: 293 (25.4%)
     - P1EA_82+4Q: 293 (25.4%)
     - TC5yr_Atlantic: 5 (0.4%)
     - TC5yr_Pacific: 5 (0.4%)
     - MGO: 5 (0.4%)
     - VLSFO: 5 (0.4%)
     - P4_82: 1 (0.1%) - from 1-day lag only
     - PDOPEX: 1 (0.1%) - from 1-day lag only
     - P1EA_82+1MON: 1 (0.1%) - from 1-day lag only
     - P1EA_82+1Q: 1 (0.1%) - from 1-day lag only
     - P1EA_82

In [25]:
# Check for duplicate dates
print("\n" + "-"*80)
print("DUPLICATE DATE CHECK")
print("-"*80)

duplicate_dates = features_daily['Date'].duplicated().sum()
if duplicate_dates > 0:
    print(f"\n⚠️ WARNING: {duplicate_dates} duplicate dates found!")
    print(features_daily[features_daily['Date'].duplicated(keep=False)][['Date']].head(20))
else:
    print(f"\n✅ No duplicate dates detected")


--------------------------------------------------------------------------------
DUPLICATE DATE CHECK
--------------------------------------------------------------------------------

✅ No duplicate dates detected


In [26]:
# Verify date alignment with master calendar
print("\n" + "-"*80)
print("MASTER CALENDAR ALIGNMENT CHECK")
print("-"*80)

print(f"\nMaster calendar dates: {len(master_calendar)}")
print(f"Features daily dates: {len(features_daily)}")

if len(features_daily) == len(master_calendar):
    print(f"\n✅ Date counts match!")
    
    # Check if dates are identical
    dates_match = (features_daily['Date'].values == master_calendar['Date'].values).all()
    if dates_match:
        print(f"✅ All dates perfectly aligned with master calendar")
    else:
        print(f"⚠️ WARNING: Date counts match but dates are not identical!")
else:
    print(f"\n⚠️ WARNING: Date counts do not match!")
    print(f"  Difference: {len(features_daily) - len(master_calendar)} rows")


--------------------------------------------------------------------------------
MASTER CALENDAR ALIGNMENT CHECK
--------------------------------------------------------------------------------

Master calendar dates: 1153
Features daily dates: 1153

✅ Date counts match!
✅ All dates perfectly aligned with master calendar


In [27]:
# Display sample of final dataset
print("\n" + "-"*80)
print("SAMPLE OF FINAL DATASET (First 5 rows)")
print("-"*80)
print(features_daily.head())

print("\n" + "-"*80)
print("SAMPLE OF FINAL DATASET (Last 5 rows)")
print("-"*80)
print(features_daily.tail())


--------------------------------------------------------------------------------
SAMPLE OF FINAL DATASET (First 5 rows)
--------------------------------------------------------------------------------
        Date     BPI     C5TC   P4_82  PDOPEX  P1EA_82+1MON  P1EA_82+1Q  \
0 2021-03-01     NaN      NaN     NaN     NaN           NaN         NaN   
1 2021-03-02  2086.0  11679.0  5836.0  4744.0       19451.0     18662.0   
2 2021-03-03  2100.0  12152.0  5814.0  4744.0       20401.0     19234.0   
3 2021-03-04  2161.0  13910.0  5871.0  4744.0       21117.0     19651.0   
4 2021-03-05  2212.0  13874.0  5918.0  4744.0       21367.0     19890.0   

   P1EA_82+2MON  P1EA_82+2Q  P1EA_82+3MON  ...  World_Grain_Trade_YoY_MT  \
0           NaN         NaN           NaN  ...                       NaN   
1       18551.0     16651.0       17984.0  ...                      2.26   
2       18884.0     17017.0       18417.0  ...                      2.26   
3       19417.0     17234.0       18417.0  

---
## Step 11: Save Processed Data

In [28]:
print("\n" + "="*80)
print("STEP 11: SAVE PROCESSED DATA")
print("="*80)

# Save features_raw_daily
features_raw_daily_file = INTERMEDIATE_DIR / 'features_raw_daily.csv'
features_daily.to_csv(features_raw_daily_file, index=False)
print(f"\n✅ Features (raw daily) saved to: {features_raw_daily_file}")
print(f"   Shape: {features_daily.shape}")
print(f"   Columns: {len(features_daily.columns)}")

print("\n" + "-"*80)
print("DATA GATHERING COMPLETE")
print("-"*80)

print("\nSummary of saved files:")
print(f"  1. Master calendar: {INTERMEDIATE_DIR / 'master_calendar.csv'}")
print(f"  2. Labels: {INTERMEDIATE_DIR / 'labels.csv'}")
print(f"  3. BFA wide: {INTERMEDIATE_DIR / 'bfa_wide.csv'}")
print(f"  4. Features (raw daily): {INTERMEDIATE_DIR / 'features_raw_daily.csv'}")

print("\n✅ Ready for next step: 02_feature_engineering.ipynb")


STEP 11: SAVE PROCESSED DATA

✅ Features (raw daily) saved to: C:\Users\moame\Documents\GitHub\MScFECapstoneProject\round_2\data\processed\intermediate\features_raw_daily.csv
   Shape: (1153, 59)
   Columns: 59

--------------------------------------------------------------------------------
DATA GATHERING COMPLETE
--------------------------------------------------------------------------------

Summary of saved files:
  1. Master calendar: C:\Users\moame\Documents\GitHub\MScFECapstoneProject\round_2\data\processed\intermediate\master_calendar.csv
  2. Labels: C:\Users\moame\Documents\GitHub\MScFECapstoneProject\round_2\data\processed\intermediate\labels.csv
  3. BFA wide: C:\Users\moame\Documents\GitHub\MScFECapstoneProject\round_2\data\processed\intermediate\bfa_wide.csv
  4. Features (raw daily): C:\Users\moame\Documents\GitHub\MScFECapstoneProject\round_2\data\processed\intermediate\features_raw_daily.csv

✅ Ready for next step: 02_feature_engineering.ipynb


---
## End of Notebook