# Data Collection for Europe Base Port Container Price Prediction

This notebook focuses on gathering all the raw data needed for our 1-week ahead container price forecasting project. We will collect data from three different sources and save it for later processing.

## What are we predicting?

**Target**: Europe Base Port container prices (1 week ahead)

**Base Ports Definition**: Average shipping cost for a 40-foot container from Shanghai/China to major European ports including Rotterdam (Netherlands), Hamburg (Germany), London (UK), and Antwerp (Belgium).

**Why these ports**: These represent the main entry points for Asian goods into Europe and provide a standard benchmark for European route pricing.

## Our data sources:

1. Shanghai Containerized Freight Index (local CSV file) - Main price data
2. Crude oil prices (EIA DCOILWTICO dataset) - Cost factor affecting shipping fuel costs
3. Geopolitical disruption data (from GDELT via BigQuery) - Black swan event indicators

**Note**: We use GDELT data exported from Google BigQuery for historical coverage (2018-2025). This provides weekly disruption metrics including conflict events, severe incidents, and sentiment analysis for shipping-critical regions.

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import yfinance as yf
import requests
import time
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

print("Libraries imported successfully!")

## Step 1: Load the Shanghai Containerized Freight Index

In [None]:
try:
    # Load the raw CSV file
    # header=1 means the actual column names are in the second row (row 1, counting from 0)
    raw_df = pd.read_csv('data/Shanghai_Containerized_Freight_Index.csv', header=1)
    print(f"Successfully loaded the CSV file with {raw_df.shape[0]} rows and {raw_df.shape[1]} columns")
    
    # Display the first few rows to see what the data looks like
    print("\nFirst 5 rows of the raw data:")
    print(raw_df.head())
    
except FileNotFoundError:
    print("Error: The file 'data/Shanghai_Containerized_Freight_Index.csv' was not found.")
    print("Please make sure it is in the data folder.")

## Step 2: Select and rename the columns we need

The dataset has many columns for different routes, but we only need a few for our Europe Base Port prediction. We will select the date column, the overall freight index (SCFI), and the Europe Base Port price.

In [None]:
# Select only the columns we need
df_freight = raw_df[['the period (YYYY-MM-DD)', 'Comprehensive Index', 'Europe (Base port)']].copy()

# Rename columns to simpler names
df_freight.columns = ['Date', 'SCFI_Index', 'Europe_Base_Price']

print("Selected and renamed columns:")
print(df_freight.columns.tolist())
print(f"\nDataset now has {df_freight.shape[0]} rows and {df_freight.shape[1]} columns")
print("\nFirst 5 rows:")
print(df_freight.head())

## Step 3: Convert date strings to proper date format

Right now, the Date column is just text. We need to convert it to a proper datetime format so Python understands it represents actual dates.

In [None]:
df_freight['Date'] = pd.to_datetime(df_freight['Date'], format='%m/%d/%Y', errors='coerce')

# Check how many dates were successfully converted
valid_dates = df_freight['Date'].notna().sum()
total_rows = len(df_freight)
print(f"Successfully converted {valid_dates} out of {total_rows} dates")

# Remove rows where date conversion failed
df_freight.dropna(subset=['Date'], inplace=True)
print(f"After removing invalid dates: {len(df_freight)} rows remaining")

# Set the Date column as the index (the row identifier)
df_freight.set_index('Date', inplace=True)

print("\nDate conversion complete!")
print(df_freight.head())

## Step 4: Convert price columns to numbers

Sometimes data is read as text even when it represents numbers. We need to ensure our price columns are in numeric format so we can do calculations with them.

In [None]:
# Convert price columns to numeric format
for col in ['SCFI_Index', 'Europe_Base_Price']:
    df_freight[col] = pd.to_numeric(df_freight[col], errors='coerce')
    print(f"Converted {col} to numeric type")

# Remove any rows with missing values
# This ensures we have complete data for all rows
before_drop = len(df_freight)
df_freight.dropna(inplace=True)
after_drop = len(df_freight)

print(f"\nRemoved {before_drop - after_drop} rows with missing values")
print(f"Final freight dataset: {after_drop} rows")
print(f"Date range: {df_freight.index.min().strftime('%Y-%m-%d')} to {df_freight.index.max().strftime('%Y-%m-%d')}")

print("\nFinal cleaned freight data:")
print(df_freight.head())

## Step 5: Load crude oil price data

Oil prices affect shipping costs since ships use fuel. We will load historical crude oil prices from the EIA DCOILWTICO dataset.

### Data Source: EIA DCOILWTICO

We use the **DCOILWTICO** dataset from the U.S. Energy Information Administration (EIA), which provides daily WTI crude oil prices. This is official government data covering 2018 to current date.

**File**: `data/DCOILWTICO.csv`
**Format**: CSV with columns `observation_date` and `DCOILWTICO`
**Frequency**: Daily prices

### Why WTI Crude Oil?

WTI (West Texas Intermediate) is a major oil benchmark used for global pricing. Its price movements directly indicate changes in shipping fuel costs, making it a critical factor for container shipping economics.

### Processing Steps:
1. Load the CSV file
2. Convert dates to datetime format
3. Rename columns for consistency
4. Filter to match freight data date range
5. Handle missing values (marked as empty strings in EIA data)

In [None]:
print("=" * 70)
print("LOADING CRUDE OIL PRICE DATA")
print("=" * 70)

# Get the date range from our freight data
start_date = df_freight.index.min().strftime('%Y-%m-%d')
end_date = df_freight.index.max().strftime('%Y-%m-%d')

print(f"\nRequesting data from {start_date} to {end_date}")

df_oil = pd.DataFrame()

# Load crude oil prices from EIA DCOILWTICO dataset
print("\n--- Loading from EIA DCOILWTICO.csv ---")

try:
    # Load the EIA crude oil price data
    oil_file_path = 'data/DCOILWTICO.csv'
    df_oil_raw = pd.read_csv(oil_file_path, parse_dates=['observation_date'], index_col='observation_date')

    # Rename the price column for consistency
    df_oil_raw = df_oil_raw.rename(columns={'DCOILWTICO': 'Oil_Price'})

    # Filter to our date range
    df_oil = df_oil_raw[(df_oil_raw.index >= start_date) & (df_oil_raw.index <= end_date)].copy()

    # Handle missing values (EIA uses empty strings for missing data)
    df_oil['Oil_Price'] = pd.to_numeric(df_oil['Oil_Price'], errors='coerce')

    # Remove rows with missing prices
    before_clean = len(df_oil)
    df_oil.dropna(subset=['Oil_Price'], inplace=True)
    after_clean = len(df_oil)

    if not df_oil.empty:
        print(f"Success: Loaded {len(df_oil)} days of crude oil price data from {oil_file_path}")
        print(f"  Date range: {df_oil.index.min().strftime('%Y-%m-%d')} to {df_oil.index.max().strftime('%Y-%m-%d')}")
        print(f"  Price range: ${df_oil['Oil_Price'].min():.2f} to ${df_oil['Oil_Price'].max():.2f}")
        print(f"  Average price: ${df_oil['Oil_Price'].mean():.2f}")
        print(f"  Removed {before_clean - after_clean} rows with missing values")
        print("\nFirst 5 rows of oil data:")
        print(df_oil.head())
    else:
        print(f"Error: No valid oil price data found in date range")

except FileNotFoundError:
    print(f"Error: The file '{oil_file_path}' was not found.")
    print("Please ensure the DCOILWTICO.csv file is in the data folder.")
except Exception as e:
    print(f"Error: An error occurred while loading oil data: {str(e)[:150]}")

# Fallback: Create synthetic oil price data if loading failed (last resort)
if df_oil.empty:
    print("\n--- Fallback: Creating synthetic placeholder data ---")
    print("Warning: No real oil data available. Creating synthetic data for demonstration.")
    print("This should only be used for testing. For production, obtain the DCOILWTICO.csv file.")

    # Create date range matching freight data
    date_range = pd.date_range(start=start_date, end=end_date, freq='D')

    # Create synthetic oil prices with realistic values and volatility
    # Base price around $70-80 with random walk
    np.random.seed(42)  # For reproducibility
    base_price = 75.0
    random_walk = np.random.randn(len(date_range)).cumsum() * 2  # Random walk with std=2
    synthetic_prices = base_price + random_walk

    # Clip to reasonable range (50-120)
    synthetic_prices = np.clip(synthetic_prices, 50, 120)

    df_oil = pd.DataFrame({
        'Oil_Price': synthetic_prices
    }, index=date_range)

    print(f"Success: Created {len(df_oil)} days of synthetic oil price data")
    print(f"  Date range: {df_oil.index.min().strftime('%Y-%m-%d')} to {df_oil.index.max().strftime('%Y-%m-%d')}")
    print(f"  Price range: ${df_oil['Oil_Price'].min():.2f} to ${df_oil['Oil_Price'].max():.2f}")
    print(f"  Average: ${df_oil['Oil_Price'].mean():.2f}")
    print("\nWarning: This is SYNTHETIC data. Replace with real DCOILWTICO.csv data for actual predictions.")

print("\n" + "=" * 70)
print("OIL DATA LOADING COMPLETE")
print("=" * 70)
print(f"Total days: {len(df_oil)}")
print(f"Data source: {'EIA DCOILWTICO' if not df_oil.empty and 'synthetic' not in str(df_oil.index.name) else 'Synthetic'}")
print("=" * 70)

## Step 6: Load the new "black swan" geopolitical disruption data

We now load the new, richer BigQuery dataset that contains specific geopolitical and black swan event metrics for shipping-critical regions.

In [None]:
print("=" * 70)
print("LOADING NEW BLACK SWAN GEOPOLITICAL DISRUPTION DATA")
print("=" * 70)

# Helper function to convert ISO year/week to Friday date
def iso_to_friday_date(row):
    """Converts ISO year and week to the Friday of that week."""
    return datetime.fromisocalendar(int(row['iso_year']), int(row['week']), 5)

# Load the new BigQuery export
try:
    news_file_path = 'data/bq-results-20251021-090045-1761037274833.csv'
    df_news_raw = pd.read_csv(news_file_path)
    print(f"Success: Loaded {len(df_news_raw)} weekly records from {news_file_path}")
    print(f"  Columns: {df_news_raw.columns.tolist()}")

    # Remove week 53 entries as they cause invalid week errors
    before_filter = len(df_news_raw)
    df_news_raw = df_news_raw[df_news_raw['week'] != 53]
    after_filter = len(df_news_raw)
    if before_filter != after_filter:
        print(f"Removed {before_filter - after_filter} week 53 entries (invalid weeks)")

    # Convert ISO week to Friday date to match freight data
    print("\nConverting ISO week numbers to Friday dates...")
    df_news_raw['date'] = df_news_raw.apply(iso_to_friday_date, axis=1)
    df_news_raw['date'] = pd.to_datetime(df_news_raw['date'])
    df_news_raw.set_index('date', inplace=True)
    print(f"Success: Converted dates. New range: {df_news_raw.index.min().date()} to {df_news_raw.index.max().date()}")

    # Drop the non-predictive global_worst_event_impact column
    if 'global_worst_event_impact' in df_news_raw.columns:
        df_news_raw = df_news_raw.drop(columns=['global_worst_event_impact'])
        print("Success: Dropped 'global_worst_event_impact' column.")

    # Save the cleaned data
    df_news_raw.to_csv('collected_news_data.csv')
    print(f"\nSuccess: New black swan data saved to 'collected_news_data.csv'")
    print(f"   Contains {len(df_news_raw.columns)} predictive features.")
    print("\nSample of new event data:")
    print(df_news_raw.head())

except FileNotFoundError:
    print(f"Error: The file '{news_file_path}' was not found.")
    print("Please ensure the BigQuery export CSV file is in the data folder.")
except Exception as e:
    print(f"An error occurred: {e}")

## Step 7: Save the collected data

We will save all three datasets as CSV files so we can use them in the next notebooks without having to fetch the data again.

In [None]:
# Save freight data
df_freight.to_csv('collected_freight_data.csv')
print("Saved freight data to 'collected_freight_data.csv'")
print(f"  {len(df_freight)} weekly records")

# Save oil data
if not df_oil.empty:
    df_oil.to_csv('collected_oil_data.csv')
    print("Saved crude oil price data to 'collected_oil_data.csv'")
    print(f"  {len(df_oil)} daily records")
else:
    print("No oil data to save")

print("\n" + "=" * 70)
print("DATA COLLECTION COMPLETE!")
print("=" * 70)
print("\nSummary of collected data:")
print(f"Available: Freight data: {len(df_freight)} weeks ({df_freight.index.min().strftime('%Y-%m-%d')} to {df_freight.index.max().strftime('%Y-%m-%d')})")
print(f"Available: Oil data: {len(df_oil) if not df_oil.empty else 0} days")

# Black swan disruption data may not exist if the file failed to load; guard against NameError
if 'df_news_raw' in globals():
    try:
        print(f"Available: Black swan disruption data: {len(df_news_raw)} weeks")
        print(f"  Features: {', '.join(df_news_raw.columns.tolist())}")
    except Exception:
        # If df_news_raw exists but isn't a dataframe or has issues, report its type
        print(f"df_news_raw exists but could not be summarized (type={type(df_news_raw)})")
else:
    print("Available: Black swan disruption data: 0 weeks (df_news_raw not found)")

print("\nReady for Step 2: Data Understanding and Feature Engineering")
print("=" * 70)