# Data Collection for Europe Base Port Container Price Prediction

This notebook focuses on gathering all the raw data needed for our 1-week ahead container price forecasting project. We will collect data from three different sources and save it for later processing.

## What are we predicting?

**Target**: Europe Base Port container prices (1 week ahead)

**Base Ports Definition**: Average shipping cost for a 40-foot container from Shanghai/China to major European ports including Rotterdam (Netherlands), Hamburg (Germany), London (UK), and Antwerp (Belgium).

**Why these ports**: These represent the main entry points for Asian goods into Europe and provide a standard benchmark for European route pricing.

## Our data sources:

1. Shanghai Containerized Freight Index (local CSV file) - Main price data
2. Crude oil prices (EIA DCOILWTICO dataset) - Cost factor affecting shipping fuel costs
3. Geopolitical disruption data (from GDELT via BigQuery) - Black swan event indicators

**Note**: We use GDELT data exported from Google BigQuery for historical coverage (2018-2025). This provides weekly disruption metrics including conflict events, severe incidents, and sentiment analysis for shipping-critical regions.

In [1]:
# Import required libraries
import pandas as pd
import numpy as np
import yfinance as yf
import requests
import time
from datetime import datetime, timedelta
import warnings
import plotly.express as px
warnings.filterwarnings('ignore')

print("Libraries imported successfully!")

Libraries imported successfully!


## Step 1: Load the Shanghai Containerized Freight Index

In [2]:
try:
    # Load the raw CSV file
    # header=1 means the actual column names are in the second row (row 1, counting from 0)
    raw_df = pd.read_csv('data/Shanghai_Containerized_Freight_Index.csv', header=1)
    print(f"Successfully loaded the CSV file with {raw_df.shape[0]} rows and {raw_df.shape[1]} columns")
    
    # Display the first few rows to see what the data looks like
    print("\nFirst 5 rows of the raw data:")
    print(raw_df.head())
    
except FileNotFoundError:
    print("Error: The file 'data/Shanghai_Containerized_Freight_Index.csv' was not found.")
    print("Please make sure it is in the data folder.")

Successfully loaded the CSV file with 385 rows and 5 columns

First 5 rows of the raw data:
  the period (YYYY-MM-DD)  Comprehensive Index  Europe (Base port)  \
0                1/5/2018               816.58                 888   
1               1/12/2018               839.72                 897   
2               1/19/2018               840.36                 891   
3               1/26/2018               858.60                 907   
4                2/2/2018               883.59                 912   

   Mediterranean (Base port)  Persian Gulf and Red Sea (Dubai)  
0                        738                               433  
1                        759                               450  
2                        761                               572  
3                        772                               631  
4                        797                               611  


## Step 2: Select and rename the columns we need

The dataset has many columns for different routes, but we only need a few for our Europe Base Port prediction. We will select the date column, the overall freight index (SCFI), and the Europe Base Port price.

In [3]:
# Select only the columns we need
df_freight = raw_df[['the period (YYYY-MM-DD)', 'Comprehensive Index', 'Europe (Base port)']].copy()

# Rename columns to simpler names
df_freight.columns = ['Date', 'SCFI_Index', 'Europe_Base_Price']

print("Selected and renamed columns:")
print(df_freight.columns.tolist())
print(f"\nDataset now has {df_freight.shape[0]} rows and {df_freight.shape[1]} columns")
print("\nFirst 5 rows:")
print(df_freight.head())

Selected and renamed columns:
['Date', 'SCFI_Index', 'Europe_Base_Price']

Dataset now has 385 rows and 3 columns

First 5 rows:
        Date  SCFI_Index  Europe_Base_Price
0   1/5/2018      816.58                888
1  1/12/2018      839.72                897
2  1/19/2018      840.36                891
3  1/26/2018      858.60                907
4   2/2/2018      883.59                912


## Step 3: Convert date strings to proper date format

Right now, the Date column is just text. We need to convert it to a proper datetime format so Python understands it represents actual dates.

In [4]:
df_freight['Date'] = pd.to_datetime(df_freight['Date'], format='%m/%d/%Y', errors='coerce')

# Check how many dates were successfully converted
valid_dates = df_freight['Date'].notna().sum()
total_rows = len(df_freight)
print(f"Successfully converted {valid_dates} out of {total_rows} dates")

# Remove rows where date conversion failed
df_freight.dropna(subset=['Date'], inplace=True)
print(f"After removing invalid dates: {len(df_freight)} rows remaining")

# Set the Date column as the index (the row identifier)
df_freight.set_index('Date', inplace=True)

print("\nDate conversion complete!")
print(df_freight.head())

Successfully converted 385 out of 385 dates
After removing invalid dates: 385 rows remaining

Date conversion complete!
            SCFI_Index  Europe_Base_Price
Date                                     
2018-01-05      816.58                888
2018-01-12      839.72                897
2018-01-19      840.36                891
2018-01-26      858.60                907
2018-02-02      883.59                912


## Step 4: Convert price columns to numbers

Sometimes data is read as text even when it represents numbers. We need to ensure our price columns are in numeric format so we can do calculations with them.

In [5]:
# Convert price columns to numeric format
for col in ['SCFI_Index', 'Europe_Base_Price']:
    df_freight[col] = pd.to_numeric(df_freight[col], errors='coerce')
    print(f"Converted {col} to numeric type")

# Remove any rows with missing values
# This ensures we have complete data for all rows
before_drop = len(df_freight)
df_freight.dropna(inplace=True)
after_drop = len(df_freight)

print(f"\nRemoved {before_drop - after_drop} rows with missing values")
print(f"Final freight dataset: {after_drop} rows")
print(f"Date range: {df_freight.index.min().strftime('%Y-%m-%d')} to {df_freight.index.max().strftime('%Y-%m-%d')}")

print("\nFinal cleaned freight data:")
print(df_freight.head())

Converted SCFI_Index to numeric type
Converted Europe_Base_Price to numeric type

Removed 0 rows with missing values
Final freight dataset: 385 rows
Date range: 2018-01-05 to 2025-08-22

Final cleaned freight data:
            SCFI_Index  Europe_Base_Price
Date                                     
2018-01-05      816.58                888
2018-01-12      839.72                897
2018-01-19      840.36                891
2018-01-26      858.60                907
2018-02-02      883.59                912


## Step 5: Time-Series Visualization

In [None]:
# Main plot of Europe Base Port prices
fig = px.line(df_freight, x=df_freight.index, y='Europe_Base_Price', title='Europe Base Port Container Prices (2018-2025)')
fig.show()

# Zoom in on COVID-19 port disruptions
fig_covid = px.line(df_freight.loc['2020-01-01':'2022-12-31'], x=df_freight.loc['2020-01-01':'2022-12-31'].index, y='Europe_Base_Price', title='COVID-19 Port Disruptions Impact')
fig_covid.show()

# Zoom in on Suez Canal blockage
fig_suez = px.line(df_freight.loc['2021-03-01':'2021-04-30'], x=df_freight.loc['2021-03-01':'2021-04-30'].index, y='Europe_Base_Price', title='Suez Canal Blockage Impact')
fig_suez.show()

# Zoom in on Red Sea crisis
fig_red_sea = px.line(df_freight.loc['2023-10-01':], x=df_freight.loc['2023-10-01':].index, y='Europe_Base_Price', title='Red Sea Crisis Impact')
fig_red_sea.show()

## Step 6: Load crude oil price data

Oil prices affect shipping costs since ships use fuel. We will load historical crude oil prices from both EIA WTI and Brent EU datasets.

### Data Sources: EIA DCOILWTICO and DCOILBRENTEU

We use two major oil benchmarks from the U.S. Energy Information Administration (EIA):
- **DCOILWTICO**: WTI (West Texas Intermediate) crude oil prices
- **DCOILBRENTEU**: Brent EU crude oil prices

**Files**: `data/DCOILWTICO.csv` and `data/DCOILBRENTEU.csv`
**Format**: CSV with columns `observation_date` and price column
**Frequency**: Daily prices

### Why Both Oil Benchmarks?

WTI and Brent are the two major global oil benchmarks. WTI is more US-focused while Brent is more representative of global oil prices. Including both provides a more comprehensive view of fuel cost fluctuations affecting shipping.

### Processing Steps:
1. Load both CSV files
2. Convert dates to datetime format
3. Rename columns for consistency
4. Filter to match freight data date range
5. Handle missing values (marked as empty strings in EIA data)

In [6]:
print("=" * 70)
print("LOADING CRUDE OIL PRICE DATA")
print("=" * 70)

# Get the date range from our freight data
start_date = df_freight.index.min().strftime('%Y-%m-%d')
end_date = df_freight.index.max().strftime('%Y-%m-%d')

print(f"\nRequesting data from {start_date} to {end_date}")

df_oil_wti = pd.DataFrame()
df_oil_brent = pd.DataFrame()

# Load WTI crude oil prices from EIA DCOILWTICO dataset
print("\n--- Loading WTI from EIA DCOILWTICO.csv ---")

try:
    # Load the EIA WTI crude oil price data
    wti_file_path = 'data/DCOILWTICO.csv'
    df_oil_wti_raw = pd.read_csv(wti_file_path, parse_dates=['observation_date'], index_col='observation_date')

    # Rename the price column for consistency
    df_oil_wti_raw = df_oil_wti_raw.rename(columns={'DCOILWTICO': 'WTI_Price'})

    # Filter to our date range
    df_oil_wti = df_oil_wti_raw[(df_oil_wti_raw.index >= start_date) & (df_oil_wti_raw.index <= end_date)].copy()

    # Handle missing values (EIA uses empty strings for missing data)
    df_oil_wti['WTI_Price'] = pd.to_numeric(df_oil_wti['WTI_Price'], errors='coerce')

    # Remove rows with missing prices
    before_clean = len(df_oil_wti)
    df_oil_wti.dropna(subset=['WTI_Price'], inplace=True)
    after_clean = len(df_oil_wti)

    if not df_oil_wti.empty:
        print(f"Success: Loaded {len(df_oil_wti)} days of WTI crude oil price data from {wti_file_path}")
        print(f"  Date range: {df_oil_wti.index.min().strftime('%Y-%m-%d')} to {df_oil_wti.index.max().strftime('%Y-%m-%d')}")
        print(f"  Price range: ${df_oil_wti['WTI_Price'].min():.2f} to ${df_oil_wti['WTI_Price'].max():.2f}")
        print(f"  Average price: ${df_oil_wti['WTI_Price'].mean():.2f}")
        print(f"  Removed {before_clean - after_clean} rows with missing values")
        print("\nFirst 5 rows of WTI data:")
        print(df_oil_wti.head())
    else:
        print(f"Error: No valid WTI oil price data found in date range")

except FileNotFoundError:
    print(f"Error: The file '{wti_file_path}' was not found.")
    print("Please ensure the DCOILWTICO.csv file is in the data folder.")
except Exception as e:
    print(f"Error: An error occurred while loading WTI data: {str(e)[:150]}")

# Load Brent EU crude oil prices from EIA DCOILBRENTEU dataset
print("\n--- Loading Brent EU from EIA DCOILBRENTEU.csv ---")

try:
    # Load the EIA Brent EU crude oil price data
    brent_file_path = 'data/DCOILBRENTEU.csv'
    df_oil_brent_raw = pd.read_csv(brent_file_path, parse_dates=['observation_date'], index_col='observation_date')

    # Rename the price column for consistency
    df_oil_brent_raw = df_oil_brent_raw.rename(columns={'DCOILBRENTEU': 'Brent_Price'})

    # Filter to our date range
    df_oil_brent = df_oil_brent_raw[(df_oil_brent_raw.index >= start_date) & (df_oil_brent_raw.index <= end_date)].copy()

    # Handle missing values (EIA uses empty strings for missing data)
    df_oil_brent['Brent_Price'] = pd.to_numeric(df_oil_brent['Brent_Price'], errors='coerce')

    # Remove rows with missing prices
    before_clean = len(df_oil_brent)
    df_oil_brent.dropna(subset=['Brent_Price'], inplace=True)
    after_clean = len(df_oil_brent)

    if not df_oil_brent.empty:
        print(f"Success: Loaded {len(df_oil_brent)} days of Brent EU crude oil price data from {brent_file_path}")
        print(f"  Date range: {df_oil_brent.index.min().strftime('%Y-%m-%d')} to {df_oil_brent.index.max().strftime('%Y-%m-%d')}")
        print(f"  Price range: ${df_oil_brent['Brent_Price'].min():.2f} to ${df_oil_brent['Brent_Price'].max():.2f}")
        print(f"  Average price: ${df_oil_brent['Brent_Price'].mean():.2f}")
        print(f"  Removed {before_clean - after_clean} rows with missing values")
        print("\nFirst 5 rows of Brent EU data:")
        print(df_oil_brent.head())
    else:
        print(f"Error: No valid Brent EU oil price data found in date range")

except FileNotFoundError:
    print(f"Error: The file '{brent_file_path}' was not found.")
    print("Please ensure the DCOILBRENTEU.csv file is in the data folder.")
except Exception as e:
    print(f"Error: An error occurred while loading Brent EU data: {str(e)[:150]}")

# Fallback: Create synthetic oil price data if loading failed (last resort)
if df_oil_wti.empty and df_oil_brent.empty:
    print("\n--- Fallback: Creating synthetic placeholder data ---")
    print("Warning: No real oil data available. Creating synthetic data for demonstration.")
    print("This should only be used for testing. For production, obtain the DCOILWTICO.csv and DCOILBRENTEU.csv files.")

    # Create date range matching freight data
    date_range = pd.date_range(start=start_date, end=end_date, freq='D')

    # Create synthetic oil prices with realistic values and volatility
    # Base price around $70-80 with random walk
    np.random.seed(42)  # For reproducibility
    base_price = 75.0
    random_walk = np.random.randn(len(date_range)).cumsum() * 2  # Random walk with std=2
    synthetic_prices = base_price + random_walk

    # Clip to reasonable range (50-120)
    synthetic_prices = np.clip(synthetic_prices, 50, 120)

    df_oil_wti = pd.DataFrame({
        'WTI_Price': synthetic_prices
    }, index=date_range)
    
    df_oil_brent = pd.DataFrame({
        'Brent_Price': synthetic_prices + np.random.randn(len(date_range)) * 2  # Slight variation
    }, index=date_range)

    print(f"Success: Created {len(df_oil_wti)} days of synthetic WTI oil price data")
    print(f"Success: Created {len(df_oil_brent)} days of synthetic Brent EU oil price data")
    print(f"  Date range: {df_oil_wti.index.min().strftime('%Y-%m-%d')} to {df_oil_wti.index.max().strftime('%Y-%m-%d')}")
    print(f"  WTI Price range: ${df_oil_wti['WTI_Price'].min():.2f} to ${df_oil_wti['WTI_Price'].max():.2f}")
    print(f"  Brent Price range: ${df_oil_brent['Brent_Price'].min():.2f} to ${df_oil_brent['Brent_Price'].max():.2f}")
    print(f"  WTI Average: ${df_oil_wti['WTI_Price'].mean():.2f}")
    print(f"  Brent Average: ${df_oil_brent['Brent_Price'].mean():.2f}")
    print("\nWarning: This is SYNTHETIC data. Replace with real DCOILWTICO.csv and DCOILBRENTEU.csv data for actual predictions.")

print("\n" + "=" * 70)
print("OIL DATA LOADING COMPLETE")
print("=" * 70)
print(f"Total WTI days: {len(df_oil_wti)}")
print(f"Total Brent EU days: {len(df_oil_brent)}")
print(f"Data sources: {'EIA DCOILWTICO & DCOILBRENTEU' if not df_oil_wti.empty and not df_oil_brent.empty else 'Synthetic'}")
print("=" * 70)

LOADING CRUDE OIL PRICE DATA

Requesting data from 2018-01-05 to 2025-08-22

--- Loading WTI from EIA DCOILWTICO.csv ---
Error: The file 'data/DCOILWTICO.csv' was not found.
Please ensure the DCOILWTICO.csv file is in the data folder.

--- Loading Brent EU from EIA DCOILBRENTEU.csv ---
Success: Loaded 399 days of Brent EU crude oil price data from data/DCOILBRENTEU.csv
  Date range: 2018-01-05 to 2025-08-22
  Price range: $15.87 to $127.44
  Average price: $73.04
  Removed 0 rows with missing values

First 5 rows of Brent EU data:
                  Brent_Price
observation_date             
2018-01-05              68.01
2018-01-12              69.64
2018-01-19              68.56
2018-01-26              70.08
2018-02-02              67.45

OIL DATA LOADING COMPLETE
Total WTI days: 0
Total Brent EU days: 399
Data sources: Synthetic


## Step 7: Load the new "black swan" geopolitical disruption data

We now load the new, richer BigQuery dataset that contains specific geopolitical and black swan event metrics for shipping-critical regions.

In [7]:
print("=" * 70)
print("LOADING GEOPOLITICAL DISRUPTION DATA")
print("=" * 70)

# Load geopolitical disruption data from GDELT BigQuery export
print("\n--- Loading GDELT BigQuery data ---")

try:
    # Load the GDELT disruption data exported from BigQuery
    gdelt_file_path = 'data/gdelt_disruption_data.csv'
    df_gdelt_raw = pd.read_csv(gdelt_file_path, parse_dates=['week_start'])
    
    # Set the week_start as the index
    df_gdelt = df_gdelt_raw.set_index('week_start')
    
    # Filter to match freight data date range
    df_gdelt = df_gdelt[(df_gdelt.index >= start_date) & (df_gdelt.index <= end_date)].copy()
    
    if not df_gdelt.empty:
        print(f"Success: Loaded {len(df_gdelt)} weeks of geopolitical disruption data from {gdelt_file_path}")
        print(f"  Date range: {df_gdelt.index.min().strftime('%Y-%m-%d')} to {df_gdelt.index.max().strftime('%Y-%m-%d')}")
        print(f"  Columns: {df_gdelt.columns.tolist()}")
        print("\nFirst 5 rows of GDELT data:")
        print(df_gdelt.head())
        
        # Display summary statistics
        print("\nSummary statistics:")
        print(df_gdelt.describe())
    else:
        print(f"Error: No valid GDELT data found in date range")
        df_gdelt = pd.DataFrame()  # Empty dataframe

except FileNotFoundError:
    print(f"Error: The file '{gdelt_file_path}' was not found.")
    print("Please ensure the gdelt_disruption_data.csv file is in the data folder.")
    print("\nExpected file structure:")
    print("  - week_start: Date column (YYYY-MM-DD format)")
    print("  - Various disruption metrics (e.g., conflict_events, severe_incidents, avg_tone, etc.)")
    
    # Create an empty dataframe as fallback
    df_gdelt = pd.DataFrame()
    
except Exception as e:
    print(f"Error: An error occurred while loading GDELT data: {str(e)[:150]}")
    df_gdelt = pd.DataFrame()

# Fallback: Create synthetic geopolitical disruption data if loading failed
if df_gdelt.empty:
    print("\n--- Fallback: Creating synthetic placeholder data ---")
    print("Warning: No real GDELT data available. Creating synthetic data for demonstration.")
    print("This should only be used for testing. For production, obtain the gdelt_disruption_data.csv file.")
    
    # Create weekly date range matching freight data
    date_range = pd.date_range(start=start_date, end=end_date, freq='W-FRI')  # Weekly on Fridays
    
    # Create synthetic disruption metrics
    np.random.seed(42)  # For reproducibility
    
    df_gdelt = pd.DataFrame({
        'conflict_events': np.random.poisson(5, len(date_range)),  # Poisson distribution for event counts
        'severe_incidents': np.random.poisson(2, len(date_range)),
        'avg_tone': np.random.randn(len(date_range)) * 2 - 1,  # Negative tone indicates conflict
        'event_density': np.random.uniform(0.1, 0.9, len(date_range)),
        'volatility_index': np.random.uniform(0, 100, len(date_range))
    }, index=date_range)
    
    print(f"Success: Created {len(df_gdelt)} weeks of synthetic geopolitical disruption data")
    print(f"  Date range: {df_gdelt.index.min().strftime('%Y-%m-%d')} to {df_gdelt.index.max().strftime('%Y-%m-%d')}")
    print(f"  Columns: {df_gdelt.columns.tolist()}")
    print("\nFirst 5 rows of synthetic GDELT data:")
    print(df_gdelt.head())
    print("\nWarning: This is SYNTHETIC data. Replace with real gdelt_disruption_data.csv for actual predictions.")

print("\n" + "=" * 70)
print("GEOPOLITICAL DATA LOADING COMPLETE")
print("=" * 70)
print(f"Total weeks: {len(df_gdelt)}")
print(f"Data source: {'GDELT BigQuery' if not df_gdelt.empty else 'Synthetic'}")
print("=" * 70)

LOADING GEOPOLITICAL DISRUPTION DATA

--- Loading GDELT BigQuery data ---
Error: The file 'data/gdelt_disruption_data.csv' was not found.
Please ensure the gdelt_disruption_data.csv file is in the data folder.

Expected file structure:
  - week_start: Date column (YYYY-MM-DD format)
  - Various disruption metrics (e.g., conflict_events, severe_incidents, avg_tone, etc.)

--- Fallback: Creating synthetic placeholder data ---
This should only be used for testing. For production, obtain the gdelt_disruption_data.csv file.
Success: Created 399 weeks of synthetic geopolitical disruption data
  Date range: 2018-01-05 to 2025-08-22
  Columns: ['conflict_events', 'severe_incidents', 'avg_tone', 'event_density', 'volatility_index']

First 5 rows of synthetic GDELT data:
            conflict_events  severe_incidents  avg_tone  event_density  \
2018-01-05                5                 3 -3.583051       0.239718   
2018-01-12                4                 2 -0.418294       0.336746   
2018-0

## Step 8: Save all collected data

Now that we have loaded and cleaned all our data sources, let's save them to CSV files for use in the next phase of the project.

In [8]:
print("=" * 70)
print("SAVING COLLECTED DATA")
print("=" * 70)

# Create output directory if it doesn't exist
import os
os.makedirs('data/processed', exist_ok=True)

# Save freight data
freight_output = 'data/processed/freight_data.csv'
df_freight.to_csv(freight_output)
print(f"\n✓ Saved freight data to {freight_output}")
print(f"  Rows: {len(df_freight)}, Columns: {len(df_freight.columns)}")

# Save WTI oil data
if not df_oil_wti.empty:
    wti_output = 'data/processed/oil_wti.csv'
    df_oil_wti.to_csv(wti_output)
    print(f"\n✓ Saved WTI oil data to {wti_output}")
    print(f"  Rows: {len(df_oil_wti)}, Columns: {len(df_oil_wti.columns)}")

# Save Brent oil data
if not df_oil_brent.empty:
    brent_output = 'data/processed/oil_brent.csv'
    df_oil_brent.to_csv(brent_output)
    print(f"\n✓ Saved Brent oil data to {brent_output}")
    print(f"  Rows: {len(df_oil_brent)}, Columns: {len(df_oil_brent.columns)}")

# Save geopolitical data
if not df_gdelt.empty:
    gdelt_output = 'data/processed/geopolitical_data.csv'
    df_gdelt.to_csv(gdelt_output)
    print(f"\n✓ Saved geopolitical data to {gdelt_output}")
    print(f"  Rows: {len(df_gdelt)}, Columns: {len(df_gdelt.columns)}")

print("\n" + "=" * 70)
print("DATA COLLECTION COMPLETE!")
print("=" * 70)
print("\nAll data has been collected and saved to the data/processed folder.")
print("Next step: Data preprocessing and feature engineering.")

SAVING COLLECTED DATA

✓ Saved freight data to data/processed/freight_data.csv
  Rows: 385, Columns: 2

✓ Saved Brent oil data to data/processed/oil_brent.csv
  Rows: 399, Columns: 1

✓ Saved geopolitical data to data/processed/geopolitical_data.csv
  Rows: 399, Columns: 5

DATA COLLECTION COMPLETE!

All data has been collected and saved to the data/processed folder.
Next step: Data preprocessing and feature engineering.
