# Data Collection for Europe Base Port Container Price Prediction

This notebook focuses on gathering all the raw data needed for our 1-week ahead container price forecasting project. We will collect data from three different sources and save it for later processing.

## What is data collection?

Data collection is the first step in any data science project. It involves gathering raw information from various sources like files, databases, or APIs (Application Programming Interfaces, which are ways for programs to talk to each other over the internet). Think of it like gathering ingredients before cooking a meal.

## What are we predicting?

**Target**: Europe Base Port container prices (1 week ahead)

**Base Ports Definition**: Average shipping cost for a 40-foot container from Shanghai/China to major European ports including Rotterdam (Netherlands), Hamburg (Germany), London (UK), and Antwerp (Belgium).

**Why these ports**: These represent the main entry points for Asian goods into Europe and provide a standard benchmark for European route pricing.

## Our data sources:

1. Shanghai Containerized Freight Index (local CSV file) - Main price data
2. Oil prices (from Yahoo Finance API Hopefully) - Cost factor affecting shipping / (Not working looking for work around)
3. Geopolitical disruption data (from GDELT via BigQuery) - Black swan event indicators

**Note**: We use GDELT data exported from Google BigQuery for historical coverage (2018-2025). This provides weekly disruption metrics including conflict events, severe incidents, and sentiment analysis for shipping-critical regions.

In [2]:
# Import required libraries
import pandas as pd
import numpy as np
import yfinance as yf
import requests
import time
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

print("Libraries imported successfully!")

Libraries imported successfully!


## Step 1: Load the Shanghai Containerized Freight Index

This is our main dataset. It contains weekly freight prices for shipping containers from Shanghai to various destinations around the world. We are specifically interested in the Europe Base Port price, which represents the average cost to ship a container from Shanghai/China to major European ports (Rotterdam, Hamburg, London, Antwerp).

In [3]:
try:
    # Load the raw CSV file
    # header=1 means the actual column names are in the second row (row 1, counting from 0)
    raw_df = pd.read_csv('Shanghai_Containerized_Freight_Index.csv', header=1)
    print(f"Successfully loaded the CSV file with {raw_df.shape[0]} rows and {raw_df.shape[1]} columns")
    
    # Display the first few rows to see what the data looks like
    print("\nFirst 5 rows of the raw data:")
    print(raw_df.head())
    
except FileNotFoundError:
    print("Error: The file 'Shanghai_Containerized_Freight_Index.csv' was not found.")
    print("Please make sure it is in the same folder as this notebook.")

Successfully loaded the CSV file with 385 rows and 5 columns

First 5 rows of the raw data:
  the period (YYYY-MM-DD)  Comprehensive Index  Europe (Base port)  \
0                1/5/2018               816.58                 888   
1               1/12/2018               839.72                 897   
2               1/19/2018               840.36                 891   
3               1/26/2018               858.60                 907   
4                2/2/2018               883.59                 912   

   Mediterranean (Base port)  Persian Gulf and Red Sea (Dubai)  
0                        738                               433  
1                        759                               450  
2                        761                               572  
3                        772                               631  
4                        797                               611  


## Step 2: Select and rename the columns we need

The dataset has many columns for different routes, but we only need a few for our Europe Base Port prediction. We will select the date column, the overall freight index (SCFI), and the Europe Base Port price.

### Why select only these columns?

Europe Base Port: Our prediction target - average cost from Shanghai to Rotterdam/Hamburg/London/Antwerp

SCFI_Index: Overall freight market indicator (note: highly correlated with Europe prices, so we may exclude it from final models)

Date: Needed for time series analysis and chronological ordering

In [4]:
# Select only the columns we need
df_freight = raw_df[['the period (YYYY-MM-DD)', 'Comprehensive Index', 'Europe (Base port)']].copy()

# Rename columns to simpler names
df_freight.columns = ['Date', 'SCFI_Index', 'Europe_Base_Price']

print("Selected and renamed columns:")
print(df_freight.columns.tolist())
print(f"\nDataset now has {df_freight.shape[0]} rows and {df_freight.shape[1]} columns")
print("\nFirst 5 rows:")
print(df_freight.head())

Selected and renamed columns:
['Date', 'SCFI_Index', 'Europe_Base_Price']

Dataset now has 385 rows and 3 columns

First 5 rows:
        Date  SCFI_Index  Europe_Base_Price
0   1/5/2018      816.58                888
1  1/12/2018      839.72                897
2  1/19/2018      840.36                891
3  1/26/2018      858.60                907
4   2/2/2018      883.59                912


## Step 3: Convert date strings to proper date format

Right now, the Date column is just text. We need to convert it to a proper datetime format so Python understands it represents actual dates.

### What is datetime?

datetime is a special data type in Python that represents dates and times. It allows us to do things like sort by date, calculate time differences, and extract parts of a date (like the month or year).

In [None]:
df_freight['Date'] = pd.to_datetime(df_freight['Date'], format='%m/%d/%Y', errors='coerce')

# Check how many dates were successfully converted
valid_dates = df_freight['Date'].notna().sum()
total_rows = len(df_freight)
print(f"Successfully converted {valid_dates} out of {total_rows} dates")

# Remove rows where date conversion failed
df_freight.dropna(subset=['Date'], inplace=True)
print(f"After removing invalid dates: {len(df_freight)} rows remaining")

# Set the Date column as the index (the row identifier)
df_freight.set_index('Date', inplace=True)

print("\nDate conversion complete!")
print(df_freight.head())

Successfully converted 385 out of 385 dates
After removing invalid dates: 385 rows remaining

Date conversion complete!
            SCFI_Index  Europe_Base_Price
Date                                     
2018-01-05      816.58                888
2018-01-12      839.72                897
2018-01-19      840.36                891
2018-01-26      858.60                907
2018-02-02      883.59                912


## Step 4: Convert price columns to numbers

Sometimes data is read as text even when it represents numbers. We need to ensure our price columns are in numeric format so we can do calculations with them.

### What is numeric conversion?

This process takes text that looks like numbers (like "123.45") and converts it to actual numbers that Python can use for math operations.

In [6]:
# Convert price columns to numeric format
for col in ['SCFI_Index', 'Europe_Base_Price']:
    df_freight[col] = pd.to_numeric(df_freight[col], errors='coerce')
    print(f"Converted {col} to numeric type")

# Remove any rows with missing values
# This ensures we have complete data for all rows
before_drop = len(df_freight)
df_freight.dropna(inplace=True)
after_drop = len(df_freight)

print(f"\nRemoved {before_drop - after_drop} rows with missing values")
print(f"Final freight dataset: {after_drop} rows")
print(f"Date range: {df_freight.index.min().strftime('%Y-%m-%d')} to {df_freight.index.max().strftime('%Y-%m-%d')}")

print("\nFinal cleaned freight data:")
print(df_freight.head())

Converted SCFI_Index to numeric type
Converted Europe_Base_Price to numeric type

Removed 0 rows with missing values
Final freight dataset: 385 rows
Date range: 2018-01-05 to 2025-08-22

Final cleaned freight data:
            SCFI_Index  Europe_Base_Price
Date                                     
2018-01-05      816.58                888
2018-01-12      839.72                897
2018-01-19      840.36                891
2018-01-26      858.60                907
2018-02-02      883.59                912


## Step 5: Fetch oil price data

Oil prices affect shipping costs since ships use fuel. We will download historical oil prices using multiple fallback methods.

### Option 1: Yahoo Finance API (Primary)

We try to fetch data from Yahoo Finance using the `yfinance` library. This is free and usually reliable, though it can occasionally fail due to rate limiting or API changes.

**Tickers tried:**
- **USO**: United States Oil Fund ETF (most reliable)
- **CL=F**: WTI Crude Oil Futures
- **BZ=F**: Brent Crude Oil Futures  
- **XLE**: Energy Select Sector SPDR Fund (energy sector)

### Option 2: Manual CSV File (Backup)

If Yahoo Finance fails, you can provide your own oil price data:

1. **Download oil prices** from one of these sources:
   - **EIA (U.S. Energy Information Administration)**: https://www.eia.gov/dnav/pet/hist/RWTCD.htm
     - Free, official U.S. government data
     - Download as Excel/CSV and save as `oil_prices_manual.csv`
   - **FRED (Federal Reserve Economic Data)**: https://fred.stlouisfed.org/series/DCOILWTICO
     - Free historical WTI crude oil prices
     - Click "Download" → CSV format
   - **Quandl/Nasdaq Data Link**: https://data.nasdaq.com/
     - Free tier available with API key
     
2. **Format the CSV** with two columns:
   ```
   Date,Price
   2018-01-01,60.37
   2018-01-02,61.44
   ...
   ```
   
3. **Save as** `oil_prices_manual.csv` in the same directory as this notebook

### Option 3: Synthetic Data (Last Resort)

If no real data is available, we create synthetic oil prices for testing purposes only. This uses a random walk model with realistic price ranges ($50-$120) but should NOT be used for actual predictions.

### Why Brent/WTI Crude Oil?

Brent Crude and WTI (West Texas Intermediate) are major oil benchmarks used for global pricing. Their price movements indicate changes in shipping fuel costs.

In [7]:
print("=" * 70)
print("FETCHING OIL/ENERGY PRICE DATA")
print("=" * 70)

# Get the date range from our freight data
start_date = df_freight.index.min().strftime('%Y-%m-%d')
end_date = df_freight.index.max().strftime('%Y-%m-%d')

print(f"\nRequesting data from {start_date} to {end_date}")

df_oil = pd.DataFrame()

# Option 1: Try Yahoo Finance first (most reliable when it works)
print("\n--- Option 1: Trying Yahoo Finance API ---")

oil_tickers = [
    ('USO', 'United States Oil Fund ETF'),  # Oil ETF - most reliable
    ('CL=F', 'WTI Crude Oil Futures'),  # WTI Futures
    ('BZ=F', 'Brent Crude Oil Futures'),  # Brent Futures
    ('XLE', 'Energy Select Sector SPDR Fund'),  # Energy sector ETF
]

for ticker, name in oil_tickers:
    print(f"\nTrying {name} ({ticker})...")
    
    try:
        # Download the oil price data with a longer timeout
        temp_df = yf.download(ticker, start=start_date, end=end_date, progress=False, timeout=15)

        if not temp_df.empty and len(temp_df) > 10:  # Need at least some data
            # Sometimes yfinance returns data with multiple column levels, we need to flatten it
            if temp_df.columns.nlevels > 1:
                temp_df.columns = temp_df.columns.droplevel(1)
            
            # Keep only the closing price and rename it
            df_oil = temp_df[['Close']].rename(columns={'Close': 'Oil_Price'})
            
            print(f"✓ Successfully fetched {len(df_oil)} days of oil price data from {name}")
            print(f"  Date range: {df_oil.index.min().strftime('%Y-%m-%d')} to {df_oil.index.max().strftime('%Y-%m-%d')}")
            print(f"  Price range: ${df_oil['Oil_Price'].min():.2f} to ${df_oil['Oil_Price'].max():.2f}")
            print(f"  Average price: ${df_oil['Oil_Price'].mean():.2f}")
            print("\nFirst 5 rows of oil data:")
            print(df_oil.head())
            break  # Success! Exit the loop
        else:
            print(f"✗ No data or insufficient data returned for {ticker}")
            
    except Exception as e:
        print(f"✗ Error with {ticker}: {str(e)[:150]}")
        continue

# Option 2: Try loading from a manual CSV file if API failed
if df_oil.empty:
    print("\n--- Option 2: Trying manual CSV file ---")
    try:
        # Check if user has provided a manual oil price CSV
        oil_csv_path = 'oil_prices_manual.csv'
        df_oil = pd.read_csv(oil_csv_path, parse_dates=['Date'], index_col='Date')
        
        # Filter to our date range
        df_oil = df_oil[(df_oil.index >= start_date) & (df_oil.index <= end_date)]
        
        if not df_oil.empty:
            # Rename to standard column name
            if 'Price' in df_oil.columns:
                df_oil = df_oil[['Price']].rename(columns={'Price': 'Oil_Price'})
            elif 'Close' in df_oil.columns:
                df_oil = df_oil[['Close']].rename(columns={'Close': 'Oil_Price'})
            
            print(f"✓ Loaded {len(df_oil)} days from manual CSV: {oil_csv_path}")
            print(f"  Date range: {df_oil.index.min().strftime('%Y-%m-%d')} to {df_oil.index.max().strftime('%Y-%m-%d')}")
            print(f"  Price range: ${df_oil['Oil_Price'].min():.2f} to ${df_oil['Oil_Price'].max():.2f}")
        else:
            print(f"✗ CSV file found but no data in date range")
            
    except FileNotFoundError:
        print(f"✗ Manual CSV file not found: {oil_csv_path}")
    except Exception as e:
        print(f"✗ Error loading manual CSV: {str(e)[:100]}")

# Option 3: Create synthetic oil price data based on historical patterns (last resort)
if df_oil.empty:
    print("\n--- Option 3: Creating synthetic placeholder data ---")
    print("⚠ WARNING: No real oil data available. Creating synthetic data for demonstration.")
    print("This should only be used for testing. For production, obtain real oil price data.")
    
    # Create date range matching freight data
    date_range = pd.date_range(start=start_date, end=end_date, freq='D')
    
    # Create synthetic oil prices with realistic values and volatility
    # Base price around $70-80 with random walk
    np.random.seed(42)  # For reproducibility
    base_price = 75.0
    random_walk = np.random.randn(len(date_range)).cumsum() * 2  # Random walk with std=2
    synthetic_prices = base_price + random_walk
    
    # Clip to reasonable range (50-120)
    synthetic_prices = np.clip(synthetic_prices, 50, 120)
    
    df_oil = pd.DataFrame({
        'Oil_Price': synthetic_prices
    }, index=date_range)
    
    print(f"✓ Created {len(df_oil)} days of synthetic oil price data")
    print(f"  Date range: {df_oil.index.min().strftime('%Y-%m-%d')} to {df_oil.index.max().strftime('%Y-%m-%d')}")
    print(f"  Price range: ${df_oil['Oil_Price'].min():.2f} to ${df_oil['Oil_Price'].max():.2f}")
    print(f"  Average: ${df_oil['Oil_Price'].mean():.2f}")
    print("\n⚠ Remember: This is SYNTHETIC data. Replace with real data for actual predictions.")

print("\n" + "=" * 70)
print("OIL DATA COLLECTION COMPLETE")
print("=" * 70)
print(f"Total days: {len(df_oil)}")
print(f"Data source: {'Yahoo Finance' if not df_oil.empty and 'synthetic' not in str(df_oil.index.name) else 'Synthetic/Manual'}")
print("=" * 70)

FETCHING OIL/ENERGY PRICE DATA

Requesting data from 2018-01-05 to 2025-08-22

--- Option 1: Trying Yahoo Finance API ---

Trying United States Oil Fund ETF (USO)...


Failed to get ticker 'USO' reason: Expecting value: line 1 column 1 (char 0)

1 Failed download:
['USO']: YFTzMissingError('$%ticker%: possibly delisted; No timezone found')


✗ No data or insufficient data returned for USO

Trying WTI Crude Oil Futures (CL=F)...


Failed to get ticker 'CL=F' reason: Expecting value: line 1 column 1 (char 0)

1 Failed download:
['CL=F']: YFTzMissingError('$%ticker%: possibly delisted; No timezone found')


✗ No data or insufficient data returned for CL=F

Trying Brent Crude Oil Futures (BZ=F)...


Failed to get ticker 'BZ=F' reason: Expecting value: line 1 column 1 (char 0)

1 Failed download:
['BZ=F']: YFTzMissingError('$%ticker%: possibly delisted; No timezone found')


✗ No data or insufficient data returned for BZ=F

Trying Energy Select Sector SPDR Fund (XLE)...


Failed to get ticker 'XLE' reason: Expecting value: line 1 column 1 (char 0)

1 Failed download:
['XLE']: YFTzMissingError('$%ticker%: possibly delisted; No timezone found')


✗ No data or insufficient data returned for XLE

--- Option 2: Trying manual CSV file ---
✗ Manual CSV file not found: oil_prices_manual.csv

--- Option 3: Creating synthetic placeholder data ---
This should only be used for testing. For production, obtain real oil price data.
✓ Created 2787 days of synthetic oil price data
  Date range: 2018-01-05 to 2025-08-22
  Price range: $50.00 to $120.00
  Average: $103.58

⚠ Remember: This is SYNTHETIC data. Replace with real data for actual predictions.

OIL DATA COLLECTION COMPLETE
Total days: 2787
Data source: Yahoo Finance


## Step 6: Load geopolitical disruption data from GDELT BigQuery Export

Major geopolitical events like port blockages, conflicts in shipping routes, or regional instability can cause sudden price changes (sometimes called "black swan" events). We use GDELT (Global Database of Events, Language, and Tone) data to track these disruptions.

### Why BigQuery Instead of GDELT Library?

**Previous Approach Issues:**
- The `gdeltPyR` library downloads entire daily GKG files (millions of rows) before filtering
- Rate limiting issues with GDELT's free API when making many requests
- Very slow performance due to excessive data transfer

**Current Solution: BigQuery Export**

We pre-processed GDELT data using Google BigQuery SQL queries to:
1. **Server-side filtering**: Only extract events from shipping-critical regions (Egypt, Yemen, China, Singapore, Netherlands, Germany, Iran, Israel, Taiwan, Ukraine, Russia)
2. **Aggregate to weekly level**: Match the SCFI freight data frequency (Fridays)
3. **Focus on disruptions**: Filter for conflicts (QuadClass 3,4) and severe negative events (GoldsteinScale < -3)
4. **Manageable size**: 403 weekly records (2018-2025) instead of millions of daily events

### BigQuery Data Structure

The exported CSV contains these weekly metrics:
- `iso_year`, `week`: ISO 8601 week numbering
- `total_events`: All events in shipping-critical regions
- `conflict_events`: Military/violent events (QuadClass 3,4)
- `severe_events`: Events with strong negative impact (GoldsteinScale < -3)
- `avg_impact`: Average GoldsteinScale (impact severity)
- `avg_sentiment`: Average media tone
- `total_media_mentions`: Volume of news coverage

### How This Helps Our Black Swan Detection

Our model will learn: *"When conflict events spike and media mentions surge in shipping-critical regions, prices usually increase within 1-2 weeks"*

We create a composite `disruption_index` that weighs:
- Severe events (weight 3.0): Strong price impact
- Conflict events (weight 2.0): Moderate price impact  
- Media mentions (weight 1.0): Market attention signal

In [8]:
print("=" * 70)
print("LOADING GDELT DISRUPTION DATA FROM BIGQUERY EXPORT")
print("=" * 70)

# Load the BigQuery export CSV (updated filename)
gdelt_csv_file = 'world_events_dt.csv'

try:
    print(f"\nLoading BigQuery export: {gdelt_csv_file}")
    df_gdelt = pd.read_csv(gdelt_csv_file)
    
    print(f"✓ Loaded {len(df_gdelt)} weekly records")
    print(f"\nColumns: {df_gdelt.columns.tolist()}")
    print(f"\nFirst few rows:")
    print(df_gdelt.head())
    
    # Convert ISO year + week to actual Friday dates
    print("\n" + "=" * 70)
    print("Converting ISO week numbers to Friday dates...")
    print("=" * 70)
    
    # Handle ISO week edge cases (week 0 and week 53)
    def iso_week_to_friday(row):
        """Convert ISO year + week to the Friday of that week"""
        year = int(row['iso_year'])
        week = int(row['week'])
        
        # Handle week 0 (some years start in previous ISO year)
        if week == 0:
            # Week 0 is actually the last week of the previous year
            # Use the last Friday of the previous year
            last_day_prev_year = pd.Timestamp(year=year-1, month=12, day=31)
            days_from_friday = (last_day_prev_year.dayofweek - 4) % 7
            return last_day_prev_year - pd.Timedelta(days=days_from_friday)
        
        # Handle week 53 (some years have 53 ISO weeks)
        if week == 53:
            # Check if this year actually has 53 weeks
            last_day = pd.Timestamp(year=year, month=12, day=31)
            iso_calendar = last_day.isocalendar()
            if iso_calendar[1] == 53:
                # Year has 53 weeks, use last Friday
                days_from_friday = (last_day.dayofweek - 4) % 7
                return last_day - pd.Timedelta(days=days_from_friday)
            else:
                # This shouldn't happen, but if week 53 doesn't exist, use week 52
                return pd.to_datetime(f'{year}-W52-5', format='%Y-W%W-%w')
        
        # Normal weeks (1-52)
        try:
            # %Y-W%W-%w format: year-Week-weekday (5=Friday)
            date_str = f'{year}-W{week:02d}-5'
            return pd.to_datetime(date_str, format='%Y-W%W-%w')
        except:
            # Fallback for any parsing issues
            print(f"  Warning: Could not parse year={year}, week={week}. Using approximate date.")
            # Approximate: year start + (week * 7 days) to get to Friday
            jan_1 = pd.Timestamp(year=year, month=1, day=1)
            days_to_add = (week * 7) + (4 - jan_1.dayofweek) % 7
            return jan_1 + pd.Timedelta(days=days_to_add)
    
    # Apply the conversion
    df_gdelt['date'] = df_gdelt.apply(iso_week_to_friday, axis=1)
    
    # Check for any failed conversions
    null_dates = df_gdelt['date'].isna().sum()
    if null_dates > 0:
        print(f"⚠ Warning: {null_dates} dates could not be converted and will be dropped")
        df_gdelt = df_gdelt.dropna(subset=['date'])
    
    print(f"✓ Converted {len(df_gdelt)} weeks to Friday dates")
    print(f"  Date range: {df_gdelt['date'].min().strftime('%Y-%m-%d')} to {df_gdelt['date'].max().strftime('%Y-%m-%d')}")
    
    # Create composite disruption index
    print("\n" + "=" * 70)
    print("Creating disruption index...")
    print("=" * 70)
    
    # Normalize and weight different disruption signals
    # Severe events have highest weight (3.0), conflicts medium (2.0), media mentions lower (1.0)
    df_gdelt['disruption_index'] = (
        (df_gdelt['severe_events'] / 1000) * 3.0 +
        (df_gdelt['conflict_events'] / 1000) * 2.0 +
        (df_gdelt['total_media_mentions'] / 100000) * 1.0
    )
    
    print("Disruption index formula:")
    print("  = (severe_events/1000) * 3.0")
    print("  + (conflict_events/1000) * 2.0")  
    print("  + (media_mentions/100000) * 1.0")
    print(f"\nDisruption index statistics:")
    print(f"  Mean: {df_gdelt['disruption_index'].mean():.3f}")
    print(f"  Std:  {df_gdelt['disruption_index'].std():.3f}")
    print(f"  Min:  {df_gdelt['disruption_index'].min():.3f}")
    print(f"  Max:  {df_gdelt['disruption_index'].max():.3f}")
    
    # Prepare final dataframe with renamed columns
    df_news = df_gdelt[['date', 'disruption_index', 'avg_sentiment', 'conflict_events', 
                         'severe_events', 'total_media_mentions']].copy()
    
    # Rename columns to match expected format
    df_news.columns = ['date', 'disruption_index', 'tone', 'conflict_count', 
                       'severe_event_count', 'media_mentions']
    
    # Set date as index
    df_news.set_index('date', inplace=True)
    df_news.sort_index(inplace=True)
    
    print("\n" + "=" * 70)
    print("✅ GDELT DATA LOADED SUCCESSFULLY")
    print("=" * 70)
    print(f"Total weeks: {len(df_news)}")
    print(f"Date range: {df_news.index.min().strftime('%Y-%m-%d')} to {df_news.index.max().strftime('%Y-%m-%d')}")
    print(f"\nFeatures available:")
    print(f"  - disruption_index: Composite disruption score")
    print(f"  - tone: Average sentiment ({df_news['tone'].mean():.2f} avg)")
    print(f"  - conflict_count: Military/violent events ({df_news['conflict_count'].sum():.0f} total)")
    print(f"  - severe_event_count: High-impact events ({df_news['severe_event_count'].sum():.0f} total)")
    print(f"  - media_mentions: News coverage volume ({df_news['media_mentions'].sum():.0f} total)")
    
    print(f"\nSample data:")
    print(df_news.head(10))
    
    # Find major disruption events for validation
    print(f"\n" + "=" * 70)
    print("TOP 10 DISRUPTION WEEKS (for validation):")
    print("=" * 70)
    top_disruptions = df_news.nlargest(10, 'disruption_index')[['disruption_index', 'conflict_count', 'severe_event_count', 'media_mentions']]
    for date, row in top_disruptions.iterrows():
        print(f"{date.strftime('%Y-%m-%d')}: Disruption={row['disruption_index']:.2f}, "
              f"Conflicts={row['conflict_count']:.0f}, Severe={row['severe_event_count']:.0f}, "
              f"Media={row['media_mentions']:.0f}")
    
except FileNotFoundError:
    print(f"\n🚨 ERROR: Could not find {gdelt_csv_file}")
    print("Please ensure the BigQuery export CSV is in the same directory as this notebook.")
    print("Creating empty DataFrame...")
    df_news = pd.DataFrame()
    
except Exception as e:
    print(f"\n🚨 ERROR loading GDELT data: {type(e).__name__}: {e}")
    import traceback
    print(traceback.format_exc())
    print("Creating empty DataFrame...")
    df_news = pd.DataFrame()

print("\n" + "=" * 70)

LOADING GDELT DISRUPTION DATA FROM BIGQUERY EXPORT

Loading BigQuery export: world_events_dt.csv
✓ Loaded 412 weekly records

Columns: ['iso_year', 'week', 'total_events', 'conflict_events', 'severe_events', 'avg_impact', 'avg_sentiment', 'total_media_mentions', 'middle_east_events', 'middle_east_impact', 'middle_east_media', 'asia_events', 'asia_impact', 'europe_events', 'europe_impact', 'ukraine_russia_events', 'ukraine_russia_impact', 'military_conflict_events', 'protest_events', 'trade_restriction_events', 'cooperation_breakdown_events', 'extreme_crisis_events', 'worst_event_impact', 'peak_media_attention', 'media_attention_volatility', 'egypt_events', 'egypt_impact', 'yemen_events', 'yemen_impact']

First few rows:
   iso_year  week  total_events  conflict_events  severe_events  avg_impact  \
0      2018     0         66868            66868          28034   -5.699991   
1      2018     1         74501            74501          31761   -5.900558   
2      2018     2         70590  

## Step 7: Save the collected data

We will save all three datasets as CSV files so we can use them in the next notebooks without having to fetch the data again.

In [9]:
# Save freight data
df_freight.to_csv('collected_freight_data.csv')
print("Saved freight data to 'collected_freight_data.csv'")
print(f"  {len(df_freight)} weekly records")

# Save oil data if we have it
if not df_oil.empty:
    df_oil.to_csv('collected_oil_data.csv')
    print("Saved oil price data to 'collected_oil_data.csv'")
    print(f"  {len(df_oil)} daily records")
else:
    print("No oil data to save (Yahoo Finance may be temporarily unavailable)")

# Save news/disruption data if we have it
if not df_news.empty:
    df_news.to_csv('collected_news_data.csv')
    print("Saved GDELT disruption data to 'collected_news_data.csv'")
    print(f"  {len(df_news)} weekly records")
    print(f"  Features: {', '.join(df_news.columns.tolist())}")
else:
    print("No GDELT disruption data to save")

print("\n" + "=" * 70)
print("DATA COLLECTION COMPLETE!")
print("=" * 70)
print("\nSummary of collected data:")
print(f"✓ Freight data: {len(df_freight)} weeks ({df_freight.index.min().strftime('%Y-%m-%d')} to {df_freight.index.max().strftime('%Y-%m-%d')})")
print(f"✓ Oil data: {len(df_oil) if not df_oil.empty else 0} days")
if not df_news.empty:
    print(f"✓ Disruption data: {len(df_news)} weeks")
    print(f"  - Total conflict events: {df_news['conflict_count'].sum():.0f}")
    print(f"  - Total severe events: {df_news['severe_event_count'].sum():.0f}")
    print(f"  - Average disruption index: {df_news['disruption_index'].mean():.3f}")
else:
    print(f"✗ Disruption data: 0 records")

print("\nReady for Step 2: Data Understanding")
print("=" * 70)

Saved freight data to 'collected_freight_data.csv'
  385 weekly records
Saved oil price data to 'collected_oil_data.csv'
  2787 daily records
Saved GDELT disruption data to 'collected_news_data.csv'
  412 weekly records
  Features: disruption_index, tone, conflict_count, severe_event_count, media_mentions

DATA COLLECTION COMPLETE!

Summary of collected data:
✓ Freight data: 385 weeks (2018-01-05 to 2025-08-22)
✓ Oil data: 2787 days
✓ Disruption data: 412 weeks
  - Total conflict events: 28908617
  - Total severe events: 12912294
  - Average disruption index: 237.885

Ready for Step 2: Data Understanding
