# Port Event Identification Analysis

## Project Overview

This notebook focuses on **identifying and filtering GDELT events that occur near maritime ports** using geospatial distance calculations. The goal is to establish distance-based parameters to accurately identify port-related events for downstream predictive analysis.

### Key Objectives:
1. **Geospatial Analysis**: Calculate distances between GDELT events and port locations
2. **Distance Parameter Tuning**: Determine optimal radius to capture port-related events
3. **Event Filtering**: Focus on conflict/protest events (CAMEO codes 14-22) that may disrupt operations
4. **Validation**: Use Vancouver, Canada as a test case for methodology development

### Use Case:
Maritime operators need to monitor events happening **near** ports (strikes, protests, conflicts) that could impact operations. This analysis determines how close is "close enough" to be relevant.

---

## 1. Setup: Libraries and Dependencies

Import geospatial calculation libraries alongside standard PySpark functions.

In [None]:
# ============================================
# LIBRARY IMPORTS
# ============================================

# PySpark SQL functions for data transformation
from pyspark.sql.functions import *
from pyspark.sql.functions import lit, col, udf
from pyspark.sql.types import DoubleType
from pyspark.sql.window import Window

# Geopy for accurate geospatial distance calculations
from geopy.distance import geodesic  # Uses geodesic (ellipsoidal) distance, more accurate than Haversine

# Date/time operations
import datetime

## 2. User-Defined Functions: Geospatial Distance

### Why Use Geopy's Geodesic?

While the Haversine formula (used in Country Exploration) assumes a perfect sphere, **geopy's geodesic function** uses the WGS-84 ellipsoidal model of Earth, providing:
- **Higher accuracy** (up to 0.5% more accurate)
- **Better for short distances** (< 100 km) like port proximity
- **Industry standard** for maritime/aviation applications

### Distance Calculation Strategy:
We create a UDF (User-Defined Function) to apply geopy's calculation at scale across millions of GDELT events.

In [None]:
# ============================================
# UDF: GEODESIC DISTANCE CALCULATION
# ============================================

def calculate_distance(lat1, lon1, lat2, lon2):
    """
    Calculate geodesic distance between two points using the WGS-84 ellipsoidal model.
    
    This is more accurate than Haversine for short distances (< 100km) and is the
    standard for maritime navigation applications.
    
    Parameters:
    -----------
    lat1, lon1 : float
        Latitude and longitude of first point (event location)
    lat2, lon2 : float
        Latitude and longitude of second point (port location)
    
    Returns:
    --------
    float
        Distance in kilometers
    """
    return geodesic((lat1, lon1), (lat2, lon2)).kilometers

# Register as Spark UDF for distributed processing
calculate_distance_udf = udf(calculate_distance, DoubleType())

# This UDF can now be applied to DataFrame columns at scale

## 3. Data Loading: Focused Event Extraction

### Event Selection Criteria:

Unlike the Country Exploration notebook which analyzes all news, this analysis focuses on **actionable events** from GDELT_EVENTS table:

#### Filters Applied:
1. **Geographic**: Canada (CA) - Using Vancouver as test case
2. **Temporal**: June-July 2023 - Period of known labor disputes
3. **Event Type**: CAMEO Root Codes 14-22 (Protest, Conflict, Violence)

#### CAMEO Event Codes Selected:
| Code | Category | Examples |
|------|----------|----------|
| 14 | Protest | Demonstrations, strikes, obstruction |
| 15 | Exhibit Force | Military displays, threats |
| 16 | Reduce Relations | Sanctions, expulsions |
| 17 | Coerce | Threats, ultimatums |
| 19 | Fight | Armed conflicts, clashes |
| 20 | Mass Violence | Large-scale violence |
| 21 | Chemical/Biological | WMD-related events |
| 22 | Material Conflict | Supply chain disruptions |

These codes represent events that could **directly impact port operations**.

In [None]:
# ============================================
# LOAD FILTERED EVENT DATA
# ============================================

# Load GDELT Events with targeted filters
GDELT_EVENTS = spark.sql("""
    SELECT *
    FROM BRONZE.GDELT_EVENTS
    WHERE 
        -- Geographic filter: Canada only (for Vancouver case study)
        ActionGeo_CountryCode = 'CA'
        
        -- Temporal filter: June-July 2023 (labor dispute period)
        AND TO_DATE(CAST(DATEADDED AS STRING), 'yyyyMMdd') BETWEEN '2023-06-01' AND '2023-07-30'
        
        -- Event type filter: Conflict, protest, and violence events only
        -- These are the events most likely to disrupt port operations
        AND EventRootCode IN (14, 15, 16, 17, 19, 20, 21, 22)
    
    ORDER BY DATEADDED DESC
""")

# Load port location reference data
PORT_LOCATIONS_DIM = spark.sql("SELECT * FROM BRONZE.PORTS_DICTIONARY")

# Load CAMEO dictionary for event code interpretations
CAMEO_DICTIONARY = spark.sql("SELECT * FROM BRONZE.CAMEO_DICTIONARY")

print(f"✓ Loaded {GDELT_EVENTS.count():,} relevant events for analysis")

## 4. Port Location Data Cleaning

**Note**: This cleaning process is identical to the Country Exploration notebook. 
See that notebook for detailed explanation of each transformation step.

In [None]:
# ============================================
# PORT LOCATIONS DATA CLEANING
# ============================================
# (Same cleaning logic as Country Exploration notebook)

PORT_LOCATIONS_DIM_CLEANED = (
    PORT_LOCATIONS_DIM
    .filter("LATITUDE IS NOT NULL")
    .filter("LONGITUDE IS NOT NULL")
    .withColumn("LATITUDE", regexp_replace(col("LATITUDE"), " ", ""))
    .withColumn("LONGITUDE", regexp_replace(col("LONGITUDE"), " ", ""))
    .withColumn("Lat_Ori", substring(col("LATITUDE"), -1, 1))
    .withColumn("Long_Ori", substring(col("LONGITUDE"), -1, 1))
    .withColumn("LATITUDE_CORRECTED",
        when(col("Lat_Ori") == 'S', expr("substring(LATITUDE, 1, length(LATITUDE) - 1)") * -1)
        .when(col("Lat_Ori") == 'N', expr("substring(LATITUDE, 1, length(LATITUDE) - 1)"))
        .when(col("Lat_Ori") == 'E', expr("substring(LATITUDE, 1, length(LATITUDE) - 1)") * -1)
        .otherwise(999.999)
    )
    .withColumn("LONGITUDE_CORRECTED",
        when(col("Long_Ori") == 'E', expr("substring(LONGITUDE, 1, length(LONGITUDE) - 1)"))
        .when(col("Long_Ori") == 'W', expr("substring(LONGITUDE, 1, length(LONGITUDE) - 1)") * -1)
        .when(col("Lat_Ori") == 'N', expr("substring(LATITUDE, 1, length(LATITUDE) - 1)") * -1)
        .otherwise(999.999)
    )
    .select("COUNTRY", "PORT", "LATITUDE_CORRECTED", "LONGITUDE_CORRECTED")
)

display(PORT_LOCATIONS_DIM_CLEANED.limit(5))

## 5. Case Study: Vancouver Port Analysis

### Why Vancouver?

Vancouver serves as an ideal test case because:
- **Major Transpacific port**: One of North America's largest container ports
- **Known disruptions**: Experienced labor strikes in June-July 2023
- **Well-documented**: Extensive news coverage allows validation of our distance parameters

### Methodology:
1. Extract Vancouver's precise coordinates from cleaned port data
2. Calculate distance from each GDELT event to Vancouver
3. Test various distance thresholds to determine optimal radius
4. Validate results against known events

In [None]:
# ============================================
# EXTRACT VANCOUVER PORT COORDINATES
# ============================================

# Filter for Vancouver port from cleaned data
canada_port = PORT_LOCATIONS_DIM_CLEANED.filter("PORT = 'Vancouver, B.C., Canada '")
display(canada_port)

# Note: Trailing space in port name is from source data

In [None]:
# ============================================
# DEFINE VANCOUVER COORDINATES
# ============================================

# Vancouver Port coordinates (validated from port authority data)
vancouver_lat = 49.17   # Latitude: 49.17°N
vancouver_long = -123.07  # Longitude: 123.07°W (negative for western hemisphere)

# Apply distance calculation to all events
events_with_distance = GDELT_EVENTS.withColumn(
    "distance_to_vancouver",
    calculate_distance_udf(
        col("ActionGeo_Lat"),    # Event latitude
        col("ActionGeo_Long"),   # Event longitude
        lit(vancouver_lat),      # Vancouver latitude (constant)
        lit(vancouver_long)      # Vancouver longitude (constant)
    )
)

print("✓ Distance calculations completed for all events")

## 6. Distance Parameter Analysis

### Determining Optimal Radius

**Question**: How close must an event be to be considered "port-related"?

#### Distance Threshold Testing:
- **25 km radius**: Captures immediate port area + surrounding industrial zones
- Rationale:
  - Port facilities: 0-5 km
  - Industrial/logistics zones: 5-15 km
  - Transportation corridors: 15-25 km
  - Labor union halls/staging areas: Within 25 km

### Analysis Approach:
1. Filter events within 25 km
2. Review event types and descriptions
3. Validate against known port disruptions
4. Adjust threshold if needed based on results

In [None]:
# ============================================
# FILTER EVENTS NEAR VANCOUVER PORT
# ============================================

# Apply 25 km distance threshold
events_near_vancouver = events_with_distance.filter(col("distance_to_vancouver") <= 25)

print(f"Events within 25 km of Vancouver: {events_near_vancouver.count():,}")
print(f"Total events analyzed: {events_with_distance.count():,}")
print(f"Percentage captured: {(events_near_vancouver.count() / events_with_distance.count() * 100):.1f}%")

# Display filtered results
display(events_near_vancouver.orderBy("distance_to_vancouver"))

## 7. Full Event Distribution Analysis

Visualize the complete distance distribution to understand event patterns across all of Canada.

In [None]:
# ============================================
# ANALYZE COMPLETE DISTANCE DISTRIBUTION
# ============================================

# Display all events with distances for distribution analysis
display(
    events_with_distance
    .select(
        "DATEADDED",
        "EventRootCode",
        "ActionGeo_Lat",
        "ActionGeo_Long",
        "distance_to_vancouver"
    )
    .orderBy("distance_to_vancouver")
)

# This allows us to:
# 1. Visualize distance distribution histogram
# 2. Identify clustering patterns
# 3. Validate 25 km threshold choice
# 4. Spot outliers or data quality issues

---

## 8. Key Findings and Next Steps

### What We Accomplished:

✅ **Geospatial Distance Calculation**: Implemented accurate geodesic distance UDF

✅ **Event Filtering**: Focused on conflict/protest events (CAMEO 14-22) most likely to disrupt operations

✅ **Distance Parameter**: Established 25 km as initial threshold for port-related events

✅ **Validation Framework**: Created methodology to test and refine distance parameters

### Applications:

These distance parameters are used in the **Data Engineering** and **Data Science** components of the project to:
1. Filter GDELT events for port relevance
2. Create training datasets for ML models
3. Define alert zones for real-time monitoring
4. Validate predictions against ground truth

### Potential Refinements:

- **Dynamic thresholds**: Adjust radius based on port size/importance
- **Event-type specific distances**: Different radii for strikes vs. protests vs. conflicts
- **Multi-port analysis**: Extend methodology to all Transpacific Route ports
- **Temporal patterns**: Analyze how event proximity changes over time

---

**End of Notebook**