# ETL/ELT Pipeline - DB-7 Maritime Shipping Intelligence

This notebook provides a comprehensive ETL/ELT pipeline for maritime shipping intelligence database db-7.

## Pipeline Overview
1. **Extract**: Load data from government sources (NOAA, USCG, MARAD, Data.gov)
2. **Transform**: Clean, validate, and transform maritime data
3. **Load**: Load transformed data into target database
4. **Validate**: Verify data quality and completeness
5. **Monitor**: Track pipeline performance and errors

## Government Data Sources

### NOAA (National Oceanic and Atmospheric Administration)
- **AccessAIS Tool**: Interactive vessel traffic data download
- **MarineCadastre.gov**: AIS vessel traffic data (2009-2024)
- **Vessel Traffic Data**: CSV, GeoPackages, GeoTIFFs formats
- **Base URL**: https://coast.noaa.gov/digitalcoast/data/vesseltraffic.html
- **AccessAIS Tool**: https://coast.noaa.gov/digitalcoast/tools/ais.html

### US Coast Guard (USCG)
- **National Vessel Movement Center (NVMC)**: Notice of Arrival and Departure (NOAD) data
- **Vessel Information Verification Service (VIVS)**: AIS static data (MMSI, call sign, vessel info)
- **AIS Data Sharing**: Level A (real-time), Level B (filtered), Level C (historical)
- **NVMC Base URL**: https://www.nvmc.uscg.gov/
- **VIVS Base URL**: https://navcen.uscg.gov/ais-vivs-home

### MARAD (Maritime Administration)
- **U.S.-Flag Fleet Data**: Current fleet lists, vessel characteristics, capacities
- **Port Statistics**: Cargo volumes, vessel calls, berth productivity
- **Waterborne Commerce Statistics**: Port performance metrics
- **Base URL**: https://www.maritime.dot.gov/data-reports
- **Port Data**: https://www.maritime.dot.gov/data-reports/ports
- **Contact**: data.marad@dot.gov

### Data.gov
- **Virginia International Gateway Vessel Schedules**: Port of Virginia vessel schedules
- **Port Region Grain Ocean Vessel Activity**: USDA weekly vessel activity data
- **AIS Vessel Tracks**: Commerce Data Hub AIS datasets
- **Base URL**: https://catalog.data.gov
- **Virginia Data**: https://data.virginia.gov/dataset/virginia-international-gateway-vig-vessel-schedules-the-port-of-virginia

## Section 1: Setup and Configuration

In [None]:
import sys
from pathlib import Path
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import json
import logging
from typing import Dict, List, Optional
import warnings
import requests
from urllib.parse import urljoin, urlencode
warnings.filterwarnings('ignore')

# Database connections
try:
    from sqlalchemy import create_engine, text
    SQLALCHEMY_AVAILABLE = True
except ImportError:
    SQLALCHEMY_AVAILABLE = False
    print("Warning: sqlalchemy not available")

# Geospatial libraries
try:
    import geopandas as gpd
    from shapely.geometry import Point
    GEOSPATIAL_AVAILABLE = True
except ImportError:
    GEOSPATIAL_AVAILABLE = False
    print("Warning: geopandas/shapely not available")

# Set up logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 50)
pd.set_option('display.width', None)

print("✓ Imports successful")

In [None]:
# Configuration
DB_NAME = "db-7"
DB_PATH = Path.cwd().parent

# Database connection strings (configure as needed)
# PostgreSQL
POSTGRES_CONNECTION_STRING = None  # "postgresql://user:password@localhost:5432/dbname"

# Databricks
DATABRICKS_CONNECTION_STRING = None  # Configure Databricks connection

# Databricks
SNOWFLAKE_CONNECTION_STRING = None  # Configure Databricks connection

# Source data paths
DATA_DIR = DB_PATH / "data"
SCHEMA_FILE = DATA_DIR / "schema.sql"
DATA_FILE = DATA_DIR / "data.sql"

# Government API endpoints
NOAA_ACCESS_AIS_URL = "https://coast.noaa.gov/digitalcoast/tools/ais.html"
MARINE_CADASTRE_AIS_URL = "https://marinecadastre.gov/AIS/"
USCG_NVMC_URL = "https://www.nvmc.uscg.gov/"
USCG_VIVS_URL = "https://navcen.uscg.gov/ais-vivs-home"
MARAD_DATA_URL = "https://www.maritime.dot.gov/data-reports"
DATA_GOV_CATALOG_URL = "https://catalog.data.gov/api/3/action"
VIRGINIA_PORT_SCHEDULES_URL = "https://data.virginia.gov/dataset/virginia-international-gateway-vig-vessel-schedules-the-port-of-virginia"

print(f"Database: {DB_NAME}")
print(f"Data directory: {DATA_DIR}")
print(f"Schema file exists: {SCHEMA_FILE.exists()}")
print(f"Data file exists: {DATA_FILE.exists()}")

## Section 2: Extract - Government Data Sources

In [None]:
def search_data_gov_datasets(query: str, limit: int = 10) -> Optional[List[Dict]]:
    """Search Data.gov catalog for maritime datasets."""
    try:
        url = f"{DATA_GOV_CATALOG_URL}/package_search"
        params = {
            'q': query,
            'rows': limit
        }
        response = requests.get(url, params=params, timeout=30)
        response.raise_for_status()
        data = response.json()
        if 'result' in data and 'results' in data['result']:
            return data['result']['results']
        return []
    except Exception as e:
        logger.error(f"Error searching Data.gov: {e}")
        return None

# Search for maritime datasets
maritime_datasets = search_data_gov_datasets("maritime shipping vessel port", limit=20)
if maritime_datasets:
    print(f"✓ Found {len(maritime_datasets)} maritime datasets on Data.gov")
    for dataset in maritime_datasets[:5]:
        print(f"  - {dataset.get('title', 'N/A')}")

In [None]:
def extract_noaa_ais_data(year: int, region: str = None) -> Optional[pd.DataFrame]:
    """
    Extract NOAA AIS vessel traffic data.
    Note: This is a placeholder - actual implementation requires
    downloading from MarineCadastre.gov or using AccessAIS tool.
    """
    logger.info(f"Extracting NOAA AIS data for year {year}")
    # Implementation would download from MarineCadastre.gov
    # Data available in CSV, GeoPackage, GeoTIFF formats
    # URL: https://marinecadastre.gov/AIS/
    return None

def extract_uscg_noad_data(start_date: datetime, end_date: datetime) -> Optional[pd.DataFrame]:
    """
    Extract USCG Notice of Arrival and Departure (NOAD) data.
    Note: Requires NVMC access and proper authentication.
    """
    logger.info(f"Extracting USCG NOAD data from {start_date} to {end_date}")
    # Implementation would connect to NVMC system
    # URL: https://www.nvmc.uscg.gov/
    return None

def extract_marad_fleet_data() -> Optional[pd.DataFrame]:
    """
    Extract MARAD U.S.-Flag Fleet data.
    Note: Data available from maritime.dot.gov/data-reports
    """
    logger.info("Extracting MARAD fleet data")
    # Implementation would download from MARAD website
    # URL: https://www.maritime.dot.gov/data-reports/us-flag-fleet-dashboard
    return None

def extract_port_schedules_virginia() -> Optional[pd.DataFrame]:
    """
    Extract Virginia International Gateway vessel schedules.
    """
    logger.info("Extracting Virginia port vessel schedules")
    # Implementation would download from data.virginia.gov
    # URL: https://data.virginia.gov/dataset/virginia-international-gateway-vig-vessel-schedules-the-port-of-virginia
    return None

## Section 3: Transform - Data Cleaning and Transformation

In [None]:
def transform_ais_tracking_data(df: pd.DataFrame) -> pd.DataFrame:
    """Transform AIS tracking data to match vessel_tracking table schema."""
    if df is None or df.empty:
        return pd.DataFrame()
    
    # Map columns to schema
    transformed = pd.DataFrame()
    
    # Generate tracking_id
    transformed['tracking_id'] = df.apply(
        lambda row: f"TRK_{row.get('mmsi', 'UNK')}_{row.get('timestamp', datetime.now()).strftime('%Y%m%d%H%M%S')}",
        axis=1
    )
    
    # Map other fields
    column_mapping = {
        'mmsi': 'mmsi',
        'timestamp': 'timestamp',
        'latitude': 'latitude',
        'longitude': 'longitude',
        'speed': 'speed_knots',
        'course': 'course_degrees',
        'heading': 'heading_degrees',
        'nav_status': 'navigation_status',
        'destination': 'destination',
        'eta': 'eta',
        'draught': 'draught_meters'
    }
    
    for source_col, target_col in column_mapping.items():
        if source_col in df.columns:
            transformed[target_col] = df[source_col]
    
    # Set defaults
    transformed['data_source'] = 'AIS'
    transformed['data_quality'] = 'High'
    transformed['created_at'] = datetime.now()
    
    return transformed

def transform_port_data(df: pd.DataFrame) -> pd.DataFrame:
    """Transform port data to match ports table schema."""
    if df is None or df.empty:
        return pd.DataFrame()
    
    transformed = pd.DataFrame()
    
    # Generate port_id
    transformed['port_id'] = df.apply(
        lambda row: f"PORT_{row.get('locode', row.get('port_code', 'UNK'))}",
        axis=1
    )
    
    # Map fields
    column_mapping = {
        'port_name': 'port_name',
        'port_code': 'port_code',
        'locode': 'locode',
        'country': 'country',
        'country_code': 'country_code',
        'latitude': 'latitude',
        'longitude': 'longitude',
        'port_type': 'port_type',
        'timezone': 'timezone'
    }
    
    for source_col, target_col in column_mapping.items():
        if source_col in df.columns:
            transformed[target_col] = df[source_col]
    
    transformed['status'] = 'Active'
    transformed['data_source'] = 'MARAD'
    transformed['created_at'] = datetime.now()
    
    return transformed

## Section 4: Load - Database Loading

In [None]:
def load_to_database(df: pd.DataFrame, table_name: str, connection_string: str) -> bool:
    """Load DataFrame to database table."""
    if not SQLALCHEMY_AVAILABLE:
        logger.error("SQLAlchemy not available")
        return False
    
    if df is None or df.empty:
        logger.warning(f"No data to load to {table_name}")
        return False
    
    try:
        engine = create_engine(connection_string)
        df.to_sql(table_name, engine, if_exists='append', index=False)
        logger.info(f"Loaded {len(df)} rows to {table_name}")
        return True
    except Exception as e:
        logger.error(f"Error loading to {table_name}: {e}")
        return False

## Section 5: Validate - Data Quality Checks

In [None]:
def validate_data_quality(df: pd.DataFrame, table_name: str) -> Dict:
    """Perform data quality validation checks."""
    if df is None or df.empty:
        return {"status": "empty", "issues": []}
    
    issues = []
    
    # Check for nulls in required fields
    required_fields = {
        'vessels': ['vessel_name', 'imo_number'],
        'ports': ['port_name', 'latitude', 'longitude'],
        'vessel_tracking': ['vessel_id', 'timestamp', 'latitude', 'longitude']
    }
    
    if table_name in required_fields:
        for field in required_fields[table_name]:
            null_count = df[field].isna().sum() if field in df.columns else len(df)
            if null_count > 0:
                issues.append(f"{null_count} null values in {field}")
    
    # Check data ranges
    if 'latitude' in df.columns:
        invalid_lat = ((df['latitude'] < -90) | (df['latitude'] > 90)).sum()
        if invalid_lat > 0:
            issues.append(f"{invalid_lat} invalid latitude values")
    
    if 'longitude' in df.columns:
        invalid_lon = ((df['longitude'] < -180) | (df['longitude'] > 180)).sum()
        if invalid_lon > 0:
            issues.append(f"{invalid_lon} invalid longitude values")
    
    return {
        "status": "valid" if len(issues) == 0 else "issues_found",
        "row_count": len(df),
        "issues": issues
    }

print("✓ Validation functions defined")

## Section 6: Monitor - Pipeline Execution Tracking

In [None]:
# Pipeline execution metadata
pipeline_metadata = {
    "pipeline_name": "db-7-maritime-intelligence",
    "execution_date": datetime.now().isoformat(),
    "data_sources": [
        "NOAA AccessAIS",
        "USCG NVMC",
        "MARAD Fleet Data",
        "Data.gov Maritime Datasets"
    ],
    "tables_loaded": [],
    "records_processed": 0,
    "errors": []
}

# Save metadata
metadata_file = DB_PATH / "metadata" / "pipeline_metadata.json"
metadata_file.parent.mkdir(parents=True, exist_ok=True)

with open(metadata_file, 'w') as f:
    json.dump(pipeline_metadata, f, indent=2)

print(f"✓ Pipeline metadata saved to {metadata_file}")