# NOAA AIS Data Extraction

### **Project Objective**
This notebook performs extraction of AIS (Automatic Identification System) data from the NOAA public database for the year of 2024. 

URL : https://coast.noaa.gov/htdata/CMSP/AISDataHandler/2024

### **Logic**
1. The code loops through every day of the year (366 days for leap year 2024).
2. It fetches the daily ZIP file directly from the NOAA Coastal URL.
3.  To save disk space, it extracts and reads the CSV directly from memory without saving the raw daily files to the Drive.
4. **Filtering Logic:**
   * **Vessel Type:** Filters specifically for **Tankers** (Cargo codes 80-89).
   * **Constraints:** Removes noise by enforcing valid Length (>30m), Draft (>0.5m), and moving status (SOG 0-40 knots).
   * **Transceiver:** Keeps only Class A transceivers for high-quality commercial data.
5. **Data Reduction:** Downloading hourly data, so bucketing timestamps to the nearest hour per vessel (MMSI)

In [None]:
import pandas as pd
import requests
import zipfile
import io
import os
import warnings
from google.colab import drive

#  MOUNT GOOGLE DRIVE
print("Mounting Google Drive...")
drive.mount('/content/drive')

# CONFIGURATION
output_folder = "/content/drive/MyDrive/AIS_Project"
if not os.path.exists(output_folder):
    os.makedirs(output_folder)

master_file = os.path.join(output_folder, "AIS_2024_Tanker_Corridor_Master.csv")
base_url = "https://coast.noaa.gov/htdata/CMSP/AISDataHandler/2024/AIS_2024_{:02d}_{:02d}.zip"

warnings.simplefilter(action='ignore', category=pd.errors.SettingWithCopyWarning)

cols_to_load = [
    'MMSI', 'BaseDateTime','IMO', 'LAT', 'LON', 'SOG','COG', 'Heading',
    'VesselType', 'Status', 'Length', 'Width', 'Draft',
    'Cargo', 'TransceiverClass'
]

print(f"Starting Extraction... Saving to: {master_file}")

#  LOOP OVER EVERY DAY OF YEAR
for month in range(1, 13):
    for day in range(1, 32):
        try:
            pd.Timestamp(2024, month, day)
        except ValueError:
            continue

        url = base_url.format(month, day)
        date_str = f"2024-{month:02d}-{day:02d}"
        print(f"\n--- Processing: {date_str} ---")

        try:
            #  DOWNLOAD
            r = requests.get(url, timeout=30)
            if r.status_code != 200:
                print("Skipped (404/Not Found)")
                continue

            #  LOAD AND PROCESS
            with zipfile.ZipFile(io.BytesIO(r.content)) as z:
                with z.open(z.namelist()[0]) as f:
                    df = pd.read_csv(f, usecols=cols_to_load)
                    print(f" -> Data Loaded: {len(df)} raw rows")

                    # CLEANING
                    df = df[df['TransceiverClass'].astype(str).str.strip().str.upper() == 'A']
                    if 'VesselType' in df.columns:
                        df['Cargo'] = df['Cargo'].fillna(df['VesselType'])
                    mask_tanker = (df['Cargo'] >= 80) & (df['Cargo'] <= 89)
                    mask_moving = (df['SOG'] >= 0.0) & (df['SOG'] <= 40.0)
                    mask_physics = (
                        (df['Length'] > 30) &
                        (df['Width'] > 0) &
                        (df['Draft'] > 0.5) &
                        (df['LON'] < 0)
                    )
                    df = df[mask_tanker & mask_moving & mask_physics]

                    df['BaseDateTime'] = pd.to_datetime(df['BaseDateTime'])
                    hour_bucket = df['BaseDateTime'].dt.floor('h')
                    df = df.loc[~pd.concat([df['MMSI'], hour_bucket], axis=1).duplicated()]

                    # DROP COLUMNS
                    cols_to_drop = ['VesselType', 'TransceiverClass']
                    df.drop(columns=[c for c in cols_to_drop if c in df.columns], inplace=True)

                    # SAVE
                    use_header = not os.path.exists(master_file)
                    df.to_csv(master_file, mode='a', index=False, header=use_header)

                    print(f" -> Saved to Master: +{len(df)} rows")

            # CLEAR FROM MEMORY
            del r
            print(" -> Parent data cleared from memory.")

        except Exception as e:
            print(f"Error: {e}")

print("\n--- MISSION COMPLETE ---")

Mounting Google Drive...
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Starting Extraction... Saving to: /content/drive/MyDrive/AIS_Project/AIS_2024_Tanker_Corridor_Master.csv

--- Processing: 2024-01-01 ---
 -> Data Loaded: 7296275 raw rows
 -> Saved to Master: +10715 rows
 -> Parent data cleared from memory.

--- Processing: 2024-01-02 ---
 -> Data Loaded: 7295616 raw rows
 -> Saved to Master: +10703 rows
 -> Parent data cleared from memory.

--- Processing: 2024-01-03 ---
 -> Data Loaded: 7290618 raw rows
 -> Saved to Master: +10446 rows
 -> Parent data cleared from memory.

--- Processing: 2024-01-04 ---
 -> Data Loaded: 7085305 raw rows
 -> Saved to Master: +10176 rows
 -> Parent data cleared from memory.

--- Processing: 2024-01-05 ---
 -> Data Loaded: 7397034 raw rows
 -> Saved to Master: +10331 rows
 -> Parent data cleared from memory.

--- Processing: 2024-01-06 ---
 -> Data Loaded: 7287838 raw 