# Morning exercises:

The data that we will be working with is the [Marine Cadastre ship traffic](https://hub.marinecadastre.gov/pages/vesseltraffic) published by the US Government.

Here's a short description of the data:

> Vessel traffic data, or Automatic Identification System (AIS) data, are collected by the U.S. Coast Guard through an onboard navigation safety device that transmits and monitors the location and characteristics of vessels in U.S. and international waters in real time. The Bureau of Ocean Energy Management, the National Oceanic and Atmospheric Administration, and the U.S. Coast Guard Navigation Center have worked together to repurpose some of the most important records and make these records available to the public. These records are sourced from the U.S. Coast Guard’s national network of AIS receivers called the Nationwide Automatic Identification System. Information such as location, time, vessel type, speed, length, beam, and draft have been extracted from the raw data and prepared for analyses in desktop geographic information system (GIS) software. Note that Marine Cadastre does not have access to live AIS data feeds or more recent data than what is provided on this webpage.

A data dictionary can be found [here](https://coast.noaa.gov/data/marinecadastre/ais/data-dictionary.pdf) but the table is duplicated below:

| # | Field Name | Description | Example | Unit | Valid Domain | Null Allowed | Arrow Type | Bytes | Query |
|---|---|---|---|---|---|---|---|---|---|
| 1 | mmsi | Maritime Mobile Service Identity value | 477220100 | integer | 2^7 + MMDx3 + 4 | N | int32 | 4 | Y |
| 2 | base_date_time | Full UTC date and time | 2017-02-01T05:02 | - | - | N | datetime64[ns] | 8 | Y |
| 3 | longitude | Longitude | -71.04182 | decimal degree | -179.99999 to 179.99999 | N | double | 8 | Y |
| 4 | latitude | Latitude | 42.35137 | decimal degree | -89.99999 to 89.99999 | N | double | 8 | Y |
| 5 | sog | Speed Over Ground | 5.9 | knot | 0 to 99.9 | Y | float | 4 | Y |
| 6 | cog | Course Over Ground | 47.5 | degree NAz | 0 to 359.9 | Y | float | 4 | Y |
| 7 | heading | True Heading | 45 | degree NAz | 0 to 359 | Y | int32 | 4 | - |
| 8 | vessel_name | Name as shown on the station radio license | OOCL Malaysia | alphanumeric | ASCII characters UTF-8 | Y | string | 24 | Y |
| 9 | imo | International Maritime Organization Vessel number | IMO9627980 | alphanumeric | alphanumeric | Y | string | 12 | Y |
| 10 | call_sign | Call sign as assigned by FCC | VRME7 | alphanumeric | alphanumeric | Y | string | 8 | Y |
| 11 | vessel_type | Vessel type as defined in NAIS specifications | 70 | scalar | 1 to 1024* | Y | int32 | 4 | Y |
| 12 | status | Navigation status as defined by the COLREGS | 3 | scalar | 1 to 14* | Y | int32 | 4 | Y |
| 13 | length | Length of vessel (see NAIS specifications) | 71 | meter | 1 to 509 | Y | int32 | 4 | Y |
| 14 | width | Width of vessel (see NAIS specifications) | 12 | meter | 1 to 61 | Y | int32 | 4 | Y |
| 15 | draft | Draft depth of vessel (see NAIS specifications) | 3.5 | meter | 1 to 24 | Y | float | 4 | Y |
| 16 | cargo | Cargo type (see NAIS specification and codes) | 70 | scalar | 1 to 1024* | Y | int32 | 4 | - |
| 17 | transceiver | Class of AIS transceiver | A | character | A \| B | Y | string | 2 | Y |


In [None]:
import glob
import os

import matplotlib.pyplot as plt
import pandas as pd

## Part 1: Exploratory stats

**First, load the January 1, 2025 AIS data by reading the data from the link below**.

If you are in Google Colab, then you can access the shared drive -- Email me at cc257@rice.edu for access.

In [None]:
%%time
pd.read_csv("data/2025_raw/ais-2025-01-01.csv.zst")

In [None]:
import polars as pl

In [None]:
%%time

df_pl = pl.read_csv("data/2025_raw/ais-2025-01-01.csv.zst")

In [None]:
url = "https://rice.box.com/shared/static/408bvz8janxz57vziii5vqac28irkdrj.zst"

df = pd.read_csv(url)

# # If in Google Colab
# 
# from google.colab import drive
# drive.mount("gdrive", force_remount=True)
# # After being granted access, see if you can find files at
# glob.glob("gdrive/Shareddrives/colab_data/raw_ais_data/*")

In [None]:
df.columns

**How many observations total are there?**

In [None]:
df.shape

**How many missing values are there in each column?**

In [None]:
df.isna().sum()

**How many unique ships are there?**

In [None]:
df["mmsi"].nunique()

**What column/columns do you think make the best index for this DataFrame? Why?**

**How many of each type of ship are there?**

**Which ship/ships were observed the most times? How many times was it? Does that make sense?**

**Which observation was the furthest east? Does this make sense?**

In [None]:
df["longitude"].max()

**Which observation was the furthest west? Does this make sense?**

In [None]:
df["longitude"].min()

**What percentage of cargo ship observations (ship-minutes) were west of the middle of the US? How many were east of the middle?**

_Note: For the purposes of this question, we'll use the "geographic middle of the contiguous US" to define the middle of the US and this middle is defined by the point (39°50′N 98°35′W)._

**What percentage of cargo ship observations (ship-minutes) had a speed of less than 0.5 knots?**

In [None]:
cargo_ship_observations = df.query("vessel_type // 10 == 7")

cargo_ship_observations.query("sog < 0.5").shape[0] / cargo_ship_observations.shape[0]

**How many times did each ship appear in the dataset? Plot a histogram or violin plot of these values**

In [None]:
df["mmsi"].value_counts()

## Part 2: Split-Apply-Combine

**Write a function that takes a single ship's data and calculates the distance that it moved that day then apply that function and create a dataframe that has the ship's mmsi and net distance traveled.**

_Hint_: I've included some sample code below for calculating the distance between two points. If you use this style of approach, please note that `.agg` won't work here because you will need multiple columns from your DataFrame.

_Hint_: Alternative would be to use `geopandas`. Since we didn't talk about it, you should see whether you can work with an LLM to get distance code using `geopandas`!

In [None]:
from pyproj import Geod

geod = Geod(ellps="WGS84")

austin_lat, austin_lon = 30.26, -97.74
dc_lat, dc_lon = 38.90, -77.03


forward_azimuth, back_azimuth, distance_meters = geod.inv(austin_lon, austin_lat, dc_lon, dc_lat)

print(f"The distance from Austin to Washington DC is {distance_meters/1000} km")

In [None]:
def daily_ship_distance(subdf):

    # Get the first and last observation of the day
    first_observation = subdf["base_date_time"].idxmin()
    last_observation = subdf["base_date_time"].idxmax()

    flat = subdf.at[first_observation, "latitude"]
    flon = subdf.at[first_observation, "longitude"]
    llat = subdf.at[last_observation, "latitude"]
    llon = subdf.at[last_observation, "longitude"]

    geod = Geod(ellps="WGS84")
    forward_azimuth, back_azimuth, distance_meters = geod.inv(flon, flat, llon, llat)

    return distance_meters / 1000

In [None]:
%%time

daily_distances = df.groupby("mmsi").apply(daily_ship_distance, include_groups=False)

In [None]:
8*(180+365) / 60/60

In [None]:
df_pl = df_pl.sort(["mmsi", "timestamp"])

# Step 2: Use pl.shift() to get the previous point's coordinates
df_pl = df_pl.with_columns(
    pl.col("latitude").shift(1).over("mmsi").alias("prev_lat"),
    pl.col("longitude").shift(1).over("mmsi").alias("prev_lon"),
)

# Step 3: Calculate the distance using an expression (e.g., Haversine distance)
# You would define a function for the Haversine calculation that takes
# two Series for (lat, lon) and two Series for (prev_lat, prev_lon)
# and use it within a final .with_columns() call:

# Example of a pseudo-calculation using the columns:
df_pl = df_pl.with_columns(
    (
        pl.lit(100) * pl.col("latitude") - pl.col("prev_lat")
    ).alias("distance_placeholder_km") # Replace this with your actual distance formula!
)

In [None]:
df_pl.group_by(pl.col("mmsi")).transform(daily_distances)

In [None]:
df.loc[first_index]

In [None]:
first_index.shape

In [None]:
first_index = df.groupby("mmsi")["base_date_time"].agg(pd.Series.idxmin)
last_index = df.groupby("mmsi")["base_date_time"].agg(pd.Series.idxmax)

In [None]:
df.loc[first_index].distance(df.loc[last_index])

In [None]:
daily_distances

**Create a DataFrame that only has the first daily locations for cargo ships that move less than 0.5 km**

In [None]:
cargo_ships = df.query("vessel_type // 10 == 7")["mmsi"].unique()

In [None]:
cargo_ships_daily_distance = daily_distances.loc[cargo_ships]
cargo_ships_stationary = cargo_ships_daily_distance.loc[cargo_ships_daily_distance < 0.5].index

In [None]:
cargo_ships_stationary_df = df.query("mmsi in @cargo_ships_stationary").groupby("mmsi").first()

In [None]:
cargo_ships_stationary_df

**Plot all of the `(latitude, longitude)` pairs for all of the cargo ships that have moved less than 0.5 km on a map. What do you notice?**

In [None]:
import folium

m = folium.Map(location=[39.50, -98.35], zoom_start=4)

# 2. Add the track as a line to the map
for row in cargo_ships_stationary_df.itertuples():
    lat = row.latitude
    lon = row.longitude

    folium.Marker((lat, lon), color="blue", weight=5, opacity=0.7).add_to(m)

# 4. Save the map to an HTML file
# You can open this file in any web browser to view the interactive map
m.save("ship_track_map.html")

print("Interactive map saved to ship_track_map.html")

In [None]:
import geopandas as gpd

from geodatasets import get_path

In [None]:
gdf = gpd.GeoDataFrame(
    cargo_ships_stationary_df,
    geometry=gpd.points_from_xy(cargo_ships_stationary_df.longitude, cargo_ships_stationary_df.latitude)
)

In [None]:
def daily_ship_distance(subdf):

    # Get the first and last observation of the day
    first_index = subdf["base_date_time"].idxmin()
    last_index = subdf["base_date_time"].idxmax()

    first_observation = subdf.loc[first_index, :]
    last_observation = sub

    geod = Geod(ellps="WGS84")
    forward_azimuth, back_azimuth, distance_meters = geod.inv(flon, flat, llon, llat)

    return distance_meters / 1000


daily_distances = df.groupby("mmsi").apply(daily_ship_distance, include_groups=False)

In [None]:
world = gpd.read_file(get_path("naturalearth.land"))

# We restrict to USA
fig, ax = plt.subplots()
world.clip([-170, -15, 0, 75]).plot(color="white", edgecolor="black", ax=ax)

# We can now plot our ``GeoDataFrame``.
gdf.plot(ax=ax, color="red", alpha=0.25)

plt.show()

In [None]:
import pandas as pd

In [None]:
df_2024 = pd.read_csv("data/2024_raw/AIS_2024_01_01.zip")
df_2024.columns

In [None]:
df_2025 = pd.read_csv("data/2025_raw/ais-2025-01-01.csv.zst")
df_2025.columns

In [None]:
import glob

from io import StringIO

import polars as pl
import zipfile

In [None]:
ais_2024_files = glob.glob("data/2024_raw/*.zip")

for file in ais_2024_files:
    print(file)

    _, _, filename = file.split("/")
    csv_filename = filename.replace("zip", "csv")
    parquet_filename = filename.replace("zip", "parquet").lower()

    df = pd.read_csv(file)
    (
        df
        .rename(
            columns={
                "MMSI": "mmsi",
                "BaseDateTime": "base_date_time",
                "LAT": "latitude",
                "LON": "longitude",
                "SOG": "sog",
                "COG": "cog",
                "Heading": "heading",
                "VesselName": "vessel_name",
                "IMO": "imo",
                "CallSign": "call_sign",
                "VesselType": "vessel_type",
                "Status": "status",
                "Length": "length",
                "Width": "width",
                "Draft": "draft",
                "Cargo": "cargo",
                "TransceiverClass": "transceiver"
            }
        )
        .to_parquet(f"data/2024/{parquet_filename}")
    )


In [None]:
ais_2025_files = glob.glob("data/2025_raw/*.zst")

for file in ais_2025_files:
    print(file)

    _, _, filename = file.split("/")
    parquet_filename = filename.replace("zip", "parquet").lower()

    pl.read_csv(file).write_parquet(f"data/2025/{parquet_filename}")


In [None]:
# ais_clean_files_2024 = glob.glob("data/2024/*.parquet")
ais_clean_files_2024 = []
ais_clean_files_2025 = glob.glob("data/2025/*.parquet")

for file in (ais_clean_files_2024 + ais_clean_files_2025):
    _, _, filename = file.split("/")

    daily = (
        pl
        .scan_parquet(file)
        .filter(
            pl.col("vessel_type") // 10 == 7
        )
        .group_by(pl.col("mmsi"))
        .agg(
            pl.col("base_date_time").str.to_datetime().min().dt.date().alias("date"),
            pl.col("vessel_name").first(),
            pl.col("longitude").mean(),
            pl.col("latitude").mean(),
            pl.col("sog").max(),
        )
        .filter(pl.col("sog") < 0.5)
        .collect()
        .write_parquet(f"data/dailies/{filename}")
    )


**Load all daily files and perform DBScan to find clusters. Plot the centroids of each cluster**

## Part 3: Merging data sources

We had about 600 unique cargo ships listed in the data for January 1, 2025.

What if we wanted to have a look at roughly how many static ships there were at the first of each month for 2024 to 2025?

In [None]:
import polars as pl

In [None]:
%time

daily_ships = (
    pl
    .scan_parquet("data/dailies/ais-*-01.parquet")
    .group_by("date")
    .agg(pl.col("mmsi").n_unique())
    .collect()
    .sort(pl.col("date"))
)

In [None]:
daily_ships

In [None]:
daily_ships.to_pandas().plot(x="date", y="mmsi", kind="line")

In [None]:
monthly_locations_df = (
    pl
    .scan_parquet("data/dailies/ais-*-01.parquet")
    .filter(pl.col("sog") < 0.5)
    .select(
        pl.col("mmsi"),
        pl.col("date"),
        pl.col("latitude"),
        pl.col("longitude"),
        pl.col("sog")
    )
    .collect()
    .sort(pl.col("date"))
)

In [None]:
monthly_locations_df.shape

In [None]:
mld = monthly_locations_df.to_pandas()

In [None]:
gdf = gpd.GeoDataFrame(
    monthly_locations_df,
    geometry=gpd.points_from_xy(mld.longitude, mld.latitude)
)

In [None]:
world = gpd.read_file(get_path("naturalearth.land"))

# We restrict to USA
fig, ax = plt.subplots()
world.clip([-250, -25, 0, 75]).plot(color="white", edgecolor="black", ax=ax)

# We can now plot our ``GeoDataFrame``.
gdf.plot(ax=ax, color="red", alpha=0.05)

ax.set_xlim(-175, -25)
ax.set_ylim(0, 60)

plt.show()

In [None]:
import folium

m = folium.Map(location=[39.50, -98.35], zoom_start=4)

# 2. Add the track as a line to the map
for row in mld.itertuples():
    lat = row.latitude
    lon = row.longitude

    folium.Marker((lat, lon), color="blue", weight=5, opacity=0.7).add_to(m)

# 4. Save the map to an HTML file
# You can open this file in any web browser to view the interactive map
m.save("ship_track_map.html")

print("Interactive map saved to ship_track_map.html")

## Part 3: Merging data

In [None]:
# Data for regression
sp500 = pd.read_parquet("https://rice.box.com/shared/static/3jamp27br4oa0e99fo2wws9g1vwwuhv5.parquet")
prices = pd.read_parquet("https://rice.box.com/shared/static/fnkvb48ml32fsx4iu6ftspbf7yy9thtp.parquet")
tbills = pd.read_parquet("https://rice.box.com/shared/static/lotx7w5bs54if2xcizuwqluiv7n8tee9.parquet").set_index("dt")