# Morning exercises:

The data that we will be working with is the [Marine Cadastre ship traffic](https://hub.marinecadastre.gov/pages/vesseltraffic) published by the US Government.

Here's a short description of the data:

> Vessel traffic data, or Automatic Identification System (AIS) data, are collected by the U.S. Coast Guard through an onboard navigation safety device that transmits and monitors the location and characteristics of vessels in U.S. and international waters in real time. The Bureau of Ocean Energy Management, the National Oceanic and Atmospheric Administration, and the U.S. Coast Guard Navigation Center have worked together to repurpose some of the most important records and make these records available to the public. These records are sourced from the U.S. Coast Guard’s national network of AIS receivers called the Nationwide Automatic Identification System. Information such as location, time, vessel type, speed, length, beam, and draft have been extracted from the raw data and prepared for analyses in desktop geographic information system (GIS) software. Note that Marine Cadastre does not have access to live AIS data feeds or more recent data than what is provided on this webpage.

A data dictionary can be found [here](https://coast.noaa.gov/data/marinecadastre/ais/data-dictionary.pdf) but the table is duplicated below:

| # | Field Name | Description | Example | Unit | Valid Domain | Null Allowed | Arrow Type | Bytes | Query |
|---|---|---|---|---|---|---|---|---|---|
| 1 | mmsi | Maritime Mobile Service Identity value | 477220100 | integer | 2^7 + MMDx3 + 4 | N | int32 | 4 | Y |
| 2 | base_date_time | Full UTC date and time | 2017-02-01T05:02 | - | - | N | datetime64[ns] | 8 | Y |
| 3 | longitude | Longitude | -71.04182 | decimal degree | -179.99999 to 179.99999 | N | double | 8 | Y |
| 4 | latitude | Latitude | 42.35137 | decimal degree | -89.99999 to 89.99999 | N | double | 8 | Y |
| 5 | sog | Speed Over Ground | 5.9 | knot | 0 to 99.9 | Y | float | 4 | Y |
| 6 | cog | Course Over Ground | 47.5 | degree NAz | 0 to 359.9 | Y | float | 4 | Y |
| 7 | heading | True Heading | 45 | degree NAz | 0 to 359 | Y | int32 | 4 | - |
| 8 | vessel_name | Name as shown on the station radio license | OOCL Malaysia | alphanumeric | ASCII characters UTF-8 | Y | string | 24 | Y |
| 9 | imo | International Maritime Organization Vessel number | IMO9627980 | alphanumeric | alphanumeric | Y | string | 12 | Y |
| 10 | call_sign | Call sign as assigned by FCC | VRME7 | alphanumeric | alphanumeric | Y | string | 8 | Y |
| 11 | vessel_type | Vessel type as defined in NAIS specifications | 70 | scalar | 1 to 1024* | Y | int32 | 4 | Y |
| 12 | status | Navigation status as defined by the COLREGS | 3 | scalar | 1 to 14* | Y | int32 | 4 | Y |
| 13 | length | Length of vessel (see NAIS specifications) | 71 | meter | 1 to 509 | Y | int32 | 4 | Y |
| 14 | width | Width of vessel (see NAIS specifications) | 12 | meter | 1 to 61 | Y | int32 | 4 | Y |
| 15 | draft | Draft depth of vessel (see NAIS specifications) | 3.5 | meter | 1 to 24 | Y | float | 4 | Y |
| 16 | cargo | Cargo type (see NAIS specification and codes) | 70 | scalar | 1 to 1024* | Y | int32 | 4 | - |
| 17 | transceiver | Class of AIS transceiver | A | character | A \| B | Y | string | 2 | Y |


In [None]:
import glob
import os

import matplotlib.pyplot as plt
import pandas as pd

## Part 1: Exploratory stats

**First, load the January 1, 2025 AIS data by reading the data from the link below**.

If you are in Google Colab, then you can access the shared drive -- Email me at cc257@rice.edu for access.

In [None]:
url = "https://rice.box.com/shared/static/408bvz8janxz57vziii5vqac28irkdrj.zst"

# # If in Google Colab
# 
# from google.colab import drive
# drive.mount("gdrive", force_remount=True)
# # After being granted access, see if you can find files at
# glob.glob("gdrive/Shareddrives/colab_data/raw_ais_data/*")

**How many observations total are there?**

**How many missing values are there in each column?**

**How many unique ships are there?**

**What column/columns do you think make the best index for this DataFrame? Why?**

**How many of each type of ship are there?**

**Which ship/ships were observed the most times? How many times was it? Does that make sense?**

**Which observation was the furthest east? Does this make sense?**

**Which observation was the furthest west? Does this make sense?**

**What percentage of cargo ship observations (ship-minutes) were west of the middle of the US? How many were east of the middle?**

_Note: For the purposes of this question, we'll use the "geographic middle of the contiguous US" to define the middle of the US and this middle is defined by the point (39°50′N 98°35′W)._

**What percentage of cargo ship observations (ship-minutes) had a speed of less than 0.5 knots?**

**How many times did each ship appear in the dataset? Plot a histogram or violin plot of these values**

## Part 2: Split-Apply-Combine

**Write a function that takes a single ship's data and calculates the distance that it moved that day then apply that function and create a dataframe that has the ship's mmsi and net distance traveled.**

_Hint_: I've included some sample code below for calculating the distance between two points. If you use this style of approach, please note that `.agg` won't work here because you will need multiple columns from your DataFrame.

_Hint_: Alternative would be to use `geopandas`. Since we didn't talk about it, you should see whether you can work with an LLM to get distance code using `geopandas`!

In [None]:
from pyproj import Geod, Proj

geod = Geod(ellps="WGS84")
proj = Proj("ESRI:102005")

austin_lat, austin_lon = 30.26, -97.74
dc_lat, dc_lon = 38.90, -77.03

forward_azimuth, back_azimuth, distance_meters = geod.inv(austin_lon, austin_lat, dc_lon, dc_lat)

print(f"The distance from Austin to Washington DC is {distance_meters/1000} km")

austin_x, austin_y = proj(austin_lon, austin_lat)
dc_x, dc_y = proj(dc_lon, dc_lat)
print(f"Austin is located at {austin_x, austin_y}")
print(f"Washington DC is located at {dc_x, dc_y}")
print(f"Distance is {((austin_x - dc_x)**2 + (austin_y - dc_y)**2)**0.5 / 1000} km")

**Create a DataFrame that only has the first daily locations for cargo ships that move less than 0.5 km**

**Plot all of the `(latitude, longitude)` pairs for all of the cargo ships that have moved less than 0.5 km on a map. What do you notice?**

**Load all daily files, convert the longitude/latitude data to `(x, y)` data, and then perform DBScan to find clusters. Plot the centroids of each cluster**

**We had about 600 unique cargo ships listed in the data for January 1, 2025. What if we wanted to have a look at roughly how many static ships there were at the first of each month for 2024 to 2025?**

## Part 3: Merging data

Your task is to take `sp500`, `prices`, and `tbills` and merge them into a single dataset.

You should end up with a DataFrame that has the following columns:

* `ticker`
* `gics`
* `gics_subindustry`
* `dt`
* `price`
* `monthly_return`
* `risk_free_rate`


In [None]:
# Data for regression
sp500 = pd.read_parquet("https://rice.box.com/shared/static/3jamp27br4oa0e99fo2wws9g1vwwuhv5.parquet")
prices = pd.read_parquet("https://rice.box.com/shared/static/fnkvb48ml32fsx4iu6ftspbf7yy9thtp.parquet")
tbills = pd.read_parquet("https://rice.box.com/shared/static/lotx7w5bs54if2xcizuwqluiv7n8tee9.parquet").set_index("dt")

In [None]:
sp500.head()

In [None]:
prices.head()

In [None]:
tbills.head()