# Hurricane Clustering and ENSO/NAO Index

## Background

This project is based on the paper "Classifying North Atlantic tropical cyclone tracks by mass moments." Journal of Climate 22.20 (2009): [5481-5494](https://journals.ametsoc.org/configurable/content/journals$002fclim$002f22$002f20$002f2009jcli2828.1.xml?t:ac=journals%24002fclim%24002f22%24002f20%24002f2009jcli2828.1.xml) by Nakamura, Jennifer, et al.

In this paper, Nakamura et al. group North Atlantic hurricanes by geographic variables (the mean and variance of storm tracks) using K-means clustering. They find that INCLUDE MAIN FINDINGS OF THE PAPER HERE.

Our group wanted to extend this analysis by investigating the possible connections between large-scale, interannual climate processes and North Atlantic hurricane tracks. More specifically, we wondered if the clusters presented by Nakamura et al. (2009) would contain any information about ENSO/NAO cycles. SENTENCE ABOUT ENSO/NAO HERE. CONNECT ENSO/NAO BACK TO HURRICANE FORMATION.

Therefore, our goal in this notebook is to extend this analysis by testing the following hypothesis:

<p style="text-align: center;"><b>Including PDI (wind speed?) in our hurricane clusters will improve their correlation with (ENSO, NAO?) index.</p>

Testing this hypothesis led us to a number of related inquiries, which are listed below and correspond to the headings of this notebook.
1) **Which time span should we use for clustering?** The dataset of storms we use (described below) has storm data dating back to 1851. The reliability of this data increases with time however, as the number of ground-based observations increases over time and the advent of satellite observations around 1980 provides more robust remote-sensing observations. So we must balance providing the algorithm with high-quality data vs a high volume of data.
2) **Which clustering algorithm should we use?** There are many different algorithms that cluster data. We will test three different methods (K-means, DBSCAN, and GMM) and compare the results.
3) **What inputs should be given to the algorithm?** The paper only provides spatial information about a storm track (longitudinal mean, latitudinal mean, longitudianl variance, latitudinal variance, and the covariance of latitude and longitude). We will test whether adding another metric to the clustering algorithm (PDI? wind speed?) will change the results.

After addressing each of these points, we will conclude the notebook with a section addressing our hypothesis directly by investigating the connections between hurricane cluster and ENSO/NAO index.

## Load packages, functions, and data

In [6]:
# import all necessary packages

import cartopy.crs as ccrs # used for map projection
import matplotlib.pyplot as plt # matplotlib
import cartopy.feature as cfeature # used for map projection
import xarray as xr # x-array
import numpy as np # numpy
import urllib.request # download request
import warnings # to suppress warnings
import gender_guesser.detector as gender # for analyzing the names of hurricanes
from numpy import linalg as LA # to plot the moments (by calculating the eigenvalues)
from sklearn.cluster import k_means # to perform k-means
from collections import Counter # set operations
warnings.filterwarnings('ignore')

In [11]:
# import functions our group has written

from func_tools import time_functions as tf      #making time selections in the data
from func_tools import helper_functions as hf    #clustering algorithms
from func_tools import plotting_functions as pf  #making plots

In [8]:
#check if a data directory exists
#if not, create a new directory

import os
cwd=os.getcwd()

cwd_data=cwd+'/data'

if not os.path.exists(cwd_data):
    os.mkdir(cwd_data)

Below is the hurricane track data we are using for this project, called [IBTrAcs](https://www.ncei.noaa.gov/products/international-best-track-archive). This cell will open the data specified in the URL and save it under the data folder created above.

In [9]:
# Download the needed track file

filedata = urllib.request.urlopen('https://www.ncei.noaa.gov/data/international-best-track-archive-for-climate-stewardship-ibtracs/v04r00/access/netcdf/IBTrACS.NA.v04r00.nc')

datatowrite = filedata.read()

with open('data/NA_data.nc', 'wb') as f:   
    f.write(datatowrite)

In [12]:
#load this file as an xarray dataset

tks = xr.open_dataset('data/NA_data.nc', engine="netcdf4", decode_times=False)
tks

## Time selection

- Describe which time horizon we've selected for our analysis and why. Hopefully this can include a quantitative metric, like the correlation score (?)

In [None]:
# converts the time coordinate to datetime64 format
tks = tf.to_datetime(tks)

# adds a new coordinate origin_year which is an integer of the year the storm formed
tks = tf.add_origin_year(tks)

# selects only the storms that originated in the years from 1950 to 2023
tks_1950_2023 = tf.select_years(1950, 2023)

## Clustering algorithm

- Describe which clustering algorithms we've compared. It would be good to include a brief description of how each algorithm computes clusters.
- Include some kind of quantitative metric, like correlation score (?) to justify final decision

## Clustering inputs

- Include the formulas for the 5 track-based moment values and PDI/wind speed
- Based on the chosen clustering algorithm, compute clusters with (1) 5 track-based moments, (2) 5 track-based moments AND PDI/wind speed

## ENSO/NAO Index

- Describe what ENSO/NAO index measures and what positive and negative values mean.
- Cite the dataset where these values are pulled from.
- Use this as an evaluation metric: is there an improvement in cluster correlation with ENSO/NAO index when PDI/wind speed is included in the clustering algorithm?

## Summary

- quickly summarize the key contributions of this project: what did we do, what did we find?