# Data Analysis
In this Jupyter Notebook, we analyze marine heatwave events from the last 40 years of CESM-LE simulations. In the notebook before this one, titled savingensembleruns_last40years.ipynb, we ran Ocetrac on the 100 CESM-LE simulations, setting a radius size 3. 

### Loading in packages

In [1]:
##### LOADING IN PACKAGES #--------------------------------------------------------------
import s3fs; import xarray as xr; import numpy as np
import pandas as pd; 
import dask.array as da
import ocetrac

import matplotlib.pyplot as plt; import cartopy.crs as ccrs

import warnings; import expectexception
warnings.filterwarnings('ignore')

import netCDF4 as nc; import datetime as dt
import scipy

import intake; import pprint
# Allow multiple lines per cell to be displayed without print (default is just last line)
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
# Enable more explicit control of DataFrame display (e.g., to omit annoying line numbers)
from IPython.display import HTML

### Loading marine heatwave event files and SST anomaly files

In [2]:
# Open original collection description file #----------------------------------------------
cat_url_orig = '/glade/collections/cmip/catalog/intake-esm-datastore/catalogs/glade-cesm2-le.json'
coll_orig = intake.open_esm_datastore(cat_url_orig)

In [3]:
subset = coll_orig.search(component='atm',variable='SST',frequency='month_1',experiment='historical')
filenamechange = list(subset.df.member_id.unique())

In [4]:
list_of_xarrays = []
list_of_xarrays_SSTA = []

for i in filenamechange:
    
    string_head = '/glade/work/cassiacai/' + str(i) + '_rad3_blobs.nc'    
    xarray_file = xr.open_dataset(str(string_head))
    list_of_xarrays.append(xarray_file)
    
    string_head_SSTA = '/glade/work/cassiacai/' + str(i) + '_detrended.nc'
    xarray_file_SSTA = xr.open_dataset(str(string_head_SSTA))
    list_of_xarrays_SSTA.append(xarray_file_SSTA)

In [5]:
%%time
concated_xarray = xr.concat(list_of_xarrays, "new_dim")

CPU times: user 4.17 s, sys: 11.1 s, total: 15.3 s
Wall time: 52.6 s


In [6]:
%%time
concated_xarray_SSTA = xr.concat(list_of_xarrays_SSTA, "new_dim")

CPU times: user 4.12 s, sys: 11 s, total: 15.1 s
Wall time: 31.2 s


### Combining SSTA and MHW event files

In [18]:
combined_xarray = xr.combine_by_coords([concated_xarray, concated_xarray_SSTA])
combined_xarray['SSTA'] = combined_xarray['__xarray_dataarray_variable__']
combined_xarray = combined_xarray.drop(['__xarray_dataarray_variable__'])

In [19]:
combined_xarray

### Setting our area of interest

In [9]:
# North Pacific latitude and longitude limits (currently, this is a small area)
lat_lim_less = 10. # 30. can change lat_lim_less to 10
lat_lim_great = 60.

lon_lim_less = 200.
lon_lim_great = 250.

In [20]:
%%time
combined_xarray_limited = combined_xarray.where((combined_xarray.lat >= lat_lim_less) & (combined_xarray.lat <= lat_lim_great) 
                        &(combined_xarray.lon >= lon_lim_less) & (combined_xarray.lon <= lon_lim_great),drop=True)

### Pre-processing 
#### Understanding what we are working with. How many MHW events are we working with?

In [40]:
%%time 
no_mhw_counts = [] # how many MHW events globally?
for i in range(0,100):
    member_ = combined_xarray.isel(new_dim = i)
    event_member_ = member_.groupby(member_.labels)
    no_mhw_counts.append(len(event_member_))

CPU times: user 6min 51s, sys: 2min 39s, total: 9min 30s
Wall time: 12min 55s


In [50]:
# saving it here because the above code takes about 13 minutes to run
no_mhw_counts = [370, 349, 391, 360, 338, 330, 382, 354, 388, 323, 370, 173, 
                 422, 361, 359, 369, 372, 344, 358, 384, 428, 378, 337, 387, 
                 337, 380, 389, 396, 364, 319, 374, 343, 360, 398, 367, 372, 
                 381, 361, 397, 384, 387, 434, 359, 319, 377, 407, 336, 348, 
                 368, 340, 330, 423, 327, 403, 321, 380, 430, 374, 387, 360, 
                 377, 344, 390, 396, 381, 389, 363, 405, 388, 351, 369, 397, 
                 426, 325, 433, 351, 385, 406, 359, 383, 385, 387, 339, 366, 
                 430, 338, 366, 353, 442, 399, 389, 406, 374, 375, 453, 402, 
                 419, 415, 397, 401]

In [54]:
print('mean: ', np.nanmean(no_mhw_counts))
print('std:  ', np.std(no_mhw_counts))

mean:  374.13
std:   35.96877951779849


In [47]:
%%time 
no_mhw_counts_lim_reg = [] # how many MHW events in our region of interest? How many different MHW events make any sort of appearance in our region?
for i in range(0,100):
    member_ = combined_xarray_limited.isel(new_dim = i)
    event_member_ = member_.groupby(member_.labels)
    no_mhw_counts_lim_reg.append(len(event_member_))

CPU times: user 17.7 s, sys: 675 ms, total: 18.4 s
Wall time: 27.3 s


In [None]:
# saving it here as well although the above cell takes less than 30 s to run
no_mhw_counts_lim_reg = [58, 47, 46, 41, 35, 40, 51, 54, 51, 42, 51, 19, 44, 
                         47, 48, 35, 44, 43, 47, 50, 54, 42, 42, 48, 33, 50, 
                         56, 50, 50, 26, 49, 42, 57, 50, 48, 44, 51, 42, 50, 
                         63, 48, 51, 44, 35, 46, 51, 40, 48, 46, 50, 42, 58, 
                         39, 47, 39, 44, 48, 45, 44, 44, 55, 45, 48, 47, 58, 
                         46, 49, 51, 48, 38, 56, 51, 51, 38, 51, 31, 39, 51, 
                         51, 53, 53, 45, 40, 42, 49, 40, 48, 46, 63, 46, 54, 
                         41, 53, 44, 66, 61, 55, 64, 55, 47]

In [55]:
print('mean: ', np.nanmean(no_mhw_counts_lim_reg))
print('std:  ', np.std(no_mhw_counts_lim_reg))

mean:  47.18
std:   7.559603164187919


In [65]:
lens_of_events = []
for l, ent in event_member_:
    groupedby_by_time = ent.groupby(ent.time)
    lens_of_events.append(len(groupedby_by_time))

In [130]:
%%time

events_full = []
for i in range(0,1):
    member_ = combined_xarray.isel(new_dim = i)
    event_member_ = member_.groupby(member_.labels)
    
    ent_full = []  
    for l, ent in event_member_:
        groupedby_by_time = ent.groupby(ent.time)
        
        gro_full = []
        for n, gro in groupedby_by_time:
            gro_full.append(gro)
        
        ent_full.append(gro_full)
    events_full.append(ent_full)

CPU times: user 7.19 s, sys: 1.63 s, total: 8.82 s
Wall time: 8.99 s


In [146]:
%%time

concated_on_time_full = []

for i in range(len(events_full[0])):
    concated_on_time = xr.concat(events_full[0][i], "time_dim")
    concated_on_time_full.append(concated_on_time)

## ANALYSIS
---------------

### Trajectory clustering 
track how the center of mass of a MHW moves
1. https://towardsdatascience.com/gps-trajectory-clustering-with-python-9b0d35660156 
2. Clustering Moving Object Trajectories: Integration in CROSS-CPP Analytic Toolbox
3. [Continuous Clustering of Moving Objects](https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.65.8587&rep=rep1&type=pdf)
4. Clustering gridded data shapes
5. [Comparing trajectory clustering methods (Github repo)](https://github.com/seljukgulcan/comparing-trajectory-clustering-methods)

### Self-organizing maps
need to understand what this is
1. [Beginners Guide to Anomaly Detection Using Self-Organizing Maps](https://www.analyticsvidhya.com/blog/2021/09/beginners-guide-to-anomaly-detection-using-self-organizing-maps/)
2. https://pypi.org/project/sklearn-som/

### Spatial clustering
1. [Spatial Clustering Methods in Data Mining: A Survey](https://www.comp.nus.edu.sg/~atung/publication/gkdbk01.pdf)
2. [Spatial clustering of summer temperature maxima from the CNRM-CM5 climate model ensembles & E-OBS over Europe](https://www.sciencedirect.com/science/article/pii/S2212094715300013)

### Convolutional Neural Networks
1. https://www.analyticsvidhya.com/blog/2021/05/convolutional-neural-networks-cnn/ 
2. [Predicting clustered weather patterns: A test case for applications of convolutional neural networks to spatio-temporal climate data](https://www.nature.com/articles/s41598-020-57897-9)

### Image clustering implementation
1. [Image Clustering Implementation with PyTorch](https://towardsdatascience.com/image-clustering-implementation-with-pytorch-587af1d14123)
2. [How to cluster images based on visual similarity](https://towardsdatascience.com/how-to-cluster-images-based-on-visual-similarity-cd6e7209fe34)

### Other clustering resources
1. [dpsom](https://github.com/ratschlab/dpsom): Code associated with ACM-CHIL 21 paper 'T-DPSOM - An Interpretable Clustering Method for Unsupervised Learning of Patient Health States'

### tslearn
a Python package that provides machine learning tools for the analysis of time series. This package builds on (and hence depends on) scikit-learn, numpy and scipy libraries*
- https://tslearn.readthedocs.io/en/stable/gen_modules/clustering/tslearn.clustering.TimeSeriesKMeans.html
- https://tslearn.readthedocs.io/en/stable/index.html
- https://github.com/tslearn-team/tslearn/
- https://tslearn.readthedocs.io/en/stable/user_guide/clustering.html

#### Other time series clustering sources
- [Time Series Clustering and Dimensionality Reduction](https://towardsdatascience.com/time-series-clustering-and-dimensionality-reduction-5b3b4e84f6a3)
- [Deep Time-Series Clustering: A Review (Alqahtani, A.; Ali, M.; Xie, X.; Jones, M.W. Deep Time-Series Clustering: A Review. Electronics 2021, 10, 3001.](https://doi.org/10.3390/electronics10233001)

### Other Papers / Resources
1. [Predicting climate types for the Continental United States using unsupervised clustering techniques](https://ds153.github.io/files/environmetrics_ds.pdf)
2. [An unsupervised learning approach to identifying blocking events: the case of European summer](https://wcd.copernicus.org/preprints/wcd-2021-1/wcd-2021-1-ATC1.pdf)