# ERA5 Data Preprocessing

---

## Overview
Here, we will use the processed IBTRACKS data to select ERA5 environmental variables associated with each cyclone. 

## Prerequisites
| Concepts | Importance | Notes |
| --- | --- | --- |
| [Intro to NUMPY](https://foundations.projectpythia.org/core/numpy/) | Necessary | |
| [Intro to PANDAS](https://foundations.projectpythia.org/core/numpy/) | Necessary | |
| [Intro to XARRAY](https://foundations.projectpythia.org/core/xarray/) | Necessary | |
| Project management | Helpful | |

- **Time to learn**: ~15 minntes


---

## Imports
Begin your body of content with another `---` divider before continuing into this section, then remove this body text and populate the following code cell with all necessary Python imports **up-front**:

In [24]:
import xarray as xr 
from dask.distributed import Client
import matplotlib.pyplot as plt
import cartopy.crs as ccrs
import pandas as pd
import glob
from global_land_mask import globe
import cartopy.feature as cfeature
from matplotlib.path import Path
import matplotlib.patches as patches
from matplotlib import patheffects
import numpy as np
import dask

## Edit and pad ERA5 data

In this section, we will select ERA5 data within a 5x5 latitude/longitude grid centered at each cyclone center at each time step in our dataset. We will then have to pad the data to account for instances in which grid cells occur over land. 

In [25]:
input_dsets = xr.open_dataset('~/Data/final_proc_5yr_6h.nc')

:::{hint}
The coriolis parameter is a function of latitude only. However, cyclones tend to move in preferred directions based on latitude and in turn the magnitude of this parameter. This is why the coriolis parameter is chosen to be one of the predictor variables of our AI model. This variable is calculated below. 
:::

In [26]:
# calculating coriolis parameter 
cor_parms =  2 * 7.29 * 1e-5 * np.sin(np.radians(input_dsets['latitude']))

input_dsets['cor_params'] = xr.DataArray(cor_parms,
                                            name='cor_params'
                                            ).broadcast_like(input_dsets['r'])

In [27]:
ib_data_processed_6h = pd.read_csv('../test_folder/ib_data_processed_6h.csv')

:::{hint}
When training our AI model, we want all cyclones to have the same number of time steps. Realistically this does not happen in the real world. Therefore, we must pad each cyclone track with "dummy" values until the lifespan of the cyclone is the same as that of the longest lasting cyclone in our dataset. 
:::

In [28]:
final_data = []
max_len = ib_data_processed_6h.groupby('id').size().max()  # assuming max length is 3 hours per storm

## Edit predictors for each cyclone
Here we can move on to our second objective, to explicitly edit the predictors that will be used by the machine learning model. We wish to center each cyclone within a 5x5 grid at each time step. We will then select the data at each grid cell for each variable of interest including sea surface temperatures, 500 hPa relatice humidity, pressure, vertical wind shear, 850 hPa relative vorticity, and the coriolis parameter.

In [30]:
for id_number,group in ib_data_processed_6h.groupby('id'):
    events_data = []
    for index,row in group.iterrows():
        lat = int(row['LAT'])
        lon = int(row['LON'])
        time = row['datetime']
        
        #We want data in a 5x5 latitude/longitude grid centered on the cyclone latitude/longitude
        latmin = lat - 2
        latmax = lat + 2
        lonmin = lon - 2
        lonmax = lon + 2
        sel_data = input_dsets.sel(latitude=slice(latmax, latmin), longitude=slice(lonmin, lonmax), time=time)
        
            
        final_xr = sel_data.rename({'latitude': 'y', 'longitude': 'x'})
        final_xr['x'] = np.arange(0,final_xr.sizes['x'])
        final_xr['y'] = np.arange(0,final_xr.sizes['y'])
        
        # fill NaN values with zeros along the x and y dimensions
        for jj in final_xr.data_vars:
            final_xr[jj].fillna(0)  # Fill NaN values
        
        #Recall that we are trying to predict the wind speed.
        #Hence, our target is USA_WIND
        final_xr['target'] = row['USA_WIND']    
        events_data.append(final_xr)
    
    final_event = xr.concat(events_data,dim='time')
    
    #Pad data with zeros up to the maximum time
    if len(final_event.time) <= max_len:
        new_time = pd.date_range(start=final_event['time'].min().values, periods=max_len ,freq='6h')
        padded_data = final_event.reindex(time=new_time, fill_value=0.0)
    else:
        padded_data = final_event
    
    lead_time = np.arange(0,max_len*6 ,6)
    padded_data['lead'] = ('time', lead_time)
    padded_data = padded_data.assign_coords({'lead': padded_data['lead'].astype(int)})
    
    # swap time and lead dimensions
    padded_data = padded_data.swap_dims({'time': 'lead'})
    padded_data['id'] = id_number 
    padded_data = padded_data.set_coords('id')
    
    # convert the time dimension to a variable
    final_data.append(padded_data)

In [31]:
final_input_padded = xr.concat(final_data, dim='id')
final_input_padded

:::{note}
Recall that our "target" variable, or the variable we want to predict is the wind speed. We take the wind speed from ERA5 to initially train our model.
:::

In [32]:
final_input_padded.to_netcdf('~/ml-hurricane-intensity/test_folder/input_predictands.nc')

---

## Summary
Here, we selected and edited ERA5 data associated with the cyclones at each time step in our dataset. This involved gathering data for each variable of interest within a 5x5 grid. We also needed to be sure to mask out all grid cells corresponding to land as our AI model will only take into account grid cells over water.

### What's next?
We have now officially preprocessed all of our data! Next, we will test each variable of interest to get a sense of how well it can act as a predictor for cyclone intensity. After this, we will begin setting up our AI model!