# ERA5 Data Preprocessing

---

## Overview
Here, we will use the processed IBTRACKS data to select ERA5 environmental variables associated with each cyclone. 

## Prerequisites
| Concepts | Importance | Notes |
| --- | --- | --- |
| [Intro to NUMPY](https://foundations.projectpythia.org/core/numpy/) | Necessary | |
| [Intro to PANDAS](https://foundations.projectpythia.org/core/numpy/) | Necessary | |
| [Intro to XARRAY](https://foundations.projectpythia.org/core/xarray/) | Necessary | |
| Project management | Helpful | |

- **Time to learn**: estimate in minutes. For a rough idea, use 5 mins per subsection, 10 if longer; add these up for a total. Safer to round up and overestimate.
- **System requirements**:
    - Populate with any system, version, or non-Python software requirements if necessary
    - Otherwise use the concepts table above and the Imports section below to describe required packages as necessary
    - If no extra requirements, remove the **System requirements** point altogether

---

## Imports
Begin your body of content with another `---` divider before continuing into this section, then remove this body text and populate the following code cell with all necessary Python imports **up-front**:

In [8]:
import xarray as xr 
from dask.distributed import Client
import matplotlib.pyplot as plt
import cartopy.crs as ccrs
import pandas as pd
import glob
from global_land_mask import globe
import cartopy.feature as cfeature
from matplotlib.path import Path
import matplotlib.patches as patches
from matplotlib import patheffects
import numpy as np
import dask

## Edit and pad ERA5 data

In this section, we will select ERA5 data within a 5x5 latitude/longitude grid centered at each cyclone center at each time step in our dataset. We will then have to pad the data to account for instances in which grid cells occur over land. 

In [12]:
input_dsets = xr.open_dataset('../test_folder/final_proc_5yr_6h.nc')

ValueError: did not find a match in any of xarray's currently installed IO backends ['netcdf4', 'h5netcdf', 'scipy']. Consider explicitly selecting one of the installed engines via the ``engine`` parameter, or installing additional IO dependencies, see:
https://docs.xarray.dev/en/stable/getting-started-guide/installing.html
https://docs.xarray.dev/en/stable/user-guide/io.html

In [5]:
ib_data_processed_6h = pd.read_csv('../test_folder/ib_data_processed_6h.csv')

In [6]:
final_data = []
max_len = ib_data_processed_6h.groupby('id').size().max()  # assuming max length is 3 hours per storm

In [9]:
for id_number,group in ib_data_processed_6h.groupby('id'):
    events_data = []
    for index,row in group.iterrows():
        lat = int(row['LAT'])
        lon = int(row['LON'])
        time = row['datetime']
        
        #We want data in a 5x5 latitude/longitude grid centered on the cyclone latitude/longitude
        latmin = lat - 2
        latmax = lat + 2
        lonmin = lon - 2
        lonmax = lon + 2
        sel_data = input_dsets.sel(latitude=slice(latmax, latmin), longitude=slice(lonmin, lonmax), time=time)
        
            
        final_xr = sel_data.rename({'latitude': 'y', 'longitude': 'x'})
        final_xr['x'] = np.arange(0,final_xr.sizes['x'])
        final_xr['y'] = np.arange(0,final_xr.sizes['y'])
        
        # fill NaN values with zeros along the x and y dimensions
        for jj in final_xr.data_vars:
            final_xr[jj].fillna(0)  # Fill NaN values
        
        #Recall that we are trying to predict the wind speed.
        #Hence, our target is USA_WIND
        final_xr['target'] = row['USA_WIND']    
        events_data.append(final_xr)
    
    final_event = xr.concat(events_data,dim='time')
    id
    #Pad data with zeros up to the maximum time
    if len(final_event.time) <= max_len:
        new_time = pd.date_range(start=final_event['time'].min().values, periods=max_len ,freq='6h')
        padded_data = final_event.reindex(time=new_time, fill_value=0.0)
    else:
        padded_data = final_event
    
    lead_time = np.arange(0,max_len*6 ,6)
    padded_data['lead'] = ('time', lead_time)
    padded_data = padded_data.assign_coords({'lead': padded_data['lead'].astype(int)})
    
    # swap time and lead dimensions
    padded_data = padded_data.swap_dims({'time': 'lead'})
    padded_data['id'] = id_number 
    
    # convert the time dimension to a variable
    final_data.append(padded_data)

In [11]:
final_input_padded = xr.concat(final_data, dim='SID')
final_input_padded

## Edit predictands for each cyclone
Here we can move on to our second objective, to explicitly edit the predictands that will be used by the machine learning model. The predictands are the environmental variables associated with a cyclone at each time step. Some of these variables include sea surface temperatures and lower tropospheric humidity.

In [13]:
def process_row(row, input_dsets):
    lat = int(row['LAT'])
    lon = int(row['LON'])
    time = row['datetime']
    latmin = lat - 5
    latmax = lat + 5
    lonmin = lon - 5
    lonmax = lon + 5

    try:
        # Select the data for the given lat/lon/time
        sel_data = input_dsets.sel(latitude=slice(latmax, latmin), longitude=slice(lonmin, lonmax), time=time)
    
    except KeyError:
        # If data is not found, return None (will be filtered out later)
        print(f"Data not found for SID: {row['SID']} at time {time} with lat {lat} and lon {lon}")
        return None

    # Add SID and wind speed as new variables
    sel_data['id'] = row['id']
    wind_speed = row['USA_WIND']

    # Rename dimensions and set coordinate ranges
    final_xr = sel_data.rename({'latitude': 'y', 'longitude': 'x'})
    final_xr['x'] = np.arange(0, len(final_xr['x']), 1)
    final_xr['y'] = np.arange(0, len(final_xr['y']), 1)
    final_xr = final_xr.fillna(0)  # Fill NaN values with zeros
    final_xr['target'] = wind_speed

    return final_xr

In [14]:
# Wrap your row processing in dask.delayed
delayed_results = []
for index, row in ib_data_processed.iterrows():
    delayed_result = dask.delayed(process_row)(row, input_dsets)
    delayed_results.append(delayed_result)

# Compute in parallel and filter out None results
final_data = dask.compute(*delayed_results)
final_data = [ds for ds in final_data if ds is not None]

# Concatenate along 'time' dimension
final_data_xr = xr.concat(final_data, dim='time')

NameError: name 'ib_data_processed' is not defined

Check out [**any number of helpful Markdown resources**](https://www.markdownguide.org/basic-syntax/) for further customizing your notebooks and the [**MyST Syntax Overview**](https://mystmd.org/guide/syntax-overview) for MyST-specific formatting information. Don't hesitate to ask questions if you have problems getting it to look *just right*.

## Last Section

You can add [admonitions using MyST syntax](https://mystmd.org/guide/admonitions):

:::{note}
Your relevant information here!
:::

Some other admonitions you can put in ([there are 10 total](https://mystmd.org/guide/admonitions#admonitions-list)):

:::{hint}
A helpful hint.
:::

:::{warning}
Be careful!
:::

:::{danger}
Scary stuff be here.
:::

We also suggest checking out Jupyter Book's [brief demonstration](https://jupyterbook.org/content/metadata.html#jupyter-cell-tags) on adding cell tags to your cells in Jupyter Notebook, Lab, or manually. Using these cell tags can allow you to [customize](https://jupyterbook.org/interactive/hiding.html) how your code content is displayed and even [demonstrate errors](https://jupyterbook.org/content/execute.html#dealing-with-code-that-raises-errors) without altogether crashing our loyal army of machines!

---

## Summary
Add one final `---` marking the end of your body of content, and then conclude with a brief single paragraph summarizing at a high level the key pieces that were learned and how they tied to your objectives. Look to reiterate what the most important takeaways were.

### What's next?
Let Jupyter book tie this to the next (sequential) piece of content that people could move on to down below and in the sidebar. However, if this page uniquely enables your reader to tackle other nonsequential concepts throughout this book, or even external content, link to it here!

## Resources and references
Finally, be rigorous in your citations and references as necessary. Give credit where credit is due. Also, feel free to link to relevant external material, further reading, documentation, etc. Then you're done! Give yourself a quick review, a high five, and send us a pull request. A few final notes:
 - `Kernel > Restart Kernel and Run All Cells...` to confirm that your notebook will cleanly run from start to finish
 - `Kernel > Restart Kernel and Clear All Outputs...` before committing your notebook, our machines will do the heavy lifting
 - Take credit! Provide author contact information if you'd like; if so, consider adding information here at the bottom of your notebook
 - Give credit! Attribute appropriate authorship for referenced code, information, images, etc.
 - Only include what you're legally allowed: **no copyright infringement or plagiarism**
 
Thank you for your contribution!