# Data Processing Examples
Example code for downloading and preparing code

## 0 - Misc Imports
Plenty of imports that are generally useful -- imports up here so they don't cause too much clutter

### Important Imports
Actually important imports -- make sure these dependencies are installed (will do a requirements.txt at some point); this code block doesn't actually *have* to be run; any required imports will be at the top of each code block

In [None]:
import xarray as xr
import numpy as np

#### Environment Settings
To be honest, I have no clue where this is supposed to go (like idk what is causing errors), but this **does** need to be run at some point before the rest of the code 😊

In [None]:
import os
os.environ['SCIKIT_ARRAY_API'] = 1

### Optional Import(s)
Intel has put out the `sklearnex` module which provides speed ups for training of sklearn models on Intel CPUs. If this is not relevant, ignore this section. If this section raises errors, also just skip it; it provides a speed up but the functionality is completely unaffected.

In [None]:
from sklearnex import patch_sklearn
patch_sklearn()

## 1 - Download + Pre-Process
Essentially example code for `forest_fire/prepare_data`

First, we need to download the data -- in my experience this tends to take circa 10-12 ish minutes per month of data, so for 5 years, that adds up to around 6 hours (I think). In other words, probably set this cell off and leave it going in the background.

DO NOTE THAT this currently only downloads the training data (2010-2014); downloading test data (2015-2019) requires changes to the code.

Either:
1. Change `request_total_data` to request all 10 years of data at once, or
2. Make a second request for the test data

I have gone with 2. while doing the project, but 1. might be more efficient, with the small caveat that it is a bit of a pain if the download is interrupted partway. Note that in the second case, the training data can be used as the prior.

In [None]:
from forest_fire.prepare_data.CDS_requests import request_total_data

## USE CANADA RANGE AS EXAMPLE
from forest_fire.prepare_data.extents import CANADA_RICHARDSON_EXTENT

request_total_data(
    extent = CANADA_RICHARDSON_EXTENT,
    data_path = './data'
)

By default, the above code block will download our data to `data/canada/main/combined.grib` and `data/canada/prior/combined.grib`

For future reference, it'll be much quicker to have this data in the form of Zarr groups instead of grib files, so we do that next -- in fairness, this step isn't entirely necessary

In [None]:
from forest_fire.prepare_data.process_data import grib_to_zarr

grib_to_zarr(grib_path = 'data/canada/main/combined.grib', zarr_path = 'data/canada/main/', name = '_ZARR')
grib_to_zarr(grib_path = 'data/canada/prior/combined.grib', zarr_path = 'data/canada/prior', name = '_ZARR')

We now need to **setup** the data -- for this we have the `setup_dataset()` function from `process_data.py`. We will write this prepared dataset to storage as well

In [None]:
from forest_fire.prepare_data.process_data import setup_dataset

PROXIES = {
    'tp': (30, 90, 180),
    't2m': (30, 90, 180)
}

setup_dataset(
    main_path = 'data/canada/main/_ZARR',
    prior_path = 'data/canada/prior/_ZARR',
    proxy_config = PROXIES
).to_zarr(
    store = 'data/canada/ZARR_READY'
)

Any further manipulation of data is then handled in `forest_fire/train`. You can also restart the kernel at this point to clear stored variables if you so wish -- just remember to re-import modules from section 0!

## 2 - Model Training
Example code for `forest_fire/train` + examples of models used for this project

In [None]:
from forest_fire.train.prepare_samples import prep_samples

train_ds = xr.open_zarr(
    'data/canada/ZARR_READY'
)

X, y = prep_samples(
    ds = train_ds
)

### Resampling

This generally works fine. However, the data for this project has tended to be wildly imbalanced (with around a 1:10000 ratio of negative-positive points being fairly common). Therefore, up/downsampling has been used on the training data. This has been done using the `imblearn` module.

Downsampling was done with `RandomUnderSampler` (bootstrapped samples), while upsampling was done with a mix of `ADASYN` and `SMOTE`. Example code for this is provided below.

#### Downsampling

In [None]:
from imblearn.under_sampling import RandomUnderSampler

downsampler = RandomUnderSampler(sampling_strategy = 0.1)

X_down, y_down = downsampler.fit_resample(X, y)

#### Upsampling
Examples are provided for both SMOTE and ADASYN; there is more work to be done on comparing which is better in various cases

In [None]:
'''SMOTE'''
from imblearn.over_sampling import SMOTE

SMOTE_oversampler = SMOTE(sampling_strategy = 0.1)

X_smote, y_smote = SMOTE_oversampler.fit_resample(X, y)

In [None]:
'''ADASYN'''
from imblearn.over_sampling import ADASYN

ADASYN_oversampler = ADASYN(sampling_strategy = 0.1)

X_ada, y_ada = ADASYN_oversampler.fit_resample(X, y)

### Training
