# ml_drought

This series of notebooks outlines how to use the pipeline created as part of the ECMWF Summer of Weather Code 2019. 

## The `environment` files
The two environment files `environment.mac.yml`/`environment.ubuntu.cpu.yml` specify working conda environments for different platforms. In order to run the pipeline it is advised to install a new `conda` environment.

## Pipeline Structure

The pipeline is structured as below. We have a number of different classes, all written in the `src` directory. These have been tested with the tests written in the `tests` directory but with the same structure as the `src` directory. These tests can be a useful entry point to understand how we use each part of the pipeline

- Exporters: `src/exporters`
- Preprocessors: `src/preprocess`
- Engineers: `src/engineer`
- Models: `src/models`
- Analysis: `src/analysis`

<img src="img/pipeline_structure.png">

## Exporters

The exporters work to download data from external sources. These sources vary and the methods for downloading data also vary. The exporters all inherit behaviour from the `BaseExporter` defined in `src/exporters/base.py`. The `SEAS5Exporter` and the `ERA5Exporter` both interact with the ECMWF / Copernicus [`cdsapi`](https://cds.climate.copernicus.eu/api-how-to). Other exporters work with ftp servers or websites.

<img src="img/exporter_diagram.png" style='background-color: #878787; border-radius: 25px; padding: 20px'>

### Sources:
- The `S5Exporter` and the `ERA5Exporter` work with the [`Climate Data Store` (CDS)](https://cds.climate.copernicus.eu/#!/home) to download data. 
- The `ERA5ExporterPOS` downloads data from the PlanetOS AWS data mirror which can be visualised [here](https://data.planetos.com/datasets/ecmwf_era5)
- The `GLEAMExporter` downloads data from the [GLEAM FTP Server](https://www.gleam.eu/)
- The `VHIExporter` downloads data from the [NOAA Vegetation Health FTP Server](https://www.star.nesdis.noaa.gov/smcd/emb/vci/VH/vh_ftp.php)
- The `SRTMExporter` uses the [`elevation` package](https://github.com/bopen/elevation)

NOTE: By default the data 

### Exporters API

The exporters have a common `export` method which will download the data to the `data/raw` directory by default. If you wish to download the data elsewhere then you should provide an `pathlib.Path` path to the `Exporter`. 

**Be aware that data volumes are significant (can be upwards of 1TB if you use downloaded all data)**

**NOTE: the area surrounding Kenya will be downloaded by default for the CDS Exporters. Otherwise data is global and is subset later**


In [2]:
from pathlib import Path
import os

if Path('.').absolute().parents[1].name == 'ml_drought':
    os.chdir(Path('.').absolute().parents[1])

from src import exporters

  PANDAS_TYPES = (pd.Series, pd.DataFrame, pd.Panel)
  'DataArray', pd.Series, pd.DataFrame, pd.Panel]:


In [3]:
dir(exporters)[:8]

['CHIRPSExporter',
 'ERA5Exporter',
 'ERA5ExporterPOS',
 'ESACCIExporter',
 'GLEAMExporter',
 'KenyaAdminExporter',
 'S5Exporter',
 'SRTMExporter']

## Preprocessors

The preprocessors work to convert these different datasets into a unified data format. This makes testing and developing different models much more straightforward.

There is a `Preprocessor` for each `Exporter`.

These `Preprocessors` perform a number of tasks:
- Put the data on a regular spatial grid
- Put the data on a consistent temporal frequency (e.g. all data is converted to `monthly` timesteps)
- Dimension names are standardized (`time, lat, lon`)
- The same areal extend (by default Kenya is subset from the data).

<img src="img/preprocess_diagram.png" style='background-color: #878787; border-radius: 25px; padding: 20px'>

The preprocessors offer an opportunity to tailor the pipeline to your own needs. You can easily change the area to be subset (the Region of Interest - ROI) for example.

The preprocessors do a very useful task in making the data consistent. Working with a Unified Data Format is useful for many comparison tasks and is essential for training machine learning and staitistical models. 

In [4]:
from src import preprocess

dir(preprocess)[:6]

['CHIRPSPreprocessor',
 'ERA5MonthlyMeanPreprocessor',
 'ESACCIPreprocessor',
 'GLEAMPreprocessor',
 'KenyaAdminPreprocessor',
 'PlanetOSPreprocessor']

### Preprocessors API

The main entry point to the preprocessors is through the `prerprocessor.preprocess()` function.

Regridding the data requires you to have a reference `.nc` file that you want to use as the reference grid. This means that your data will be put onto the same `lat, lon` grid as the reference file.

```python
preprocessor.preprocess(
    subset_str='kenya', 
    regrid=Path('path/to/reference/netcdf.nc')
)
```

In [5]:
preprocessor = preprocess.ERA5MonthlyMeanPreprocessor()

[method for method in dir(preprocessor) if '__' not in method]

['_preprocess_single',
 'analysis',
 'chop_roi',
 'create_filename',
 'data_folder',
 'dataset',
 'filter_outfiles',
 'get_filepaths',
 'interim',
 'load_reference_grid',
 'merge_files',
 'out_dir',
 'preprocess',
 'preprocessed_folder',
 'raw_folder',
 'regrid',
 'resample_time',
 'static',
 'static_vars']

## Engineer

The Engineer is responsible for taking the `preprocessed` data from the `data/interim/*_preprocessed/` directories and writing to the `data/features` directory. 

In doing so the `Engineer` creates `train` and `test` data for different month-years. 

The label on the directory `data/features/{experiment}/{year}_{month}` (for example: `data/features/nowcast/2015_1`) refers to the `target` timestep. Therefore, our `y.nc` has the timestep `January 2015` in this example.

<img src="img/engineer_diagram.png" style='background-color: #878787; border-radius: 25px; padding: 20px'>

### We currently have two `experiments` defined in the pipeline.

These two experiments are accessed through the `Engineer` class as an argument - `experiment: str`.

The **`OneMonthForecast`** experiment tries to predict the target variable next month. For example, we might use `total_preciptation` as our regressor (stored in `x.nc`) and want to predict vegetation health `VHI` stored in `y.nc`. 

We therefore use data for December 2014 (total_precipitation and VHI as an autoregressive component) to predict January 2015 VHI.

The **`Nowcast`** experiment suggests that we have information about variables other than the target variable for the target time. So we have `total_preciptiation` information in January 2015 and we want to use that information to predict January 2015 `VHI`. This experiment is a good way of incorporating SEAS5 forecast data.

- `x.nc` includes December 2014 `VHI` and `total_precipitation`, as well as January 2015 `total_precipitation` (non-target variable at target timestep).
- `y.nc` contains January 2015 `VHI` - our target variable.

## Models

The models are the implementation of machine learning methods for making predictions about our `target_variable`. 

We have currently implemented 5 models with varying levels of complexity. We have some simple baseline models (`parsimonious` models) such as `Persistence` but also some complex Neural Networks with architectures specific for hydro-meteorology ([`EARecurrentNetwork` (paper here)](https://arxiv.org/pdf/1907.08456.pdf))

These classes work with data from the `data/features` directory and write predictions to the `data/models/{model}`. Results are stored in: `results.json`.

In [6]:
from src import models

dir(models)[:3] + dir(models)[5:7]

### EALSTM - `EARecurrentNetwork`

Of particular interset is the Entity-Aware Long-Short Term Memory (EALSTM). This model was developed recently in a [paper looking at Regional Rainfall-Runoff modelling](https://arxiv.org/abs/1907.08456). 

The authors adapt the classical Long-Short Term Memory (LSTM) network architecture to include the input of static and dynamic data separately. The **static** data is passed to each cell of the EALSTM through the input gate, modifying the information from the **dynamic** data that enters the long-term memory component ($C$).

In the diagram below we compare the cell architecture in the LSTM (*top*) and the EALSTM (*bottom*).

<img src="img/ealstm_lstm.png" style='background-color: #878787; border-radius: 25px; padding: 20px'>

## Analysis

The analysis directory contains code for interpreting the output of the models and for interrogating the input datasets. This is a very general directory with 'helper' code. 

Some of the tasks that the `src/analysis` directory can help with are:
1. Subsetting your analysis by region.
2. Subsetting your analysis by landcover types.
2. Calculating various temporal aggregations (such as 3 Monthly moving averages).
3. Comparisons of `true` against predictions `preds`.
4. Calculating indices from results ([Vegetation Deficit Index used in other papers](https://www.mdpi.com/2072-4292/11/9/1099)).

In order to see the pipeline analysis and outputs in action please checkout the next notebook!