# ml_drought

This series of notebooks outlines how to use the pipeline created as part of the ECMWF Summer of Weather Code 2019. 

## The `environment` files
The two environment files `environment.mac.yml`/`environment.ubuntu.cpu.yml` specify working conda environments for different platforms. In order to run the pipeline it is advised to install a new `conda` environment.

## Pipeline Structure

The pipeline is structured as below. We have a number of different classes, all written in the `src` directory. These have been tested with the tests written in the `tests` directory but with the same structure as the `src` directory. These tests can be a useful entry point to understand how we use each part of the pipeline

- Exporters: `src/exporters`
- Preprocessors: `src/preprocess`
- Engineers: `src/engineer`
- Models: `src/models`
- Analysis: `src/analysis`

<img src="img/pipeline_structure.png">

## The `data` directory

The pipeline interacts with the `data` directory. It is important that this directory is in the correct format since all of the pipeline interacts with it. The pipeline is flexible in its applications and modelling decisions but it is **very opinionated** about the structure of this directory. We therefore, recommend that you don't manually move files around from this directory. Ideally, this repository should be located somewhere with sufficient storage (either an external hard drive or a remote server). This is because data volumes can grow very quickly when working with 3+ dimensional data `(time, latitude, longitude)`.

<img src="img/data_dir_diagram.png" style='background-color: #878787; border-radius: 25px; padding: 20px'>

### `raw` data 
The raw data exported from external sources. The `Exporters` populate this directory.

### `interim` data
The data that has been preprocessed. A temporary directory `data/{dataset}_interim/` will be created with each raw file being preprocessed individually (chopping the region of interest (`subset`) for example), before then being combined as a final step into the `data/{dataset}_preprocessed/{dataset}_{subset}.nc` file.

### `features` data
The features directory contains data that has been through the engineer and therefore split into `train` and `test` directories. This is the data that will be read by the `DataLoader` in the models. It is still stored as netcdf (`.nc`) files here so that it can be easily read and checked. 

Because we are currently working with time series models, each directory (e.g. `data/features/{experiment}/train/2015_1/`) has one target timestep and target variable (`y.nc`), and then the regressors stored in `x.nc`. `y.nc` will be the target variable for January 2015 in this example (`.../2015_1`).

### `models` data
The models directory store the predictions of the models. While the models work with numpy arrays, in order to utilise the power of xarray and the spatial-temporal structure of hydro-meteorological variables, we write the predictions back to `.nc`. 

In this directory you will find predictions for the `x.nc`/`y.nc` data stored in the `data/features/{experiment}/test/{time}`. The data used for testing is created by the `Engineers` and therefore stored in the `.../test` directory.

## Exporters

The exporters work to download data from external sources. These sources vary and the methods for downloading data also vary. The exporters all inherit behaviour from the `BaseExporter` defined in `src/exporters/base.py`. The `SEAS5Exporter` and the `ERA5Exporter` both interact with the ECMWF / Copernicus [`cdsapi`](). Other exporters work with ftp servers or websites.

The list of exporters are below.

In [1]:
from pathlib import Path
import os

if Path('.').absolute().parents[1].name == 'ml_drought':
    os.chdir(Path('.').absolute().parents[1])

from src import exporters

In [8]:
dir(exporters)[:8]

['CHIRPSExporter',
 'ERA5Exporter',
 'ERA5ExporterPOS',
 'ESACCIExporter',
 'GLEAMExporter',
 'S5Exporter',
 'SRTMExporter',
 'VHIExporter']

## Preprocessors

The preprocessors work to convert these different datasets into a unified data format. This makes testing and developing different models much more straightforward.

There is a `Preprocessor` for each `Exporter`.

In [13]:
from src import preprocess

dir(preprocess)[:6]

['CHIRPSPreprocesser',
 'ERA5MonthlyMeanPreprocessor',
 'ESACCIPreprocessor',
 'GLEAMPreprocessor',
 'PlanetOSPreprocessor',
 'VHIPreprocessor']

### Engineer

The `Engineer` class works to create `train` and `test` data. This class reads data from the `data/interim/{dataset}_preprocessed` directories and writes to the `data/features` directory.

This class allows us enormous flexibility to choose input and output variables.

### Models

The models are the implementation of machine learning methods for making predictions about our `target_variable`. 

We have currently implemented 5 models with varying levels of complexity. We have some simple baseline models (`parsimonious` models) such as `Persistence` but also some complex Neural Networks with architectures specific for hydro-meteorology ([`EARecurrentNetwork` (paper here)](https://arxiv.org/pdf/1907.08456.pdf))

These classes work with data from the `data/features` directory and write predictions to the `data/models/{model}`. Results are stored in: `results.json`.

In [30]:
from src import models

dir(models)[:3] + dir(models)[5:7]

### Analysis

The analysis directory contains code for interpreting the output of the models and for interrogating the input datasets. This is a very general directory with 'helper' code. 