# IceNet CLI Usage

## Context

### Purpose
The IceNet library provides the ability to download, process, train and predict from end to end via a set of command-line interfaces. By using the 

### Modelling approach
This modelling approach allows users to immediately utilise the library for producing sea ice concentraion forecasts.

### Highlights
The key features of an end to end run are: 
* [Setup](#Setup)
* [Download](#Download) 
* [Process](#Process)
* [Train](#Train)
* [Predict](#Predict)

### Contributions
#### Notebook
James Byrne (author)

__Please raise issues [in this repository](https://github.com/antarctica/IceNet-Pipeline) to suggest updates to this notebook!__ 

Contact me at _jambyr \<at\> bas.ac.uk_ for anything else...

#### Modelling codebase
James Byrne (code author), Tom Andersson (science author)

#### Modelling publications
Andersson, T.R., Hosking, J.S., Pérez-Ortiz, M. et al. Seasonal Arctic sea ice forecasting with probabilistic deep learning. Nat Commun 12, 5124 (2021). https://doi.org/10.1038/s41467-021-25257-4

#### Involved organisations
The Alan Turing Institute and British Antarctic Survey

## Setup

### Prerequisites

In order to undertake the following, I'm assuming you have a the following at your disposal:

* A host to run this on
* A working conda installation on that host
* Either a slurm cluster to submit jobs to or run locally
* Wherever you run, you want GPUs for training (predictions run fine without)
* Git, python and shell knowledge to a basic degree :-)
* There are numerous external facilities that we interface with, which it's assumed you're set up to use (otherwise check the options as they can be disabled/overlooked)
  * Data sources under [Climate and Sea Ice Data](#Climate-and-Sea-Ice-Data)
  * Wandb (Weights and Biases) - can be disabled when using `icenet_train`
  * Azure - we demonstrate native uploading which can be skipped if required

The important thing to follow this notebook is to clone the [IceNet-Pipeline repository](https://github.com/antarctica/IceNet-Pipeline). The cloned directory __will become your working directory for the rest of your work in the notebooks unless otherwise specified__.

```bash
git clone git@github.com:antarctica/IceNet-Pipeline.git green
cd green
```

I've called my folder green as this was derived from the blue-green infrastructure used for operational forecasting at the moment in BAS.

__Generally I run these commands in a screen or tmux session, so that they can be picked up from.__

### Environment Configuration

___TODO: update, as at time of writing the icenet package is not publicly available for installation and instead should be installed from source...___

```bash
./install_env.sh icenet-green
git clone git@github.com:JimCircadian/icenet2.git ../icenet2
pip install ../icenet2

# Do this every time you restart work
conda activate icenet-green
```

#### Commands

Once the icenet library is installed, you'll be able to access all commands made available by the library. Some are utilities that won't be covered, but using `icenet_<TAB>`-complete you should be able to see a list that includes (but _is not limited to_):

* icenet_data_cmip
* icenet_data_era5
* icenet_data_hres
* icenet_data_masks
* icenet_data_sic
* icenet_dataset_create
* icenet_output
* icenet_predict
* icenet_process_cmip
* icenet_process_era5
* icenet_process_hres
* icenet_process_metadata
* icenet_process_sic
* icenet_train
* icenet_upload_azure

All of these commands are either directly or indirectly (through pipeline shell scripts) used in this notebook...

All commands accept options such as `-v` for turning on verbose logging and `-h` for obtaining help about what options they offer. ___As is best practice for all commands in *nix land, use `-h` to obtain information about options___.

### The idea behind end to end runs

The IceNet package is designed to support automated runs from end to end by exposing the above CLI operations. These are simple wrappers around the library itself, and __any__ step of this can be undertaken manually or programmatically by inspecting the relevant endpoints. 

___TL;DR: for those of you just wanting to skip straight to an end to end example in shell, [please look at the daily execution script](#Daily-execution)...___

The end to end execution methodology is illustrated by this diagram:

![Full IceNet operational workflow...](https://raw.githubusercontent.com/wiki/alan-turing-institute/IceNet-Project/Pipeline%20Layout.png)

The portion of this you're really interested in understand is in the green box however, with the IceNet-Pipeline directory (e.g. `green`) corresponding to the green box and thus being, essentially, an ephemeral environment. 

#### A tip behind source data

You'll see that `Source Data Store` is located outside the green box. Because of the expense and time required to interface with external sources, _we recommend the following step so that source data can be shared between environments_...

```bash
# Make a source data store outside our ephemeral environment
mkdir ../data
ln -s ../data
```

### Pipeline versus CLI verses Library usage

Though this notebook is tailored around use of the [IceNet-Pipeline repository](https://github.com/antarctica/IceNet-Pipeline) there is no dependency on this repository for using the `icenet_*` commands. The pipeline repository just offers helpers scripts written in [`bash`](https://tldp.org/LDP/abs/html) for running an end to end pipeline out of the box. 

You are welcome to use any arbitrary directory to run the CLI scripts below. However, when it comes to the sections on [training](#Train) and [prediction](#Predict), as well as [running daily predictions](#Daily execution), you'll notice that we leverage scripts from the pipeline repository. This is because these scripts interact with the [model ensembling tool from BAS](https://github.com/JimCircadian/model-ensembler) to train and predict across multiple models instances. 

The rule of thumb to follow: 

* Use the pipeline repository if you want to run the end to end IceNet processing out of the box.
* Adapt or customise this process using `icenet_*` commands described in this notebook and in the scripts contained in the pipeline repo.
* For ultimate customisation, you can interact with the IceNet repository programmatically (which is how the CLI commands operate.) For more information look at the [IceNet CLI implementations](https://github.com/JimCircadian/icenet2/blob/main/setup.py#L32) and the [library notebook](03.library_usage.ipynb), along with the [library documentation](#TODO). 

## Download

### Climate and Sea Ice Data

Obtaining and preparing data is simply achieved using `icenet_data_*` commands, which share common arguments `hemisphere`, `start_date` and `end_date`. There are also implementation specific options worth reviewing under `--help`. For example, getting the last two days data from the ERA5 reanalysis dataset can be done thus:

```bash
icenet_data_era5 south 2020-12-30 2020-12-31
# TODO: run
```

By default, the IceNet commands regrid and rotates data as required to align with the OSISAF SIC data, which is used as the output for the dataset. Programmatic usage allows you to avoid this ([see notebook 03](0.3.library_usage)).

At time of writing there are the following downloaders: 

* `icenet_data_era5` - downloads [ERA5 reanalysis](https://cds.climate.copernicus.eu/cdsapp#!/search?type=dataset&keywords=((%20%22Product%20type:%20Reanalysis%22%20) data using either the CDS Toolbox or direct API
* `icenet_data_cmip` - downloads the prescribed experiments from [CMIP6](https://esgf-node.llnl.gov/search/cmip6/) for the original IceNet paper runs
* `icenet_data_hres` - downloads up to date [forecast generated data from the ECMWF MARS API](https://www.ecmwf.int/en/forecasts/datasets/catalogue-ecmwf-real-time-products)
* `icenet_data_sic` - downloads [OSISAF sea-ice concentration (SIC) data](https://osisaf-hl.met.no/v2p1-sea-ice-index)

### Mask data

IceNet relies on some generated masks for training/prediction, which can be automatically generated very easily using `icenet_data_masks {north,south}`. Once performed, this does not need to be rerun under the pipeline directory...

## Process

Processing takes the data made available through the source data store and undertakes the necessary normalisation for use as input channels to the UNet architecture. This intermediary step means that the original source data can be reused numerous times with varying training, validation and test date setups.

### Command example

```bash
DATE_STR="-ns 1998-1-1,2011-1-1 -ne 2002-12-31,2015-12-31 -vs 2010-1-1 -ve 2010-8-31 -ts 2010-9-1 -te 2010-12-31"

icenet_process_era5 green_north north $DATE_STR -l 3
icenet_process_era5 green_south south $DATE_STR -l 3

icenet_process_sic green_north north $DATE_STR -l 3
icenet_process_sic green_south south $DATE_STR -l 3

icenet_process_metadata green_north north
icenet_process_metadata green_south south
```

Consulting the command options will make the above more obvious (as well as further options) but a few things we can note that are helpful: 

* Options `-ns`, `-ne`, `-vs`, `-ve`, `-ts`, `-te`, which correspond to training, validation and test sets, allow ranges to be comma-delimited. The above example produces a split training set, for example, that spans two periods: 2000-2009 and 2011-2019.
* These date ranges can be randomised and subsampled using `-d`, __though this is still a bit experimental__
* The `-l` option (which is for `--lag`) specified the number of days back we look at input data variables for the output in question.

There are plenty of other options available for preprocessing the data, but it should be noted that whilst this is not strongly coupled to dataset creation, options like the lag specified here might influence the creation of datasets in the next step. 

These commands, especially with decadal ranges, can take a long time (12+ hours) to complete depending on the hosts/storage in use.

### Dataset creation

Once the above preprocessing is taken care of datasets can easily be created thus. This operation _creates a cached dataset_ in the filesystem that can be fed in for training runs. 

```bash
icenet_dataset_create -l 3 -ob 2 -w 32 green_north north
icenet_dataset_create -l 3 -ob 2 -w 32 green_south south
```

The common options used here: 

* `-l` as in the preprocessing stage. If experimenting and using full date ranges, creating a dataset with a different lag can save having to reprocess everything.
* `-ob` is the output batch size for the tfrecords. It is advisable to keep this smaller except where there are seriously large numbers of sets, preferably near to the expected size being used for training.
* `-w` specifies the number of worker subprocesses to use for producing the output. Probably advisable to keep this below the number of cores on your host! :) 

#### Config-only operation / Prediction datasets

Datasets used to predict don't benefit from caching, so adding the `-c` option and dropping `-w` and `-ob` will create a configuration for the dataset without writing sets to disk. You can also use this option to create a dataset that is fed directly from the preprocessed data, though bear in mind, depending on your infrastructure, that this requires the batches to be created on the fly and can have a significant impact on performance.

```bash
icenet_dataset_create -l 3 -c -fn pred_south green_south south
```

## Train

Once the dataset is prepared, running a network is then as simple as using `icenet_train` with the appropriate parameters. Some key parameters are illustrated in the following commands:
 
```bash
icenet_train -v green_south south_testnet.42 42 -b 4 -e 5 -m -qs 4 -w 4 -r 0.2 -n 0.6  

icenet_train -v green_south south_testnet.42 42 -b 4 -e 5 -m -qs 4 -w 4 -r 0.2 -n 0.6 \
    -p results/networks/south_testnet.42/south_testnet.42.network_green_south.42.h5  
```

These runs demonstrate using the aforementioned dataset, in `-b` batches of 4 for a run of `-e` five epochs. Using `-m` for multiprocessing we enable up to `-w` four process workers to load data at a time into a data queue `-qs` of length four. By specifying a `-r` ratio we use only 0.2x of the files from the dataset (_useful when testing on a low power machine as I was in this case_) supplying a UNet built with 0.6x the `-n` numbers of filters as normal. 

With the second command we `-p` pickup the output weights from the previous run to continue training.

There are a few things to note about the `icenet_train` and `icenet_predict` (see [the prediction section below](#Predict)) commands and the switches they provide: 

* Common switches such as `-n` should be applied consistently between training and prediction. 
* These commands work with __individual network runs__ (see the next section).

### Ensemble running

For producing forecasts in the described pipeline we actually run a set of models using the [model-ensembler](https://github.com/JimCircadian/model-ensembler) tool and as such there are convenience scripts for doing this as part of the end to end run. 

```bash
./run_train_ensemble.sh \
    -b 4 -e 500 -f 1.2 -n node022 -p bashpc.sh -q 4 -j 3 \
    green_south green_south22 south_run
```

Many of the arguments are equivalent to the above `icenet_train` command. However, the `-n` filters factor is actually `-f` in this example (note that because I'm running on a cluster I've doubled this) and we have additional arguments `-n` for the node to run on, `-p` for the pre_run script to use and `-j` for the number of simultaneous runs to execute on the SLURM cluster we use at BAS. However, these arguments are not necessarily required for other clusters, nor is the model-ensembler rooted to running on SLURM.  

The pipeline repository shell scripts that provide this functionality are easily adaptable, as well as the ensemble itself which is stored in the pipeline repository under `/ensemble/`.

_Please review the `-h` help option for the script to gain further insight the options available._

## Predict

One the network is trained it is possible to run any suitable sets through the network for training. __This is the purpose of configuration only datasets__ which are used by the `run_predict_ensemble` to, similarly to the training process, run predictions through all of the ensemble members. 

To run an individual sets through the test network from the test dataset we produced earlier can be easily achieved. The steps are to create a date file, which can be produced from the configuration created by `icenet_process` in the [processing section](#Process). This date file then can be supplied to the `icenet_predict` command to produce files using either cached data (useful for test data prepared at the same time as the training and validation sets) or directly from the normalised data (as is the case for nearly all data that isn't part of the training run.)

```bash
./loader_test_dates.sh green_south | head -n 1 > testdate
cat testdate 
2010-09-01
icenet_predict -n 1.2 -t green_south south_run example_south_forecast 42 testdate
```

The example uses the cached test data from the training run, but the process is the same for any other processed data with only the need to _omit the `-t` option, which specifies to source from cached test data_.

### Outputs

From the above example the following logs are produced.

```
INFO:root:Loading configuration ./dataset_config.green_south.json
INFO:root:Loading configuration loader.green_south.json
INFO:root:Datasets: 913 train, 61 val and 31 test filenames
...tensorflow...
INFO:root:Processing batch 1 - item 0
INFO:root:Loading model from ./results/networks/south_run.42/south_run.42.network_green_south.42.h5...
INFO:root:Running prediction 0 - 2010-09-01
INFO:root:Saving 2010-09-01 - forecast output (1, 432, 432, 93)
INFO:root:Saving outputs generated for these inputs as well...
INFO:root:Saving 2010-09-01 - generated outputs (432, 432, 93, 1)
INFO:root:Saving 2010-09-01 - generated weights (432, 432, 93, 1)
```

The outputs initially are stored as Numpy arrays under the `results` directory thusly: 

```
results/predict/example_south_forecast/south_run.42/2010_09_01.npy
results/predict/example_south_forecast/south_run.42/loader/outputs/2010_09_01.npy
results/predict/example_south_forecast/south_run.42/loader/weights/2010_09_01.npy
```

### Ensemble running

When producing daily forecasts for IceNet we train on an ensemble of models and also run predictions across them producing a mean and error across that model ensemble. To do this the pipeline repository offers the `run_predict_ensemble` which operates similarly to the above training script. An example of running the ensemble: 

```bash
./run_predict_ensemble.sh \
    -b 1 -f 1.2 -p bashpc.sh -i green_south22 \
    south_run green_south south_ens_forecast testdate
```

As with the previous example, the individual numpy outputs, samples and sample weights are deposited into `/results/predict` for each ensemble member. However, the ensemble also runs `icenet_output` to generate __a CF-compliant NetCDF containing the forecasts requested__ which can then be post-processed or [deposited to an external location](#Uploading-to-Azure) (which is the platform for the [wider IceNet forecasting infrastructure](https://github.com/alan-turing-institute/IceNet-Project). 

```bash
icenet_output -o results/predict south_ens_forecast green_south testdate

INFO:root:Loading configuration ./dataset_config.green_south.json
INFO:root:Post-processing 2010-09-01
INFO:root:Dataset arr shape: (1, 432, 432, 93, 2)
INFO:root:Saving to results/predict/south_ens_forecast.nc
```

_Please review the `-h` help option for the script to gain further insight the options available._

### Uploading to Azure

```bash
icenet_upload_azure results/predict/south_ens_forecast.nc 2010-09-01
```

## Other Pipeline Considerations

### A bit more information on ensemble runs

#### Cleaning up runs

Ensemble runs take place under `/ensemble/` in the pipeline folder and ARE NOT deleted after they've happened, to allow for debugging. Commonly, the ensemble configurations will contain a delete task to remove the extraneous run folders. __In the meantime this should be done manually__ after running `run_train_ensemble` or `run_predict_ensemble`.

The only exception to this is the use of `run_daily.sh` (see below) which does clean up prior to rerunning. 

### Daily execution

Daily execution is facilitated in the pipeline by using [`run_daily.sh`](https://github.com/antarctica/IceNet-Pipeline/blob/main/run_daily.sh). This wraps all the necessary steps to perform the following sequence for producing forecasts from yesterday for the next 93 days, for both northern and southern hemispheres. 

* Removes any old ensemble runs
* Downloads [HRES forecast data from the ECMWF MARS API](https://www.ecmwf.int/en/forecasts/datasets/catalogue-ecmwf-real-time-products)
* Processes the HRES and necessary training metadata to produce a data loader
* Creates a dataset configuration for it
* Runs a [prediction ensemble](#Predict) to produce a NetCDF
* Uploads to the necessary endpoint

#### Automation

With the above shell script it's trivial to automate using cron. Of course this is simply for demonstration, with more complex workflow managers offering far great flexibility especially when considering analysis of the produced forecasts.

```bash
# We assume your environment is configured appropriately to run conda from cron files, for example by adding...
#
# SHELL=/bin/bash
# BASH_ENV=~/.bashrc_env
#
# With conda initialisation in bashrc_env at the top of your crontab
25 9 * * * conda activate icenet; cd $HOME/hpc/icenet/pipeline && bash run_daily.sh >$HOME/daily.log 2>&1; conda deactivate
```

## Summary

Within this notebook we've attempted to give a full crash course to running the CLI tools both __manually__ and using the __pipeline helper scripts__. This is the first of four (currently) notebooks contained within the pipeline repository, covering further information: 

* [Data structure and analysis](02.data_analysis.ipynb): understand the structure of the data stores and products created by these workflows and what tools currently exist in IceNet to looks over them.
* [Library usage](03.library_usage.ipynb): understand how to programmatically perform an end to end run.
* [Library extension](04.library_extension.ipynb): understand why and how to extend the IceNet library.

## Version
- IceNet Codebase: v0.1.0