## A guided overview of `otbench`

This notebook provides an overview of the `otbench` package, including the motivation and key design decisions of this project.

An interactive example focused on forecasting using the USNA $C_n^2$ small dataset is available [here](/notebooks/forecasting/usna_cn2_sm.ipynb).
A similar interactive example focued on regression using the MLO $C_n^2$ dataset is provided [here](/notebooks/regression/mlo_cn2.ipynb).

In [1]:
import pprint

from otbench.tasks import TaskApi

In [2]:
%matplotlib inline

In [3]:
pprinter = pprint.PrettyPrinter(indent=4, width=120, compact=True)

The optical turbulence benchmark package provides a set of tasks to be used for the evaluation of optical turbulence prediction methods.

The package includes both regression tasks, and forecasting tasks. Each regression task assesses a given model's ability to predict the optical turbulence strength (as measured by $C_n^2$) at a given location, given a set of meterological parameters. Each forecasting task assesses a given model's ability to predict the optical turbulence strength some number of time steps in the future given a set of prior meterological parameters and previous measurements of the optical turbulence strength at that location.

Each task contains a dataset and a set of metrics.

#### Datasets

Datasets contain some number of timestamped observations of the optical turbulence strength alongside a set of potentially relevant meterological and oceanographic parameters. Datasets may or may not contain missing measurements, and strive to conform to the [NetCDF](https://docs.unidata.ucar.edu/netcdf-c/current/index.html) Climate and Forecast (CF) [Metadata Conventions](http://cfconventions.org/).

Under the hood, datasets are stored as [xarray](http://xarray.pydata.org/en/stable/) `Dataset` objects. They are serialized to disk as [NetCDF](https://docs.unidata.ucar.edu/netcdf-c/current/index.html) files. Datasets are shared between one or more tasks. For example, the `mlo_cn2` dataset is used by both the `mlo_cn2` regression task and the `mlo_cn2_forecast` forecasting task.

The train, test, and validation splits are defined by the task using fixed indices. Tasks also define data processing pipelines that are applied to the data before it is used for training, testing, or validation. This can include common techniques such as removing rows with missing measurements, or taking the log of the optical turbulence strength. Finally, tasks define the set of features which are unavailable for training. Again taking the `mlo_cn2` tasks as an example, the target feature is the optical turbulence strength at a height of 15 \[m\]; the unavailable features are the optical turbulence strength at other heights are assumed as unavailable for training or inference in the regression task.

Tasks evaluate the performance of a model on the test and validation splits using the metrics defined by the task.

#### Metrics

The metrics are used to evaluate the performance of a model or prediction method on the data. Metrics, in the context of `otb` tasks, allow for rigorous comparison of different models and prediction methods. Metrics are defined by the task, and are evaluated on the test and validation splits of the dataset.

Many tasks use standard error metrics including:
* mean absolute error (MAE)
* explained variance score (EVS or $R^2$)
* root mean squared error (RMSE)
* mean absolute percentage error (MAPE)

All regression and forecasting tasks include some baseline models which can be applied to the prediction problem. Each task's metrics are evaluated on the baseline models and the results are stored in a shared `experiments.json` file. This allows for easy comparison of different models and prediction methods. When developing a new model or prediction method, it is recommended to compare the performance of the new method to the baseline models. After the new method has been evaluated, it can be programmatically added to the `experiments.json` file for future comparison using the task's interface.

#### Example: (`mlo_cn2`) regression task, without missing values

```
{
    'description': 'Regression task for MLO Cn2 data, ...',
    'description_long': 'This dataset evaluates ...',
    'dropna': True,
    'ds_name': 'mlo_cn2',
    'eval_metrics': ['root_mean_square_error', 'coefficient_of_determination', 'mean_absolute_error', 'mean_absolute_percentage_error'],
    'log_transform': True,
    'obs_lat': 19.53,
    'obs_lon': -155.57,
    'obs_tz': 'US/Hawaii',
    'remove': ['base_time', 'Cn2_6m', 'Cn2_15m', 'Cn2_25m'],
    'target': 'Cn2_15m',
    'val_idx': ['8367:10367'],
    'train_idx': ['0:8367'],
    'test_idx': ['10367:13943']
}
```

### load the tasks

The `TaskApi` is the main entry point for the `otb` API.

In [4]:
task_api = TaskApi()

The `TaskApi` provides access to the tasks, which in turn enable access to training, test, and validation data, benchmarking metrics, and evaluation of new prediction models or methods.

The tasks which are currently supported by the `otb` package are accessible via the `TaskApi`:

In [5]:
task_api.list_tasks()

['forecasting.usna_cn2_sm.full.Cn2_3m',
 'regression.usna_cn2_lg.full.Cn2_3m',
 'forecasting.mlo_cn2.dropna.Cn2_15m',
 'regression.usna_cn2_sm.full.Cn2_3m',
 'regression.mlo_cn2.dropna.Cn2_15m',
 'regression.mlo_cn2.full.Cn2_15m']

As an illustrative example, we can load the `mlo_cn2` regression task with missing values removed and develop a new model for predicting optical turbulence strength.

In [6]:
task = task_api.get_task("regression.mlo_cn2.dropna.Cn2_15m")

The `task` object gives access to the description and associated metadata surrounding the task.

In [7]:
task_info = task.get_info()
pprinter.pprint(task_info)

{   'description': 'Regression task for MLO Cn2 data, where the last 12 days are set aside for evaluation',
    'description_long': 'This dataset evaluates regression approaches for predicting the extent of optical turbulence, '
                        'as measured by Cn2 at an elevation of 15m. Optical turbulence on data collected at the Mauna '
                        'Loa Solar Observatory between 27 July 2006 and 8 August 2006, inclusive, are used to evaluate '
                        'prediction accuracy under the root-mean square error metric.',
    'dropna': True,
    'ds_name': 'mlo_cn2',
    'eval_metrics': ['root_mean_square_error', 'coefficient_of_determination', 'mean_absolute_error', 'mean_absolute_percentage_error'],
    'log_transform': True,
    'obs_lat': 19.53,
    'obs_lon': -155.57,
    'obs_tz': 'US/Hawaii',
    'remove': ['base_time', 'Cn2_6m', 'Cn2_15m', 'Cn2_25m'],
    'target': 'Cn2_15m',
    'test_idx': ['10367:13943'],
    'train_idx': ['0:8367'],
    'val_id

As seen above, the `regression.mlo_cn2.dropna.Cn2_15m` task is focused on predicting the optical turbulence strength at a height of 15 \[m\] at the Mauna Loa Observatory (MLO) in Hawaii. The task uses the `mlo_cn2` dataset, which is a dataset of optical turbulence strength measurements at the MLO. The `task` contains an `obs_tz` attribute which specifies the timezone of the observatory. The latitude and longitude of the observatory are also provided as `obs_lat` and `obs_lon` attributes.

The `task` also contains a `target` attribute which specifies the target feature for the task. The task is focused on predicting the optical turbulence strength at a height of 15 \[m\], and the optical turbulence strength measurements at heights of 6 and 25 \[m\] are assumed to be unavailable for training or inference.

To ensure consistency and robust comparison between modeling approaches, the `train_idx`, `test_idx`, and `val_idx` are fixed for the given task. The `train_idx` and `val_idx` attributes specify the indices of the dataset which are available for model development. The `test_idx` attribute specifies the indices of the dataset which are used to evaluate the model during and compare against existing benchmarks for the task.

The task is evaluated using the root mean squared error (RMSE), explained variance score (EVS), mean absolute error (MAE), and mean absolute percentage error (MAPE) metrics. The task is evaluated on the test and validation splits of the dataset, and the training split is used for training new models.

Get the training data

In [8]:
X_train, y_train = task.get_train_data(data_type="pd")

The `otb` package attempts to make as few assumptions about the model or prediction method's API surface as possible. A major constraint is the assumption that each model is called during evaluation against the validation set in the same form as is returned by the `get_training_data` method with the `data_type` argument set to `pd`.

Models can take many forms, from simple statistical models such as predicting the mean value seen during training, to complex deep learning models. The `otb` package does not attempt to provide a unified API for developing all models, but instead provides a set of tools for evaluating models against the tasks.

Existing statistical and parametric techniques are included under the `otb.benchmark.models` module. These models provide samples of best practices for developing new models for the tasks. An example statistical method which predicts the mean value seen during training is included below.

```python
class PersistanceRegressionModel:

    def __init__(
        self,
        name: str,
        **kwargs
    ):
        self.name = name
        self.mean = np.nan
    
    def train(self, X: 'pd.DataFrame', y: Union['pd.DataFrame', 'pd.Series', np.ndarray]):
        # maintain the same interface as the other models
        self.mean = np.mean(y)

    def predict(self, X: 'pd.DataFrame'):
        # predict the mean for each entry in X
        return np.full(len(X), self.mean)
```

When evaluated, the `PersistanceRegressionModel`s performance is measured by calling the `predict` method on the validation data and comparing the results to the ground truth values. The `PersistanceRegressionModel` has already been evaluated against the metrics defined by the task, and the results are stored in the `experiments.json` file.

More information on using the `otbench` package to evaluate new models can be found in the [regression overview](regression/modeling.ipynb) notebook and the [forecasting overview](forecasting/usna_cn2_sm.ipynb).