# Template Experiment notebook
This notebook walks you through how to read, add and modify experiments.

# Table of contents

TODO: Expand

## 0. Set-up and necessary requirements

TODO: Explain the set-up and requirements

In [None]:
!pip uninstall NeuralNetworksTrainingPackage --y
#%pip uninstall TabularExperimentTrackerClient --y
!pip install git+https://github.com/DanielWarfield1/TabularExperimentTrackerClient
!pip install git+https://github.com/Bartosz-G/NeuralNetworksTrainingPackage

In [None]:
import numpy as np
import pandas as pd
import sklearn
import torch
import time

In [None]:
ex = ExperimentOrchstratorClient()
ex.get_credentials

# 1. Defining the experiment

## 1.1 Defining hyperparameter space

In this example we'll create 2 models:
- One sklearn model
- One Pytorch model

We begin by creating the possible space of hyperparameters and parameters which will later be used to build our models.

In [None]:
our_sklearn_model_space = {'param_1': {"distribution": "int_uniform", "min":1, "max":11},
                           'param_2': {"distribution": "float_uniform", "min":0.5, "max":1}}

our_pytorch_regression_model_space = {'param_1': {'distribution': 'constant', 'value': 0.1},
                                      'param_2': {'distribution': 'categorical', 'values':[10, 10, 15, 20]}}

our_pytorch_classification_model_space = {'param_1': {'distribution': 'constant', 'value': 0.1},
                                         'param_2': {'distribution': 'categorical', 'values':[10, 10, 15, 20]}}

Possible distributions:
- TODO: list all distributions

Remember that your further code should be able to re-create the models and training based on only the hyperparemeter space, so we also recommend including parameters like 'seed', 'cuda' and 'task'.

### Example:

In [None]:
XGBoost_space = {
    "max_depth": {"distribution": "int_uniform", "min":1, "max":11},
    "n_estimators": {"distribution": "int_uniform", "min":100, "max":200},
    "min_child_weight": {"distribution": "log_uniform", "min":1, "max":1e2},
    "subsample": {"distribution": "float_uniform", "min":0.5, "max":1},
    "colsample_bylevel": {"distribution": "float_uniform", "min":0.5, "max":1},
    "colsample_bytree": {"distribution": "float_uniform", "min":0.5, "max":1},
    "gamma": {"distribution": "log_uniform", "min":1e-8, "max":7},
    "reg_lambda": {"distribution": "log_uniform", "min":1, "max":4},
    "reg_alpha": {"distribution": "log_uniform", "min":1e-8, "max":1e2},
}

LCN_reg_SGD_space = {
    'depth': {'distribution': 'int_uniform', 'min':1, 'max':11},
    'seed': {'distribution': 'constant', 'value': 42},
    'drop_type': {'distribution': 'categorical', 'values':['node_dropconnect', 'none']},
    'p': {'distribution': 'float_uniform', 'min':0.25, 'max':0.75},
    'ensemble_n': {'distribution': 'constant', 'value': 1},
    'shrinkage': {'distribution': 'constant', 'value': 1},
    'back_n': {'distribution': 'categorical', 'values':[0, 0, 0, 1]},
    'net_type': {'distribution': 'constant', 'value': 'locally_constant'},
    'hidden_dim': {'distribution': 'constant', 'value': 1},
    'anneal': {'distribution': 'categorical', 'values':['interpolation', 'none', 'approx']},
    'optimizer': {'distribution': 'constant', 'value': 'SGD'},
    'batch_size': {'distribution': 'categorical', 'values':[16,32,64,64,64,128,256]},
    'epochs': {'distribution': 'constant', 'value': 30},
    'lr': {'distribution': 'log_uniform', 'min':0.05, 'max':0.2},
    'momentum': {'distribution': 'constant', 'value': 0.9},
    'no_cuda': {'distribution': 'constant', 'value': False},
    'lr_step_size': {'distribution': 'categorical', 'values':[10, 10, 15, 20]},
    'gamma': {'distribution': 'constant', 'value': 0.1},
    'task': {'distribution': 'constant', 'value': 'regression'}
}

## 1.2 Assigning distributions to their models:

In [None]:
model_groups = {
    'our_sklearn_model':{'model':'our_sklearn_model', 'hype':our_sklearn_model_space},
    'our_pytorch_regression_model':{'model':'our_pytorch_regression_model', 'hype':our_pytorch_regression_model_space},
    'our_pytorch_classification_model':{'model':'our_pytorch_classification_model', 'hype':our_pytorch_regression_model_space},
}

ex.def_model_groups(model_groups)

# 1.3 Assign experiments to appropriate tasks

TODO Explain:
- opml_reg_purnum_group:
- opml_reg_numcat_group:
- opml_class_purnum_group:
- opml_class_numcat_group:

In [None]:
regression_models = {'our_sklearn_model':{'model':'our_sklearn_model', 'hype':our_sklearn_model_space},
                     'our_pytorch_regression_model':{'model':'our_pytorch_regression_model', 'hype':our_pytorch_regression_model_space}}

classification_models = {
    'our_sklearn_model':{'model':'our_sklearn_model', 'hype':our_sklearn_model_space},
    'our_pytorch_classification_model':{'model':'our_pytorch_classification_model', 'hype':our_pytorch_regression_model_space}}


applications = {'opml_reg_purnum_group': regression_models,
                'opml_reg_numcat_group': regression_models,
                'opml_class_purnum_group': classification_models,
                'opml_class_numcat_group': classification_models}

ex.def_applications(applications)

## 1.4 Registering an Experiment

In [None]:
experiment_name = 'template_experiment'
ex.reg_experiment(experiment_name)
exp_info = ex.experiment_info()
successful_runs = exp_info['successful_runs']
required_runs = exp_info['required_runs']

# 2. Task pre-processing steps

In our experiments data pre-processing steps are applied right after downloading datasets in the main loop of the experiment.
Given that different models and tasks might require different pre-processing steps, we apply pre-processing steps in an event driven fashion. Those events are separated into task specific data pre-processing events that are applied based on the task, and model specific data pre-processing steps. The main object holding and executing all pre-processing steps is the `dataPreProcessingEventEmitter()` from `NeuralNetworksTrainingPackage.event_handler`.

### 2.1 Defining task specific pre-processing steps

Firstly, we begin by initialising task pre-processing steps stored in `NeuralNetworksTrainingPackage.dataprocessing.basic_pre_processing` module. Task pre-processing steps come in form of classes. When initialised those objects store parameters of how data pre-processing steps should be applied.

In [None]:
from NeuralNetworksTrainingPackage.dataprocessing.basic_pre_processing import filterCardinality, quantileTransform, trunctuateData, oneHotEncodePredictors, oneHotEncodeTargets, toDataFrame, splitTrainValTest

n_sample = 20000
split = [0.5, 0.25, 0.25]
quantile_transform_distribution='uniform'

filter_cardinality = filterCardinality(transform = 'all')
trunctuate_data = trunctuateData(n = n_sample, transform = 'all')
one_hot_encode_predictors = oneHotEncodePredictors(transform = 'all')
one_hot_encode_targets = oneHotEncodeTargets(transform = 'all')
to_data_frame = toDataFrame(transform = 'all')
split_train_val_test = splitTrainValTest(split = split) # Special transformation
quantile_transform = quantileTransform(output_distribution = quantile_transform_distribution, transform = 'train')

Pre-processing steps can be divided into two, ordinary and special.

Ordinary steps:
- `trunctuateData(n, seed = None, transform = 'all')`
- `filterCardinality(transform = 'all')`
- `quantileTransform(n_quantiles=1000, output_distribution='uniform',
                 ignore_implicit_zeros=False,
                 subsample=10000,
                 random_state=None,
                 copy=True,
                 transform = 'all')`
- `toDataFrame(transform = 'all')`
- `oneHotEncodePredictors( transform = 'all')`
- `oneHotEncodeTargets(transform = 'all'):`

Ordinary classes all come with `transform = 'all'` parameter. This parameter dictates whether the transformation should be applied to the entire dataset `'all'`, only training `'train'`, only validation`'val'` or only test `'test'`. Bear in mind `transform = 'train'/'val'/'test'` is only available after the `splitTrainValTest` special step, before that step transform should be set to it's default `'all'`

Special steps:
- `splitTrainValTest(split = [0.5, 0.25, 0.25])`
- `toPyTorchDatasets(wrapper = CustomDataset):`

`splitTrainValTest(split = [train, val, test])`, is responsible for splitting the dataset into the train, validation and test sets. `toPyTorchDatasets(wrapper = CustomDataset)` is responsible for wrapping datasets into a pytorch friendly dataset format. Bear in mind after calling `toPyTorchDatasets`, no other pre-processing steps can be called. More on that in section 3.


### 2.2 Adding pre-processing step objects to an event listener

Next we need to add objects we've initiated into their corresponding tasks. `dataPreProcessingEventEmitter` works like a standard event emitter which fires events when it receives the name of the event, in our case `regression` or `classification` Feel free to apply different pre-processing steps to different tasks.

In [None]:
from NeuralNetworksTrainingPackage.event_handler import dataPreProcessingEventEmitter

data_pre_processing = dataPreProcessingEventEmitter()

# Transformations will be applied in the order they're added to data_pre_processing
data_pre_processing.add_pre_processing_step('regression', filter_cardinality)
data_pre_processing.add_pre_processing_step('regression', trunctuate_data)
data_pre_processing.add_pre_processing_step('regression', one_hot_encode_predictors)
data_pre_processing.add_pre_processing_step('regression', to_data_frame)
data_pre_processing.add_pre_processing_step('regression', split_train_val_test)
data_pre_processing.add_pre_processing_step('regression', quantile_transform)


data_pre_processing.add_pre_processing_step('classification', filter_cardinality)
data_pre_processing.add_pre_processing_step('classification', trunctuate_data)
data_pre_processing.add_pre_processing_step('classification', one_hot_encode_predictors)
data_pre_processing.add_pre_processing_step('classification', one_hot_encode_targets) # different steps can be applied to different tasks
data_pre_processing.add_pre_processing_step('classification', to_data_frame)
data_pre_processing.add_pre_processing_step('classification', split_train_val_test)
data_pre_processing.add_pre_processing_step('classification', quantile_transform)

### 2.3 Adding your own pre-processing steps

If you'd like to add your own pre-processing steps using the following code template. The pre-processing step requires to have `self.seed = None`, `self.parent = None`, and `self.transform = transform`. The last one needs to be passed as an argument when initialising the object. The transformation steps are meant to receive X as `pd.DataFrame` and y as either `pd.DataFrame` or `pd.Series`, categorical_indicator as `List[bool]` and attribute_names as `List[str]`, and they're required to return them in the same format. You need to ensure proper type handling yourself! Transformation steps should be stateless. Although every step has a `transform` attribute you don't need to manually code separate transformations for `train`, `val`, `test`, the `dataPreProcessingEventEmitter` handles calling transformation steps on the appropriate partitions of the data.

In [None]:
class exampleTransformationTemplate():
    def __init__(self, transform = 'all'): # You can add more arguments
        self.seed = None
        self.parent = None
        self.transform = transform
        # you can add more

    def apply(self, X, y, categorical_indicator, attribute_names):
        if not self.seed:
            self.seed = self.parent.seed

        # ---------
        # Add your own code here
        # Remember to adjust categorical_indicator as well as attribute_names if your transformation adjusts them
        # ---------

        return X, y, categorical_indicator, attribute_names

# 3. Model specific transformations

After task specific data transformations we can add model specific transformations. We're not required to add model specific transformations, if an event emitter doesn't find any model names, it does nothing. Remember to add them to the same event emitter as the task specific ones'!

In [None]:
from NeuralNetworksTrainingPackage.dataprocessing.basic_pre_processing import CustomDataset, toPyTorchDatasets

to_pytorch_datasets = toPyTorchDatasets(wrapper = CustomDataset)

# Transformations will be called after general pre-processing steps, and in order they're added
data_pre_processing.add_pre_processing_step('our_pytorch_regression_model', to_pytorch_datasets)

data_pre_processing.add_pre_processing_step('our_pytorch_classification_model', to_pytorch_datasets)

# If our models don't require any additional pre-processing steps, you shouldn't add anything

`toPyTorchDatasets(wrapper = CustomDataset)` is responsible for wrapping datasets into a pytorch friendly dataset format. Bear in mind after calling `toPyTorchDatasets`, no other pre-processing steps can be called. By default, when indexed, the `CustomDataset` returns a tuple of tensors `torch.Size([1, number_of_predictor_columns])` and `torch.Size([1, number_of_outcome_columns])`, of `torch.float` data type. `CustomDataset` can be directly passed to `torch.utils.data.DataLoader()`. If your models requires different dimensions you can create your own pytorch dataset object and pass it as a wrapper to `toPyTorchDatasets`, skeleton code below.

### 3.1 Creating your own format of output

The Dataset object has the same requirements as a standard pytorch dataset object, but there're 2 additional constraints:
- has to take in specifically X, y, categorical_indicator, attribute_names
- has to have a `get_dims` methods that returns a number of predictor columns and output columns in form of a dict

You can find more on creating your own dataset object in pytorch's official documentation:
https://pytorch.org/tutorials/beginner/data_loading_tutorial.html

In [None]:
class CustomDataset(torch.utils.data.Dataset):
    def __init__(self, X, Y, categorical_indicator, attribute_names, tensor_type=torch.float):
        assert isinstance(X, pd.DataFrame), "X must be a Pandas DataFrame"
        assert isinstance(Y, pd.DataFrame), "Y must be a Pandas DataFrame"

        # ---------
        # Add your own code here
        # ---------

    def get_dims(self):

        # ---------
        # Add your own code here
        # ---------

        return {'input_dim': num_columns_X, 'output_dim':num_columns_Y}

    def __len__(self):

        # ---------
        # Add your own code here
        # ---------

        return

    def __getitem__(self, idx):

        # ---------
        # Add your own code here
        # ---------

        return

### 3.2 How data transformations will be applied

The `data_pre_processing` object from `dataPreProcessingEventEmitter` class, has several methods:
- `.add_pre_processing_step('event_name', transformation_object)` to add a new transformation step object under that event name
- `.set_seef_for_all(seed)` allows for setting the same seed for all transformations which require one
- `.set_dataset(X, y, categorical_indicator, attribute_names)` which is how you pass it the datasets
- `.apply('event_name')` applies all of the transformations defined under the event name
- `.get_train_val_test()` retrieves train, val, and test datasets from the data_pre_processing event
- `.reset(hard = False)` reset the dataset or `hard = True` to reset everything including all added transformations steps (called automatically when calling `.set_dataset`)
- `.get('name')` retrieves a specific split of the data, name can be `train`, `val`, `test`

When `.apply('event_name')` doesn't find any `'event_name'` it does nothing.

In [None]:
# Example
data_pre_processing.set_seed_for_all(seed)
data_pre_processing.set_dataset(X, y, categorical_indicator, attribute_names)
data_pre_processing.apply('regression')
data_pre_processing.apply('our_pytorch_regression_model')
train_data, val_data, test_data = data_pre_processing.get_train_val_test()

By default, if `toPyTorchDatasets(wrapper = CustomDataset)` is not applied, train_data, val_data, test_data, come in as a tuple of the corresponding `(X, y, categorical_indicator, attribute_names)`, if `toDataFrame` is included in the steps, both X, y are assured to be `pandas.DataFrame`. The `one_hot_encode_targets` encodes y with one-hot-encoding, so for binary classification it will have 2 columns or n columns for n-multi-label-classification.

# 4. Model Metrics

Next we need to add how our metrics will be calculated. Due to different models outputting different formats, you need to code in your own class  that will be responsible for calculating metrics. You're also free to add more metrics. Even if you're using different model for different tasks you still need to add them under `model_name` and `task`, this is to ensure proper assignment of

In [None]:
our_sklearn_metrics_for_both_regression_and_classification = ourSklearnMetricsForBothRegressionAndClassification()
our_pytorch_regression_metrics = ourPytorchRegressionMetrics()
our_pytorch_classification_metrics = ourPytorchClassificationMetrics()


sklearn_metrics = {'regression': our_sklearn_metrics_for_both_regression_and_classification,
                   'classification': our_sklearn_metrics_for_both_regression_and_classification}

pytorch_regression_metrics = {'regression': our_pytorch_regression_metrics}

pytorch_classification_metrics = {'classification': our_pytorch_classification_metrics}

metric_model_pairs = {
    'our_sklearn_model': sklearn_metrics,
    'our_pytorch_regression_model': pytorch_classification_metrics,
    'our_pytorch_classification_model': pytorch_regression_metrics

}