# Delphi to ATM format conversion

## Overview

This notebook loads the old Delphi CSVs and transforms them creating some new columns, renaming some of the old ones and resorting them to match the current ATM Modelhub format.

### Warning

Some columns are renamed to match the current ATM names but their data format still does not match the new one.

This happens for all the Base64 encoded columns.

### Configuration

The notebook has several options which are configured in the cell below.
These are:

* EMPTY_NEW_COLUMNS: If **True**, any column that did not exist in Delphi but exists in ATM will created with empty data. If **False**, this will be skipped and no new columns will be created.
* READ_BUCKET: If provided, the CSVs will be downloaded from the given S3 bucket. If empty, they will be read from local folder.
* WRITE_BUCKET: If provided, the CSVs will be uploaded to the given S3 bucket. Otherwise they will be saved in a local folder.
* READ_PATH: Folfer to read the CSVs from.
* WRITE_PATH: Folder to write the CSVs to.

In [1]:
EMPTY_NEW_COLUMNS = False
READ_BUCKET = 'atm-data-store'
READ_PATH = 'gp+bandit-search/csvs/'
WRITE_BUCKET = 'atm-data-store'
WRITE_PATH = 'gp+bandit-search/csvs/new/'

In [2]:
import io
import os

import boto3
import pandas as pd

In [3]:
if READ_BUCKET:
    read_bucket = boto3.resource('s3').Bucket(READ_BUCKET)
    
if WRITE_BUCKET:
    write_bucket = boto3.resource('s3').Bucket(WRITE_BUCKET)

In [4]:
def read_csv(name, *args, **kwargs):
    path = os.path.join(READ_PATH, name)
    if READ_BUCKET:
        print('Downloading file {} from S3 bucket {}'.format(path, read_bucket))
        body = read_bucket.Object(path).get()['Body'].read()
        with io.BytesIO(body) as buf:
            return pd.read_csv(buf, *args, **kwargs)
        
    else:
        return pd.read_csv(path, *args, **kwargs)

def to_csv(df, name, *args, **kwargs):
    path = os.path.join(WRITE_PATH, name)
    if WRITE_BUCKET:
        print('Uploading file {} to S3 bucket {}'.format(path, write_bucket))
        with io.StringIO() as buf:
            df.to_csv(buf, *args, **kwargs)
            write_bucket.Object(path).put(Body=buf.getvalue())
            
    else:
        df.to_csv(path, *args, **kwargs)

## Datasets

A Dataset represents a single set of data which can be used to train and test models by ATM. The table stores information about the location of the data as well as metadata to help with analysis.

* id (Int): Unique identifier for the dataset.
* name (String): Identifier string for a classification technique.
* class_column (String): Name of the class label column.
* train_path (String): Location of the dataset train file.
* test_path (String): Location of the dataset test file.
* description (String): Human-readable description of the dataset.
   * not described in the paper

The metadata fields below are not described in the paper.

* n_examples (Int): Number of samples (rows) in the dataset.
* k_classes (Int): Number of classes in the dataset.
* d_features (Int): Number of features in the dataset.
* majority (Number): Ratio of the number of samples in the largest class to the number of samples in all other classes.
* size_kb (Int): Approximate size of the dataset in KB.

In [5]:
datasets = read_csv('datasets.csv')

Downloading file gp+bandit-search/csvs/datasets.csv from S3 bucket s3.Bucket(name='atm-data-store')


In [6]:
datasets['id'] = datasets['dataset_id']

if EMPTY_NEW_COLUMNS:
    datasets['description'] = None
    datasets['n_examples'] = None
    datasets['k_classes'] = None
    datasets['d_features'] = None
    datasets['majority'] = None
    datasets['size_kb'] = None

In [7]:
datasets_columns = ['id', 'name', 'class_column', 'train_path', 'test_path']

if EMPTY_NEW_COLUMNS:
    dataset_columns += ['description', 'n_examples', 'k_classes', 'd_features', 'majority','size_kb']
    
datasets = datasets[datasets_columns]

In [8]:
datasets.head(3).T

Unnamed: 0,0,1,2
id,1,2,3
name,2dplanes_1,AP_Endometrium_Prostate_1,Amazon_employee_access_1
class_column,0,0,0
train_path,data/processed/2dplanes_1_train.csv,data/processed/AP_Endometrium_Prostate_1_train...,data/processed/Amazon_employee_access_1_train.csv
test_path,data/processed/2dplanes_1_test.csv,data/processed/AP_Endometrium_Prostate_1_test.csv,data/processed/Amazon_employee_access_1_test.csv


In [9]:
to_csv(datasets, 'datasets.csv', index=False)

Uploading file gp+bandit-search/csvs/new/datasets.csv to S3 bucket s3.Bucket(name='atm-data-store')


## Dataruns

A Datarun is a single logical job for ATM to complete. The Dataruns table contains a reference to a dataset, configuration for ATM and BTB, and state information.

* id (Int): Unique identifier for the datarun.
* dataset_id (Int): ID of the dataset associated with this datarun.
* description (String): Human-readable description of the datarun.
    * not in the paper

BTB configuration:

* selector (String): Selection technique for hyperpartitions.
  * called “hyperpartition_selection_scheme” in the paper
* k_window (Int): The number of previous classifiers the selector will consider, for selection techniques that set a limit of the number of historical runs to use.
  * called “ts” in the paper
* tuner (String): The technique that BTB will use to choose new continuous hyperparameters.
  * called “hyperparameters_tuning_scheme” in the paper
* r_minimum (Int): The number of random runs that must be performed in each hyperpartition before allowing Bayesian optimization to select parameters.
* gridding (Int): If this value is set to a positive integer, each numeric hyperparameter will be chosen from a set of gridding discrete, evenly-spaced values. If set to 0 or NULL, values will be chosen from the full, continuous space of possibilities.
  * not in the paper

ATM configuration:

* priority (Int): Run priority for the datarun. If multiple unfinished dataruns are in the ModelHub at once, workers will process higher-priority runs first.
* budget_type (Enum): One of [“learner”, “walltime”]. If this is “learner”, only budget classifiers will be trained; if “walltime”, classifiers will only be trained for budget minutes total.
* budget (Int): The maximum number of classifiers to build, or the maximum amount of time to train classifiers (in minutes).
  * called “budget_amount” in the paper
* deadline (DateTime): If provided, and if budget_type is set to “walltime”, the datarun will run until this absolute time. This overrides the budget column.
  * not in the paper
* metric (String): The metric by which to score each classifier for comparison purposes. Can be one of [“accuracy”, “cohen_kappa”, “f1”, “roc_auc”, “ap”, “mcc”] for binary problems, or [“accuracy”, “rank_accuracy”, “cohen_kappa”, “f1_micro”, “f1_macro”, “roc_auc_micro”, “roc_auc_macro”] for multiclass problems
  * not in the paper
* score_target (Enum): One of [“cv”, “test”, “mu_sigma”]. Determines how the final comparative metric (the judgment metric) is calculated.
   * “cv” (cross-validation): the judgment metric is the average of a 5-fold cross-validation test.
   * “test”: the judgment metric is computed on the test data.
   * “mu_sigma”: the judgment metric is the lower error bound on the mean CV score.
     * not in the paper

State information:

* start_time (DateTime): Time the DataRun began.
* end_time (DateTime): Time the DataRun was completed.
* status (Enum): Indicates whether the run is pending, in progress, or has been finished. One of [“pending”, “running”, “complete”].
  * not in the paper

In [10]:
dataruns = read_csv('dataruns.csv')

Downloading file gp+bandit-search/csvs/dataruns.csv from S3 bucket s3.Bucket(name='atm-data-store')


In [11]:
dataruns['id'] = dataruns['datarun_id']
dataruns['selector'] = dataruns['hyperpartition_selection_scheme']
dataruns['k_window'] = dataruns['t_s']
dataruns['tuner'] = dataruns['hyperparameter_tuning_scheme']
dataruns['budget'] = dataruns['budget_amount']

if EMPTY_NEW_COLUMNS:
    dataruns['description'] = None
    dataruns['gridding'] = None
    dataruns['deadline'] = None
    dataruns['metric'] = None
    dataruns['score_target'] = None
    dataruns['status'] = None

In [12]:
if EMPTY_NEW_COLUMNS:
    dataruns_columns = [
        'id', 'dataset_id', 'description', 'priority', 'selector',
        'k_window', 'tuner', 'gridding', 'r_minimum', 'budget_type',
        'budget', 'deadline', 'metric', 'score_target', 'start_time',
        'end_time', 'status'
    ]
    
else:
    dataruns_columns = [
        'id', 'dataset_id', 'priority', 'selector',
        'k_window', 'tuner', 'r_minimum', 'budget_type',
        'budget', 'start_time', 'end_time'
    ]

In [13]:
dataruns = dataruns[dataruns_columns]

In [14]:
dataruns.head(3).T

Unnamed: 0,0,1,2
id,1,2,3
dataset_id,1,2,3
priority,10,10,10
selector,bestkvel,bestkvel,bestkvel
k_window,2,2,2
tuner,gp_ei,gp_ei,gp_ei
r_minimum,2,2,2
budget_type,learner,learner,learner
budget,100,100,100
start_time,,,


In [15]:
to_csv(dataruns, 'dataruns.csv', index=False)

Uploading file gp+bandit-search/csvs/new/dataruns.csv to S3 bucket s3.Bucket(name='atm-data-store')


## Hyperpartitions

A Hyperpartition is a fixed set of categorical hyperparameters which defines a space of numeric hyperparameters that can be explored by a tuner. ATM uses BTB selectors to choose among hyperpartitions during a run. Each hyperpartition instance must be associated with a single datarun; the performance of a hyperpartition in a previous datarun is assumed to have no bearing on its performance in the future.

* id (Int): Unique identifier for the hyperparition.
* datarun_id (Int): ID of the datarun associated with this hyperpartition.
* method (String): Code for, or path to a JSON file describing, this hyperpartition’s classification method (e.g. “svm”, “knn”).
* categorical_hyperparameters_64 (Base64-encoded object): List of categorical hyperparameters whose values are fixed to define this hyperpartition.
  * called “partition_hyperparameter_values” in the paper
* tunable_hyperparameters_64 (Base64-encoded object): List of continuous hyperparameters which are free; their values must be selected by a Tuner.
  * called “conditional_hyperparameters” in the paper
* constant_hyperparameters_64 (Base64-encoded object): List of categorical or continuous parameters whose values are always fixed. These do not define the hyperpartition, but their values must be passed to the classification method to fully parameterize it.
  * not in the paper
* status (Enum): Indicates whether the hyperpartition has caused too many classifiers to error, or whether the grid for this partition has been fully explored. One of [“incomplete”, “gridding_done”, “errored”].
  * not in the paper

In [16]:
hyperpartitions = read_csv('hyperpartitions.csv')

Downloading file gp+bandit-search/csvs/hyperpartitions.csv from S3 bucket s3.Bucket(name='atm-data-store')


In [17]:
hyperpartitions['id'] = hyperpartitions['hyperpartition_id']
hyperpartitions['categorical_hyperparameters_64'] = hyperpartitions['partition_hyperparameter_values']
hyperpartitions['tunable_hyperparameters_64'] = hyperpartitions['conditional_hyperparameters']

if EMPTY_NEW_COLUMNS:
    hyperpartitions['constant_hyperparameters_64'] = None
    hyperpartitions['status'] = None

In [18]:
hyperpartitions_columns = [
    'id', 'datarun_id', 'method', 'categorical_hyperparameters_64',
    'tunable_hyperparameters_64'
]

if EMPTY_NEW_COLUMNS:
    hyperpartitions_columns += ['constant_hyperparameters_64', 'status']

In [19]:
hyperpartitions = hyperpartitions[hyperpartitions_columns]

In [20]:
hyperpartitions.head(3).T

Unnamed: 0,0,1,2
id,1,2,3
datarun_id,1,1,1
method,classify_rf,classify_rf,classify_dt
categorical_hyperparameters_64,n_estimators:1000;criterion:entropy;,n_estimators:1000;criterion:gini;,criterion:entropy;
tunable_hyperparameters_64,"min_samples_leaf:(1, 2):INT:NOCAT;max_features...","min_samples_leaf:(1, 2):INT:NOCAT;max_features...","min_samples_split:(2, 4):INT:NOCAT;max_feature..."


In [21]:
to_csv(hyperpartitions, 'hyperpartitions.csv', index=False)

Uploading file gp+bandit-search/csvs/new/hyperpartitions.csv to S3 bucket s3.Bucket(name='atm-data-store')


## Classifiers

A Classifier represents a single train/test run using a method and a set of hyperparameters with a particular dataset.

* id (Int): Unique identifier for the classifier.
* datarun_id (Int): ID of the datarun associated with this classifier.
* hyperpartition_id (Int): ID of the hyperpartition associated with this classifier.
* host (String): IP address or name of the host machine where the classifier was tested.
    * not in the paper
* model_location (String): Path to the serialized model object for this classifier.
* metrics_location (String): Path to the full set of metrics computed during testing.
* cv_judgment_metric (Number): Mean of the judgement metrics from the cross-validated training data.
* cv_judgment_metric_stdev (Number): Standard deviation of the cross-validation test.
* test_judgment_metric (Number): Judgment metric computed on the test data.
* hyperparameter_values_64 (Base64-encoded object): The full set of hyperparameter values used to create this classifier.
* start_time (DateTime): Time that a worker started working on the classifier.
* end_time (DateTime): Time that a worker finished working on the classifier.
* status (Enum): One of [“running”, “errored”, “complete”].
* error_message (String): If this classifier encountered an error, this is the Python stack trace from the caught exception.

In [22]:
classifiers = read_csv('classifiers.csv')

Downloading file gp+bandit-search/csvs/classifiers.csv from S3 bucket s3.Bucket(name='atm-data-store')


In [23]:
classifiers['id'] = classifiers['classifier_id']
classifiers['hyperparameter_values_64'] = classifiers['hyperparameter_values']

if EMPTY_NEW_COLUMNS:
    classifiers['host'] = None
    classifiers['cv_judgment_metric_stdev'] = None

In [24]:
if EMPTY_NEW_COLUMNS:
    classifiers_columns = [
        'id', 'datarun_id', 'hyperpartition_id', 'host', 'model_location',
        'metrics_location', 'hyperparameter_values_64', 'cv_judgment_metric',
        'cv_judgment_metric_stdev', 'test_judgment_metric', 'start_time',
        'end_time', 'status', 'error_message',
    ]

else:
    classifiers_columns = [
        'id', 'datarun_id', 'hyperpartition_id', 'model_location',
        'metrics_location', 'hyperparameter_values_64', 'cv_judgment_metric',
        'test_judgment_metric', 'start_time', 'end_time', 'status', 'error_message',
    ]   

In [25]:
classifiers = classifiers[classifiers_columns]

In [26]:
classifiers.head(3).T

Unnamed: 0,0,1,2
id,1,2,3
datarun_id,68,145,155
hyperpartition_id,11532,24896,26555
model_location,,models/cf03ebc08c25d33ba76d5414345abfb6-bb5b76...,models/17307d3b00eb7718edf9d9617b9afe54-e3de60...
metrics_location,,metrics/cf03ebc08c25d33ba76d5414345abfb6-bb5b7...,metrics/17307d3b00eb7718edf9d9617b9afe54-e3de6...
hyperparameter_values_64,,function:classify_mlp;_scale:True;solver:sgd;l...,function:classify_sgd;loss:modified_huber;eta0...
cv_judgment_metric,,0.1,0.632857
test_judgment_metric,,0,0.645161
start_time,,2017-08-18 14:46:52,2017-08-18 14:46:50
end_time,,2017-08-18 14:46:55,2017-08-18 14:46:52


In [27]:
to_csv(classifiers, 'classifiers.csv', index=False)

Uploading file gp+bandit-search/csvs/new/classifiers.csv to S3 bucket s3.Bucket(name='atm-data-store')
