## AutoML 014: lagging-feature-transformation-test

Licensed under the MIT License.


In this example we demonstrate how AutoML transforms the input data by appending previous rows.

Make sure you have executed the [setup](setup.ipynb) before running this notebook.

In this notebook you would see
1. Extraction of lagging features by appending previous rows
2. Validation of generated features using a validation set

In [None]:
# AzureML imports
import azureml.core
from azureml.core import RunConfiguration
from azureml.core.workspace import Workspace
from azureml.core.experiment import Experiment
from azureml.train.automl import AutoMLConfig
from azureml.train.automl.run import AutoMLRun

# Pandas and Numpy imports
import logging
import numpy as np
import pandas as pd

In [None]:
# Get the workspace
ws = Workspace.from_config()

# choose a name for the run history container in the workspace
experiment_name = 'automl-local-lagging-features'
# project folder
project_folder = './sample_projects/automl-local-lagging-features'
# Create an experiment
experiment=Experiment(ws, experiment_name)

output = {}
output['SDK version'] = azureml.core.VERSION
output['Subscription ID'] = ws.subscription_id
output['Workspace Name'] = ws.name
output['Resource Group'] = ws.resource_group
output['Location'] = ws.location
output['Project Directory'] = project_folder
output['Experiment Name'] = experiment.name
pd.set_option('display.max_colwidth', -1)
pd.DataFrame(data = output, index = ['']).T

## Diagnostics
Opt-in diagnostics collection for better experience, quality, and security of future releases

In [None]:
from azureml.telemetry import set_diagnostics_collection
set_diagnostics_collection(send_diagnostics=True)

## Load dataset for lagging transformation
The data set contains "value" which is numeric data.  

In [None]:
# Columns to read from raw data
fields = ['label', 'value']

# Read the test data
dataset = pd.read_csv('featurizers_test_data.csv', usecols=fields)

# Number of samples in test data
number_of_samples_in_test_data = 10

# Output label
y = dataset['label']

# Training data
X = dataset.drop('label', axis=1)

# Dump first ten rows of X
print(X.head(10))

# Split the data into train and test
X_train = X.iloc[0:X.shape[0] - number_of_samples_in_test_data]
X_test = X.iloc[X.shape[0] - number_of_samples_in_test_data:X.shape[0]]
y_train = y.iloc[0:y.shape[0] - number_of_samples_in_test_data].values
y_test = y.iloc[y.shape[0] - number_of_samples_in_test_data:y.shape[0]].values

## Instantiate Auto ML Classifier

Instantiate a AutoML Object This creates an Experiment in Azure ML. You can reuse this objects to trigger multiple runs. Each run will be part of the same experiment.

|Property|Description|
|-|-|
|**primary_metric**|This is the metric that you want to optimize.<br> Auto ML Classifier supports the following primary metrics <br><i>AUC_macro</i><br><i>AUC_weighted</i><br><i>accuracy</i><br><i>weighted_accuracy</i><br><i>norm_macro_recall</i><br><i>balanced_accuracy</i><br><i>average_precision_score_weighted</i>|
|**iteration_timeout_minutes**|Time limit in minutes for each iterations|
|**iterations**|Number of iterations. In each iteration Auto ML Classifier trains the data with a specific pipeline|
|**n_cross_validations**|Number of cross validation splits|

## Training the Model

You can call the submit method on the AutoML experiment instance and pass the run configuration. For Local runs the execution is synchronous. Depending on the data and number of iterations this can run for while.
You will see the currently running iterations printing to the console.

*submit* method on Auto ML Classifier triggers the training of the model. It can be called with the following parameters

|**Parameter**|**Description**|
|-|-|
|**automal_config**|Indicates the Auto configuration
|**show_output**| True/False to turn on/off console output|

In [None]:
automl_config = AutoMLConfig(task = 'classification',
                             debug_log = 'automl_errors.log',
                             primary_metric = 'accuracy',
                             iteration_timeout_minutes = 60,
                             iterations = 5,
                             n_cross_validations = 4,
                             preprocess=True,
                             lag_length=1,
                             verbosity = logging.INFO,
                             X = X_train, 
                             y = y_train,
                             path=project_folder)

In [None]:
local_run = experiment.submit(automl_config, show_output=True)

## Get the best fitted model

In [None]:
# Get the parent run
parent_run = AutoMLRun(experiment=experiment, run_id=local_run.run_id)

# Find the best fitted model for the given run
best_run, fitted_model = parent_run.get_output(metric='accuracy')

## Transforming the test data using data transfromer
Given the best fitted model, transform the test data 

In [None]:
# Transform the test data using the data transformer for categorical data
x_test_transform = pd.DataFrame(fitted_model.named_steps.laggingtransformer.transform(X_test))

# Dump the transformed data
print(x_test_transform)

# Test the transformed data against the expected transformed data frame
expected_x_test_transform = [[394697.0,  0.0],
                             [514597.0,  394697.0],
                             [973105.0,  514597.0],
                             [534876.0,  973105.0],
                             [748394.0,  534876.0],
                             [544768.0,  748394.0],
                             [552064.0,  544768.0],
                             [472667.0,  552064.0],
                             [303836.0,  472667.0],
                             [475814.0,  303836.0]]
assert((x_test_transform.values == pd.DataFrame(data=expected_x_test_transform).values).all())