Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# AutoML 13: Prepare Data using `azureml.dataprep`
In this example we showcase how you can use `azureml.dataprep` SDK to load and prepare data for AutoML. `azureml.dataprep` can also be used standalone - full documentation can be found [here](https://github.com/Microsoft/PendletonDocs).

Make sure you have executed the [setup](00.configuration.ipynb) before running this notebook.

In this notebook you would see
1. Defining data loading and preparation steps in a `Dataflow` using `azureml.dataprep`
2. Passing the `Dataflow` to AutoML for local run
3. Passing the `Dataflow` to AutoML for remote run

## Install `azureml.dataprep` SDK

Please restart your kernel after the below installs.

Tornado must be downgraded to a pre-5 version due to a known Tornado x Jupyter event loop bug.

In [None]:
!pip install azureml-dataprep
!pip install tornado==4.5.1

## Diagnostics

Opt-in diagnostics for better experience, quality, and security of future releases.

In [None]:
from azureml.telemetry import set_diagnostics_collection
set_diagnostics_collection(send_diagnostics = True)

## Create Experiment

As part of the setup you have already created a <b>Workspace</b>. For AutoML you would need to create an <b>Experiment</b>. An <b>Experiment</b> is a named object in a <b>Workspace</b>, which is used to run experiments.

In [None]:
import logging
import os

import pandas as pd

import azureml.core
from azureml.core.compute import DsvmCompute
from azureml.core.experiment import Experiment
from azureml.core.runconfig import CondaDependencies
from azureml.core.runconfig import RunConfiguration
from azureml.core.workspace import Workspace
import azureml.dataprep as dprep
from azureml.train.automl import AutoMLConfig

In [None]:
ws = Workspace.from_config()
 
# choose a name for experiment
experiment_name = 'automl-dataprep-classification'
# project folder
project_folder = './sample_projects/automl-dataprep-classification'
 
experiment = Experiment(ws, experiment_name)
 
output = {}
output['SDK version'] = azureml.core.VERSION
output['Subscription ID'] = ws.subscription_id
output['Workspace Name'] = ws.name
output['Resource Group'] = ws.resource_group
output['Location'] = ws.location
output['Project Directory'] = project_folder
output['Experiment Name'] = experiment.name
pd.set_option('display.max_colwidth', -1)
pd.DataFrame(data = output, index = ['']).T

## Loading Data using DataPrep

In [None]:
# You can use `smart_read_file` which intelligently figures out delimiters and datatypes of a file
# data pulled from sklearn.datasets.load_digits()
simple_example_data_root = 'https://dprepdata.blob.core.windows.net/automl-notebook-data/'
X = dprep.smart_read_file(simple_example_data_root + 'X.csv').skip(1)  # remove header

# You can also use `read_csv` and `to_*` transformations to read (with overridable delimiter).
# and convert column types manually.
# Here we read a comma delimited file and convert all columns to integers.
y = dprep.read_csv(simple_example_data_root + 'y.csv').to_long(dprep.ColumnSelector(term='.*', use_regex = True))

## Review the Data Preparation Result

You can peek the result of a Dataflow at any range using `skip(i)` and `head(j)`. Doing so evaluates only `j` records for all the steps in the Dataflow, which makes it fast even against large dataset.

In [None]:
X.skip(1).head(5)

## Instantiate AutoML Settings

This creates a general Auto ML Settings applicable for both Local and Remote runs.

In [None]:
automl_settings = {
    "max_time_sec": 600,
    "iterations": 2,
    "primary_metric": 'AUC_weighted',
    "preprocess": False,
    "verbosity": logging.INFO,
    "n_cross_validations" : 3
}

## Local Run

### Pass data with Dataflows

The `Dataflow` objects captured above can be passed to `submit` method for local run. AutoML will retrieve the results from the `Dataflow` for model training.

In [None]:
automl_config = AutoMLConfig(task = 'classification',
                             debug_log = 'automl_errors.log',
                             X = X,
                             y = y,
                             **automl_settings)

In [None]:
local_run = experiment.submit(automl_config, show_output=True)

## Remote Run
*Note: This feature might not work properly in your workspace region before the October update. You may jump to the "Exploring the results" section below to explore other features AutoML and DataPrep has to offer.*

### Create or Attach a Remote Linux DSVM

In [None]:
dsvm_name = 'mydsvm'
try:
    dsvm_compute = DsvmCompute(ws, dsvm_name)
    print('found existing dsvm.')
except:
    print('creating new dsvm.')
    dsvm_config = DsvmCompute.provisioning_configuration(vm_size = "Standard_D2_v2")
    dsvm_compute = DsvmCompute.create(ws, name = dsvm_name, provisioning_configuration = dsvm_config)
    dsvm_compute.wait_for_completion(show_output = True)

### Update Conda Dependency file to have AutoML and DataPrep SDK

Currently AutoML and DataPrep SDK is not installed with Azure ML SDK by default. Due to this we update the conda dependency file to add such dependencies.

In [None]:
cd = CondaDependencies()
cd.add_pip_package(pip_package='azureml-dataprep')
cd.add_pip_package(pip_package='tornado==4.5.1')

### Create a RunConfiguration with DSVM name

In [None]:
run_config = RunConfiguration(conda_dependencies=cd)
run_config.target = dsvm_compute
run_config.auto_prepare_environment = True

### Pass data with Dataflows

The `Dataflow` objects captured above can also be passed to `submit` method for remote run. AutoML will serialize the `Dataflow` and send to remote compute target. The `Dataflow` will not be evaluated locally.

In [None]:
automl_config = AutoMLConfig(task = 'classification',
                             debug_log = 'automl_errors.log',
                             path = project_folder,
                             run_configuration = run_config,
                             X = X,
                             y = y,
                             **automl_settings)
# Please uncomment the line below to try out remote run with dataprep. 
# This feature might not work properly in your workspace region before the October update.
# remote_run = experiment.submit(automl_config, show_output = True)

## Exploring the results

#### Widget for monitoring runs

The widget will sit on "loading" until the first iteration completed, then you will see an auto-updating graph and table show up. It refreshed once per minute, so you should see the graph update as child runs complete.

NOTE: The widget displays a link at the bottom. This links to a web-ui to explore the individual run details.

In [None]:
from azureml.train.widgets import RunDetails
RunDetails(local_run).show() 

#### Retrieve all child runs
You can also use SDK methods to fetch all the child runs and see individual metrics that we log.

In [None]:
children = list(local_run.get_children())
metricslist = {}
for run in children:
    properties = run.get_properties()
    metrics = {k: v for k, v in run.get_metrics().items() if isinstance(v, float)}
    metricslist[int(properties['iteration'])] = metrics
    
import pandas as pd
rundata = pd.DataFrame(metricslist).sort_index(1)
rundata

### Retrieve the Best Model

Below we select the best pipeline from our iterations. The *get_output* method on automl_classifier returns the best run and the fitted model for the last *fit* invocation. There are overloads on *get_output* that allow you to retrieve the best run and fitted model for *any* logged metric or a particular *iteration*.

In [None]:
best_run, fitted_model = local_run.get_output()
print(best_run)
print(fitted_model)

#### Best Model based on any other metric
Give me the run and the model that has the smallest `log_loss`:

In [None]:
lookup_metric = "log_loss"
best_run, fitted_model = local_run.get_output(metric = lookup_metric)
print(best_run)
print(fitted_model)

#### Best Model based on any iteration
Give me the run and the model from the 1st iteration:

In [None]:
iteration = 0
best_run, fitted_model = local_run.get_output(iteration = iteration)
print(best_run)
print(fitted_model)

### Testing the Fitted Model 

#### Load Test Data

In [None]:
from sklearn import datasets

digits = datasets.load_digits()
X_digits = digits.data[:10, :]
y_digits = digits.target[:10]
images = digits.images[:10]

#### Testing our best pipeline
We will try to predict 2 digits and see how our model works.

In [None]:
#Randomly select digits and test
from matplotlib import pyplot as plt
from matplotlib.pyplot import imshow
import random
import numpy as np

for index in np.random.choice(len(y_digits), 2):
    print(index)
    predicted = fitted_model.predict(X_digits[index:index + 1])[0]
    label = y_digits[index]
    title = "Label value = %d  Predicted value = %d " % ( label,predicted)
    fig = plt.figure(1, figsize=(3,3))
    ax1 = fig.add_axes((0,0,.8,.8))
    ax1.set_title(title)
    plt.imshow(images[index], cmap=plt.cm.gray_r, interpolation='nearest')
    plt.show()

## Appendix

### Capture the Dataflows to use for AutoML later

`Dataflow` objects are immutable. Each of them is composed of a list of data preparation steps. A `Dataflow` can be branched at any point for further usage.

In [None]:
# sklearn.digits.data + target
digits_complete = dprep.smart_read_file('https://dprepdata.blob.core.windows.net/automl-notebook-data/digits-complete.csv')

`digits_complete` (sourced from `sklearn.datasets.load_digits()`)is forked into `dflow_X` to capture all the feature columns and `dflow_y` to capture the label column.

In [None]:
digits_complete.to_pandas_dataframe().shape
labels_column = 'Column64'
dflow_X = digits_complete.drop_columns(columns = [labels_column])
dflow_y = digits_complete.keep_columns(columns = [labels_column])