<a href="https://colab.research.google.com/github/JayThibs/hyperdrive-vs-automl-plus-deployment/blob/main/automl.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Automated ML

Note: For data exploration, go to hyperparameter_tuning.ipynb

# Import Dependencies

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.train.automl import AutoMLConfig
from azureml.core.dataset import Dataset

from azureml.pipeline.steps import AutoMLStep

# Check core SDK version number
print("SDK version:", azureml.core.VERSION)

SDK version: 1.26.0


In [2]:
import numpy as np
import pandas as pd

def bools(df):
    """
    public_meeting: we will fill the nulls as 'False'
    permit: we will fill the nulls as 'False
    """
    z = ['public_meeting', 'permit']
    for i in z:
        df[i].fillna(False, inplace = True)
        df[i] = df[i].apply(lambda x: float(x))
    return df

def locs(df, trans = ['longitude', 'latitude', 'gps_height', 'population']):
    """
    fill in the nulls for ['longitude', 'latitude', 'gps_height', 'population'] by using medians from 
    ['subvillage', 'district_code', 'basin'], and lastly the overall median
    """
    df.loc[df.longitude == 0, 'latitude'] = 0
    for z in trans:
        df[z].replace(0., np.NaN, inplace = True)
        df[z].replace(1., np.NaN, inplace = True)
        
        for j in ['district_code', 'basin']:
        
            df['median'] = df.groupby([j])[z].transform('median')
            df[z] = df[z].fillna(df['median'])
        
        df[z] = df[z].fillna(df[z].median())
        del df['median']
    return df

def construction(df):
    """
    A lot of null values for construction year. Of course, this is a missing value (a placeholder).
    For modeling purposes, this is actually fine, but we'll have trouble with visualizations if we
    compare the results for different years, so we'll set the value to something closer to
    the other values that aren't placeholders. Let's look at the unique years and set the null
    values to 50 years sooner.
    Let's set it to 1910 since the lowest "good" value is 1960.
    """
    df.loc[df['construction_year'] < 1950, 'construction_year'] = 1910
    return df

# Alright, now let's drop a few columns
# Needed to drop quite a few categorical columns so that the data would fit in memory in Azure
# Tested the model before and after (from 6388 columns to 278) in Colab and only had a ~0.03% reduction in performance

def removal(df):
  # id: we drop the id column because it is not a useful predictor.
  # amount_tsh: is mostly blank - delete
  # wpt_name: not useful, delete (too many values)
  # subvillage: too many values, delete
  # scheme_name: this is almost 50% nulls, so we will delete this column
  # num_private: we will delete this column because ~99% of the values are zeros.
  features_to_drop = ['id','amount_tsh',  'num_private', 
          'quantity', 'quality_group', 'source_type', 'payment', 
          'waterpoint_type_group', 'extraction_type_group', 'wpt_name', 
          'subvillage', 'scheme_name', 'funder', 'installer', 'recorded_by',
          'ward']
  df = df.drop(features_to_drop, axis=1)

  return df

def dummy(df):
    dummy_cols = ['basin', 'lga', 'public_meeting',
       'scheme_management', 'permit', 'extraction_type',
       'extraction_type_class', 'management', 'management_group',
       'payment_type', 'water_quality', 'quantity_group', 'source',
       'source_class', 'waterpoint_type', 'region']

    df = pd.get_dummies(df, columns=dummy_cols)

    return df

def dates(df):
    """
    date_recorded: this might be a useful variable for this analysis, although the year itself would be useless in a practical scenario moving into the future. We will convert this column into a datetime, and we will also create 'year_recorded' and 'month_recorded' columns just in case those levels prove to be useful. A visual inspection of both casts significant doubt on that possibility, but we'll proceed for now. We will delete date_recorded itself, since random forest cannot accept datetime
    """
    df['date_recorded'] = pd.to_datetime(df['date_recorded'])
    df['year_recorded'] = df['date_recorded'].apply(lambda x: x.year)
    df['month_recorded'] = df['date_recorded'].apply(lambda x: x.month)
    df['date_recorded'] = (pd.to_datetime(df['date_recorded'])).apply(lambda x: x.toordinal())
    return df

def dates2(df):
    """
    Turn year_recorded and month_recorded into dummy variables
    """
    for z in ['month_recorded', 'year_recorded']:
        df[z] = df[z].apply(lambda x: str(x))
        good_cols = [z+'_'+i for i in df[z].unique()]
        df = pd.concat((df, pd.get_dummies(df[z], prefix = z)[good_cols]), axis = 1)
        del df[z]
    return df

def small_n(df):
    "Collapsing small categorical value counts into 'other'"
    cols = [i for i in df.columns if type(df[i].iloc[0]) == str]
    df[cols] = df[cols].where(df[cols].apply(lambda x: x.map(x.value_counts())) > 100, "other")
    return df

## Dataset

### Overview

We'll be using the Pump it Up dataset from the DrivenData competition.

The description of the problem: 

> Using data from Taarifa and the Tanzanian Ministry of Water, can you predict which pumps are functional, which need some repairs, and which don't work at all? This is an intermediate-level practice competition. Predict one of these three classes based on a number of variables about what kind of pump is operating, when it was installed, and how it is managed. A smart understanding of which waterpoints will fail can improve maintenance operations and ensure that clean, potable water is available to communities across Tanzania.

In other words, our goal is to predict which water pumps are non-functioning or functioning, but in need of repair.

In this project, we will train a model using AutoML to train multiple multiple and choose the best performing model for deployment.

In [3]:
# We loaded the dataset into Azure and we are grabbing it here.

from azureml.core import Workspace, Experiment, Dataset

# download config file in azure and put it in the current Notebooks folder
ws = Workspace.from_config()
exp = Experiment(workspace=ws, name="Pump-it-Up-Data-Mining-the-Water-Table")

print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep = '\n')

run = exp.start_logging()

# download config file in azure and put it in the current Notebooks folder
ws = run.experiment.workspace

key = 'Pump-it-Up-dataset'

if key in ws.datasets.keys():
      dataset = ws.datasets[key]
      print('dataset found!')

else:
      url = 'https://raw.githubusercontent.com/JayThibs/hyperdrive-vs-automl-plus-deployment/main/Pump-it-Up-dataset.csv'
      dataset = Dataset.Tabular.from_delimited_files(url)
      datatset = dataset.register(ws, key)

dataset.to_pandas_dataframe()

Performing interactive authentication. Please follow the instructions on the terminal.
To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code FYGDE3GYU to authenticate.
You have logged in. Now let us find all the subscriptions to which you have access...
Interactive authentication successfully completed.
Workspace name: quick-starts-ws-143063
Azure region: southcentralus
Subscription id: a24a24d5-8d87-4c8a-99b6-91ed2d2df51f
Resource group: aml-quickstarts-143063


Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,...,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group,status_group
0,69572,6000.0,2011-03-14,Roman,1390,Roman,34.938093,-9.856322,none,0,...,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe,functional
1,8776,0.0,2013-03-06,Grumeti,1399,GRUMETI,34.698766,-2.147466,Zahanati,0,...,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe,functional
2,34310,25.0,2013-02-25,Lottery Club,686,World vision,37.460664,-3.821329,Kwa Mahundi,0,...,soft,good,enough,enough,dam,dam,surface,communal standpipe multiple,communal standpipe,functional
3,67743,0.0,2013-01-28,Unicef,263,UNICEF,38.486161,-11.155298,Zahanati Ya Nanyumbu,0,...,soft,good,dry,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe,non functional
4,19728,0.0,2011-07-13,Action In A,0,Artisan,31.130847,-1.825359,Shuleni,0,...,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe,functional
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59395,60739,10.0,2013-05-03,Germany Republi,1210,CES,37.169807,-3.253847,Area Three Namba 27,0,...,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe,functional
59396,27263,4700.0,2011-05-07,Cefa-njombe,1212,Cefa,35.249991,-9.070629,Kwa Yahona Kuvala,0,...,soft,good,enough,enough,river,river/lake,surface,communal standpipe,communal standpipe,functional
59397,37057,0.0,2011-04-11,,0,,34.017087,-8.750434,Mashine,0,...,fluoride,fluoride,enough,enough,machine dbh,borehole,groundwater,hand pump,hand pump,functional
59398,31282,0.0,2011-03-08,Malec,0,Musa,35.861315,-6.378573,Mshoro,0,...,soft,good,insufficient,insufficient,shallow well,shallow well,groundwater,hand pump,hand pump,functional


In [4]:
X = dataset.to_pandas_dataframe()
y = X[['status_group']]
del X['status_group']

# Cleaning up the features of our dataset
X = bools(X)
X = locs(X)
X = construction(X)
X = removal(X)
X = dummy(X)
X = dates(X)
x = dates2(X)
X = small_n(X)

# Removing ">", "[" and "]" from the headers to make the data compatible with different algorithms (namely, xgboost)
regex = re.compile(r"\[|\]|<", re.IGNORECASE)
X.columns = [regex.sub("_", col) if any(x in str(col) for x in set(('[', ']', '<'))) else col for col in X.columns.values]

# Converting the population values to log
X['population'] = np.log(X['population'])

# Splitting the dataset into a training and test set
# Test set will be used later
# The same random seed (42) for the Hyperdrive model
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Concatenating the features and labels together to feed to our AutoML model
clean_train_df = pd.concat([X_train, y_train], axis=1)

In [5]:
from azureml.data.dataset_factory import TabularDatasetFactory

# Registering the test dataset for future inference

# Get the default datastore to be entered as a parameter in tabular dataset creation
datastore = ws.get_default_datastore()

# Change pandas dataframe into a tabular dataset to be used in automl
testing_data = TabularDatasetFactory.register_pandas_dataframe(X_test, datastore, 'automl_data_test')

Method register_pandas_dataframe: This is an experimental method, and may change at any time.<br/>For more information, see https://aka.ms/azuremlexperimental.


Validating arguments.
Arguments validated.
Successfully obtained datastore reference and path.
Uploading file to managed-dataset/9deec5e1-0206-4b5a-b5a7-c86f1183841d/
Successfully uploaded file to datastore.
Creating and registering a new dataset.
Successfully created and registered a new dataset.


In [6]:
from azureml.data.dataset_factory import TabularDatasetFactory

# Get the default datastore to be entered as a parameter in tabular dataset creation
datastore = ws.get_default_datastore()

# Change pandas dataframe into a tabular dataset to be used in automl
training_data = TabularDatasetFactory.register_pandas_dataframe(clean_train_df, datastore, 'automl_data')

Method register_pandas_dataframe: This is an experimental method, and may change at any time.<br/>For more information, see https://aka.ms/azuremlexperimental.


Validating arguments.
Arguments validated.
Successfully obtained datastore reference and path.
Uploading file to managed-dataset/629c085b-cfe8-4675-92b8-60dd7e510bba/
Successfully uploaded file to datastore.
Creating and registering a new dataset.
Successfully created and registered a new dataset.


In [7]:
training_data.take(3).to_pandas_dataframe()

Unnamed: 0,date_recorded,gps_height,longitude,latitude,region_code,district_code,population,construction_year,basin_Internal,basin_Lake Nyasa,...,region_Pwani,region_Rukwa,region_Ruvuma,region_Shinyanga,region_Singida,region_Tabora,region_Tanga,year_recorded,month_recorded,status_group
0,734926,2092.0,35.42602,-4.227446,21,1,5.075174,1998,1,0,...,0,0,0,0,0,0,0,2013,2,functional
1,734213,550.0,35.510074,-5.724555,1,6,5.298317,1910,1,0,...,0,0,0,0,0,0,0,2011,3,functional
2,734328,550.0,32.499866,-9.081222,12,6,5.298317,1910,0,0,...,0,0,0,0,0,0,0,2011,7,non functional


# Setting up Experiment

We'll create a new experiment for our deployment of an AutoML model and create a project folder to hold the training scripts.

In [8]:
experiment_name = 'automl-pump-it-up-operationalize'
project_folder = './automl-pipeline-project'

automl_experiment = Experiment(ws, experiment_name)
automl_experiment

Name,Workspace,Report Page,Docs Page
automl-pump-it-up-operationalize,quick-starts-ws-143063,Link to Azure Machine Learning studio,Link to Documentation


In [9]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# Creating a compute cluster if there isn't one that is already created.

cpu_cluster_name = 'hypr-auto-clustr'

try:
    cpu_cluster = ComputeTarget(workspace=ws, name=cpu_cluster_name)
    print('Found existing compute target.')
except ComputeTargetException:
    print('Creating a new computer target...')
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_v2',
                                                          max_nodes=4)
    cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, compute_config)
    
cpu_cluster.wait_for_completion(show_output=True)

Creating a new computer target...
Creating....
SucceededProvisioning operation finished, operation "Succeeded"
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


# AutoML Configuration

We'll create a new experiment for our deployment of an AutoML model and create a project folder to hold the training scripts.

Here we create the general AutoML settings object.


Calculate recall to test how well we do on True Positives. We can imagine a real scenario where we want to build a model that does not miss the non-functioning water pumps, and we care much less functioning water pumps that are incorrectly predicted as non-functional. Recall is useful to make sure we miss less True Positives.

In [10]:
from azureml.train.automl import AutoMLConfig

# Note: We are using `norm_macro_recall` for the primary metric here, but that is not the metric we actually want
# our model to perform the best on. As described in the readme, we want `recall_score_micro`. However,
# we cannot use `recall_score_micro` as our primary metric because AutoML currently only allows a few primary metrics.
# We decided to use `norm_macro_recall` because it was the closest metric to the one we actually wanted to evaluate.

automl_settings = {
    "experiment_timeout_minutes": 120, # to set a limit on the amount of time AutoML will be running
    "max_concurrent_iterations": 5, # applies to the compute target we are using
    "primary_metric" : 'norm_macro_recall' # recall for our primary metric
}

# Setting AutoML config for model training.

automl_config = AutoMLConfig(compute_target=cpu_cluster,
                             task = "classification", # classifying if water pumps are functional
                             training_data=training_data, 
                             label_column_name="status_group", # our target variable for water pump function  
                             path = project_folder,
                             enable_early_stopping= True, # prevents automl from spending too much time on models that stopped improving, saves time and compute costs
                             featurization= 'auto',
                             debug_log = "automl_errors.log",
                             **automl_settings
                            )

## Create Pipeline and AutoMLStep

Defining the outputs for the AutoMLStep using TrainingOutput.

In [11]:
from azureml.pipeline.core import PipelineData, TrainingOutput

ds = ws.get_default_datastore()
metrics_output_name = 'metrics_output'
best_model_output_name = 'best_model_output'

metrics_data = PipelineData(name='metrics_data',
                           datastore=ds,
                           pipeline_output_name=metrics_output_name,
                           training_output=TrainingOutput(type='Metrics'))
model_data = PipelineData(name='model_data',
                           datastore=ds,
                           pipeline_output_name=best_model_output_name,
                           training_output=TrainingOutput(type='Model'))

## Create the AutoMLStep

In [12]:
# Creating an AutoMLStep

automl_step = AutoMLStep(
    name='automl_module',
    automl_config=automl_config,
    outputs=[metrics_data, model_data],
    allow_reuse=True
    )

In [13]:
# Creating a Pipeline

from azureml.pipeline.core import Pipeline

pipeline = Pipeline(
    description="pipeline_with_automlstep",
    workspace=ws,    
    steps=[automl_step])

In [14]:
print('Submitting AutoML experiment...')

pipeline_run = automl_experiment.submit(pipeline)

Submitting AutoML experiment...
Created step automl_module [4bd1fdf9][e897a8e9-492f-42db-bc86-0dc8b6727503], (This step will run and generate new outputs)
Submitted PipelineRun 1b1a6950-2c78-4af9-8f82-51346807ef1a
Link to Azure Machine Learning Portal: https://ml.azure.com/runs/1b1a6950-2c78-4af9-8f82-51346807ef1a?wsid=/subscriptions/a24a24d5-8d87-4c8a-99b6-91ed2d2df51f/resourcegroups/aml-quickstarts-143063/workspaces/quick-starts-ws-143063&tid=660b3398-b80e-49d2-bc5b-ac1dc93b5254


# Run Details

Using the RunDetails widget to show the different experiments.

In [15]:
from azureml.widgets import RunDetails
RunDetails(pipeline_run).show()

_PipelineWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', …

In [16]:
pipeline_run.wait_for_completion()

PipelineRunId: 1b1a6950-2c78-4af9-8f82-51346807ef1a
Link to Azure Machine Learning Portal: https://ml.azure.com/runs/1b1a6950-2c78-4af9-8f82-51346807ef1a?wsid=/subscriptions/a24a24d5-8d87-4c8a-99b6-91ed2d2df51f/resourcegroups/aml-quickstarts-143063/workspaces/quick-starts-ws-143063&tid=660b3398-b80e-49d2-bc5b-ac1dc93b5254
PipelineRun Status: NotStarted
PipelineRun Status: Running


StepRunId: f3f5f564-3eb1-4c5b-bdba-049df1aa812b
Link to Azure Machine Learning Portal: https://ml.azure.com/runs/f3f5f564-3eb1-4c5b-bdba-049df1aa812b?wsid=/subscriptions/a24a24d5-8d87-4c8a-99b6-91ed2d2df51f/resourcegroups/aml-quickstarts-143063/workspaces/quick-starts-ws-143063&tid=660b3398-b80e-49d2-bc5b-ac1dc93b5254
StepRun( automl_module ) Status: NotStarted
StepRun( automl_module ) Status: Running

StepRun(automl_module) Execution Summary
StepRun( automl_module ) Status: Finished



PipelineRun Execution Summary
PipelineRun Status: Finished
{'runId': '1b1a6950-2c78-4af9-8f82-51346807ef1a', 'status': 'Com

'Finished'

# Examine Results

# Retrive the metrics of all child runs

In [17]:
metrics_output = pipeline_run.get_pipeline_output(metrics_output_name)
num_file_downloaded = metrics_output.download('.', show_progress=True)

Downloading azureml/f3f5f564-3eb1-4c5b-bdba-049df1aa812b/metrics_data
Downloaded azureml/f3f5f564-3eb1-4c5b-bdba-049df1aa812b/metrics_data, 1 files out of an estimated total of 1


In [18]:
import json
with open(metrics_output._path_on_datastore) as f:
    metrics_output_result = f.read()
    
deserialized_metrics_output = json.loads(metrics_output_result)
df = pd.DataFrame(deserialized_metrics_output)
pd.set_option('display.max_rows', 100)
df_t = df.T
df_t['recall_score_micro'].sort_values()

f3f5f564-3eb1-4c5b-bdba-049df1aa812b_12    [0.4861111111111111]
f3f5f564-3eb1-4c5b-bdba-049df1aa812b_35    [0.5237794612794613]
f3f5f564-3eb1-4c5b-bdba-049df1aa812b_30    [0.5429292929292929]
f3f5f564-3eb1-4c5b-bdba-049df1aa812b_25    [0.5429292929292929]
f3f5f564-3eb1-4c5b-bdba-049df1aa812b_45    [0.5429292929292929]
f3f5f564-3eb1-4c5b-bdba-049df1aa812b_11     [0.564604377104377]
f3f5f564-3eb1-4c5b-bdba-049df1aa812b_26     [0.564604377104377]
f3f5f564-3eb1-4c5b-bdba-049df1aa812b_10    [0.5675505050505051]
f3f5f564-3eb1-4c5b-bdba-049df1aa812b_2     [0.5795454545454546]
f3f5f564-3eb1-4c5b-bdba-049df1aa812b_14    [0.5883838383838383]
f3f5f564-3eb1-4c5b-bdba-049df1aa812b_9     [0.5913299663299664]
f3f5f564-3eb1-4c5b-bdba-049df1aa812b_15    [0.5949074074074074]
f3f5f564-3eb1-4c5b-bdba-049df1aa812b_4     [0.5961700336700336]
f3f5f564-3eb1-4c5b-bdba-049df1aa812b_18    [0.6123737373737373]
f3f5f564-3eb1-4c5b-bdba-049df1aa812b_19    [0.6294191919191919]
f3f5f564-3eb1-4c5b-bdba-049df1aa812b_40 

As we can see, the last row in this list is the best result is Run ID: f3f5f564-3eb1-4c5b-bdba-049df1aa812b_33, with a `recall_score_micro` of 0.8136. It is an XGBoost model, we will see it's hyperparameters shortly.

# Best Model

In [19]:
from azureml.train.automl.run import AutoMLRun
best_recall_run_id = df_t['recall_score_micro'].str.get(0).idxmax() # get string for best recall_score_micro run
automl_run = AutoMLRun(automl_experiment, run_id=best_recall_run_id)
automl_run.download_files()

In [22]:
import pickle

with open('outputs/model.pkl', "rb" ) as f:
    best_model = pickle.load(f)
best_model

PipelineWithYTransformations(Pipeline={'memory': None,
                                       'steps': [('datatransformer',
                                                  DataTransformer(enable_dnn=None,
                                                                  enable_feature_sweeping=None,
                                                                  feature_sweeping_config=None,
                                                                  feature_sweeping_timeout=None,
                                                                  featurization_config=None,
                                                                  force_text_dnn=None,
                                                                  is_cross_validation=None,
                                                                  is_onnx_compatible=None,
                                                                  logger=None,
                                                              

In [23]:
# As we can see, XGboost performed the best on recall_score_micro
best_model.steps

[('datatransformer',
  DataTransformer(enable_dnn=None, enable_feature_sweeping=None,
                  feature_sweeping_config=None, feature_sweeping_timeout=None,
                  featurization_config=None, force_text_dnn=None,
                  is_cross_validation=None, is_onnx_compatible=None, logger=None,
                  observer=None, task=None, working_dir=None)),
 ('StandardScalerWrapper',
  <azureml.automl.runtime.shared.model_wrappers.StandardScalerWrapper at 0x7fe007bd34e0>),
 ('XGBoostClassifier',
  XGBoostClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
                    colsample_bynode=1, colsample_bytree=1, eta=0.4, gamma=0.01,
                    learning_rate=0.1, max_delta_step=0, max_depth=7,
                    max_leaves=63, min_child_weight=1, missing=nan,
                    n_estimators=400, n_jobs=1, nthread=None,
                    objective='multi:softprob', random_state=0, reg_alpha=1.875,
                    reg_lambda=1.979166666666

In [43]:
print(automl_run) # best AutoML model (for recall_score_micro)

Run(Experiment: automl-pump-it-up-operationalize,
Id: f3f5f564-3eb1-4c5b-bdba-049df1aa812b_33,
Type: azureml.scriptrun,
Status: Completed)


Here's our XGBoost model's hyperparameters in a more clean format:

    base_score=0.5, 
    booster='gbtree', 
    colsample_bylevel=1,                    
    colsample_bynode=1, 
    colsample_bytree=1, 
    eta=0.4, 
    gamma=0.01,
    learning_rate=0.1, 
    max_delta_step=0, 
    max_depth=7,
    max_leaves=63, 
    min_child_weight=1, 
    missing=nan,
    n_estimators=400, 
    n_jobs=1, 
    nthread=None,
    objective='multi:softprob', 
    random_state=0,
    reg_alpha=1.875, 
    reg_lambda=1.9791666666666667,
    scale_pos_weight=1, 
    seed=None, 
    silent=None, 
    subsample=0.7,
    tree_method='auto', 
    verbose=-10, 
    verbosity=0

# Test the model on the Test Set

In [24]:
# important because registering data with TabularDatasetFactory might change column names (it did in this case)
# If column names change and you only registered X_train, there will be a mismatch unless you do the same with X_test
X_testing = testing_data.to_pandas_dataframe() 

In [25]:
from sklearn.metrics import recall_score

# Predict on the Test Set
ypred = best_model.predict(X_testing)

# Calculate recall
recall = recall_score(y_test, ypred, average='micro')
print('Recall Micro: %.3f' % recall)

Recall Micro: 0.807


Our AutoML-XGBoost model scores 0.807 on the test set, which is great. Not far off from our training set, so we can assume our model has very little variance.

As you can see, the score on `recall_score_micro` is higher with the XGBoost model than it was with the Random Forest HyperDrive model (which was 0.765).

Therefore, we will be deploying the best model we got with AutoML.

# Model Deployment

Registering the model, creating an inference config and deploy the model as a web service.

In other words, we are publishing the pipeline to enable a REST endpoint to rerun the pipeline from any HTTP library on any platform.

In [29]:
from azureml.core.model import Model

# Register model (with the best recall_score_micro performance)
model = Model.register(model_path='outputs/model.pkl', 
                          model_name='automl_XGBoost',
                          tags={'Training context':'Auto ML', 'Run ID': best_recall_run_id, 'Experiment': 'automl-pump-it-up-operationalize', 'Type': 'azureml.scriptrun'},
                          properties={'Recall_Micro': recall},
                          workspace=ws)


Registering model automl_XGBoost


In [27]:
print(model)

Model(workspace=Workspace.create(name='quick-starts-ws-143063', subscription_id='a24a24d5-8d87-4c8a-99b6-91ed2d2df51f', resource_group='aml-quickstarts-143063'), name=automl_XGBoost, id=automl_XGBoost:1, version=1, tags={'Training context': 'Auto ML', 'Run ID': 'f3f5f564-3eb1-4c5b-bdba-049df1aa812b_33', 'Experiment': 'automl-pump-it-up-operationalize', 'Type': 'azureml.scriptrun'}, properties={'Recall_Micro': '0.8068181818181818'})


In [30]:
# moving the AutoML model to base folder.
!mv outputs/model.pkl ./

In [31]:
# Testing if prediction works in notebook before sending request to endpoint
import json
import joblib
import pandas as pd

x_new = pd.DataFrame(X_testing.loc[10]).T # grabbing a random example for testing the webservice
# x_new.columns = x_new.columns.str.replace(r"[^a-zA-Z\d_]+", "")
x_new = x_new.T.rename(columns={10: "data"})
x_new = x_new.to_dict()
x_new = {"data": [x_new['data']]}
x_new

model_path = Model.get_model_path('model.pkl')
model_test = joblib.load(model_path)

data = json.dumps(x_new)

data_test = pd.DataFrame(json.loads(data)['data'])
predictions_test = model_test.predict(data_test)

# It works!
predictions_test.tolist()

['non functional']

In [32]:
%%writefile score.py

import json
import joblib
import pandas as pd
from azureml.core.model import Model

# Called when the service is loaded
def init():
    global model
    # Get the path to the registered model file and load it
    model_path = Model.get_model_path('automl_XGBoost')
    model = joblib.load(model_path)

# Called when a request is received
def run(data):
    # Get the input data as a numpy array
    data = pd.DataFrame(json.loads(data)['data'])
    # Get a prediction from the model
    predictions = model.predict(data)
    # Return the predictions as any JSON serializable format
    return predictions.tolist()


Writing score.py


In [33]:
from azureml.core.conda_dependencies import CondaDependencies

# Add the dependencies for your model
# We need to include all of these packages for deployment of the automl model
# Otherwise the deployment will not work
myenv = CondaDependencies()
myenv.add_conda_package("scikit-learn==0.22.1")
myenv.add_conda_package("pandas==0.25.1")
myenv.add_conda_package("numpy>=1.16.0,<1.19.0")
myenv.add_conda_package("py-xgboost<=0.90")
myenv.add_conda_package("fbprophet==0.5")
myenv.add_conda_package("holidays==0.9.11")
myenv.add_conda_package("psutil>=5.2.2,<6.0.0")
myenv.add_pip_package("azureml-interpret==1.20.0")
myenv.add_pip_package("azureml-train-automl-runtime==1.20.0")
myenv.add_pip_package("inference-schema")

# Save the environment config as a .yml file
env_file = './env.yml'
with open(env_file,"w") as f:
    f.write(myenv.serialize_to_string())
print("Saved dependency info in", env_file)

Saved dependency info in ./env.yml


In [34]:
# Create inference_config
from azureml.core.model import InferenceConfig

classifier_inference_config = InferenceConfig(runtime="python",
                                              source_directory = '.',
                                              entry_script="score.py",
                                              conda_file="env.yml")


In [35]:
from azureml.core.webservice import AciWebservice

classifier_deploy_config = AciWebservice.deploy_configuration(cpu_cores = 1,
                                                              memory_gb = 1,
                                                              enable_app_insights=True)

In [36]:
from azureml.core.model import Model

model = ws.models['automl_XGBoost']
service = Model.deploy(workspace=ws,
                       name = 'pump-it-up-deployed-service',
                       models = [model],
                       inference_config = classifier_inference_config,
                       deployment_config = classifier_deploy_config)

service.wait_for_deployment(show_output = True)

Tips: You can try get_logs(): https://aka.ms/debugimage#dockerlog or local deployment: https://aka.ms/debugimage#debug-locally to debug if deployment takes longer than 10 minutes.
Running
2021-04-18 18:29:16+00:00 Creating Container Registry if not exists.
2021-04-18 18:29:17+00:00 Registering the environment.
2021-04-18 18:29:18+00:00 Building image..
2021-04-18 18:41:08+00:00 Generating deployment configuration.
2021-04-18 18:41:09+00:00 Submitting deployment to compute.
2021-04-18 18:41:12+00:00 Checking the status of deployment pump-it-up-deployed-service..
2021-04-18 18:45:25+00:00 Checking the status of inference endpoint pump-it-up-deployed-service.
Succeeded
ACI service creation operation finished, operation "Succeeded"


In [37]:
print(service.get_logs())

2021-04-18T18:45:18,191273900+00:00 - gunicorn/run 
2021-04-18T18:45:18,199078000+00:00 - rsyslog/run 
2021-04-18T18:45:18,208198800+00:00 - iot-server/run 
2021-04-18T18:45:18,229488400+00:00 - nginx/run 
/usr/sbin/nginx: /azureml-envs/azureml_7084af439e955eb6815347fde2c0f0f6/lib/libcrypto.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_7084af439e955eb6815347fde2c0f0f6/lib/libcrypto.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_7084af439e955eb6815347fde2c0f0f6/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_7084af439e955eb6815347fde2c0f0f6/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_7084af439e955eb6815347fde2c0f0f6/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
rsyslogd

# Test the Deployed Model

Here we will send a request to the deployed model to test it.



In [38]:
endpoint = service.scoring_uri

print(f'\nservice state: {service.state}\n')
print(f'scoring URI: \n{endpoint}\n')
print(f'swagger URI: \n{service.swagger_uri}\n')

print(endpoint)
print(service.swagger_uri)


service state: Healthy

scoring URI: 
http://1013fac7-9003-4912-a64c-3d5587ea2294.southcentralus.azurecontainer.io/score

swagger URI: 
http://1013fac7-9003-4912-a64c-3d5587ea2294.southcentralus.azurecontainer.io/swagger.json

http://1013fac7-9003-4912-a64c-3d5587ea2294.southcentralus.azurecontainer.io/score
http://1013fac7-9003-4912-a64c-3d5587ea2294.southcentralus.azurecontainer.io/swagger.json


In [39]:
import requests
import json

# Convert the array to a serializable list in a JSON document
input_data = json.dumps(x_new)

with open('data.json', 'w') as file:
    file.write(input_data)

# Set the content type in the request headers
request_headers = { "Content-Type":"application/json"}

# Call the service
response = requests.post(url = endpoint,
                         data = input_data,
                         headers = request_headers)

print(response)
print("Prediction Results:", response.json())

<Response [200]>
Prediction Results: ['non functional']


It works!

In [40]:
response.status_code

200

# Deleting the Service and the Compute Target

In [None]:
# Delete computer target in order to avoid incurring additional charges.

AmlCompute.delete(cpu_cluster)
service.delete()