<a href="https://colab.research.google.com/github/JayThibs/hyperdrive-vs-automl-plus-deployment/blob/main/automl.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Automated ML

Note: For data exploration, go to hyperparameter_tuning.ipynb

# Import Dependencies

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.train.automl import AutoMLConfig
from azureml.core.dataset import Dataset

from azureml.pipeline.steps import AutoMLStep

# Check core SDK version number
print("SDK version:", azureml.core.VERSION)

SDK version: 1.26.0


In [2]:
# %%writefile feature_preprocessing.py

import numpy as np
import pandas as pd

def bools(df):
    """
    public_meeting: we will fill the nulls as 'False'
    permit: we will fill the nulls as 'False
    """
    z = ['public_meeting', 'permit']
    for i in z:
        df[i].fillna(False, inplace = True)
        df[i] = df[i].apply(lambda x: float(x))
    return df

def locs(df, trans = ['longitude', 'latitude', 'gps_height', 'population']):
    """
    fill in the nulls for ['longitude', 'latitude', 'gps_height', 'population'] by using medians from 
    ['subvillage', 'district_code', 'basin'], and lastly the overall median
    """
    df.loc[df.longitude == 0, 'latitude'] = 0
    for z in trans:
        df[z].replace(0., np.NaN, inplace = True)
        df[z].replace(1., np.NaN, inplace = True)
        
        for j in ['district_code', 'basin']:
        
            df['median'] = df.groupby([j])[z].transform('median')
            df[z] = df[z].fillna(df['median'])
        
        df[z] = df[z].fillna(df[z].median())
        del df['median']
    return df

def construction(df):
    """
    A lot of null values for construction year. Of course, this is a missing value (a placeholder).
    For modeling purposes, this is actually fine, but we'll have trouble with visualizations if we
    compare the results for different years, so we'll set the value to something closer to
    the other values that aren't placeholders. Let's look at the unique years and set the null
    values to 50 years sooner.
    Let's set it to 1910 since the lowest "good" value is 1960.
    """
    df.loc[df['construction_year'] < 1950, 'construction_year'] = 1910
    return df

# Alright, now let's drop a few columns
# Needed to drop quite a few categorical columns so that the data would fit in memory in Azure
# Tested the model before and after (from 6388 columns to 278) in Colab and only had a ~0.03% reduction in performance

def removal(df):
  # id: we drop the id column because it is not a useful predictor.
  # amount_tsh: is mostly blank - delete
  # wpt_name: not useful, delete (too many values)
  # subvillage: too many values, delete
  # scheme_name: this is almost 50% nulls, so we will delete this column
  # num_private: we will delete this column because ~99% of the values are zeros.
  features_to_drop = ['id','amount_tsh',  'num_private', 
          'quantity', 'quality_group', 'source_type', 'payment', 
          'waterpoint_type_group', 'extraction_type_group', 'wpt_name', 
          'subvillage', 'scheme_name', 'funder', 'installer', 'recorded_by',
          'ward']
  df = df.drop(features_to_drop, axis=1)

  return df

def dummy(df):
    dummy_cols = ['basin', 'lga', 'public_meeting',
       'scheme_management', 'permit', 'extraction_type',
       'extraction_type_class', 'management', 'management_group',
       'payment_type', 'water_quality', 'quantity_group', 'source',
       'source_class', 'waterpoint_type', 'region']

    df = pd.get_dummies(df, columns=dummy_cols)

    return df

def dates(df):
    """
    date_recorded: this might be a useful variable for this analysis, although the year itself would be useless in a practical scenario moving into the future. We will convert this column into a datetime, and we will also create 'year_recorded' and 'month_recorded' columns just in case those levels prove to be useful. A visual inspection of both casts significant doubt on that possibility, but we'll proceed for now. We will delete date_recorded itself, since random forest cannot accept datetime
    """
    df['date_recorded'] = pd.to_datetime(df['date_recorded'])
    df['year_recorded'] = df['date_recorded'].apply(lambda x: x.year)
    df['month_recorded'] = df['date_recorded'].apply(lambda x: x.month)
    df['date_recorded'] = (pd.to_datetime(df['date_recorded'])).apply(lambda x: x.toordinal())
    return df

def dates2(df):
    """
    Turn year_recorded and month_recorded into dummy variables
    """
    for z in ['month_recorded', 'year_recorded']:
        df[z] = df[z].apply(lambda x: str(x))
        good_cols = [z+'_'+i for i in df[z].unique()]
        df = pd.concat((df, pd.get_dummies(df[z], prefix = z)[good_cols]), axis = 1)
        del df[z]
    return df

def small_n(df):
    "Collapsing small categorical value counts into 'other'"
    cols = [i for i in df.columns if type(df[i].iloc[0]) == str]
    df[cols] = df[cols].where(df[cols].apply(lambda x: x.map(x.value_counts())) > 100, "other")
    return df

## Dataset

### Overview

We'll be using the Pump it Up dataset from the DrivenData competition.

The description of the problem: 

> Using data from Taarifa and the Tanzanian Ministry of Water, can you predict which pumps are functional, which need some repairs, and which don't work at all? This is an intermediate-level practice competition. Predict one of these three classes based on a number of variables about what kind of pump is operating, when it was installed, and how it is managed. A smart understanding of which waterpoints will fail can improve maintenance operations and ensure that clean, potable water is available to communities across Tanzania.

In other words, our goal is to predict which water pumps are non-functioning or functioning, but in need of repair.

In this project, we will train a model using AutoML to train multiple multiple and choose the best performing model for deployment.

In [3]:
# We loaded the dataset into Azure and we are grabbing it here.

from azureml.core import Workspace, Experiment, Dataset
# from feature_preprocessing import *

# download config file in azure and put it in the current Notebooks folder
ws = Workspace.from_config()
exp = Experiment(workspace=ws, name="Pump-it-Up-Data-Mining-the-Water-Table")

print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep = '\n')

run = exp.start_logging()

# download config file in azure and put it in the current Notebooks folder
ws = run.experiment.workspace

dataset = Dataset.get_by_name(ws, name='Pump-it-Up-dataset')
X = dataset.to_pandas_dataframe()
y = X[['status_group']]
del X['status_group']

# Cleaning up the features of our dataset
X = bools(X)
X = locs(X)
X = construction(X)
X = removal(X)
X = dummy(X)
X = dates(X)
x = dates2(X)
X = small_n(X)

# Removing ">", "[" and "]" from the headers to make the data compatible with different algorithms (namely, xgboost)
regex = re.compile(r"\[|\]|<", re.IGNORECASE)
X.columns = [regex.sub("_", col) if any(x in str(col) for x in set(('[', ']', '<'))) else col for col in X.columns.values]

# Converting the population values to log
X['population'] = np.log(X['population'])

# Splitting the dataset into a training and test set
# Test set will be used later
# The same random seed (42) for the Hyperdrive model
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Concatenating the features and labels together to feed to our AutoML model
clean_train_df = pd.concat([X_train, y_train], axis=1)

Performing interactive authentication. Please follow the instructions on the terminal.
To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code FHKD5FNZL to authenticate.
You have logged in. Now let us find all the subscriptions to which you have access...
Interactive authentication successfully completed.
Workspace name: quick-starts-ws-142527
Azure region: southcentralus
Subscription id: 6971f5ac-8af1-446e-8034-05acea24681f
Resource group: aml-quickstarts-142527


In [63]:
from azureml.data.dataset_factory import TabularDatasetFactory

# Get the default datastore to be entered as a parameter in tabular dataset creation
datastore = ws.get_default_datastore()

# Change pandas dataframe into a tabular dataset to be used in automl
testing_data = TabularDatasetFactory.register_pandas_dataframe(X_test, datastore, 'automl_data_test')



Validating arguments.
Arguments validated.
Successfully obtained datastore reference and path.
Uploading file to managed-dataset/f236cea9-e14c-4caf-b55c-6f8db5c6a5c9/
Successfully uploaded file to datastore.
Creating and registering a new dataset.
Successfully created and registered a new dataset.


In [4]:
from azureml.data.dataset_factory import TabularDatasetFactory

# Get the default datastore to be entered as a parameter in tabular dataset creation
datastore = ws.get_default_datastore()

# Change pandas dataframe into a tabular dataset to be used in automl
training_data = TabularDatasetFactory.register_pandas_dataframe(clean_train_df, datastore, 'automl_data')

Method register_pandas_dataframe: This is an experimental method, and may change at any time.<br/>For more information, see https://aka.ms/azuremlexperimental.


Validating arguments.
Arguments validated.
Successfully obtained datastore reference and path.
Uploading file to managed-dataset/1f873d0b-f4e9-44ee-a1d6-38216b14ee83/
Successfully uploaded file to datastore.
Creating and registering a new dataset.
Successfully created and registered a new dataset.


In [58]:
training_data.take(3).to_pandas_dataframe()

Unnamed: 0,date_recorded,gps_height,longitude,latitude,region_code,district_code,population,construction_year,basin_Internal,basin_Lake Nyasa,basin_Lake Rukwa,basin_Lake Tanganyika,basin_Lake Victoria,basin_Pangani,basin_Rufiji,basin_Ruvuma / Southern Coast,basin_Wami / Ruvu,lga_Arusha Rural,lga_Arusha Urban,lga_Babati,lga_Bagamoyo,lga_Bahi,lga_Bariadi,lga_Biharamulo,lga_Bukoba Rural,lga_Bukoba Urban,lga_Bukombe,lga_Bunda,lga_Chamwino,lga_Chato,lga_Chunya,lga_Dodoma Urban,lga_Geita,lga_Hai,lga_Hanang,lga_Handeni,lga_Igunga,lga_Ilala,lga_Ileje,lga_Ilemela,lga_Iramba,lga_Iringa Rural,lga_Kahama,lga_Karagwe,lga_Karatu,lga_Kasulu,lga_Kibaha,lga_Kibondo,lga_Kigoma Rural,lga_Kigoma Urban,lga_Kilindi,lga_Kilolo,lga_Kilombero,lga_Kilosa,lga_Kilwa,lga_Kinondoni,lga_Kisarawe,lga_Kishapu,lga_Kiteto,lga_Kondoa,lga_Kongwa,lga_Korogwe,lga_Kwimba,lga_Kyela,lga_Lindi Rural,lga_Lindi Urban,lga_Liwale,lga_Longido,lga_Ludewa,lga_Lushoto,lga_Mafia,lga_Magu,lga_Makete,lga_Manyoni,lga_Masasi,lga_Maswa,lga_Mbarali,lga_Mbeya Rural,lga_Mbinga,lga_Mbozi,lga_Mbulu,lga_Meatu,lga_Meru,lga_Misenyi,lga_Missungwi,lga_Mkinga,lga_Mkuranga,lga_Monduli,lga_Morogoro Rural,lga_Morogoro Urban,lga_Moshi Rural,lga_Moshi Urban,lga_Mpanda,lga_Mpwapwa,lga_Mtwara Rural,lga_Mtwara Urban,lga_Mufindi,lga_Muheza,lga_Muleba,lga_Musoma Rural,lga_Mvomero,lga_Mwanga,lga_Nachingwea,lga_Namtumbo,lga_Nanyumbu,lga_Newala,lga_Ngara,lga_Ngorongoro,lga_Njombe,lga_Nkasi,lga_Nyamagana,lga_Nzega,lga_Pangani,lga_Rombo,lga_Rorya,lga_Ruangwa,lga_Rufiji,lga_Rungwe,lga_Same,lga_Sengerema,lga_Serengeti,lga_Shinyanga Rural,lga_Shinyanga Urban,lga_Siha,lga_Sikonge,lga_Simanjiro,lga_Singida Rural,lga_Singida Urban,lga_Songea Rural,lga_Songea Urban,lga_Sumbawanga Rural,lga_Sumbawanga Urban,lga_Tabora Urban,lga_Tandahimba,lga_Tanga,lga_Tarime,lga_Temeke,lga_Tunduru,lga_Ukerewe,lga_Ulanga,lga_Urambo,lga_Uyui,public_meeting_0_0,public_meeting_1_0,scheme_management_Company,scheme_management_None,scheme_management_Other,scheme_management_Parastatal,scheme_management_Private operator,scheme_management_SWC,scheme_management_Trust,scheme_management_VWC,scheme_management_WUA,scheme_management_WUG,scheme_management_Water Board,scheme_management_Water authority,permit_0_0,permit_1_0,extraction_type_afridev,extraction_type_cemo,extraction_type_climax,extraction_type_gravity,extraction_type_india mark ii,extraction_type_india mark iii,extraction_type_ksb,extraction_type_mono,extraction_type_nira/tanira,extraction_type_other,extraction_type_other - mkulima/shinyanga,extraction_type_other - play pump,extraction_type_other - rope pump,extraction_type_other - swn 81,extraction_type_submersible,extraction_type_swn 80,extraction_type_walimi,extraction_type_windmill,extraction_type_class_gravity,extraction_type_class_handpump,extraction_type_class_motorpump,extraction_type_class_other,extraction_type_class_rope pump,extraction_type_class_submersible,extraction_type_class_wind-powered,management_company,management_other,management_other - school,management_parastatal,management_private operator,management_trust,management_unknown,management_vwc,management_water authority,management_water board,management_wua,management_wug,management_group_commercial,management_group_other,management_group_parastatal,management_group_unknown,management_group_user-group,payment_type_annually,payment_type_monthly,payment_type_never pay,payment_type_on failure,payment_type_other,payment_type_per bucket,payment_type_unknown,water_quality_coloured,water_quality_fluoride,water_quality_fluoride abandoned,water_quality_milky,water_quality_salty,water_quality_salty abandoned,water_quality_soft,water_quality_unknown,quantity_group_dry,quantity_group_enough,quantity_group_insufficient,quantity_group_seasonal,quantity_group_unknown,source_dam,source_hand dtw,source_lake,source_machine dbh,source_other,source_rainwater harvesting,source_river,source_shallow well,source_spring,source_unknown,source_class_groundwater,source_class_surface,source_class_unknown,waterpoint_type_cattle trough,waterpoint_type_communal standpipe,waterpoint_type_communal standpipe multiple,waterpoint_type_dam,waterpoint_type_hand pump,waterpoint_type_improved spring,waterpoint_type_other,region_Arusha,region_Dar es Salaam,region_Dodoma,region_Iringa,region_Kagera,region_Kigoma,region_Kilimanjaro,region_Lindi,region_Manyara,region_Mara,region_Mbeya,region_Morogoro,region_Mtwara,region_Mwanza,region_Pwani,region_Rukwa,region_Ruvuma,region_Shinyanga,region_Singida,region_Tabora,region_Tanga,year_recorded,month_recorded,status_group
0,734926,2092.0,35.43,-4.23,21,1,5.08,1998,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,2013,2,functional
1,734213,550.0,35.51,-5.72,1,6,5.3,1910,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2011,3,functional
2,734328,550.0,32.5,-9.08,12,6,5.3,1910,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,2011,7,non functional


# Setting up Experiment

We'll create a new experiment for our deployment of an AutoML model and create a project folder to hold the training scripts.

In [6]:
experiment_name = 'automl-pump-it-up-operationalize'
project_folder = './automl-pipeline-project'

automl_experiment = Experiment(ws, experiment_name)
automl_experiment

Name,Workspace,Report Page,Docs Page
automl-pump-it-up-operationalize,quick-starts-ws-142527,Link to Azure Machine Learning studio,Link to Documentation


In [7]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# Creating a compute cluster if there isn't one that is already created.

cpu_cluster_name = 'hypr-auto-clustr'

try:
    cpu_cluster = ComputeTarget(workspace=ws, name=cpu_cluster_name)
    print('Found existing compute target.')
except ComputeTargetException:
    print('Creating a new computer target...')
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_v2',
                                                          max_nodes=4)
    cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, compute_config)
    
cpu_cluster.wait_for_completion(show_output=True)

Creating a new computer target...
Creating....
SucceededProvisioning operation finished, operation "Succeeded"
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


# AutoML Configuration

We'll create a new experiment for our deployment of an AutoML model and create a project folder to hold the training scripts.

Here we create the general AutoML settings object.


Calculate recall to test how well we do on True Positives. We can imagine a real scenario where we want to build a model that does not miss the non-functioning water pumps, and we care much less functioning water pumps that are incorrectly predicted as non-functional. Recall is useful to make sure we miss less True Positives.

In [19]:
from azureml.train.automl import AutoMLConfig

automl_settings = {
    "experiment_timeout_minutes": 120, # to set a limit on the amount of time AutoML will be running
    "max_concurrent_iterations": 5, # applies to the compute target we are using
    "primary_metric" : 'norm_macro_recall' # recall for our performance metric
}

# Setting AutoML config for model training.

automl_config = AutoMLConfig(compute_target=cpu_cluster,
                             task = "classification", # classifying if water pumps are functional
                             training_data=training_data, 
                             label_column_name="status_group", # our target variable for water pump function  
                             path = project_folder,
                             enable_early_stopping= True, # prevents automl from spending too much time on models that stopped improving, saves time and compute costs
                             featurization= 'auto',
                             debug_log = "automl_errors.log",
                             **automl_settings
                            )

## Create Pipeline and AutoMLStep

Defining the outputs for the AutoMLStep using TrainingOutput.

In [20]:
from azureml.pipeline.core import PipelineData, TrainingOutput

ds = ws.get_default_datastore()
metrics_output_name = 'metrics_output'
best_model_output_name = 'best_model_output'

metrics_data = PipelineData(name='metrics_data',
                           datastore=ds,
                           pipeline_output_name=metrics_output_name,
                           training_output=TrainingOutput(type='Metrics'))
model_data = PipelineData(name='model_data',
                           datastore=ds,
                           pipeline_output_name=best_model_output_name,
                           training_output=TrainingOutput(type='Model'))

## Create the AutoMLStep

In [21]:
# Creating an AutoMLStep

automl_step = AutoMLStep(
    name='automl_module',
    automl_config=automl_config,
    outputs=[metrics_data, model_data],
    allow_reuse=True
    )

In [22]:
# Creating a Pipeline

from azureml.pipeline.core import Pipeline

pipeline = Pipeline(
    description="pipeline_with_automlstep",
    workspace=ws,    
    steps=[automl_step])

In [23]:
print('Submitting AutoML experiment...')

pipeline_run = automl_experiment.submit(pipeline)

Submitting AutoML experiment...
Created step automl_module [167229c4][8fc39410-543e-4a68-9df5-fc7d7033d7f0], (This step will run and generate new outputs)
Submitted PipelineRun 6ed157a1-f7a3-495c-baa7-e4950e97af62
Link to Azure Machine Learning Portal: https://ml.azure.com/runs/6ed157a1-f7a3-495c-baa7-e4950e97af62?wsid=/subscriptions/6971f5ac-8af1-446e-8034-05acea24681f/resourcegroups/aml-quickstarts-142527/workspaces/quick-starts-ws-142527&tid=660b3398-b80e-49d2-bc5b-ac1dc93b5254


# Run Details

Using the RunDeatils widget to show the different experiments.

In [24]:
from azureml.widgets import RunDetails
RunDetails(pipeline_run).show()

_PipelineWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', …

In [25]:
pipeline_run.wait_for_completion()

PipelineRunId: 6ed157a1-f7a3-495c-baa7-e4950e97af62
Link to Azure Machine Learning Portal: https://ml.azure.com/runs/6ed157a1-f7a3-495c-baa7-e4950e97af62?wsid=/subscriptions/6971f5ac-8af1-446e-8034-05acea24681f/resourcegroups/aml-quickstarts-142527/workspaces/quick-starts-ws-142527&tid=660b3398-b80e-49d2-bc5b-ac1dc93b5254
PipelineRun Status: Running


StepRunId: 6f16f702-1b81-499c-ba0f-59d6780c3e21
Link to Azure Machine Learning Portal: https://ml.azure.com/runs/6f16f702-1b81-499c-ba0f-59d6780c3e21?wsid=/subscriptions/6971f5ac-8af1-446e-8034-05acea24681f/resourcegroups/aml-quickstarts-142527/workspaces/quick-starts-ws-142527&tid=660b3398-b80e-49d2-bc5b-ac1dc93b5254
StepRun( automl_module ) Status: NotStarted
StepRun( automl_module ) Status: Running

StepRun(automl_module) Execution Summary
StepRun( automl_module ) Status: Finished



PipelineRun Execution Summary
PipelineRun Status: Finished
{'runId': '6ed157a1-f7a3-495c-baa7-e4950e97af62', 'status': 'Completed', 'startTimeUtc': '2021-

'Finished'

# Examine Results

# Retrive the metrics of all child runs

In [26]:
metrics_output = pipeline_run.get_pipeline_output(metrics_output_name)
num_file_downloaded = metrics_output.download('.', show_progress=True)

Downloading azureml/6f16f702-1b81-499c-ba0f-59d6780c3e21/metrics_data
Downloaded azureml/6f16f702-1b81-499c-ba0f-59d6780c3e21/metrics_data, 1 files out of an estimated total of 1


In [37]:
import json
with open(metrics_output._path_on_datastore) as f:
    metrics_output_result = f.read()
    
deserialized_metrics_output = json.loads(metrics_output_result)
df = pd.DataFrame(deserialized_metrics_output)
pd.set_option('display.max_rows', 100)
df_t = df.T
df_t['recall_score_micro'].sort_values()

6f16f702-1b81-499c-ba0f-59d6780c3e21_12    [0.4524410774410774]
6f16f702-1b81-499c-ba0f-59d6780c3e21_30    [0.5429292929292929]
6f16f702-1b81-499c-ba0f-59d6780c3e21_35    [0.5448232323232324]
6f16f702-1b81-499c-ba0f-59d6780c3e21_10    [0.5515572390572391]
6f16f702-1b81-499c-ba0f-59d6780c3e21_2     [0.5620791245791246]
6f16f702-1b81-499c-ba0f-59d6780c3e21_9      [0.577020202020202]
6f16f702-1b81-499c-ba0f-59d6780c3e21_18    [0.5805976430976431]
6f16f702-1b81-499c-ba0f-59d6780c3e21_4     [0.5839646464646465]
6f16f702-1b81-499c-ba0f-59d6780c3e21_11    [0.5955387205387206]
6f16f702-1b81-499c-ba0f-59d6780c3e21_14    [0.5976430976430976]
6f16f702-1b81-499c-ba0f-59d6780c3e21_15    [0.5986952861952862]
6f16f702-1b81-499c-ba0f-59d6780c3e21_19    [0.6325757575757576]
6f16f702-1b81-499c-ba0f-59d6780c3e21_5     [0.6550925925925926]
6f16f702-1b81-499c-ba0f-59d6780c3e21_17    [0.6877104377104377]
6f16f702-1b81-499c-ba0f-59d6780c3e21_20    [0.6900252525252525]
6f16f702-1b81-499c-ba0f-59d6780c3e21_6  

# Best Model

In [77]:
best_model_output_name

'best_model_output'

In [99]:
from azureml.train.automl.run import AutoMLRun
automl_run = AutoMLRun(automl_experiment, run_id='6f16f702-1b81-499c-ba0f-59d6780c3e21_0')
num_file_downloaded = automl_run.get(ws, '6f16f702-1b81-499c-ba0f-59d6780c3e21_0').download_files

In [96]:
best_model_output._path_on_datastore

'azureml/6f16f702-1b81-499c-ba0f-59d6780c3e21/model_data'

In [91]:
# Retrieve best model from Pipeline Run
best_model_output = pipeline_run.get_pipeline_output('6f16f702-1b81-499c-ba0f-59d6780c3e21_0')
num_file_downloaded = best_model_output.download('.', show_progress=True)

ErrorResponseException: (BadRequest) PipelineRun output with name 6f16f702-1b81-499c-ba0f-59d6780c3e21_0 does not exist. (Parameter 'pipelineRunOutputName')

In [98]:
import pickle

with open('azureml/6f16f702-1b81-499c-ba0f-59d6780c3e21_0/model_data', "rb" ) as f:
    best_model = pickle.load(f)
best_model

FileNotFoundError: [Errno 2] No such file or directory: 'azureml/6f16f702-1b81-499c-ba0f-59d6780c3e21_0/model_data'

In [48]:
best_model.steps

[('datatransformer',
  DataTransformer(enable_dnn=None, enable_feature_sweeping=None,
                  feature_sweeping_config=None, feature_sweeping_timeout=None,
                  featurization_config=None, force_text_dnn=None,
                  is_cross_validation=None, is_onnx_compatible=None, logger=None,
                  observer=None, task=None, working_dir=None)),
 ('stackensembleclassifier',
  StackEnsembleClassifier(base_learners=[('13',
                                          Pipeline(memory=None,
                                                   steps=[('maxabsscaler',
                                                           MaxAbsScaler(copy=True)),
                                                          ('sgdclassifierwrapper',
                                                           SGDClassifierWrapper(alpha=1.6327367346938775,
                                                                                class_weight='balanced',
                            

# Test the model on the Test Set

In [64]:
X_testing = testing_data.to_pandas_dataframe()

In [67]:
from sklearn.metrics import recall_score

# Predict on the Test Set
ypred = best_model.predict(X_testing)

# calculate recall
recall = recall_score(y_test, ypred, average='micro')
print('Recall: %.3f' % recall)

Recall: 0.726


# Model Deployment

Registering the model, creating an inference config and deploy the model as a web service.

In other words, we are publishing the pipeline to enable a REST endpoint to rerun the pipeline from any HTTP library on any platform.

In [None]:
published_pipeline = pipeline_run.publish_pipeline(
    name="Pump it Up Train", description="Training Pump it Up pipeline", version="1.0")

published_pipeline

Now we authenticate to retrieve the auth_header so that the endpoint can be used.

In [None]:
from azureml.core.authentication import InteractiveLoginAuthentication

interactive_auth = InteractiveLoginAuthentication()
auth_header = interactive_auth.get_authentication_header()

# Test the Deployed Model

Here we will send a request to the deployed model to test it.



In [None]:
import requests

# Geting the REST url from the endpoint property of the published pipeline
rest_endpoint = published_pipeline.endpoint

# Building an HTTP POST request to the endpoint
# We also add a JSON payload object with the experiment name
response = requests.post(rest_endpoint, 
                         headers=auth_header, 
                         json={"ExperimentName": "pipeline-rest-endpoint"}
                        )

print(response.status_code)
print(response.elapsed)
print(response.json())

In [None]:
headers = {'Content-Type':'application/json'}
data = {"text": ['the food was horrible', 
                 'wow, this movie was truely great, I totally enjoyed it!',
                 'why the heck was my package not delivered in time?']}

resp = requests.post(aci_service.scoring_uri, json=data, headers=headers)
print("Prediction Results:", resp.json())

We are making it so that a request will trigger the run. We are also going to access the Id key from the response dict to get the value of the run id.

In [None]:
try:
    response.raise_for_status()
except Exception:    
    raise Exception("Received bad response from the endpoint: {}\n"
                    "Response Code: {}\n"
                    "Headers: {}\n"
                    "Content: {}".format(rest_endpoint, response.status_code, response.headers, response.content))

run_id = response.json().get('Id')
print('Submitted pipeline run: ', run_id)

# Printing the logs of the web service

We can now use the run id to monitor the status of the new run. 

In [None]:
from azureml.pipeline.core.run import PipelineRun
from azureml.widgets import RunDetails

published_pipeline_run = PipelineRun(ws.experiments["pipeline-rest-endpoint"], run_id)
RunDetails(published_pipeline_run).show()

# Printing the logs and Deleting the Service

In [None]:
# Delete computer target in order to avoid incurring additional charges.

AmlCompute.delete(cpu_cluster)