# Automated ML

In [1]:
# Import necessary packages
from azureml.data.dataset_factory import TabularDatasetFactory
from azureml.core import Workspace, Experiment, Dataset, Model, Environment
from azureml.train.automl import AutoMLConfig
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from azureml.core.conda_dependencies import CondaDependencies
from azureml.core.model import InferenceConfig
from azureml.core.webservice import AciWebservice
from azureml.widgets import RunDetails

import numpy as np
import pandas as pd

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler,OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

import os
import json
import joblib
import sklearn
import requests

## Dataset

### Overview

### Kaggle Competitions
We will be using the *[Ames Housing dataset](http://jse.amstat.org/v19n3/decock.pdf)* in this project. The original dataset was first published by Dean De Cock in his paper *[Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project](https://www.researchgate.net/publication/267976209_Ames_Iowa_Alternative_to_the_Boston_Housing_Data_as_an_End_of_Semester_Regression_Project) at Journal of Statistics Education (November 2011)*.

The original dataset is used in two different *[Kaggle](https://www.kaggle.com/)* competitions. The first competition  is the *[Housing Prices Competition for Kaggle Learn Users](https://www.kaggle.com/c/home-data-for-ml-course/overview)*, and the second competition is the *[House Prices - Advanced Regression Techniques](https://www.kaggle.com/c/house-prices-advanced-regression-techniques)* competition.

These are regression competitions in which competitors try to predict the price of the houses in the **test dataset** using the **training dataset**.

For competition purposes, approximately all of the data has been divided into two parts: **training dataset** and **test dataset**. We will be using the **training dataset** for training and the **test dataset** for submission to the competition. We will also send requests to our deployed web service using the test dataset.

The datasets can be downloaded from Kaggle. To download the necessary files just click on one of the competition links above and select the data tab. You may be required to get a free membership and accept the competition rules.

I have already downloaded the datasets to the data folder. Also, I have downloaded the *sample_submission.csv* and *data_description.txt* files. The *sample_submission.csv* file can be used to create submission files, while the *data_description.txt* file may be handy for data analysis.

The **training dataset** has 1460 rows and 81 columns (including the *Id* field). The **test dataset**, on the other hand, has 1459 rows and 80 columns (excluding the target column *SalePrice*).

I will not explain all the dataset features since they are too many and beyond the scope of this project. However, the explanations can be found in the *data_description.txt* file.

I will not use the 'Id' and 'Utilities' features. The former is just the id of the houses and the latter is the same for all items but one.

The target column is *SalePrice*. As mentioned above, this is a regression task project to predict house prices for a given set of features.

There are some missing values in the training dataset which will be handled by the [clean_data](#cleandata) function. I will use sklearn transformers and pipelines ro preprocess the data.

I will register the datasets using ML Studio's *Create dataset from local files* feature. The **training dataset** will be registered as *Housing Prices Dataset* and the **test dataset** will be registered as *Housing Prices Test Dataset*. 

**Warning: *GarageYrBlt* feature should be changed from *String* to *Integer* while creating both datasets. Otherwise clean_data function will crash! Kindly refer to README for detailes**

In [2]:
# Initialize Workspace
ws = Workspace.from_config()

# choose a name for experiment
experiment_name = 'automl_experiment'

experiment=Experiment(ws, experiment_name)

print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep = '\n')

Workspace name: my-workspace
Azure region: eastus2
Subscription id: 29e71f0e-90a3-43b1-ab69-4b27e1408264
Resource group: my-resource-group


In [3]:
# Create or Attach an AmlCompute cluster
# https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.compute.amlcompute(class)?view=azure-ml-py#provisioning-configuration-vm-size-----vm-priority--dedicated---min-nodes-0--max-nodes-none--idle-seconds-before-scaledown-none--admin-username-none--admin-user-password-none--admin-user-ssh-key-none--vnet-resourcegroup-name-none--vnet-name-none--subnet-name-none--tags-none--description-none--remote-login-port-public-access--notspecified--
# https://docs.microsoft.com/en-us/azure/machine-learning/how-to-tune-hyperparameters
# https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/ml-frameworks/scikit-learn/train-hyperparameter-tune-deploy-with-sklearn/train-hyperparameter-tune-deploy-with-sklearn.ipynb

# Choose a name for your CPU cluster
cpu_cluster_name = "cpu-cluster"

# Verify that cluster does not exist already
try:
    cpu_cluster = ComputeTarget(workspace=ws, name=cpu_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',
                                                              max_nodes=4)
    cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, compute_config)

cpu_cluster.wait_for_completion(show_output=True)

Found existing cluster, use it.
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


## Clean data function <a id='cleandata'></a>

In [4]:
# Define a preprocessing function for the datasets
# Will mostly use sklearn library.

def clean_data(data, test):
    """
    perform preprocessing over the data
    :param data: training dataset - tabular dataset
    :param test: test dataset - tabular dataset
    :return: train_data: the transformed (training) data - Dataframe
    :return: valid_data: the transformed (validation) data - Dataframe
    :return: X_test: the transformed (test) data - Dataframe
    """
    # Convert dataset to pandas dataframe
    X = data.to_pandas_dataframe()
    X_test = test.to_pandas_dataframe()
    # Set Id to index
    X.set_index('Id',inplace=True)
    X_test.set_index('Id',inplace=True)
    # Remove rows with missing target, separate target from predictors
    X.dropna(axis=0, subset=['SalePrice'], inplace=True)
    y = X.SalePrice 
    # Remove target and 'Utilities' 
    X.drop(['SalePrice', 'Utilities'], axis=1, inplace=True)
    X_test.drop(['Utilities'], axis=1, inplace=True)
    # Split the data
    X_train, X_valid, y_train, y_valid = train_test_split(X,y)
    # Select object columns
    categorical_cols = [cname for cname in X_train.columns if X_train[cname].dtype == "object"]
    # Select numeric columns
    numerical_cols = [cname for cname in X_train.columns if X_train[cname].dtype in ['int64','float64']]

    # Imputation lists
    # imputation to null values of these numerical columns need to be 'constant'
    constant_num_cols = ['GarageYrBlt', 'MasVnrArea']
    # imputation to null values of these numerical columns need to be 'mean'
    mean_num_cols = list(set(numerical_cols).difference(set(constant_num_cols)))
    # imputation to null values of these categorical columns need to be 'constant'
    constant_categorical_cols = ['Alley', 'MasVnrType', 'BsmtQual', 'BsmtCond','BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PoolQC', 'Fence', 'MiscFeature']
    # imputation to null values of these categorical columns need to be 'most_frequent'
    mf_categorical_cols = list(set(categorical_cols).difference(set(constant_categorical_cols)))

    my_cols = constant_num_cols + mean_num_cols + constant_categorical_cols + mf_categorical_cols

    # Define transformers
    # Preprocessing for numerical data - mean
    numerical_transformer_m = Pipeline(steps=[('imputer', SimpleImputer(strategy='mean')),('scaler', StandardScaler())])
    # Preprocessing for numerical data - constant
    numerical_transformer_c = Pipeline(steps=[('imputer', SimpleImputer(strategy='constant', fill_value=0)),('scaler', StandardScaler())])

    # Preprocessing for categorical data for most frequent
    categorical_transformer_mf = Pipeline(steps=[('imputer', SimpleImputer(strategy='most_frequent')), ('onehot', OneHotEncoder(handle_unknown = 'ignore', sparse = False))])
    # Preprocessing for categorical data for constant
    categorical_transformer_c = Pipeline(steps=[('imputer', SimpleImputer(strategy='constant', fill_value='NA')), ('onehot', OneHotEncoder(handle_unknown = 'ignore', sparse = False))])

    # Bundle preprocessing for numerical and categorical data
    preprocessor = ColumnTransformer(transformers=[
        ('num_mean', numerical_transformer_m, mean_num_cols),
        ('num_constant', numerical_transformer_c, constant_num_cols),
        ('cat_mf', categorical_transformer_mf, mf_categorical_cols),
        ('cat_c', categorical_transformer_c, constant_categorical_cols)])

    # Transform data
    X_train = preprocessor.fit_transform(X_train)
    X_valid = preprocessor.transform(X_valid)
    X_test = preprocessor.transform(X_test)
    
    
    # Concat datasets
    # automl needs merged datasets including target column
    # https://stackoverflow.com/questions/41989950/numpy-array-concatenate-valueerror-all-the-input-arrays-must-have-same-number
    train_data = np.concatenate([X_train, y_train[:,None]], axis=1)
    valid_data = np.concatenate([X_valid, y_valid[:,None]], axis=1)
    
    
    # Return data
    return train_data, valid_data, X_test

In [5]:
# Get the registered datasets
ds_train = Dataset.get_by_name(ws, name='Housing Prices Dataset')
ds_test = Dataset.get_by_name(ws, name='Housing Prices Test Dataset')

# Use the clean_data function to clean the data.
# Warning: Because of the undeterministic nature of the clean_data
# function, number of the features will differ for every run of the
# notebook if the kernel is restarted. As a result if you reproduce
# the code you will probably see different column numbers than this
# notebook. But this will not cause any error. 
train_data, valid_data, test_data = clean_data(ds_train, ds_test)
print("Shape of trining data: {}".format(train_data.shape))
print("Shape of validation data: {}".format(valid_data.shape))
print("Shape of test data: {}".format(test_data.shape))

Shape of trining data: (1095, 400)
Shape of validation data: (365, 400)
Shape of test data: (1459, 399)


In [6]:
# Check the data. The last columns should be "SalePrice"
# Number of the features may change in every different run because of 
# train_test_split "random_state" parameter. "SalePrice" is the last
# coulumn of the valid_data. which is "399" for this case.
# Shape of validation data: (365, 400)
# On the other hand, facing an error in this cell will
# not effect the rest of the notebook. If reproduced, just
# change the column accordingly. 
valid_data[0,399]

361919.0

In [7]:
# display train_data
print(train_data)

[[ 1.37165517e+00  4.04720756e-01 -8.24862039e-01 ...  0.00000000e+00
   0.00000000e+00  2.26000000e+05]
 [ 5.66170659e-01 -5.10918144e-01  1.10510363e+00 ...  0.00000000e+00
   0.00000000e+00  4.26000000e+05]
 [-7.87777660e-01 -5.10918144e-01  1.10510363e+00 ...  0.00000000e+00
   0.00000000e+00  1.38500000e+05]
 ...
 [-7.87777660e-01 -5.10918144e-01  1.10510363e+00 ...  0.00000000e+00
   0.00000000e+00  1.29900000e+05]
 [-7.87777660e-01  1.32035966e+00  1.10510363e+00 ...  0.00000000e+00
   0.00000000e+00  1.32500000e+05]
 [ 1.55294655e+00  2.23599856e+00 -8.24862039e-01 ...  0.00000000e+00
   0.00000000e+00  1.37000000e+05]]


In [8]:
# automl_config requires TabularDataset as a result we need to
# create a dataset from pandas dataframe
# https://docs.microsoft.com/en-us/azure/machine-learning/how-to-create-register-datasets#create-a-filedataset
print(type(train_data))

# create data folder if not exist 
if "data" not in os.listdir():
    os.mkdir("./data")

# https://stackoverflow.com/questions/11106536/adding-row-column-headers-to-numpy-arrays

# The outputs of clean_data function are numpy.ndarray and
# they don't have column names. As we will need column names in our
# automl training we need to define column names.
# column names will start from '0' and end with length of the columns minus 1 

# Calculate number of columns
number_of_columns=len(train_data[1,:])
# make a column name list for training data
names = [i for i in range(number_of_columns)]
# make a column name list for test data
# Since the test data has no "SalePrice" information it 
# shall be 1 less than training data.
test_ds_names = [i for i in range(number_of_columns-1)]

# convert train dataframe
train_path = 'data/train_cleaned.csv'
cleaned_train_data = pd.DataFrame(train_data, columns=names)
cleaned_train_data.to_csv(train_path, index=False, header=True, sep=',')

# convert valid dataframe
valid_path = 'data/valid_cleaned.csv'
cleaned_valid_data = pd.DataFrame(valid_data, columns=names)
cleaned_valid_data.to_csv(valid_path, index=False, header=True, sep=',')

# convert test dataframe
test_path = 'data/test_cleaned.csv'
cleaned_test_data = pd.DataFrame(test_data, columns=test_ds_names)
cleaned_test_data.to_csv(test_path, index=False, header=True, sep=',')

# get the datastore to upload prepared data
datastore = ws.get_default_datastore()

# upload the local file from src_dir to the target_path in datastore
datastore.upload(src_dir='data', target_path='data', overwrite=True)

# create a dataset referencing the cloud location
train_dataset = Dataset.Tabular.from_delimited_files(path = [(datastore, ('data/train_cleaned.csv'))])

# create a dataset referencing the cloud location
valid_dataset = Dataset.Tabular.from_delimited_files(path = [(datastore, ('data/valid_cleaned.csv'))])

# create a dataset referencing the cloud location
test_dataset = Dataset.Tabular.from_delimited_files(path = [(datastore, ('data/test_cleaned.csv'))])

<class 'numpy.ndarray'>
Uploading an estimated of 6 files
Uploading data/sample_submission.csv
Uploaded data/sample_submission.csv, 1 files out of an estimated total of 6
Uploading data/test.csv
Uploaded data/test.csv, 2 files out of an estimated total of 6
Uploading data/test_cleaned.csv
Uploaded data/test_cleaned.csv, 3 files out of an estimated total of 6
Uploading data/train.csv
Uploaded data/train.csv, 4 files out of an estimated total of 6
Uploading data/train_cleaned.csv
Uploaded data/train_cleaned.csv, 5 files out of an estimated total of 6
Uploading data/valid_cleaned.csv
Uploaded data/valid_cleaned.csv, 6 files out of an estimated total of 6
Uploaded 6 files


## AutoML Configuration


In [9]:
# Check the dataset
print(valid_dataset.to_pandas_dataframe().head())

          0         1         2         3         4         5         6  \
0  2.223036 -0.510918 -0.824862 -0.124791 -0.222317 -0.288201  1.632868   
1  0.754347 -0.510918 -0.824862 -0.124791 -0.222317 -0.288201 -1.051802   
2  1.194953 -0.510918 -0.824862 -0.124791 -0.222317 -0.288201  0.290533   
3  0.786474  0.404721  1.105104 -0.124791 -0.222317 -0.288201 -1.051802   
4  1.318874  0.404721 -0.824862 -0.124791 -0.222317 -0.288201  0.290533   

          7         8         9  ...  390  391  392  393  394  395  396  397  \
0 -0.073412  0.237175  2.117094  ...  0.0  0.0  0.0  1.0  0.0  1.0  0.0  0.0   
1 -0.073412 -0.823414 -0.326351  ...  0.0  0.0  0.0  1.0  0.0  1.0  0.0  0.0   
2 -0.073412 -0.782622  0.284511  ...  0.0  0.0  0.0  1.0  0.0  1.0  0.0  0.0   
3 -0.073412 -0.522574  0.284511  ...  0.0  0.0  0.0  1.0  0.0  0.0  0.0  1.0   
4 -0.073412 -0.685741  0.895372  ...  0.0  0.0  0.0  0.0  0.0  1.0  0.0  0.0   

   398       399  
0  0.0  361919.0  
1  0.0  159434.0  
2  0.0  190

In [10]:
# label_column_name shall be the last feature
label_column_name = str(number_of_columns-1)
print(label_column_name)

399


As descibed in [Configure automated ML experiments in Python](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-configure-auto-train), there are several options that we can use to configure our automated machine learning experiment. These parameters are set by instantiating an AutoMLConfig object. Below can be found the descriptions and the reasoning for some important parameters:

"experiment_timeout_minutes": Maximum amount of time in minutes that all iterations combined can take before the experiment terminates. **We do not want our experiment to cost too much.**

"max_concurrent_iterations": Represents the maximum number of iterations that would be executed in parallel. **Since we have *4* nodes in our AmlCompute cluster I have selected 4.**

"max_cores_per_iteration": The maximum number of threads to use for a given training iteration. **I have selected *-1* to use all the possible cores per iteration per child-run.** 

"training_data": The training data to be used within the experiment. Preprocessed in the previous cells. 

"validation_data": The validation data to be used within the experiment. Preprocessed in the previous cells.

"label_column_name": The name of the label column. Preprocessed in the previous cells.   

"enable_early_stopping": Whether to enable early termination if the score is not improving in the short term. **We do not want our experiment to cost too much.**

task = The type of task to run. For our case it is **regression**.

primary_metric = The metric that Automated Machine Learning will optimize for model selection. We have chosen **normalized_root_mean_squared_error** as suggested in [this article](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-configure-auto-train#primary-metric).

In [11]:
project_folder = './'
# automl settings
automl_settings = {
    "experiment_timeout_minutes": 30,
    "max_concurrent_iterations": 4,
    "max_cores_per_iteration":-1,
}

# automl config
automl_config = AutoMLConfig(compute_target = ws.compute_targets['cpu-cluster'],
                             task = "regression",
                             primary_metric = 'normalized_root_mean_squared_error',
                             training_data = train_dataset,
                             validation_data = valid_dataset,
                             label_column_name = label_column_name,   
                             path = project_folder,
                             enable_early_stopping = True,
                             debug_log = "automl_errors.log",
                             **automl_settings
                            )

In [12]:
# Submit the experiment
remote_run = experiment.submit(automl_config)

Running on remote.


## Run Details

In [13]:
# Display Run Details
RunDetails(remote_run).show()
remote_run.wait_for_completion(show_output=True)
assert(remote_run.get_status()=="Completed")

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…


Current status: FeaturesGeneration. Generating features for the dataset.
Current status: DatasetFeaturization. Beginning to fit featurizers and featurize the dataset.
Current status: ModelSelection. Beginning model selection.

****************************************************************************************************
DATA GUARDRAILS: 

TYPE:         Missing feature values imputation
STATUS:       PASSED
DESCRIPTION:  No feature missing values were detected in the training data.
              Learn more about missing value imputation: https://aka.ms/AutomatedMLFeaturization

****************************************************************************************************

TYPE:         High cardinality feature detection
STATUS:       PASSED
DESCRIPTION:  Your inputs were analyzed, and no high cardinality features were detected.
              Learn more about high cardinality feature handling: https://aka.ms/AutomatedMLFeaturization

*****************************************

## Best Model




In [14]:
# Retrieve and save best automl model.

# https://github.com/MicrosoftLearning/DP100/blob/master/08B%20-%20Using%20Automated%20Machine%20Learning.ipynb
# Get the best run object
best_run, fitted_model = remote_run.get_output()
print("Summary:")
print(remote_run.summary())
print("********************\n")
print("Best run:")
print(best_run)
print("********************\n")
print("Estimator:")
print(fitted_model.steps[-1])
print("********************\n")
print("Model:")
print(fitted_model)
print("********************\n")
best_run_metrics = best_run.get_metrics()
print('NRMSE:', best_run_metrics['normalized_root_mean_squared_error'])
print('MAE:', best_run_metrics['mean_absolute_error'])
print('RMSLE:', best_run_metrics['root_mean_squared_log_error'])

print("********************\n")

for metric_name in best_run_metrics:
    metric = best_run_metrics[metric_name]
    print(metric_name, metric)

Package:azureml-automl-runtime, training version:1.21.0, current version:1.20.0
Package:azureml-core, training version:1.21.0.post1, current version:1.20.0
Package:azureml-dataprep, training version:2.8.2, current version:2.7.3
Package:azureml-dataprep-native, training version:28.0.0, current version:27.0.0
Package:azureml-dataprep-rslex, training version:1.6.0, current version:1.5.0
Package:azureml-dataset-runtime, training version:1.21.0, current version:1.20.0
Package:azureml-defaults, training version:1.21.0, current version:1.20.0
Package:azureml-interpret, training version:1.21.0, current version:1.20.0
Package:azureml-pipeline-core, training version:1.21.0, current version:1.20.0
Package:azureml-telemetry, training version:1.21.0, current version:1.20.0
Package:azureml-train-automl-client, training version:1.21.0, current version:1.20.0
Package:azureml-train-automl-runtime, training version:1.21.0, current version:1.20.0


Summary:
[['StackEnsemble', 1, 0.031389776149283476], ['VotingEnsemble', 1, 0.029610347347248284], ['Failed', 3, nan], ['GradientBoosting', 3, 0.033595891409745476], ['LassoLars', 3, 0.040266979696220534], ['XGBoostRegressor', 3, 0.03344678098180733], ['RandomForest', 1, 0.03681478532881403], ['ExtremeRandomTrees', 1, 0.038876223763616456], ['DecisionTree', 18, 0.04600685116521176], ['ElasticNet', 3, 0.03923908410285242], ['LightGBM', 1, 0.035753923637082796]]
********************

Best run:
Run(Experiment: automl_experiment,
Id: AutoML_ae9abae5-08fd-4e1f-a793-cf0f6f55d662_36,
Type: azureml.scriptrun,
Status: Completed)
********************

Estimator:
('prefittedsoftvotingregressor', PreFittedSoftVotingRegressor(estimators=[('1',
                                          Pipeline(memory=None,
                                                   steps=[('maxabsscaler',
                                                           MaxAbsScaler(copy=True)),
                                   

In [15]:
# Predict test set for submission.
pred = fitted_model.predict(test_dataset.to_pandas_dataframe())
pred

array([113886.11265751, 159661.56421882, 182821.11893251, ...,
       188191.78873541, 118042.876886  , 220254.76365208])

In [16]:
# Save submission file. We will submit to the competitions.
# Kindly refer to README.
sample_submission_file = pd.read_csv("data/sample_submission.csv")
output = pd.DataFrame({'Id': sample_submission_file.Id,
                       'SalePrice': pred})
output.to_csv('data/submission.csv', index=False)
print ("Submission file is saved")

Submission file is saved


In [17]:
# Save the best model
# https://knowledge.udacity.com/questions/357007
# os.makedirs('outputs', exist_ok=True)
joblib.dump(fitted_model, 'automl_model.pkl')

['automl_model.pkl']

## Model Deployment

### The metrics for the best AutoML run

NRMSE (Primary metric): 0.029610347347248284

MAE (metric used for *[Housing Prices Competition for Kaggle Learn Users](https://www.kaggle.com/c/home-data-for-ml-course/overview)*): 14028.038852411917

RMSLE (metric used for *[House Prices - Advanced Regression Techniques](https://www.kaggle.com/c/house-prices-advanced-regression-techniques)*): 0.11724443714762385

### The metrics for the best Hyperdrive run

NRMSE (Primary metric): 0.032666490716132174

MAE (metric used for *[Housing Prices Competition for Kaggle Learn Users](https://www.kaggle.com/c/home-data-for-ml-course/overview)*): 15671.028999405176

RMSLE (metric used for *[House Prices - Advanced Regression Techniques](https://www.kaggle.com/c/house-prices-advanced-regression-techniques)*): 0.1279803797267253

As can be seen our best AutoML run is better. Since we have to deploy only one of the two models we trained we will deploy the best AutoML run.

In [18]:
# Register model
model = Model.register(workspace= ws,model_path='automl_model.pkl', model_name='best_automl_run')
# Check model
for model in Model.list(ws):
    print("Model Name: {}\n".format(model.name))
    print(model)
    print("********************\n")

Registering model best_automl_run
Model Name: best_automl_run

Model(workspace=Workspace.create(name='my-workspace', subscription_id='29e71f0e-90a3-43b1-ab69-4b27e1408264', resource_group='my-resource-group'), name=best_automl_run, id=best_automl_run:1, version=1, tags={}, properties={})
********************

Model Name: my_best_hyperdrive_run

Model(workspace=Workspace.create(name='my-workspace', subscription_id='29e71f0e-90a3-43b1-ab69-4b27e1408264', resource_group='my-resource-group'), name=my_best_hyperdrive_run, id=my_best_hyperdrive_run:3, version=3, tags={}, properties={})
********************



## Creating the Environment

We need to create an environment for the deployment. We can use an Azure curated environment or define a custom environment. In the following cell, I will create an Azure curated environment that can be used for AutoML. I will then check the dependencies for this environment and use them in a *yml* file to create my custom environment. I will use the custom environment I have created in my deployment.  

In [19]:
# Curated environment
env = Environment.get(workspace=ws, name="AzureML-AutoML")
print("packages", env.python.conda_dependencies.serialize_to_string())

packages channels:
- anaconda
- conda-forge
- pytorch
dependencies:
- python=3.6.2
- pip=20.2.4
- pip:
  - azureml-core==1.21.0.post1
  - azureml-pipeline-core==1.21.0
  - azureml-telemetry==1.21.0
  - azureml-defaults==1.21.0
  - azureml-interpret==1.21.0
  - azureml-automl-core==1.21.0
  - azureml-automl-runtime==1.21.0
  - azureml-train-automl-client==1.21.0
  - azureml-train-automl-runtime==1.21.0.post1
  - azureml-dataset-runtime==1.21.0
  - inference-schema
  - py-cpuinfo==5.0.0
  - boto3==1.15.18
  - botocore==1.18.18
- numpy~=1.18.0
- scikit-learn==0.22.1
- pandas~=0.25.0
- py-xgboost<=0.90
- fbprophet==0.5
- holidays==0.9.11
- setuptools-git
- psutil>5.0.0,<6.0.0
name: azureml_7ade26eb614f97df8030bc480da59236



In [20]:
# My environment from my-env.yml
my_env = Environment.from_conda_specification(name = 'my-env', file_path = './my-env.yml')

In [1]:
# We need to use a score script in deployment. This script is as follows:
with open('score.py') as f:
    print(f.read())

import joblib
import numpy as np
import pandas as pd
import os
import json

# The init() method is called once, when the web service starts up.
#
# Typically you would deserialize the model file, as shown here using joblib,
# and store it in a global variable so your run() method can access it later.
def init():
    global model

    # The AZUREML_MODEL_DIR environment variable indicates
    # a directory containing the model file you registered.
    model_filename = 'automl_model.pkl'
    model_path = os.path.join(os.environ['AZUREML_MODEL_DIR'], model_filename)

    model = joblib.load(model_path)


# The run() method is called each time a request is made to the scoring API.
# https://knowledge.udacity.com/questions/442907
def run(data):
    #print("data before")
    #print(data)
    #print(type(data))
    try:
        data = json.loads(data)['data']
        data = pd.DataFrame.from_dict(data)
        #print("Dataframe: ")
        #print(data.head())
        # Use the model object lo

In [68]:
# Deploy the model as a web service
service_name = 'my-automl-service-2'
#my_model = Model(ws, 'best_automl_run', version=1)
my_model = Model(ws, 'best_automl_run')
inference_config = InferenceConfig(entry_script='score.py', environment=my_env)
aci_config = AciWebservice.deploy_configuration(cpu_cores=1, memory_gb=1)

service = Model.deploy(workspace=ws,
                       name=service_name,
                       models=[my_model],
                       inference_config=inference_config,
                       deployment_config=aci_config,
                       overwrite=True)
service.wait_for_deployment(show_output=True)

Tips: You can try get_logs(): https://aka.ms/debugimage#dockerlog or local deployment: https://aka.ms/debugimage#debug-locally to debug if deployment takes longer than 10 minutes.
Running........................................
Succeeded
ACI service creation operation finished, operation "Succeeded"


In [69]:
# Enable application insights
service.update(enable_app_insights=True)

Send a request to the web service.

In [72]:
# Request for 1 item
# Get the test data
my_test_values = test_dataset.to_pandas_dataframe()
test_list = my_test_values.values.tolist()

# Get first item
my_list=test_list[0]

# Create JSON string
my_data={}
print("Length: ")
print(len(my_list))

# Create dictionary
for count in range(len(my_list)):
    my_data[str(count)]=my_list[count]
print("My Data: ")
print(my_data)

# Convert to json format

data = {"data":
        [
          my_data,
      ]
    }

print("Data: ")
print(data)

# Convert to JSON string
input_data = json.dumps(data)
print("Input data")
print(input_data)

Length: 
399
My Data: 
{'0': -0.7877776601303935, '1': 0.40472075580786326, '2': -0.824862038820425, '3': -0.12479112703105394, '4': -0.22231727917043134, '5': 0.5900020843135478, '6': -1.0518023616782892, '7': -0.07341187239855226, '8': -0.7010384200582758, '9': -0.9372116952221238, '10': 1.177698492303729, '11': -0.12402884834943756, '12': -0.7249813707482863, '13': -0.8685999131096809, '14': -0.2597453608736428, '15': 1.8877413397414555, '16': -0.11750960306848886, '17': 0.034213835019685064, '18': 0.07970023247881027, '19': -1.177925864816224, '20': -1.1456169750686023, '21': -0.36519377147345083, '22': -0.34676496051017097, '23': -0.4089140915554155, '24': -0.7687342984741533, '25': 0.34566371578061267, '26': -1.0380498416918418, '27': -0.7881555347992615, '28': -0.07907797829369968, '29': -1.0708264505936433, '30': -0.9354464621967534, '31': 1.6770151821776618, '32': -0.6650718421533498, '33': 0.186221289050153, '34': -0.5834629514310187, '35': 0.0, '36': 0.0, '37': 0.0, '38': 0.

In [73]:
# Set the content type
headers = {'Content-Type': 'application/json'}

# Make the request and display the response
resp = requests.post(service.scoring_uri, input_data, headers=headers)

print(resp.text)

[113886.11265750558]


In [74]:
# Request for 3 items
my_list_2 = my_test_values.values.tolist()
my_list=[my_list_2[0], my_list_2[1], my_list_2[2]]
test_list=[]
# Create JSON string
for item in my_list:
    # Create dictionary
    my_data={}
    for count in range(len(item)):
        my_data[str(count)]=item[count]
    test_list.append(my_data)
data = {"data":
        test_list
    }
# Convert to JSON string
input_data = json.dumps(data)

In [75]:
input_data

'{"data": [{"0": -0.7877776601303935, "1": 0.40472075580786326, "2": -0.824862038820425, "3": -0.12479112703105394, "4": -0.22231727917043134, "5": 0.5900020843135478, "6": -1.0518023616782892, "7": -0.07341187239855226, "8": -0.7010384200582758, "9": -0.9372116952221238, "10": 1.177698492303729, "11": -0.12402884834943756, "12": -0.7249813707482863, "13": -0.8685999131096809, "14": -0.2597453608736428, "15": 1.8877413397414555, "16": -0.11750960306848886, "17": 0.034213835019685064, "18": 0.07970023247881027, "19": -1.177925864816224, "20": -1.1456169750686023, "21": -0.36519377147345083, "22": -0.34676496051017097, "23": -0.4089140915554155, "24": -0.7687342984741533, "25": 0.34566371578061267, "26": -1.0380498416918418, "27": -0.7881555347992615, "28": -0.07907797829369968, "29": -1.0708264505936433, "30": -0.9354464621967534, "31": 1.6770151821776618, "32": -0.6650718421533498, "33": 0.186221289050153, "34": -0.5834629514310187, "35": 0.0, "36": 0.0, "37": 0.0, "38": 0.0, "39": 1.0

In [76]:
# Send the request
output = service.run(input_data)

print(output)

[113886.11265750558, 159661.5642188239, 182821.1189325057]


Print the logs of the web service and delete the service.

In [77]:
print(service.get_logs())

2021-02-10T14:14:56,823796114+00:00 - gunicorn/run 
2021-02-10T14:14:56,824993819+00:00 - iot-server/run 
2021-02-10T14:14:56,826071723+00:00 - rsyslog/run 
2021-02-10T14:14:56,834736660+00:00 - nginx/run 
/usr/sbin/nginx: /azureml-envs/azureml_7ade26eb614f97df8030bc480da59236/lib/libcrypto.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_7ade26eb614f97df8030bc480da59236/lib/libcrypto.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_7ade26eb614f97df8030bc480da59236/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_7ade26eb614f97df8030bc480da59236/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_7ade26eb614f97df8030bc480da59236/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
rsyslogd

In [78]:
# Delete the service
service.delete()

In [79]:
# Delete compute cluster
cpu_cluster.delete()

Current provisioning state of AmlCompute is "Deleting"



# References
- Cock, Dean. (2011). Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project. Journal of Statistics Education. 19. 10.1080/10691898.2011.11889627.
- [Deployment to Cloud Example](https://github.com/ErkanHatipoglu/MachineLearningNotebooks/tree/master/how-to-use-azureml/deployment/deploy-to-cloud)