# Automated ML

## Required Packages

In [1]:
import pandas as pd 
import numpy as np 
import requests
import csv
import os
import json

import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.train.automl import AutoMLConfig
from azureml.core.dataset import Dataset

from azureml.pipeline.steps import AutoMLStep


# Initialize Workspace

In [2]:
ws = Workspace.from_config()
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\n')

mlops-udacity-ws
mlops-udacity-rg
uksouth
36cd5b29-0f0b-4505-8ed0-2c1f142e8de3


## Dataset

### Overview
TODO: In this markdown cell, give an overview of the dataset you are using. Also mention the task you will be performing.

We are using the salary dataset (Census Income dataset) which is going to be extracted from a Kaggle dataset. For more details on the dataset, please refer to the readme. Once the dataset is cleaned, it will then be prepared to be used in the AutoML model, which will select the best fitting model for the dataset based on a proprietary algorithm. 


In [3]:
f = open('kaggle.json')

kaggle_cred = json.load(f)

username = kaggle_cred['username']
key = kaggle_cred['key']

os.environ['KAGGLE_USERNAME'] = username
os.environ['KAGGLE_KEY'] = key

from kaggle.api.kaggle_api_extended import KaggleApi #do not move import line

api = KaggleApi()
api.authenticate()

api.dataset_download_file('ayessa/salary-prediction-classification', 'salary.csv')

#Unzip file

import zipfile
with zipfile.ZipFile('salary.csv.zip', 'r') as zip_ref:
    zip_ref.extractall('.')


# Dataset Cleaning

In [113]:
df = pd.read_csv('salary.csv')
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


The dataset has been prior explored by Nikita Verma in this medium article:  https://medium.com/analytics-vidhya/machine-learning-application-census-income-prediction-868227debf12

As EDA is not the focus of the project/course, I've repeated most of the key steps taken by the article (with a few deviations) in a concise format to prepare the dataset for AutoML/Hyperdrive (with random checks to ensure steps are consistent). 

The dataset is prepared in the following order:

1. For all categorical variables, replace missing values ('?') with the mode categorical value for that particular variable
2. Remove 'education_num' as a parameter as we will be using a different encoding system for education 
3. For 'workclass' variable, merge 'Never-worked' into 'Without-pay' | merge 'State-gov' and 'Local-gov' into new descriptor 'Gov' | merge 'Self-emp-inc' into 'Private'
4. For 'education' variable, merge 'Prof-school' into 'Doctorate' descriptior | merge 'Assoc-acdm' and 'Assoc-voc' into new descriptor 'Assoc'  
5. For 'marital-status' variable, merge 'Married-Civ-Spouse' and 'Married-AF-Spouse' into new descriptor 'Married-with-spouse' | merge 'Separated', 'Divorced', 'Widowed' and 'Married-spouse-absent' into new descriptor 'No-spouse'
6. For 'relationship' variable, merge 'Not-in-family', 'Own-child', 'Unmarried' and 'Other-relative' into new descriptor 'Other' (based on similar distributions for salary)
7. For 'race' variable, merge 'Amer-Indian-Eskimo', 'Other' into existing 'Others' descriptor
8. Label encode the categorical variables. While one hot is typically preferred for non-ordinal variables (of which all categorical variables in the dataset apply as), one hot encoding introduces around 90 additional parameters into the dataset. This hampers the ability to interpret the model in terms of featurization and can lead to exponential computational costs for hyperparameter tuning and autoML. 
9. Continuous variables with skewed distributions: 'fnlwgt', 'capital_gain' and 'capital_loss', treated with square-root transform, cube-root transform and cube-root transform respectively

Normalization is not performed as AutoML handles this process automatically. 

### Step 1: Missing category fix

In [114]:
from statistics import mode
workclass_mode = mode(df['workclass'])
occupation_mode = mode(df['occupation'])
country_mode = mode(df['native-country'])

df['workclass'] = df['workclass'].str.replace('?', str(workclass_mode), regex=True)
df['occupation'] = df['occupation'].str.replace('?', str(occupation_mode), regex=True)
df['native-country'] = df['native-country'].str.replace('?', str(country_mode), regex=True)

### Step 2: Remove Education parameter


In [115]:
df = df.drop('education-num', axis = 'columns')
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,39,State-gov,77516,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


### Step 3-7: Categorical Variable Merging

In [116]:
df['workclass'] = df['workclass'].replace('Never-worked', 'Without-pay', regex=True)
df['workclass'] = df['workclass'].replace(to_replace=['State-gov', 'Local-gov'], value='Gov', regex=True)
df['workclass'] = df['workclass'].replace('Self-emp-inc', 'Private', regex=True)

df['education'] = df['education'].replace('Prof-school', 'Doctorate')
df['education'] = df['education'].replace(to_replace=['Assoc-acdm','Assoc-voc'], value='Assoc', regex=True)

df['marital-status'] = df['marital-status'].replace(to_replace=['Divorced','Widowed','Separated','Married-spouse-absent'], value='No-spouse', regex=True)
df['marital-status'] = df['marital-status'].replace(to_replace=['Married-AF-spouse', 'Married-civ-spouse'], value='Married-with-spouse', regex=True)

df['relationship'] = df['relationship'].replace(to_replace=['Not-in-family','Own-child','Unmarried','Other-relative'], value='Other', regex=True)

df['race'] = df['race'].replace('Amer-Indian-Eskimo', 'Other', regex=True)

#Sanity check:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

print(df['workclass'].value_counts())
print('\n')
print(df['education'].value_counts())
print('\n')
print(df['marital-status'].value_counts())
print('\n')
print(df['relationship'].value_counts())
print('\n')
print(df['race'].value_counts())


 Private             23812
 Gov                  3391
 Self-emp-not-inc     2541
  Private             1836
 Federal-gov           960
 Without-pay            21
Name: workclass, dtype: int64


 HS-grad         10501
 Some-college     7291
 Bachelors        5355
 Assoc            2449
 Masters          1723
 11th             1175
 10th              933
 7th-8th           646
 Prof-school       576
 9th               514
 12th              433
 Doctorate         413
 5th-6th           333
 1st-4th           168
 Preschool          51
Name: education, dtype: int64


 Married-with-spouse    14999
 Never-married          10683
 No-spouse               6879
Name: marital-status, dtype: int64


 Other      17800
 Husband    13193
 Wife        1568
Name: relationship, dtype: int64


 White                 27816
 Black                  3124
 Asian-Pac-Islander     1039
 Other                   582
Name: race, dtype: int64


### Pre-Step 8/9/10: Split Discrete and Continuous Variables

In [117]:
#Split dataset into continuous and discrete variables:
c_cols = [0, 2, 9, 10, 11]
d_cols = [1, 3, 4, 5, 6, 7, 8, 12, 13]
df_c = df[df.columns[c_cols]]
df_d = df[df.columns[d_cols]]

df_c.head()
df_d.head()

Unnamed: 0,age,fnlwgt,capital-gain,capital-loss,hours-per-week
0,39,77516,2174,0,40
1,50,83311,0,0,13
2,38,215646,0,0,40
3,53,234721,0,0,40
4,28,338409,0,0,40


Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,sex,native-country,salary
0,Gov,Bachelors,Never-married,Adm-clerical,Other,White,Male,United-States,<=50K
1,Self-emp-not-inc,Bachelors,Married-with-spouse,Exec-managerial,Husband,White,Male,United-States,<=50K
2,Private,HS-grad,No-spouse,Handlers-cleaners,Other,White,Male,United-States,<=50K
3,Private,11th,Married-with-spouse,Handlers-cleaners,Husband,Black,Male,United-States,<=50K
4,Private,Bachelors,Married-with-spouse,Prof-specialty,Wife,Black,Female,Cuba,<=50K


### Step 8: Label Encoding

In [118]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df_d_le = df_d.apply(le.fit_transform)
df_d_le.head()

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,sex,native-country,salary
0,2,8,1,1,1,3,1,39,0
1,4,8,0,4,0,3,1,39,0
2,3,10,2,6,1,3,1,39,0
3,3,1,0,6,0,1,1,39,0
4,3,8,0,10,2,1,0,5,0


### Step 9: Skewed Distribution Treatment

In [119]:
print('Skewness of skewed variables prior to transformation: \n')
print('fnlwgt: '+ str(df_c['fnlwgt'].skew()))
print('capital-gain: '+ str(df_c['capital-gain'].skew()))
print('capital-loss: '+ str(df_c['capital-loss'].skew()))
df_c['fnlwgt'] = np.sqrt(df_c['fnlwgt'])
df_c['capital-gain'] =np.cbrt(df_c['capital-gain'])
df_c['capital-loss']= np.cbrt(df_c['capital-loss'])
print('\n')
print('Skewness of variables after treatment: \n')
print('fnlwgt: '+ str(df_c['fnlwgt'].skew()))
print('capital-gain: '+ str(df_c['capital-gain'].skew()))
print('capital-loss: '+ str(df_c['capital-loss'].skew()))

Skewness of skewed variables prior to transformation: 

fnlwgt: 1.4469800945789826
capital-gain: 11.953847687699799
capital-loss: 4.594629121679692


Skewness of variables after treatment: 

fnlwgt: 0.18911507102940592
capital-gain: 4.099578209557927
capital-loss: 4.337076103499937


### Final step: Join together continuous and discrete variables

In [120]:
from azureml.core import Datastore
df_final = df_c.join(df_d_le)
df_final.head()

Unnamed: 0,age,fnlwgt,capital-gain,capital-loss,hours-per-week,workclass,education,marital-status,occupation,relationship,race,sex,native-country,salary
0,39,278.42,12.95,0.0,40,2,8,1,1,1,3,1,39,0
1,50,288.64,0.0,0.0,13,4,8,0,4,0,3,1,39,0
2,38,464.38,0.0,0.0,40,3,10,2,6,1,3,1,39,0
3,53,484.48,0.0,0.0,40,3,1,0,6,0,1,1,39,0
4,28,581.73,0.0,0.0,40,3,8,0,10,2,1,0,5,0


## Register Dataset as Data Asset

In [94]:
datastore = Datastore.get(ws,'workspaceblobstore')
dataset = Dataset.Tabular.register_pandas_dataframe(df_final, datastore, "Salary-Dataset-Cleaned", show_progress=True)

#Cleaned dataframe also saved as csv to reuse in the hyperparameter script:
df_final.to_csv('salary_cleaned.csv')

Validating arguments.
Arguments validated.
Successfully obtained datastore reference and path.
Uploading file to managed-dataset/df1d5f60-ad39-4521-8088-efb34230a49e/
Successfully uploaded file to datastore.
Creating and registering a new dataset.
Successfully created and registered a new dataset.


## AutoML Configuration

TODO: Explain why you chose the automl settings and cofiguration you used below.

### Attach Cluster Compute

In [121]:
from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget
from azureml.core.compute_target import ComputeTargetException

# NOTE: update the cluster name to match the existing cluster
# Choose a name for your CPU cluster
amlcompute_cluster_name = "auto-ml"

# Verify that cluster does not exist already
try:
    compute_target = ComputeTarget(workspace=ws, name=amlcompute_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_NC6', max_nodes=4)
    compute_target = ComputeTarget.create(ws, amlcompute_cluster_name, compute_config)

Found existing cluster, use it.


In [122]:
experiment_name = 'salary-automl-experiment'
experiment=Experiment(ws, experiment_name)

In [130]:
# TODO: Put your automl settings here
automl_settings = {
    "experiment_timeout_minutes": 30,
    "max_concurrent_iterations": 5,
    "primary_metric" : 'accuracy' 
}

# TODO: Put your automl config here
automl_config = AutoMLConfig(compute_target=compute_target,
                             task = "classification",
                             training_data=dataset,
                             label_column_name="salary",   
                             path = 'Users/alex.jamieson/log_files',
                             enable_early_stopping= True,
                             featurization= 'auto',
                             debug_log = "automl_errors.log",
                             **automl_settings
                            )

In [131]:
# TODO: Submit your experiment
automl_run = experiment.submit(automl_config)

Submitting remote run.


Experiment,Id,Type,Status,Details Page,Docs Page
salary-automl-experiment,AutoML_6153b81f-cfb1-45cc-82ff-928ca40cb597,automl,NotStarted,Link to Azure Machine Learning studio,Link to Documentation


## Run Details

OPTIONAL: Write about the different models trained and their performance. Why do you think some models did better than others?

TODO: In the cell below, use the `RunDetails` widget to show the different experiments.

In [132]:
from azureml.widgets import RunDetails
RunDetails(automl_run).show()

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

In [133]:
automl_run.wait_for_completion()



{'runId': 'AutoML_6153b81f-cfb1-45cc-82ff-928ca40cb597',
 'target': 'auto-ml',
 'status': 'Completed',
 'startTimeUtc': '2023-01-28T15:58:12.933123Z',
 'endTimeUtc': '2023-01-28T16:26:24.215896Z',
 'services': {},
   'message': 'No scores improved over last 10 iterations, so experiment stopped early. This early stopping behavior can be disabled by setting enable_early_stopping = False in AutoMLConfig for notebook/python SDK runs.'}],
 'properties': {'num_iterations': '1000',
  'training_type': 'TrainFull',
  'acquisition_function': 'EI',
  'primary_metric': 'accuracy',
  'train_split': '0',
  'acquisition_parameter': '0',
  'num_cross_validation': None,
  'target': 'auto-ml',
  'DataPrepJsonString': '{\\"training_data\\": {\\"datasetId\\": \\"3ddc3abb-5bc8-4364-8ff8-23db2715bd50\\"}, \\"datasets\\": 0}',
  'EnableSubsampling': None,
  'runTemplate': 'AutoML',
  'azureml.runsource': 'automl',
  'display_task_type': 'classification',
  'dependencies_versions': '{"azureml-widgets": "1.47.

In [18]:
#Call the previously ran Experiment 
# Required Imports in case of cleared variable workspace, otherwise comment out code block:
from azureml.train.automl.run import AutoMLRun
from azureml.core import Workspace
from azureml.widgets import RunDetails

ws = Workspace.from_config()

prev_exp = Experiment(ws, name="salary-automl-experiment")
automl_run = AutoMLRun(prev_exp, run_id="AutoML_6153b81f-cfb1-45cc-82ff-928ca40cb597")

RunDetails(automl_run).show()

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…



## Best Model

TODO: In the cell below, get the best model from the automl experiments and display all the properties of the model.



In [20]:
best_auto_run, fitted_model = automl_run.get_output()
best_auto_run_id = automl_run.get_best_child()

best_auto_run.get_metrics()

print('\n')
print(best_auto_run_id)
print('\n')
print(fitted_model)

Package:azureml-automl-runtime, training version:1.48.0.post1, current version:1.47.0
Package:azureml-core, training version:1.48.0, current version:1.47.0
Package:azureml-dataprep, training version:4.8.3, current version:4.5.7
Package:azureml-dataprep-rslex, training version:2.15.1, current version:2.11.4
Package:azureml-dataset-runtime, training version:1.48.0, current version:1.47.0
Package:azureml-defaults, training version:1.48.0, current version:1.47.0
Package:azureml-interpret, training version:1.48.0, current version:1.47.0
Package:azureml-mlflow, training version:1.48.0, current version:1.47.0
Package:azureml-pipeline-core, training version:1.48.0, current version:1.47.0
Package:azureml-responsibleai, training version:1.48.0, current version:1.47.0
Package:azureml-telemetry, training version:1.48.0, current version:1.47.0
Package:azureml-train-automl-client, training version:1.48.0, current version:1.47.0
Package:azureml-train-automl-runtime, training version:1.48.0.post1, cur



Run(Experiment: salary-automl-experiment,
Id: AutoML_6153b81f-cfb1-45cc-82ff-928ca40cb597_41,
Type: azureml.scriptrun,
Status: Completed)


Pipeline(memory=None,
         steps=[('datatransformer',
                 DataTransformer(enable_dnn=False, enable_feature_sweeping=True, feature_sweeping_config={}, feature_sweeping_timeout=86400, featurization_config=None, force_text_dnn=False, is_cross_validation=False, is_onnx_compatible=False, observer=None, task='classification', working_dir='/mnt/batch/tasks/shared/LS_root/moun...
                 PreFittedSoftVotingClassifier(classification_labels=array([0, 1]), estimators=[('21', Pipeline(memory=None, steps=[('standardscalerwrapper', StandardScalerWrapper(copy=True, with_mean=False, with_std=False)), ('xgboostclassifier', XGBoostClassifier(booster='gbtree', colsample_bytree=0.5, eta=0.2, gamma=0, max_depth=7, max_leaves=7, n_estimators=25, n_jobs=1, objective='reg:logistic', problem_info=ProblemInfo(gpu_training_param_dict={'processing_

Best autorun model details:

```
Run(Experiment: salary-automl-experiment,
Id: AutoML_6153b81f-cfb1-45cc-82ff-928ca40cb597_41,
Type: azureml.scriptrun,
Status: Completed)


```
```
Pipeline(memory=None,
         steps=[('datatransformer',
                 DataTransformer(enable_dnn=False, enable_feature_sweeping=True, feature_sweeping_config={}, feature_sweeping_timeout=86400, featurization_config=None, force_text_dnn=False, is_cross_validation=False, is_onnx_compatible=False, observer=None, task='classification', working_dir='/mnt/batch/tasks/shared/LS_root/moun...
                 PreFittedSoftVotingClassifier(classification_labels=array([0, 1]), estimators=[('21', Pipeline(memory=None, steps=[('standardscalerwrapper', StandardScalerWrapper(copy=True, with_mean=False, with_std=False)), ('xgboostclassifier', XGBoostClassifier(booster='gbtree', colsample_bytree=0.5, eta=0.2, gamma=0, max_depth=7, max_leaves=7, n_estimators=25, n_jobs=1, objective='reg:logistic', problem_info=ProblemInfo(gpu_training_param_dict={'processing_unit_type': 'gpu'}), random_state=0, reg_alpha=0, reg_lambda=0.20833333333333334, subsample=1, tree_method='auto'))], verbose=False)), ('33', Pipeline(memory=None, steps=[('standardscalerwrapper', StandardScalerWrapper(copy=True, with_mean=False, with_std=False)), ('xgboostclassifier', XGBoostClassifier(booster='gbtree', colsample_bytree=1, eta=0.05, gamma=0, max_depth=6, max_leaves=0, n_estimators=200, n_jobs=1, objective='reg:logistic', problem_info=ProblemInfo(gpu_training_param_dict={'processing_unit_type': 'gpu'}), random_state=0, reg_alpha=0.625, reg_lambda=0.8333333333333334, subsample=0.8, tree_method='auto'))], verbose=False)), ('1', Pipeline(memory=None, steps=[('maxabsscaler', MaxAbsScaler(copy=True)), ('xgboostclassifier', XGBoostClassifier(n_jobs=1, problem_info=ProblemInfo(gpu_training_param_dict={'processing_unit_type': 'gpu'}), random_state=0, tree_method='auto'))], verbose=False)), ('14', Pipeline(memory=None, steps=[('standardscalerwrapper', StandardScalerWrapper(copy=True, with_mean=False, with_std=False)), ('xgboostclassifier', XGBoostClassifier(booster='gbtree', colsample_bytree=1, eta=0.3, gamma=0, max_depth=10, max_leaves=511, n_estimators=10, n_jobs=1, objective='reg:logistic', problem_info=ProblemInfo(gpu_training_param_dict={'processing_unit_type': 'gpu'}), random_state=0, reg_alpha=2.1875, reg_lambda=0.4166666666666667, subsample=0.5, tree_method='auto'))], verbose=False)), ('0', Pipeline(memory=None, steps=[('maxabsscaler', MaxAbsScaler(copy=True)), ('lightgbmclassifier', LightGBMClassifier(min_data_in_leaf=20, n_jobs=1, problem_info=ProblemInfo(gpu_training_param_dict={'processing_unit_type': 'gpu'}), random_state=None))], verbose=False)), ('15', Pipeline(memory=None, steps=[('sparsenormalizer', Normalizer(copy=True, norm='l2')), ('randomforestclassifier', RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight='balanced', criterion='gini', max_depth=None, max_features='sqrt', max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=0.01, min_samples_split=0.01, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1, oob_score=True, random_state=None, verbose=0, warm_start=False))], verbose=False)), ('25', Pipeline(memory=None, steps=[('standardscalerwrapper', StandardScalerWrapper(copy=True, with_mean=False, with_std=False)), ('xgboostclassifier', XGBoostClassifier(booster='gbtree', colsample_bytree=0.9, eta=0.01, gamma=0, max_depth=9, max_leaves=63, n_estimators=400, n_jobs=1, objective='reg:logistic', problem_info=ProblemInfo(gpu_training_param_dict={'processing_unit_type': 'gpu'}), random_state=0, reg_alpha=0, reg_lambda=1.875, subsample=0.9, tree_method='auto'))], verbose=False)), ('28', Pipeline(memory=None, steps=[('sparsenormalizer', Normalizer(copy=True, norm='l1')), ('xgboostclassifier', XGBoostClassifier(booster='gbtree', colsample_bytree=1, eta=0.5, gamma=0, grow_policy='lossguide', max_bin=255, max_depth=6, max_leaves=0, n_estimators=50, n_jobs=1, objective='reg:logistic', problem_info=ProblemInfo(gpu_training_param_dict={'processing_unit_type': 'gpu'}), random_state=0, reg_alpha=2.291666666666667, reg_lambda=2.3958333333333335, subsample=0.5, tree_method='hist'))], verbose=False)), ('22', Pipeline(memory=None, steps=[('sparsenormalizer', Normalizer(copy=True, norm='l1')), ('xgboostclassifier', XGBoostClassifier(booster='gbtree', colsample_bylevel=1, colsample_bytree=1, eta=0.2, gamma=0, max_depth=10, max_leaves=0, n_estimators=10, n_jobs=1, objective='reg:logistic', problem_info=ProblemInfo(gpu_training_param_dict={'processing_unit_type': 'gpu'}), random_state=0, reg_alpha=1.3541666666666667, reg_lambda=1.4583333333333335, subsample=1, tree_method='auto'))], verbose=False)), ('3', Pipeline(memory=None, steps=[('sparsenormalizer', Normalizer(copy=True, norm='l2')), ('xgboostclassifier', XGBoostClassifier(booster='gbtree', colsample_bytree=0.7, eta=0.01, gamma=0.01, max_depth=7, max_leaves=31, n_estimators=10, n_jobs=1, objective='reg:logistic', problem_info=ProblemInfo(gpu_training_param_dict={'processing_unit_type': 'gpu'}), random_state=0, reg_alpha=2.1875, reg_lambda=1.0416666666666667, subsample=1, tree_method='auto'))], verbose=False)), ('10', Pipeline(memory=None, steps=[('sparsenormalizer', Normalizer(copy=True, norm='l1')), ('lightgbmclassifier', LightGBMClassifier(boosting_type='gbdt', colsample_bytree=0.8911111111111111, learning_rate=0.0842121052631579, max_bin=290, max_depth=3, min_child_weight=6, min_data_in_leaf=0.024145517241379314, min_split_gain=0.7368421052631579, n_estimators=25, n_jobs=1, num_leaves=137, problem_info=ProblemInfo(gpu_training_param_dict={'processing_unit_type': 'gpu'}), random_state=None, reg_alpha=0.15789473684210525, reg_lambda=0, subsample=0.29736842105263156))], verbose=False)), ('20', Pipeline(memory=None, steps=[('truncatedsvdwrapper', TruncatedSVDWrapper(n_components=0.7026315789473684, random_state=None)), ('randomforestclassifier', RandomForestClassifier(bootstrap=False, ccp_alpha=0.0, class_weight='balanced', criterion='gini', max_depth=None, max_features='log2', max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=0.01, min_samples_split=0.01, min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=1, oob_score=False, random_state=None, verbose=0, warm_start=False))], verbose=False))], flatten_transform=None, weights=[0.07142857142857142, 0.14285714285714285, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.14285714285714285, 0.07142857142857142, 0.07142857142857142]))],
         verbose=False)
```

In [135]:
#TODO: Save the best model (local directory)
import pickle

with open('automl_salary_model.pkl', 'wb') as f:
    pickle.dump(fitted_model, f, pickle.HIGHEST_PROTOCOL)


## Model Deployment

Remember you have to deploy only one of the two models you trained but you still need to register both the models. Perform the steps in the rest of this notebook only if you wish to deploy this model.

TODO: In the cell below, register the model, create an inference config and deploy the model as a web service.

In [1]:
# Required Imports in case of cleared variable workspace:
from azureml.core import Experiment
from azureml.core import Workspace
from azureml.train.automl.run import AutoMLRun

ws = Workspace.from_config()

prev_exp = Experiment(ws, name="salary-automl-experiment")
prev_run = AutoMLRun(prev_exp, run_id="AutoML_6153b81f-cfb1-45cc-82ff-928ca40cb597")

#Step 1: Register model to Azure Workspace

best_run, fitted_model = prev_run.get_output()

bestModel = prev_run.register_model(model_name="Salary-AutoML-Model", description="Salary dataset trained on AutoML Config")

Package:azureml-automl-runtime, training version:1.48.0.post1, current version:1.47.0
Package:azureml-core, training version:1.48.0, current version:1.47.0
Package:azureml-dataprep, training version:4.8.3, current version:4.5.7
Package:azureml-dataprep-rslex, training version:2.15.1, current version:2.11.4
Package:azureml-dataset-runtime, training version:1.48.0, current version:1.47.0
Package:azureml-defaults, training version:1.48.0, current version:1.47.0
Package:azureml-interpret, training version:1.48.0, current version:1.47.0
Package:azureml-mlflow, training version:1.48.0, current version:1.47.0
Package:azureml-pipeline-core, training version:1.48.0, current version:1.47.0
Package:azureml-responsibleai, training version:1.48.0, current version:1.47.0
Package:azureml-telemetry, training version:1.48.0, current version:1.47.0
Package:azureml-train-automl-client, training version:1.48.0, current version:1.47.0
Package:azureml-train-automl-runtime, training version:1.48.0.post1, cur



In [13]:
#Step 2: Create the Model Deployment Web Service

# create an online deployment.

from azureml.core.webservice import AciWebservice
from azureml.core import Model
from azureml.core.model import InferenceConfig
from azureml.core.webservice import AciWebservice

entry_script = best_run.download_file('outputs/scoring_file_v_1_0_0.py', 'scoringScript.py')
bestEnv = best_run.get_environment()
bestEnv.inferencing_stack_version='latest'

inference_config=InferenceConfig(entry_script='scoringScript.py', environment=bestEnv)

deployment_config = AciWebservice.deploy_configuration(cpu_cores = 1, memory_gb = 1, enable_app_insights=True) 
service = Model.deploy(ws, "salary-automl-endpoint", [bestModel], inference_config, deployment_config)
service.wait_for_deployment(show_output = True)
print(service.state)
print(service.scoring_uri)
print(service.swagger_uri)

Tips: You can try get_logs(): https://aka.ms/debugimage#dockerlog or local deployment: https://aka.ms/debugimage#debug-locally to debug if deployment takes longer than 10 minutes.
Running
2023-02-04 13:29:16+00:00 Creating Container Registry if not exists.
2023-02-04 13:29:16+00:00 Registering the environment.
2023-02-04 13:29:17+00:00 Use the existing image.
2023-02-04 13:29:17+00:00 Submitting deployment to compute.
2023-02-04 13:29:24+00:00 Checking the status of deployment salary-automl-endpoint..
2023-02-04 13:30:05+00:00 Checking the status of inference endpoint salary-automl-endpoint.
Succeeded
ACI service creation operation finished, operation "Succeeded"
Healthy
http://8272254d-d176-4a03-b869-e40a4a517234.uksouth.azurecontainer.io/score
http://8272254d-d176-4a03-b869-e40a4a517234.uksouth.azurecontainer.io/swagger.json


TODO: In the cell below, send a request to the web service you deployed to test it.

In [14]:
import requests

test_data = '{"data":[{"age":69,"fnlwgt":18.46523,"capital-gain":3.14,"capital-loss":1.23,"hours-per-week":45,"workclass":14,'\
'"education":10,"marital-status":1,"occupation":13,"relationship":0,"race":1,"sex":0,"native-country":35},\
 {"age":14,"fnlwgt":15.3247,"capital-gain":0,"capital-loss":0,"hours-per-week":23,"workclass":7,'\
'"education":3,"marital-status":2,"occupation":16,"relationship":1,"race":0,"sex":1,"native-country":13}],"method":"predict"}'
print(test_data)
headers = {'Content-type': 'application/json'}

response = requests.post(service.scoring_uri, test_data, headers=headers)
print("Predicted output (1: >50k Salary , 0: <=50k Salary)")
print(response.text)

{"data":[{"age":69,"fnlwgt":18.46523,"capital-gain":3.14,"capital-loss":1.23,"hours-per-week":45,"workclass":14,"education":10,"marital-status":1,"occupation":13,"relationship":0,"race":1,"sex":0,"native-country":35}, {"age":14,"fnlwgt":15.3247,"capital-gain":0,"capital-loss":0,"hours-per-week":23,"workclass":7,"education":3,"marital-status":2,"occupation":16,"relationship":1,"race":0,"sex":1,"native-country":13}],"method":"predict"}
Predicted output (1: >50k Salary , 0: <=50k Salary)
"{\"result\": [1, 0]}"


TODO: In the cell below, print the logs of the web service and delete the service

In [21]:
print(service.get_logs())
service.delete()

2023-02-04T13:29:55,635123100+00:00 - iot-server/run 
2023-02-04T13:29:55,639535700+00:00 - rsyslog/run 
2023-02-04T13:29:55,643294300+00:00 - gunicorn/run 
2023-02-04T13:29:55,647771300+00:00 | gunicorn/run | 
2023-02-04T13:29:55,654027600+00:00 | gunicorn/run | ###############################################
2023-02-04T13:29:55,656246000+00:00 | gunicorn/run | AzureML Container Runtime Information
2023-02-04T13:29:55,663212900+00:00 | gunicorn/run | ###############################################
2023-02-04T13:29:55,665999400+00:00 | gunicorn/run | 
2023-02-04T13:29:55,681519400+00:00 | gunicorn/run | 
2023-02-04T13:29:55,701052500+00:00 | gunicorn/run | AzureML image information: openmpi4.1.0-cuda11.1-cudnn8-ubuntu18.04, Materializaton Build:20230103.v4
2023-02-04T13:29:55,707489600+00:00 | gunicorn/run | 
2023-02-04T13:29:55,708730100+00:00 - nginx/run 
2023-02-04T13:29:55,712018600+00:00 | gunicorn/run | 
2023-02-04T13:29:55,719196600+00:00 | gunicorn/run | PATH environment variab

**Submission Checklist**
- I have registered the model.
- I have deployed the model with the best accuracy as a webservice.
- I have tested the webservice by sending a request to the model endpoint.
- I have deleted the webservice and shutdown all the computes that I have used.
- I have taken a screenshot showing the model endpoint as active.
- The project includes a file containing the environment details.
