# Automated ML

Requisite Dependencies for the Project is imported as below

In [1]:
from azureml.core import Workspace
from azureml.core.experiment import Experiment
from azureml.data.dataset_factory import TabularDatasetFactory
from azureml.core import  Dataset
import shutil
import os
import zipfile
import pandas as pd
from azureml.train.automl import AutoMLConfig
import json
from azureml.widgets import RunDetails
import joblib

## Workspace

Gather Workspace details from the config file and create and Experiment to run the AutoML.

In [2]:
ws = Workspace.from_config()
# choose a name for experiment
experiment_name = 'automl_capstone_exp'
experiment=Experiment(ws, experiment_name)

## Dataset

### Overview
The Dataset that we will be using is called the MNIST dataset.It is a dataset of 60,000 small square 28×28 pixel grayscale images of handwritten single digits between 0 and 9.

Goal :To take an image of a handwritten single digit(0-9), and determine what that digit is.




### Loading the Data from Kaggle

In [3]:
#Load Data for the AutoML model
!pip install kaggle

Collecting kaggle
  Downloading kaggle-1.5.12.tar.gz (58 kB)
[K     |████████████████████████████████| 58 kB 3.9 MB/s eta 0:00:011
Collecting python-slugify
  Downloading python_slugify-5.0.2-py2.py3-none-any.whl (6.7 kB)
Collecting text-unidecode>=1.3
  Downloading text_unidecode-1.3-py2.py3-none-any.whl (78 kB)
[K     |████████████████████████████████| 78 kB 5.2 MB/s  eta 0:00:01
[?25hBuilding wheels for collected packages: kaggle
  Building wheel for kaggle (setup.py) ... [?25l- \ done
[?25h  Created wheel for kaggle: filename=kaggle-1.5.12-py3-none-any.whl size=73053 sha256=93f2ddadc9a49985d12e7bb37830ce3f7782dcc8d59528ca890e901a2c47cff3
  Stored in directory: /home/azureuser/.cache/pip/wheels/77/47/e4/44a4ba1b7dfd53faaa35f59f1175e123b213ff401a8a56876b
Successfully built kaggle
Installing collected packages: text-unidecode, python-slugify, kaggle
Successfully installed kaggle-1.5.12 python-slugify-5.0.2 text-unidecode-1.3


In [4]:
#Create Data Folder and Kaggle Folder (Ref:https://inclusive-ai.medium.com/how-to-use-kaggle-api-with-azure-machine-learning-service-da056708fc5a)
data_folder = os.path.join(os.getcwd(),'data')
os.makedirs(data_folder, exist_ok=True)
kaggle_folder = os.path.join(os.getcwd(), '.kaggle')
os.makedirs(kaggle_folder, exist_ok=True)
kaggle_key_folder = '/home/azureuser/.kaggle'
os.makedirs(kaggle_key_folder, exist_ok=True)

In [5]:
#Upload the kaggle.json(Generated from Kaggle account Page) generated from kaggle in .kaggle folder

kaggle_file = kaggle_folder + '/kaggle.json'
shutil.copy(kaggle_file, kaggle_key_folder)
os.remove(kaggle_file)
!chmod 600 /home/azureuser/.kaggle/kaggle.json

#Data Download
import kaggle
!kaggle --version
!kaggle competitions download -c digit-recognizer
with zipfile.ZipFile("digit-recognizer.zip","r") as zip_ref:
    zip_ref.extractall(data_folder)


#View the Unzipped Files
for root, directories, files in os.walk(data_folder, topdown=True):
    for name in files:
        print(os.path.join(root, name))

Kaggle API 1.5.12
Downloading digit-recognizer.zip to /mnt/batch/tasks/shared/LS_root/mounts/clusters/mnistcompute/code/Users/mashrajiv
 13%|████▉                                 | 2.00M/15.3M [00:00<00:00, 19.5MB/s]
100%|██████████████████████████████████████| 15.3M/15.3M [00:00<00:00, 45.8MB/s]
/mnt/batch/tasks/shared/LS_root/mounts/clusters/mnistcompute/code/Users/mashrajiv/data/sample_submission.csv
/mnt/batch/tasks/shared/LS_root/mounts/clusters/mnistcompute/code/Users/mashrajiv/data/test.csv
/mnt/batch/tasks/shared/LS_root/mounts/clusters/mnistcompute/code/Users/mashrajiv/data/train.csv


In [6]:
#Load the CSV into Data Frames

train_properties_file = '/mnt/batch/tasks/shared/LS_root/mounts/clusters/mnistcompute/code/Users/mashrajiv/data/train.csv'
test_properties_file = '/mnt/batch/tasks/shared/LS_root/mounts/clusters/mnistcompute/code/Users/mashrajiv/data/test.csv'
train = pd.read_csv('./data/train.csv',nrows=10000)
#test file does not have lables so we will be using a part of train for validation
test = pd.read_csv('./data/test.csv')

### Compute Creation

In [8]:
#Required incase Local instance is not used
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# Choose a name for your CPU cluster
cpu_cluster_name = "cpucluster"

# Verify that cluster does not exist already
try:
    compute_target = ComputeTarget(workspace=ws, name=cpu_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_DS12_V2',
                                                           max_nodes=6)
    compute_target = ComputeTarget.create(ws, cpu_cluster_name, compute_config)

compute_target.wait_for_completion(show_output=True)

Creating......
SucceededProvisioning operation finished, operation "Succeeded"
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


## AutoML Configuration

Choice of AutoML Settings:
 
#### 1. n_cross_validation 
Indicates how many cross validations to perform and in our case splitting it into 5 portions will ensure that we have ~8000 records for training and ~2000 for validation.
 
#### 2. Primary Metric
Primary metric chosen here is accuracy to understand how much of the sample has been correctly classified.We could also use AUC as metric where we can see multiple one versus all Precision recall curves for each of the MNIST digits

#### 3. enable early stopping
Early stopping is enabled to prevent overfitting

#### 4. Experiment Stop time 
To handle costs and time

#### 5.Local compute
Going for the Local compute since we can pass dataframes and also to not create a separate compute (better cost)

In [7]:

# TODO: Put your automl settings here
automl_settings = automl_settings = {
                                    "n_cross_validations": 5,
                                    "primary_metric": 'accuracy',
                                    "enable_early_stopping": True,
                                    "experiment_timeout_minutes": 20
                                     }

# TODO: Put your automl config here
automl_config = AutoMLConfig(
   # compute_target = cpu_cluster,  - Local compute accepts Data Frames
    task='classification',
    training_data=train,
    label_column_name='label',
    **automl_settings)

In [8]:
# TODO: Submit your experiment
automl_run = experiment.submit(automl_config,show_output = False)



Experiment,Id,Type,Status,Details Page,Docs Page
automl_capstone_exp,AutoML_825541e0-d110-46eb-a4ed-70d9834d4886,automl,Preparing,Link to Azure Machine Learning studio,Link to Documentation


INFO:interpret_community.common.explanation_utils:Using default datastore for uploads


## Run Details

Below are the models which have been chosen by Azure ML for this experiment

1.Voting Ensemble

2.Stack Ensemble

3.Max ABS scaler/Light GBM

4.Max ABS scaler/XGBoost Classifier 


5.Random Forest


The Ensemble models perform better as opposed to the individual models since they combine bagging,bosting and stacking to provide the results.
They also combine the results and minimise the variance component of the error.




We can explore the results of automatic training with a Jupyter widget. 
Additionally, we can filter on different accuracy metrics than the  primary metric - Accuracy  with the dropdown selector

In [9]:

RunDetails(automl_run).show()


_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

In [11]:
automl_run.wait_for_completion(show_output=False)

{'runId': 'AutoML_825541e0-d110-46eb-a4ed-70d9834d4886',
 'target': 'local',
 'status': 'Completed',
 'startTimeUtc': '2021-07-27T05:35:36.707871Z',
 'endTimeUtc': '2021-07-27T05:57:59.274571Z',
 'properties': {'num_iterations': '1000',
  'training_type': 'TrainFull',
  'acquisition_function': 'EI',
  'primary_metric': 'accuracy',
  'train_split': '0',
  'acquisition_parameter': '0',
  'num_cross_validation': '5',
  'target': 'local',
  'DataPrepJsonString': None,
  'EnableSubsampling': None,
  'runTemplate': 'AutoML',
  'azureml.runsource': 'automl',
  'display_task_type': 'classification',
  'dependencies_versions': '{"azureml-widgets": "1.31.0", "azureml-train": "1.31.0", "azureml-train-restclients-hyperdrive": "1.31.0", "azureml-train-core": "1.31.0", "azureml-train-automl": "1.31.0", "azureml-train-automl-runtime": "1.31.0", "azureml-train-automl-client": "1.31.0", "azureml-tensorboard": "1.31.0", "azureml-telemetry": "1.31.0", "azureml-sdk": "1.31.0", "azureml-samples": "0+unknow

## Best Model

Getting  the best model from the automl experiments and display all the properties of the model.



In [13]:
best_automl_run = automl_run.get_best_child()


In [16]:
best_automl_run

Experiment,Id,Type,Status,Details Page,Docs Page
automl_capstone_exp,AutoML_2f705b9b-8879-4360-8ccf-2c0d63279de4_3,,Completed,Link to Azure Machine Learning studio,Link to Documentation


In [14]:
best_run_metrics = best_automl_run.get_metrics() # or other runs with runID
for metric_name in best_run_metrics:
     metric = best_run_metrics[metric_name]
     print(metric_name, metric)

f1_score_weighted 0.9466406316599226
weighted_accuracy 0.9467200868969089
precision_score_weighted 0.9469033319628212
average_precision_score_micro 0.9793000123851952
average_precision_score_weighted 0.9818169120157325
accuracy 0.9465999999999999
log_loss 0.211867206630954
average_precision_score_macro 0.9815919766787358
recall_score_micro 0.9465999999999999
recall_score_weighted 0.9465999999999999
f1_score_micro 0.9465999999999999
AUC_micro 0.9953126055555556
AUC_macro 0.9952907637532906
AUC_weighted 0.9953422693205942
precision_score_micro 0.9465999999999999
matthews_correlation 0.9406502555884899
precision_score_macro 0.9461385346212907
norm_macro_recall 0.9405541126033388
f1_score_macro 0.94620649433743
recall_score_macro 0.9464987013430048
balanced_accuracy 0.9464987013430048
confusion_matrix aml://artifactId/ExperimentRun/dcid.AutoML_2f705b9b-8879-4360-8ccf-2c0d63279de4_3/confusion_matrix
accuracy_table aml://artifactId/ExperimentRun/dcid.AutoML_2f705b9b-8879-4360-8ccf-2c0d63279d

In [14]:
best_automl_run.get_file_names()

In [15]:
#TODO: Save the best model in Outputs Folder

outputs_folder = os.path.join(os.getcwd(),'outputs')
os.makedirs(outputs_folder, exist_ok=True)
best_automl_run.download_file('outputs/model.pkl', output_file_path='./outputs/')
#Downloading the Scoring File
best_automl_run.download_file('outputs/scoring_file_v_1_0_0.py', output_file_path='./outputs/score1.py')
#downloading the Environment
best_automl_run.download_file('outputs/conda_env_v_1_0_0.yml', output_file_path='./outputs/env.yaml')


## Model Deployment

Remember you have to deploy only one of the two models you trained.. Perform the steps in the rest of this notebook only if you wish to deploy this model.

Registering the model, creating an inference config and deploying the model as a web service.

In [16]:
#Register the Best model
model_auto = best_automl_run.register_model(model_name='AUTOML_ATTEMPT',description ='MNIST using AutoML',
                           model_path='outputs/model.pkl')


In [54]:
from azureml.core.environment import Environment
from azureml.core.model import InferenceConfig
from azureml.core.model import Model
from azureml.core.webservice import AciWebservice
from azureml.core.webservice import webservice

infenv = Environment.from_conda_specification(name = "infenv", file_path = "outputs/env.yaml")

# Combine scoring script & environment in Inference configuration
inference_config = InferenceConfig(entry_script='outputs/score1.py', 
                                    environment=infenv
                                    )

# Set deployment configuration
deployment_config = AciWebservice.deploy_configuration(cpu_cores = 1, memory_gb = 1,tags={'type':'automl-classification'},
                                                        description='Sample Web Service for AutoML Classification')

aci_service_name = "automl-classification"
print (aci_service_name)
aci_service = Model.deploy(ws,aci_service_name,[model_auto],inference_config,deployment_config)
aci_service.wait_for_deployment(True)
print(aci_service.state)



automl-classification
Tips: You can try get_logs(): https://aka.ms/debugimage#dockerlog or local deployment: https://aka.ms/debugimage#debug-locally to debug if deployment takes longer than 10 minutes.
Running
2021-07-27 10:26:51+00:00 Creating Container Registry if not exists.
2021-07-27 10:26:51+00:00 Registering the environment.
2021-07-27 10:26:54+00:00 Use the existing image.
2021-07-27 10:26:55+00:00 Generating deployment configuration.
2021-07-27 10:26:56+00:00 Submitting deployment to compute.
2021-07-27 10:26:59+00:00 Checking the status of deployment automl-classification..
2021-07-27 10:31:19+00:00 Checking the status of inference endpoint automl-classification.
Succeeded
ACI service creation operation finished, operation "Succeeded"
Healthy


Sending a request to the web service  deployed to test it.

In [58]:
#Create the holdout set as a subset from the Train file.
#First 10000 used for training
#Remaining is being used for testing
validation =pd.read_csv('./data/train.csv',skiprows=[1,9999])
validation_labels = validation['label']
validation.drop(columns=['label'],axis=1,inplace=True)

test_sample = json.dumps({"data":validation.tail(100).to_dict(orient='records')})
response = aci_service.run(test_sample)
response

'{"result": [2, 7, 2, 7, 2, 7, 7, 7, 7, 9, 2, 2, 7, 2, 7, 5, 2, 5, 2, 2, 2, 2, 2, 7, 9, 2, 7, 2, 2, 9, 5, 5, 1, 2, 2, 2, 2, 2, 1, 2, 2, 7, 2, 6, 7, 9, 2, 2, 7, 7, 2, 5, 5, 2, 5, 2, 2, 7, 5, 2, 2, 9, 7, 2, 2, 2, 2, 2, 2, 7, 5, 7, 6, 2, 7, 2, 2, 2, 2, 2, 7, 5, 7, 7, 2, 7, 2, 7, 2, 7, 7, 2, 2, 2, 2, 5, 2, 7, 2, 2]}'

In [59]:
res_dict = json.loads(response)
Predicted_label=pd.Series(res_dict['result'])
from sklearn.metrics import accuracy_score,confusion_matrix
print(accuracy_score(validation_labels.head(100),Predicted_label))
print(confusion_matrix(validation_labels.head(100),Predicted_label,labels=[0,1,2,3,4,5,6,7,8,9]))

0.07
[[ 0  0  6  0  0  0  0  3  0  0]
 [ 0  0 10  0  0  3  0  2  0  0]
 [ 0  0  5  0  0  2  2  2  0  0]
 [ 0  0  4  0  0  0  0  5  0  2]
 [ 0  1  5  0  0  1  0  4  0  0]
 [ 0  0  4  0  0  0  0  1  0  1]
 [ 0  0  6  0  0  0  0  3  0  0]
 [ 0  0  4  0  0  1  0  2  0  0]
 [ 0  0  3  0  0  1  0  0  0  2]
 [ 0  1  6  0  0  3  0  5  0  0]]


TODO: In the cell below, print the logs of the web service and delete the service

In [57]:
aci_service.get_logs(num_lines=5000, init=False)



In [53]:
#Deleting the WebService
aci_service.delete()