### Automate machine learning model selection with Azure Machine Learning

### Introduction

Automated Machine Learning  enables you to try multiple models and preprocessing transformations withput the manual trial and error that could be waste of time.
You can explot Azure UI to perform this task but also SDK.

By default, AzureML tries every possible model but you can select them. It is able to deal with some preprocessing transformations such as scaling, normalization, imputation of missing values, categorical encoding, feature engineering etc.

Some data must be given as input: if it is the training set, AzureML will apply a cross-vaidation, otherwise will use the provided validation set.

Moreover, a metric is going to be optimized and so, it must be set as the _primary_key_.

In [3]:
# !pip install azureml-train-automl

Collecting azureml-train-automl
  Downloading azureml_train_automl-1.48.0-py3-none-any.whl (2.2 kB)
Collecting azureml-train-automl-runtime~=1.48.0
  Downloading azureml_train_automl_runtime-1.48.0-py3-none-any.whl (332 kB)
     -------------------------------------- 332.3/332.3 kB 1.2 MB/s eta 0:00:00
Collecting azureml-automl-runtime~=1.48.0
  Downloading azureml_automl_runtime-1.48.0.post1-py3-none-any.whl (1.9 MB)
     ---------------------------------------- 1.9/1.9 MB 311.6 kB/s eta 0:00:00
Collecting botocore<=1.23.19
  Downloading botocore-1.23.19-py3-none-any.whl (8.4 MB)
     ---------------------------------------- 8.4/8.4 MB 1.0 MB/s eta 0:00:00
Collecting gensim<3.9.0
  Downloading gensim-3.8.3-cp37-cp37m-win_amd64.whl (24.2 MB)
     -----------                             7.3/24.2 MB 329.5 kB/s eta 0:00:52


ERROR: Exception:
Traceback (most recent call last):
  File "C:\Users\ravazzil\Anaconda3\envs\azure_env\lib\site-packages\pip\_vendor\urllib3\response.py", line 435, in _error_catcher
    yield
  File "C:\Users\ravazzil\Anaconda3\envs\azure_env\lib\site-packages\pip\_vendor\urllib3\response.py", line 516, in read
    data = self._fp.read(amt) if not fp_closed else b""
  File "C:\Users\ravazzil\Anaconda3\envs\azure_env\lib\site-packages\pip\_vendor\cachecontrol\filewrapper.py", line 90, in read
    data = self.__fp.read(amt)
  File "C:\Users\ravazzil\Anaconda3\envs\azure_env\lib\http\client.py", line 465, in read
    n = self.readinto(b)
  File "C:\Users\ravazzil\Anaconda3\envs\azure_env\lib\http\client.py", line 509, in readinto
    n = self.fp.readinto(b)
  File "C:\Users\ravazzil\Anaconda3\envs\azure_env\lib\socket.py", line 589, in readinto
    return self._sock.recv_into(b)
  File "C:\Users\ravazzil\Anaconda3\envs\azure_env\lib\ssl.py", line 1071, in recv_into
    return self.read(

In [1]:
from azureml.core import Workspace
from azureml.core import ComputeTarget
from azureml.core import Environment

ws = Workspace.from_config()
compute = ComputeTarget(workspace = ws, name = 'aml-cluster')
env = Environment.get(workspace = ws, name = 'experiment_env')
data = ws.datasets.get('diabetes dataset')

In [2]:
# List classification metrics.

import azureml.train.automl.utilities as automl_utils

for metric in automl_utils.get_primary_metrics('classification'):
    print(metric)

norm_macro_recall
accuracy
precision_score_weighted
average_precision_score_weighted
AUC_weighted


In [3]:
from azureml.train.automl import AutoMLConfig
from azureml.core import Experiment

automl_config = AutoMLConfig(name = 'Automated ML experiment',
                             task = 'classification',
                             compute_target=compute,
                             training_data=data,
                             label_column_name='Diabetic',
                             iterations=4,
                             primary_metric='AUC_weighted',
                             max_concurrent_iterations=2,
                             featurization='auto')

exp = Experiment(workspace=ws, name='AutoML')
run = exp.submit(automl_config)

run.wait_for_completion(show_output=True)

Submitting remote run.


Experiment,Id,Type,Status,Details Page,Docs Page
AutoML,AutoML_66753947-19d6-4022-bd25-44e408f6dd86,automl,NotStarted,Link to Azure Machine Learning studio,Link to Documentation


Experiment,Id,Type,Status,Details Page,Docs Page
AutoML,AutoML_66753947-19d6-4022-bd25-44e408f6dd86,automl,NotStarted,Link to Azure Machine Learning studio,Link to Documentation



Current status: FeaturesGeneration. Generating features for the dataset.
Current status: DatasetCrossValidationSplit. Generating individually featurized CV splits.
Current status: ModelSelection. Beginning model selection.

********************************************************************************************
DATA GUARDRAILS: 

TYPE:         Cross validation
STATUS:       DONE
DESCRIPTION:  In order to accurately evaluate the model(s) trained by AutoML, we leverage a dataset that the model is not trained on. Hence, if the user doesn't provide an explicit validation dataset, a part of the training dataset is used to achieve this. For smaller datasets (fewer than 20,000 samples), cross-validation is leveraged, else a single hold-out set is split from the training data to serve as the validation dataset. Hence, for your input data we leverage cross-validation with 10 folds, if the number of training samples are fewer than 1000, and 3 folds in all other cases.
              Learn mo

{'runId': 'AutoML_66753947-19d6-4022-bd25-44e408f6dd86',
 'target': 'aml-cluster',
 'status': 'Completed',
 'startTimeUtc': '2023-01-06T15:07:32.317356Z',
 'endTimeUtc': '2023-01-06T15:15:23.201604Z',
 'services': {},
 'properties': {'num_iterations': '4',
  'training_type': 'TrainFull',
  'acquisition_function': 'EI',
  'primary_metric': 'AUC_weighted',
  'train_split': '0',
  'acquisition_parameter': '0',
  'num_cross_validation': None,
  'target': 'aml-cluster',
  'DataPrepJsonString': '{\\"training_data\\": {\\"datasetId\\": \\"c36b3658-d319-41aa-81d1-8cae5a45f92f\\"}, \\"datasets\\": 0}',
  'EnableSubsampling': 'False',
  'runTemplate': 'AutoML',
  'azureml.runsource': 'automl',
  'display_task_type': 'classification',
  'dependencies_versions': '{"azureml-dataprep-native": "38.0.0", "azureml-dataprep": "4.8.3", "azureml-dataprep-rslex": "2.15.1", "azureml-automl-core": "1.48.0", "azureml-core": "1.48.0", "azureml-dataset-runtime": "1.48.0", "azureml-mlflow": "1.48.0", "azureml-pi

In [8]:
# View child job.
for run in run.get_children():
    print('Run ID', run.id)
    for metric in run.get_metrics():
        print('\t', run.get_metrics('AUC_weighted'))

In [7]:
best_run, fitted_model = run.get_output()
print(best_run)
print('\nBest Model Definition:')
print(fitted_model)
print('\nBest Run Transformations:')
for step in fitted_model.named_steps:
    print(step)
print('\nBest Run Metrics:')
best_run_metrics = best_run.get_metrics()
for metric_name in best_run_metrics:
    metric = best_run_metrics[metric_name]
    print(metric_name, metric)

AttributeError: 'Run' object has no attribute 'get_output'