# Automated ML

TODO: Import Dependencies. In the cell below, import all the dependencies that you will need to complete the project.

In [None]:
import urllib.request
import json
import os
import ssl
import pandas as pd

from azureml.core import Workspace, Experiment
from azureml.train.automl import AutoMLConfig
from azureml.widgets import RunDetails

## Dataset

### Overview
TODO: In this markdown cell, give an overview of the dataset you are using. Also mention the task you will be performing.
I chose the dataset of [Heart Failure records from Kaggle](https://www.kaggle.com/andrewmvd/heart-failure-clinical-data) because it has a high usability score of 10 meaning that the dataset is easy to understand, machine readable, includes essential metadata and is maintained. It is also a very interesting topic. According to Kaggle, Cardiovascular diseases (CVDs) are the number 1 cause of death globally, accounting for 31% of all deaths worlwide. 

Environmental and behavioural risk factors such as tobacco use, unhealthy diet and obesity, physical inactivity and harmful use of alcohol could be used as features for estimation models. Being able to estimate the probability of developping a CVD could be of great help for high risk people.

The Dataset is tabular with 13 columns (12 features and 1 target variable) and contains 299 rows.

The following features are going to be used:

|    | Variable name             | Type            | Description                                               | Example           |
|----|---------------------------|-----------------|-----------------------------------------------------------|-------------------|
| 1  | age                       | numerical       | age of the patient                                        | 25                |
| 2  | anaemia                   | boolean         | Decrease of red blood cells or hemoglobin                 | 0 or 1            |
| 3  | creatinine_phosphokinase  | numerical       | Level of the CPK enzyme in the blood                      | 542               |
| 4  | diabetes                  | boolean         | If the patient has diabetes                               | university.degree |
| 5  | ejection_fraction         | numerical       | Percentage of blood leaving the heart at each contraction | 45                |
| 6  | high_blood_pressure       | boolean         | If the patient has hypertension                           | 0 or 1            |
| 7  | platelets                 | numerical       | Platelets in the blood                                    | 149000            |
| 8  | serum_creatinine          | numerical       | Level of serum creatinine in the blood                    | 0.5               |
| 9  | serum_sodium              | numerical       | Level of serum sodium in the blood                        | jun               |
| 10 | sex                       | boolean         | Woman or man                                              | 0 or 1            |
| 11 | smoking                   | boolean         | If the patient smokes                                     | 285               |
| 12 | time                      | numerical       | follow-up period (days)                                   | 4                 |
|----|---------------------------|-----------------|-----------------------------------------------------------|-------------------|
| 21 | DEATH_EVENT [Target]      | boolean         | if the patient deceased during the follow-up period       | 0 or 1            |

TODO: Get data. In the cell below, write code to access the data you will be using in this project. Remember that the dataset needs to be external.

In [2]:
ws = Workspace.from_config()

# choose a name for experiment
experiment_name = 'aml-experiment'

experiment=Experiment(ws, experiment_name)

In [9]:
dataset = ws.datasets['heart-failure-records']
df = dataset.to_pandas_dataframe()
df.describe()

Unnamed: 0,age,creatinine_phosphokinase,ejection_fraction,platelets,serum_creatinine,serum_sodium,time
count,299.0,299.0,299.0,299.0,299.0,299.0,299.0
mean,60.833893,581.839465,38.083612,263358.029264,1.39388,136.625418,130.26087
std,11.894809,970.287881,11.834841,97804.236869,1.03451,4.412477,77.614208
min,40.0,23.0,14.0,25100.0,0.5,113.0,4.0
25%,51.0,116.5,30.0,212500.0,0.9,134.0,73.0
50%,60.0,250.0,38.0,262000.0,1.1,137.0,115.0
75%,70.0,582.0,45.0,303500.0,1.4,140.0,203.0
max,95.0,7861.0,80.0,850000.0,9.4,148.0,285.0


## AutoML Configuration

TODO: Explain why you chose the automl settings and cofiguration you used below.

Automl Settings:

| Parameter | Description | Value | Reason |
|-----------|-------------|-------| ------------|
| experiment_timeout_minutes | The maximum amount of time (in minutes) that the experiment is allowed to run before it is automatically stopped and results are automatically made available. | 20 | I would like the experience to timout after 20 minutes because I wanted to limit my spendings considering that I am using my personal Azure Account. |
| max_concurrent_iterations | The maximum number of concurrent training iterations allowed for the experiment. | 3 | AmlCompute clusters support one interation running per node so it should be less than or equal to the max number of nodes in the cluster. |
| primary_metric | The primary metric used to determine the experiment's status. | AUC_weighted |I chose to monitor the AUC_weighted primary metric because accuracy, average_precision_score_weighted, norm_macro_recall, and precision_score_weighted may not optimize as well for datasets which are small like in our dataset with only 299 rows. |

Automl Config:

| Parameter | Description | Value | Reason |
|-----------|-------------|-------|------------|
| compute_target | The compute instance that will run the job | compute_target | I chose the cluster specifically created as compute_target|
| task | The type of task to be solved. | classification | I wanted the model to return weither if a person is likely to have heart failure or not, therefore classification was better suited than regression or forecasting for this task.  |
| training_data | The dataset to be used for training. | dataset | NA |
| label_column_name | The name of the column containing the label. | DEATH_EVENT | NA |
| enable_early_stopping | Enable early stopping. | True | I set this to true to save compute time and therefore save money |
| featurization | The featurization method to be used. | auto | This allowed data check through the Data guardrails |
| debug_log | The path to the log file. | automl_errors.log | This allows me to check the logs in case of debugging |





In [10]:
amlcompute_cluster_name = "heart-f-cluster"
cts = ws.compute_targets
compute_target = cts[amlcompute_cluster_name]

In [14]:
project_folder = './aml-project'

# TODO: Put your automl settings here
automl_settings = {
    "experiment_timeout_minutes": 20,
    "max_concurrent_iterations": 3,
    "primary_metric" : 'AUC_weighted'
}

# TODO: Put your automl config here
automl_config = AutoMLConfig(compute_target=compute_target,
                             task = "classification",
                             training_data=dataset,
                             label_column_name="DEATH_EVENT",
                             path = project_folder,  
                             enable_early_stopping= True,
                             featurization= 'auto',
                             debug_log = "automl_errors.log",
                             **automl_settings
                            )

In [15]:
# TODO: Submit your experiment
remote_run = experiment.submit(automl_config)

Submitting remote run.


Experiment,Id,Type,Status,Details Page,Docs Page
aml-experiment,AutoML_323a97c6-9e35-4d8f-b68f-1ee814cf808e,automl,NotStarted,Link to Azure Machine Learning studio,Link to Documentation


## Run Details

OPTIONAL: Write about the different models trained and their performance. Why do you think some models did better than others?
Different models were trained using different algorithms such as StandardScalerWrapper, ExtremeRandomTrees, MinMaxScaler, RandomForest... But the most performant one is the VotingEnsemble. It's performance relies on the fact that it combines the predictions from multiple other models. Here are the ensembled algorithms used : 'ExtremeRandomTrees', 'RandomForest', 'RandomForest', 'XGBoostClassifier', 'LightGBM', 'ExtremeRandomTrees', 'XGBoostClassifier', 'RandomForest', 'XGBoostClassifier', 'LightGBM', 'XGBoostClassifier', 'ExtremeRandomTrees'


TODO: In the cell below, use the `RunDetails` widget to show the different experiments.

In [17]:
RunDetails(remote_run).show()

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…

INFO:interpret_community.common.explanation_utils:Using default datastore for uploads
INFO:interpret_community.common.explanation_utils:Using default datastore for uploads
INFO:interpret_community.common.explanation_utils:Using default datastore for uploads
INFO:interpret_community.common.explanation_utils:Using default datastore for uploads
INFO:interpret_community.common.explanation_utils:Using default datastore for uploads
INFO:interpret_community.common.explanation_utils:Using default datastore for uploads
INFO:interpret_community.common.explanation_utils:Using default datastore for uploads
INFO:interpret_community.common.explanation_utils:Using default datastore for uploads
INFO:interpret_community.common.explanation_utils:Using default datastore for uploads
INFO:interpret_community.common.explanation_utils:Using default datastore for uploads
INFO:interpret_community.common.explanation_utils:Using default datastore for uploads
INFO:interpret_community.common.explanation_utils:Usin

## Best Model

TODO: In the cell below, get the best model from the automl experiments and display all the properties of the model.



In [18]:
best_run, fitted_model = remote_run.get_output()
best_run

Experiment,Id,Type,Status,Details Page,Docs Page
aml-experiment,AutoML_323a97c6-9e35-4d8f-b68f-1ee814cf808e_37,azureml.scriptrun,Completed,Link to Azure Machine Learning studio,Link to Documentation


In [38]:
best_run.get_properties()

{'runTemplate': 'automl_child',
 'pipeline_id': '__AutoML_Ensemble__',
 'pipeline_spec': '{"pipeline_id":"__AutoML_Ensemble__","objects":[{"module":"azureml.train.automl.ensemble","class_name":"Ensemble","spec_class":"sklearn","param_args":[],"param_kwargs":{"automl_settings":"{\'task_type\':\'classification\',\'primary_metric\':\'AUC_weighted\',\'verbosity\':20,\'ensemble_iterations\':15,\'is_timeseries\':False,\'name\':\'aml-experiment\',\'compute_target\':\'heart-f-cluster\',\'subscription_id\':\'2a3b9c06-13fd-4499-8d62-0323ea7c8399\',\'region\':\'centralus\',\'spark_service\':None}","ensemble_run_id":"AutoML_323a97c6-9e35-4d8f-b68f-1ee814cf808e_37","experiment_name":"aml-experiment","workspace_name":"ml-workspace","subscription_id":"2a3b9c06-13fd-4499-8d62-0323ea7c8399","resource_group_name":"networkwatcherrg"}}]}',
 'training_percent': '100',
 'predicted_cost': None,
 'iteration': '37',
 '_aml_system_scenario_identification': 'Remote.Child',
 '_azureml.ComputeTargetType': 'amlcomp

In [19]:
model_name = best_run.properties['model_name']
model_name

'AutoML323a97c6937'

In [20]:
script_file_name = 'inference/score.py'
best_run.download_file('outputs/scoring_file_v_1_0_0.py', 'inference/score.py')

In [31]:
#TODO: Save the best model

description = "aml heart failure project sdk"
model = best_run.register_model(model_name = model_name,
                                model_path = './outputs/',
                                description = description,
                                tags = None)

## Model Deployment

Remember you have to deploy only one of the two models you trained.. Perform the steps in the rest of this notebook only if you wish to deploy this model.

TODO: In the cell below, register the model, create an inference config and deploy the model as a web service.

In [28]:
from azureml.core.model import InferenceConfig
from azureml.core.webservice import AciWebservice
from azureml.core.webservice import Webservice
from azureml.core.model import Model
from azureml.core.environment import Environment

inference_config = InferenceConfig(entry_script=script_file_name, environment=best_run.get_environment())

aciconfig = AciWebservice.deploy_configuration(cpu_cores = 1,
                                               memory_gb = 1,
                                               tags = {'type': "automl-forecasting"},
                                               description = 'Sample service for AutoML Forecasting')

aci_service_name = 'automl-hf-sdk-2'
print(aci_service_name)
aci_service = Model.deploy(ws, aci_service_name, [model], inference_config, aciconfig)
aci_service.wait_for_deployment(True)
print(aci_service.state)

automl-hf-sdk-2
Tips: You can try get_logs(): https://aka.ms/debugimage#dockerlog or local deployment: https://aka.ms/debugimage#debug-locally to debug if deployment takes longer than 10 minutes.
Running
2021-08-22 18:22:10+00:00 Creating Container Registry if not exists.
2021-08-22 18:22:10+00:00 Registering the environment.
2021-08-22 18:22:11+00:00 Use the existing image.
2021-08-22 18:22:11+00:00 Generating deployment configuration.
2021-08-22 18:22:13+00:00 Submitting deployment to compute.
2021-08-22 18:22:16+00:00 Checking the status of deployment automl-hf-sdk-2..
2021-08-22 18:26:20+00:00 Checking the status of inference endpoint automl-hf-sdk-2.
Succeeded
ACI service creation operation finished, operation "Succeeded"
Healthy


TODO: In the cell below, send a request to the web service you deployed to test it.

In [34]:
data = {
    "data":
    [
        {
            'age': "60",
            'anaemia': "false",
            'creatinine_phosphokinase': "500",
            'diabetes': "false",
            'ejection_fraction': "38",
            'high_blood_pressure': "false",
            'platelets': "260000",
            'serum_creatinine': "1.40",
            'serum_sodium': "137",
            'sex': "false",
            'smoking': "false",
            'time': "130",
        },
    ],
}

test_sample = str.encode(json.dumps(data))

In [35]:
response = aci_service.run(input_data=test_sample)
response

INFO:interpret_community.common.explanation_utils:Using default datastore for uploads


'{"result": [false]}'

TODO: In the cell below, print the logs of the web service and delete the service

In [37]:
aci_service.get_logs()



In [None]:
aci_service.delete()