# Automated ML

TODO: Import Dependencies. In the cell below, import all the dependencies that you will need to complete the project.

In [1]:
from azureml.core import Workspace, Experiment
#Define a workspace
ws = Workspace.from_config()

#Create an experiment
exp = Experiment(workspace=ws, name="Hyperparameter_Tuning")
print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep = '\n')
run = exp.start_logging()


Workspace name: quick-starts-ws-243949
Azure region: westeurope
Subscription id: 81cefad3-d2c9-4f77-a466-99a7f541c7bb
Resource group: aml-quickstarts-243949


In [2]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

cpu_cluster_name = "AzureProject"

# TODO: Create compute cluster
# Use vm_size = " Standard_D2_V2"in your provisioning configuration.
# max_nodes should be no greater than 4.

### YOUR CODE HERE ###
# verify that cluser does not exist already
try:
    cpu_cluster=ComputeTarget(workspace=ws, name= cpu_cluster_name)
    print("Found existing cluster, use it.")
except ComputeTargetException:
    compute_config=AmlCompute.provisioning_configuration(vm_size="Standard_D2_V2", max_nodes=4)
    cpu_cluster= ComputeTarget.create(ws,cpu_cluster_name,compute_config)

cpu_cluster.wait_for_completion(show_output=True)
# print detailed status for the current cluster
print(cpu_cluster.get_status().serialize())

Found existing cluster, use it.
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned
{'currentNodeCount': 0, 'targetNodeCount': 0, 'nodeStateCounts': {'preparingNodeCount': 0, 'runningNodeCount': 0, 'idleNodeCount': 0, 'unusableNodeCount': 0, 'leavingNodeCount': 0, 'preemptedNodeCount': 0}, 'allocationState': 'Steady', 'allocationStateTransitionTime': '2023-11-11T16:00:42.492000+00:00', 'errors': None, 'creationTime': '2023-11-11T16:00:33.928274+00:00', 'modifiedTime': '2023-11-11T16:00:43.727927+00:00', 'provisioningState': 'Succeeded', 'provisioningStateTransitionTime': None, 'scaleSettings': {'minNodeCount': 0, 'maxNodeCount': 4, 'nodeIdleTimeBeforeScaleDown': 'PT1800S'}, 'vmPriority': 'Dedicated', 'vmSize': 'STANDARD_D2_V2'}


## Dataset

### Overview
TODO: In this markdown cell, give an overview of the dataset you are using. Also mention the task you will be performing.


TODO: Get data. In the cell below, write code to access the data you will be using in this project. Remember that the dataset needs to be external.

In [3]:
import pandas as pd
pd_path='https://raw.githubusercontent.com/AnnaDM87/Udacity_CAPSTONE/main/starter_file/Employee.csv'
df = pd.read_csv(pd_path)

df.head()

Unnamed: 0,Education,JoiningYear,City,PaymentTier,Age,Gender,EverBenched,ExperienceInCurrentDomain,LeaveOrNot
0,Bachelors,2017,Bangalore,3,34,Male,No,0,0
1,Bachelors,2013,Pune,1,28,Female,No,3,1
2,Bachelors,2014,New Delhi,3,38,Female,No,2,0
3,Masters,2016,Bangalore,3,27,Male,No,5,1
4,Masters,2017,Pune,3,24,Male,Yes,2,1


In [16]:
import os
os.listdir()

['.azureml',
 '.ipynb_checkpoints',
 'automl-proj',
 'automl.ipynb',
 'automl.log',
 'automl_errors.log',
 'azureml_automl.log',
 'conda_dependencies.yml',
 'data',
 'hyperparameter_tuning.ipynb',
 'Logs',
 'outputs',
 'train.py',
 'Users',
 '__pycache__']

In [5]:
from train import clean_data
from azureml.core import Dataset
from sklearn.model_selection import train_test_split
from azureml.data.dataset_factory import TabularDatasetFactory

#Use the clean_data function to clean your dataset.
x, y = clean_data(df)
x.head()

#Split the dataset into train and test dataset. Combine x_train and y_train. 
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2)
df_train = pd.concat([x_train,y_train], axis=1)
df_test = pd.concat([x_test,y_test], axis=1)

#Convert x_train and y_train (Which are in pandas DataFrame format) to TabularDataset format.
try:
    os.makedirs('./data', exist_ok=True)
except OSError as error:
    print('New directory cannot be created')
    
path_train = 'data/train.csv'
path_test = 'data/test.csv'
df_train.to_csv(path_train)
df_test.to_csv(path_test)

datastore = ws.get_default_datastore()
datastore.upload(src_dir='data', target_path='data')

train_data = TabularDatasetFactory.from_delimited_files(path=[(datastore, ('data/train.csv'))])
test_data = TabularDatasetFactory.from_delimited_files(path=[(datastore, ('data/test.csv'))])
print("Successfully converted the dataset to TabularDataset format.")

"Datastore.upload" is deprecated after version 1.0.69. Please use "Dataset.File.upload_directory" to upload your files             from a local directory and create FileDataset in single method call. See Dataset API change notice at https://aka.ms/dataset-deprecation.


Uploading an estimated of 2 files
Uploading data/test.csv
Uploaded data/test.csv, 1 files out of an estimated total of 2
Uploading data/train.csv
Uploaded data/train.csv, 2 files out of an estimated total of 2
Uploaded 2 files
Successfully converted the dataset to TabularDataset format.


## AutoML Configuration

TODO: Explain why you chose the automl settings and cofiguration you used below.

In [10]:
project_folder = './automl-proj'

In [20]:

from azureml.train.automl import AutoMLConfig
automl_settings = {
    "experiment_timeout_minutes": 20,
    "max_concurrent_iterations": 5,
    "primary_metric" : 'AUC_weighted'
}
automl_config = AutoMLConfig(compute_target=cpu_cluster,
                             task = "classification",
                             training_data=train_data,
                             label_column_name="LeaveOrNot",   
                             path = project_folder,
                             enable_early_stopping= True,
                             featurization= 'auto',
                             debug_log = "automl_errors.log",
                             **automl_settings
                            )

In [22]:
# TODO: Submit your experiment
remote_run = exp.submit(automl_config,show_output=True)

ConfigException: ConfigException:
	Message: Conflicting or duplicate values are provided for arguments: [{
    "script": null,
    "arguments": [],
    "target": "AzureProject",
    "framework": "Python",
    "communicator": "None",
    "maxRunDurationSeconds": null,
    "nodeCount": 1,
    "priority": null,
    "environment": {
        "name": "default-environment",
        "version": null,
        "environmentVariables": {
            "EXAMPLE_ENV_VAR": "EXAMPLE_VALUE"
        },
        "python": {
            "userManagedDependencies": false,
            "interpreterPath": "python",
            "condaDependenciesFile": null,
            "baseCondaEnvironment": null,
            "condaDependencies": {
                "name": "project_environment",
                "dependencies": [
                    "python=3.8.13",
                    {
                        "pip": [
                            "azureml-defaults"
                        ]
                    }
                ],
                "channels": [
                    "anaconda",
                    "conda-forge"
                ]
            }
        },
        "docker": {
            "enabled": false,
            "baseImage": "mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:20230509.v1",
            "baseDockerfile": null,
            "buildContext": null,
            "sharedVolumes": true,
            "shmSize": "2g",
            "arguments": [],
            "baseImageRegistry": {
                "address": null,
                "username": null,
                "password": null,
                "registryIdentity": null
            },
            "platform": {
                "os": "Linux",
                "architecture": "amd64"
            }
        },
        "spark": {
            "repositories": [],
            "packages": [],
            "precachePackages": true
        },
        "databricks": {
            "mavenLibraries": [],
            "pypiLibraries": [],
            "rcranLibraries": [],
            "jarLibraries": [],
            "eggLibraries": []
        },
        "r": null,
        "inferencingStackVersion": null,
        "assetId": null
    },
    "history": {
        "outputCollection": true,
        "snapshotProject": true,
        "directoriesToWatch": [
            "logs"
        ]
    },
    "spark": {
        "configuration": {
            "spark.app.name": "Azure ML Experiment",
            "spark.yarn.maxAppAttempts": 1
        }
    },
    "docker": {
        "useDocker": true,
        "sharedVolumes": true,
        "arguments": [],
        "shmSize": "2g"
    },
    "hdi": {
        "yarnDeployMode": "cluster"
    },
    "tensorflow": {
        "workerCount": 1,
        "parameterServerCount": 1
    },
    "mpi": {
        "processCountPerNode": 1,
        "nodeCount": 1
    },
    "pytorch": {
        "communicationBackend": "nccl",
        "processCount": null,
        "nodeCount": 1
    },
    "paralleltask": {
        "maxRetriesPerWorker": 0,
        "workerCountPerNode": 1,
        "terminalExitCodes": null
    },
    "dataReferences": {},
    "data": {},
    "datacaches": [],
    "outputData": {},
    "sourceDirectoryDataStore": null,
    "amlcompute": {
        "vmSize": null,
        "vmPriority": null,
        "retainCluster": false,
        "name": null,
        "clusterMaxNodeCount": null
    },
    "autoClusterComputeSpecification": null,
    "kubernetescompute": {
        "instanceType": null
    },
    "credentialPassthrough": false,
    "command": "",
    "environmentVariables": {},
    "applicationEndpoints": {}
}]
	InnerException: None
	ErrorResponse 
{
    "error": {
        "code": "UserError",
        "message": "Conflicting or duplicate values are provided for arguments: [{\n    \"script\": null,\n    \"arguments\": [],\n    \"target\": \"AzureProject\",\n    \"framework\": \"Python\",\n    \"communicator\": \"None\",\n    \"maxRunDurationSeconds\": null,\n    \"nodeCount\": 1,\n    \"priority\": null,\n    \"environment\": {\n        \"name\": \"default-environment\",\n        \"version\": null,\n        \"environmentVariables\": {\n            \"EXAMPLE_ENV_VAR\": \"EXAMPLE_VALUE\"\n        },\n        \"python\": {\n            \"userManagedDependencies\": false,\n            \"interpreterPath\": \"python\",\n            \"condaDependenciesFile\": null,\n            \"baseCondaEnvironment\": null,\n            \"condaDependencies\": {\n                \"name\": \"project_environment\",\n                \"dependencies\": [\n                    \"python=3.8.13\",\n                    {\n                        \"pip\": [\n                            \"azureml-defaults\"\n                        ]\n                    }\n                ],\n                \"channels\": [\n                    \"anaconda\",\n                    \"conda-forge\"\n                ]\n            }\n        },\n        \"docker\": {\n            \"enabled\": false,\n            \"baseImage\": \"mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:20230509.v1\",\n            \"baseDockerfile\": null,\n            \"buildContext\": null,\n            \"sharedVolumes\": true,\n            \"shmSize\": \"2g\",\n            \"arguments\": [],\n            \"baseImageRegistry\": {\n                \"address\": null,\n                \"username\": null,\n                \"password\": null,\n                \"registryIdentity\": null\n            },\n            \"platform\": {\n                \"os\": \"Linux\",\n                \"architecture\": \"amd64\"\n            }\n        },\n        \"spark\": {\n            \"repositories\": [],\n            \"packages\": [],\n            \"precachePackages\": true\n        },\n        \"databricks\": {\n            \"mavenLibraries\": [],\n            \"pypiLibraries\": [],\n            \"rcranLibraries\": [],\n            \"jarLibraries\": [],\n            \"eggLibraries\": []\n        },\n        \"r\": null,\n        \"inferencingStackVersion\": null,\n        \"assetId\": null\n    },\n    \"history\": {\n        \"outputCollection\": true,\n        \"snapshotProject\": true,\n        \"directoriesToWatch\": [\n            \"logs\"\n        ]\n    },\n    \"spark\": {\n        \"configuration\": {\n            \"spark.app.name\": \"Azure ML Experiment\",\n            \"spark.yarn.maxAppAttempts\": 1\n        }\n    },\n    \"docker\": {\n        \"useDocker\": true,\n        \"sharedVolumes\": true,\n        \"arguments\": [],\n        \"shmSize\": \"2g\"\n    },\n    \"hdi\": {\n        \"yarnDeployMode\": \"cluster\"\n    },\n    \"tensorflow\": {\n        \"workerCount\": 1,\n        \"parameterServerCount\": 1\n    },\n    \"mpi\": {\n        \"processCountPerNode\": 1,\n        \"nodeCount\": 1\n    },\n    \"pytorch\": {\n        \"communicationBackend\": \"nccl\",\n        \"processCount\": null,\n        \"nodeCount\": 1\n    },\n    \"paralleltask\": {\n        \"maxRetriesPerWorker\": 0,\n        \"workerCountPerNode\": 1,\n        \"terminalExitCodes\": null\n    },\n    \"dataReferences\": {},\n    \"data\": {},\n    \"datacaches\": [],\n    \"outputData\": {},\n    \"sourceDirectoryDataStore\": null,\n    \"amlcompute\": {\n        \"vmSize\": null,\n        \"vmPriority\": null,\n        \"retainCluster\": false,\n        \"name\": null,\n        \"clusterMaxNodeCount\": null\n    },\n    \"autoClusterComputeSpecification\": null,\n    \"kubernetescompute\": {\n        \"instanceType\": null\n    },\n    \"credentialPassthrough\": false,\n    \"command\": \"\",\n    \"environmentVariables\": {},\n    \"applicationEndpoints\": {}\n}]",
        "details_uri": "https://aka.ms/AutoMLConfig",
        "inner_error": {
            "code": "BadArgument",
            "inner_error": {
                "code": "ArgumentMismatch",
                "inner_error": {
                    "code": "ConflictingValueForArguments"
                }
            }
        }
    }
}

## Run Details

OPTIONAL: Write about the different models trained and their performance. Why do you think some models did better than others?

TODO: In the cell below, use the `RunDetails` widget to show the different experiments.

In [27]:
from azureml.widgets import RunDetails

#Launch the widget to view the progress and results
RunDetails(remote_run).show()

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

## Best Model

TODO: In the cell below, get the best model from the automl experiments and display all the properties of the model.



In [30]:
# Retrieve and save your best automl model.

best_run, best_model=remote_run.get_output()
print(best_run)
print(best_model)



Package:azureml-automl-runtime, training version:1.52.0.post1, current version:1.51.0.post1
Package:azureml-core, training version:1.52.0, current version:1.51.0
Package:azureml-dataprep, training version:4.11.4, current version:4.10.8
Package:azureml-dataprep-rslex, training version:2.18.4, current version:2.17.12
Package:azureml-dataset-runtime, training version:1.52.0, current version:1.51.0
Package:azureml-defaults, training version:1.52.0, current version:1.51.0
Package:azureml-interpret, training version:1.52.0, current version:1.51.0
Package:azureml-mlflow, training version:1.52.0, current version:1.51.0
Package:azureml-pipeline-core, training version:1.52.0, current version:1.51.0
Package:azureml-responsibleai, training version:1.52.0, current version:1.51.0
Package:azureml-telemetry, training version:1.52.0, current version:1.51.0
Package:azureml-train-automl-client, training version:1.52.0, current version:1.51.0.post1
Package:azureml-train-automl-runtime, training version:1.

Run(Experiment: Hyperparameter_Tuning,
Id: AutoML_25eb3f61-90cb-4cf4-af0c-4906dee69a17_52,
Type: azureml.scriptrun,
Status: Completed)
Pipeline(memory=None,
         steps=[('datatransformer',
                 DataTransformer(enable_dnn=False, enable_feature_sweeping=True, feature_sweeping_config={}, feature_sweeping_timeout=86400, featurization_config=None, force_text_dnn=False, is_cross_validation=True, is_onnx_compatible=False, observer=None, task='classification', working_dir='/mnt/batch/tasks/shared/LS_root/mount...
                 PreFittedSoftVotingClassifier(classification_labels=array([0, 1]), estimators=[('33', Pipeline(memory=None, steps=[('standardscalerwrapper', StandardScalerWrapper(copy=True, with_mean=False, with_std=False)), ('xgboostclassifier', XGBoostClassifier(booster='gbtree', colsample_bytree=0.9, eta=0.01, gamma=0, max_depth=7, max_leaves=7, n_estimators=800, n_jobs=1, objective='reg:logistic', problem_info=ProblemInfo(gpu_training_param_dict={'processing_unit_

In [31]:
print("Best run metrics :", best_run.get_metrics())
print("Best accuracy run metrics ",best_run.get_metrics(name='accuracy'))
print("Best run details :", best_run.get_details())

Best run metrics : {'average_precision_score_macro': 0.8831375108888908, 'AUC_weighted': 0.8761453479470295, 'recall_score_weighted': 0.8525005848561253, 'AUC_micro': 0.9137181843490415, 'weighted_accuracy': 0.8890910063766579, 'matthews_correlation': 0.666157157429459, 'norm_macro_recall': 0.6159719281928998, 'f1_score_weighted': 0.8467201197832311, 'recall_score_macro': 0.8079859640964498, 'f1_score_macro': 0.825046743567515, 'accuracy': 0.8525005848561253, 'AUC_macro': 0.8761453479470296, 'log_loss': 0.40253915666317003, 'precision_score_weighted': 0.8555107590657421, 'precision_score_micro': 0.8525005848561253, 'f1_score_micro': 0.8525005848561253, 'average_precision_score_micro': 0.9124892793780601, 'balanced_accuracy': 0.8079859640964498, 'average_precision_score_weighted': 0.8915173618215736, 'precision_score_macro': 0.8603228383005931, 'recall_score_micro': 0.8525005848561253, 'accuracy_table': 'aml://artifactId/ExperimentRun/dcid.AutoML_25eb3f61-90cb-4cf4-af0c-4906dee69a17_52/

In [33]:
#TODO: Save the best model
best_run.get_file_names()

['accuracy_table',
 'automl_driver.py',
 'confusion_matrix',
 'explanation/12d42c59/classes.interpret.json',
 'explanation/12d42c59/eval_data_viz.interpret.json',
 'explanation/12d42c59/expected_values.interpret.json',
 'explanation/12d42c59/features.interpret.json',
 'explanation/12d42c59/global_names/0.interpret.json',
 'explanation/12d42c59/global_rank/0.interpret.json',
 'explanation/12d42c59/global_values/0.interpret.json',
 'explanation/12d42c59/local_importance_values.interpret.json',
 'explanation/12d42c59/per_class_names/0.interpret.json',
 'explanation/12d42c59/per_class_rank/0.interpret.json',
 'explanation/12d42c59/per_class_values/0.interpret.json',
 'explanation/12d42c59/rich_metadata.interpret.json',
 'explanation/12d42c59/true_ys_viz.interpret.json',
 'explanation/12d42c59/visualization_dict.interpret.json',
 'explanation/12d42c59/ys_pred_proba_viz.interpret.json',
 'explanation/12d42c59/ys_pred_viz.interpret.json',
 'explanation/2f1cefba/classes.interpret.json',
 'expl

In [34]:
import joblib
from azureml.core.model import Model

#Save the best model
os.makedirs('results', exist_ok=True)
joblib.dump(best_model, filename="results/automl_model.pkl")
model = remote_run.register_model(model_name=best_run.properties['model_name'], description='Best AutoML model')
print("Model saved successfully")

Model saved successfully


## Model Deployment

Remember you have to deploy only one of the two models you trained but you still need to register both the models. Perform the steps in the rest of this notebook only if you wish to deploy this model.

TODO: In the cell below, register the model, create an inference config and deploy the model as a web service.

TODO: In the cell below, send a request to the web service you deployed to test it.

TODO: In the cell below, print the logs of the web service and delete the service

**Submission Checklist**
- I have registered the model.
- I have deployed the model with the best accuracy as a webservice.
- I have tested the webservice by sending a request to the model endpoint.
- I have deleted the webservice and shutdown all the computes that I have used.
- I have taken a screenshot showing the model endpoint as active.
- The project includes a file containing the environment details.
