# Automated ML

TODO: Import Dependencies. In the cell below, import all the dependencies that you will need to complete the project.

In [1]:
from azureml.core import Workspace, Dataset
from azureml.core.experiment import Experiment
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from azureml.train.automl import AutoMLConfig

from sklearn.model_selection import StratifiedKFold
from azureml.core.environment import Environment
from azureml.widgets import RunDetails

## Dataset

### Overview

For this project, the data used is **Mobile Price Classification** ([data source](https://www.kaggle.com/iabhishekofficial/mobile-price-classification?select=train.csv))
from Kaggle website. The description provided in Kaggle is the following one:

```
Bob has started his own mobile company. He wants to give tough fight to big companies like Apple,Samsung etc.

He does not know how to estimate price of mobiles his company creates. In this competitive mobile phone market you cannot simply assume things. To solve this problem he collects sales data of mobile phones of various companies.

Bob wants to find out some relation between features of a mobile phone(eg:- RAM,Internal Memory etc) and its selling price. But he is not so good at Machine Learning. So he needs your help to solve this problem.

In this problem you do not have to predict actual price but a price range indicating how high the price is.
```

We are using the *train.csv* file.

### Task
*TODO*: Explain the task you are going to be solving with this dataset and the features you will be using for it.

As described above, we are using some technical characteristics of mobile phones
to classify their prices between 0 and 3. So that, we have a Multi-Label
Classification Problem.

The features available are the following:

* **battery_power**: Total energy a battery can store in one time measured in mAh.

* **blue**: Has bluetooth or not.

* **clock_speed**: speed at which microprocessor executes instructions.

* **dual_sim**: Has dual sim support or not.

* **fc**: Front Camera mega pixels

* **four_g**: Has 4G or not.

* **int_memory**: Internal Memory in Gigabytes.

* **m_dep**: Mobile Depth in cm.

* **mobile_wt**: Weight of mobile phone.

* **n_cores**: Number of cores of processor.

* **pc**: Primary Camera mega pixels.

* **px_height**: Pixel Resolution Height.

* **px_width**: Pixel Resolution Width.

* **ram**: Random Access Memory in Mega Bytes.

* **sc_h**: Screen Height of mobile in cm.

* **sc_w**: Screen Width of mobile in cm.

* **talk_time**: longest time that a single battery charge will last when you are.

* **three_g**: Has 3G or not.

* **touch_screen**: Has touch screen or not.

* **wifi**: Has wifi or not.

* **price_range**: This is the target variable with value of 0 (low cost), 1 (medium cost), 2 (high cost) and 3 (very high cost).


In this data we have a balanced target for training set, i.e., each class has almost the same representation. This is important because it makes it easier to create a general model using classical.

TODO: Get data. In the cell below, write code to access the data you will be using in this project. Remember that the dataset needs to be external.

In [2]:
ws = Workspace.from_config()

# choose a name for experiment
experiment_name = 'automl-mobile'
project_folder = './automl-mobile-udacity'

experiment=Experiment(ws, experiment_name)

In the following cell, data is consumed using *Consume* tab in ML Studio Datasets section.

In [3]:
# azureml-core of version 1.0.72 or higher is required
# azureml-dataprep[pandas] of version 1.1.34 or higher is required

dataset = Dataset.get_by_name(ws, name='mobile_prices')
df = dataset.to_pandas_dataframe()
df.head(5)

Unnamed: 0,battery_power,blue,clock_speed,dual_sim,fc,four_g,int_memory,m_dep,mobile_wt,n_cores,...,px_height,px_width,ram,sc_h,sc_w,talk_time,three_g,touch_screen,wifi,price_range
0,842,0,2.2,0,1,0,7,0.6,188,2,...,20,756,2549,9,7,19,0,0,1,1
1,1021,1,0.5,1,0,1,53,0.7,136,3,...,905,1988,2631,17,3,7,1,1,0,2
2,563,1,0.5,1,2,1,41,0.9,145,5,...,1263,1716,2603,11,2,9,1,1,0,2
3,615,1,2.5,0,0,0,10,0.8,131,6,...,1216,1786,2769,16,8,11,1,0,0,2
4,1821,1,1.2,0,13,1,44,0.6,141,2,...,1208,1212,1411,8,2,15,1,1,0,1


In [4]:
df.shape

(2000, 21)

## AutoML Configuration
* **experiment_timeout_minutes**: Maximum amount of time in minutes that all iterations combined can take before the experiment terminates. In this case is set to 20 so that we can have time enough to do all the steps in the project.
* **max_concurrent_iterations**: Represents the maximum number of iterations that would be executed in parallel. The default value is 1. This value is bounded
by the number of maximum nodes chosen for Compute Target. In our case, this value is 4 so that we set the same number of max_concurrent_iterations.
* **primary_metric**: The metric that Automated Machine Learning will optimize for model selection. Automated Machine Learning collects more metrics than it can optimize. In this case, as we have a balanced target for Multi Label Classification, 'accuracy' seems to be a good option.
* **compute_target**: The Azure Machine Learning compute target to run the Automated Machine Learning experiment on (see Figure 6).
* **task**:The type of task to run. Values can be 'classification', 'regression', or 'forecasting' depending on the type of automated ML problem to solve. In this case, it's obviously 'classification'.
* **training_data**: The training data to be used within the experiment. It should contain both training features and a label column (optionally a sample weights column). If training_data is specified, then the label_column_name parameter must also be specified.
* **label_column_name**: The name of the label column. If the input data is from a pandas.DataFrame which doesn't have column names, column indices can be used instead, expressed as integers. In this case, the column in *training_data* that we want to predict is 'price_range'.
* **path**: The full path to the Azure Machine Learning project folder. If not specified, the default is to use the current directory or ".". Our project folder is called './automl-mobile-udacity'.
* **n_cross_validations**:How many cross validations to perform when user validation data is not specified. In this case, with 2000 rows it seems natural that we take a 20% for validation.
* **enable_early_stopping**: Whether to enable early termination if the score is not improving in the short term. The default is False.

    * Default behavior for stopping criteria:

        1. If iteration and experiment timeout are not specified, then early stopping is turned on and experiment_timeout = 6 days, num_iterations = 1000.

        2. If experiment timeout is specified, then early_stopping = off, num_iterations = 1000.

    * Early stopping logic:

        1. No early stopping for first 20 iterations (landmarks).

        2. Early stopping window starts on the 21st iteration and looks for early_stopping_n_iters iterations (currently set to 10). This means that the first iteration where stopping can occur is the 31st.

        3. AutoML still schedules 2 ensemble iterations AFTER early stopping, which might result in higher scores.

        4. Early stopping is triggered if the absolute value of best score calculated is the same for past early_stopping_n_iters iterations, that is, if there is no improvement in score for early_stopping_n_iters iterations.

* **featurization**: auto' / 'off' / FeaturizationConfig Indicator for whether featurization step should be done automatically or not, or whether customized featurization should be used. Note: If the input data is sparse, featurization cannot be turned on.

    Column type is automatically detected. Based on the detected column type preprocessing/featurization is done as follows:

    * Categorical: Target encoding, one hot encoding, drop high cardinality categories, impute missing values.

    * Numeric: Impute missing values, cluster distance, weight of evidence.

    * DateTime: Several features such as day, seconds, minutes, hours etc.

    * Text: Bag of words, pre-trained Word embedding, text target encoding.

    More details can be found in the article [Configure automated ML experiments in Python](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-configure-auto-train#data-featurization).

    This time, no specific featurization is proposed so that we keep it as 'auto'.

In [5]:
cpu_cluster_name='automl-mobiles'

# Verify that cluster does not exist already
try:
    cpu_cluster = ComputeTarget(workspace=ws, name=cpu_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',
                                                            max_nodes=4)
    cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, compute_config)

cpu_cluster.wait_for_completion(show_output=True, min_node_count = 1, timeout_in_minutes = 10)

Found existing cluster, use it.
Succeeded....................................................................................................................
AmlCompute wait for completion finished

Wait timeout has been reached
Current provisioning state of AmlCompute is "Succeeded" and current node count is "0"


In [36]:
!pip install xgboost==0.90



In [7]:
automl_settings = {
    "experiment_timeout_minutes": 20,
    "max_concurrent_iterations": 4,
    "primary_metric" : 'accuracy'
}
project_folder = './automl-mobile-udacity'
automl_config = AutoMLConfig(compute_target=cpu_cluster,
                             task = "classification",
                             training_data=dataset,
                             label_column_name="price_range",   
                             path = project_folder,
                             n_cross_validations=5,
                             enable_early_stopping= True,
                             featurization= 'auto',
                             debug_log = "automl_errors.log",
                             **automl_settings
                            )

In [8]:
# TODO: Submit your experiment
remote_run = experiment.submit(automl_config)

RunDetails(remote_run).show()

Running on remote.


_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

## Run Details


In [None]:
RunDetails(remote_run).show()

In [9]:
remote_run.wait_for_completion(show_output=True)


Current status: FeaturesGeneration. Generating features for the dataset.
Current status: DatasetFeaturization. Beginning to fit featurizers and featurize the dataset.
Current status: DatasetCrossValidationSplit. Generating individually featurized CV splits.
Current status: ModelSelection. Beginning model selection.

****************************************************************************************************
DATA GUARDRAILS: 

TYPE:         Class balancing detection
STATUS:       PASSED
DESCRIPTION:  Your inputs were analyzed, and all classes are balanced in your training data.
              Learn more about imbalanced data: https://aka.ms/AutomatedMLImbalancedData

****************************************************************************************************

TYPE:         Missing feature values imputation
STATUS:       PASSED
DESCRIPTION:  No feature missing values were detected in the training data.
              Learn more about missing value imputation: https://aka.m

{'runId': 'AutoML_2e3ecec9-8b98-46e6-8caf-ebe839cb45fc',
 'target': 'automl-mobiles',
 'status': 'Completed',
 'startTimeUtc': '2021-03-21T18:21:04.971522Z',
 'endTimeUtc': '2021-03-21T18:46:39.788851Z',
 'properties': {'num_iterations': '1000',
  'training_type': 'TrainFull',
  'acquisition_function': 'EI',
  'primary_metric': 'accuracy',
  'train_split': '0',
  'acquisition_parameter': '0',
  'num_cross_validation': '5',
  'target': 'automl-mobiles',
  'DataPrepJsonString': '{\\"training_data\\": \\"{\\\\\\"blocks\\\\\\": [{\\\\\\"id\\\\\\": \\\\\\"49700a5d-9250-4716-b833-76c1566a6551\\\\\\", \\\\\\"type\\\\\\": \\\\\\"Microsoft.DPrep.GetDatastoreFilesBlock\\\\\\", \\\\\\"arguments\\\\\\": {\\\\\\"datastores\\\\\\": [{\\\\\\"datastoreName\\\\\\": \\\\\\"workspaceblobstore\\\\\\", \\\\\\"path\\\\\\": \\\\\\"UI/03-21-2021_050026_UTC/train.csv\\\\\\", \\\\\\"resourceGroup\\\\\\": \\\\\\"aml-quickstarts-141054\\\\\\", \\\\\\"subscription\\\\\\": \\\\\\"f5091c60-1c3c-430f-8d81-d802f6bf241

## Best Model



In [55]:
remote_run.experiment

Name,Workspace,Report Page,Docs Page
automl-mobile,quick-starts-ws-141054,Link to Azure Machine Learning studio,Link to Documentation


In [7]:
# from azureml.core.run import get_run
# run = get_run(experiment, run_id='AutoML_2e3ecec9-8b98-46e6-8caf-ebe839cb45fc')
best_run, fitted_model = remote_run.get_output()
print(fitted_model.steps)

[('datatransformer', DataTransformer(enable_dnn=None, enable_feature_sweeping=None,
                feature_sweeping_config=None, feature_sweeping_timeout=None,
                featurization_config=None, force_text_dnn=None,
                is_cross_validation=None, is_onnx_compatible=None, logger=None,
                observer=None, task=None, working_dir=None)), ('prefittedsoftvotingclassifier', PreFittedSoftVotingClassifier(classification_labels=None,
                              estimators=[('6',
                                           Pipeline(memory=None,
                                                    steps=[('sparsenormalizer',
                                                            <azureml.automl.runtime.shared.model_wrappers.SparseNormalizer object at 0x7f50b2616a20>),
                                                           ('xgboostclassifier',
                                                            XGBoostClassifier(base_score=0.5,
                      

In [40]:
from pprint import pprint

def print_model(model, prefix=""):
    for step in model.steps:
        print(prefix + step[0])
        if hasattr(step[1], 'estimators') and hasattr(step[1], 'weights'):
            pprint({'estimators': list(
                e[0] for e in step[1].estimators), 'weights': step[1].weights})
            print()
            for estimator in step[1].estimators:
                print_model(estimator[1], estimator[0] + ' - ')
        else:
            pprint(step[1].get_params())
            print()

print_model(fitted_model)

datatransformer
{'enable_dnn': None,
 'enable_feature_sweeping': None,
 'feature_sweeping_config': None,
 'feature_sweeping_timeout': None,
 'featurization_config': None,
 'force_text_dnn': None,
 'is_cross_validation': None,
 'is_onnx_compatible': None,
 'logger': None,
 'observer': None,
 'task': None,
 'working_dir': None}

prefittedsoftvotingclassifier
{'estimators': ['6', '16', '23', '0', '30', '20', '1', '25', '26', '4'],
 'weights': [0.06666666666666667,
             0.06666666666666667,
             0.06666666666666667,
             0.13333333333333333,
             0.06666666666666667,
             0.06666666666666667,
             0.13333333333333333,
             0.13333333333333333,
             0.13333333333333333,
             0.13333333333333333]}

6 - sparsenormalizer
{'copy': True, 'norm': 'l2'}

6 - xgboostclassifier
{'base_score': 0.5,
 'booster': 'gbtree',
 'colsample_bylevel': 1,
 'colsample_bynode': 1,
 'colsample_bytree': 0.5,
 'eta': 0.1,
 'gamma': 0,
 'learning

In [8]:
# Retrieve and save your best automl model.
# Retrieve the best automl model

best_automl_run_metrics = best_run.get_metrics()

print('Best Run Id: ', best_run.id)
print('\n Accuracy: ', best_automl_run_metrics['accuracy'])

# Save model
print('\n SAVE MODEL...')
final_automl_model = best_run.register_model(model_name = 'automl-mobile', model_path = '/outputs/model.pkl', description='Best Model AutoML for mobile classification dataset')
print('\n SAVE MODEL...')

Best Run Id:  AutoML_2e3ecec9-8b98-46e6-8caf-ebe839cb45fc_36

 Accuracy:  0.9365

 SAVE MODEL...

 SAVE MODEL...


## Model Deployment

Remember you have to deploy only one of the two models you trained.. Perform the steps in the rest of this notebook only if you wish to deploy this model.

TODO: In the cell below, register the model, create an inference config and deploy the model as a web service.

In [34]:
env = best_run.get_environment()
env.save_to_directory('dependencies')
env.load_from_directory('dependencies')

{
    "databricks": {
        "eggLibraries": [],
        "jarLibraries": [],
        "mavenLibraries": [],
        "pypiLibraries": [],
        "rcranLibraries": []
    },
    "docker": {
        "arguments": [],
        "baseDockerfile": null,
        "baseImage": "mcr.microsoft.com/azureml/intelmpi2018.3-ubuntu16.04:20210301.v1",
        "baseImageRegistry": {
            "address": null,
            "password": null,
            "registryIdentity": null,
            "username": null
        },
        "enabled": true,
        "platform": {
            "architecture": "amd64",
            "os": "Linux"
        },
        "sharedVolumes": true,
        "shmSize": null
    },
    "environmentVariables": {
        "EXAMPLE_ENV_VAR": "EXAMPLE_VALUE"
    },
    "inferencingStackVersion": null,
    "name": "AutoML-AzureML-AutoML",
    "python": {
        "baseCondaEnvironment": null,
        "condaDependencies": {
            "channels": [
                "anaconda",
                "cond

In [30]:
%%time
from azureml.core.webservice import Webservice
from azureml.core.model import InferenceConfig
from azureml.core.model import Model

model = Model(ws, 'automl-mobile')

inference_config = InferenceConfig(entry_script="scoring_file_v_1_0_0.py", environment=env)
from azureml.core.webservice import AciWebservice

aciconfig = AciWebservice.deploy_configuration(cpu_cores=1, 
                                               memory_gb=1, 
                                               description='Predict mobile prices')
service = Model.deploy(workspace=ws, 
                       name='automl-mobile-sdk-4', 
                       models=[model], 
                       inference_config=inference_config, 
                       deployment_config=aciconfig)

service.wait_for_deployment(show_output=True)

Tips: You can try get_logs(): https://aka.ms/debugimage#dockerlog or local deployment: https://aka.ms/debugimage#debug-locally to debug if deployment takes longer than 10 minutes.
Running
2021-03-21 20:01:41+00:00 Creating Container Registry if not exists.
2021-03-21 20:01:41+00:00 Registering the environment.
2021-03-21 20:01:42+00:00 Use the existing image.
2021-03-21 20:01:42+00:00 Generating deployment configuration.
2021-03-21 20:01:43+00:00 Submitting deployment to compute..
2021-03-21 20:01:49+00:00 Checking the status of deployment automl-mobile-sdk-4..
2021-03-21 20:05:07+00:00 Checking the status of inference endpoint automl-mobile-sdk-4.
Succeeded
ACI service creation operation finished, operation "Succeeded"
CPU times: user 521 ms, sys: 75.8 ms, total: 597 ms
Wall time: 3min 49s


In [25]:
import pandas as pd
df_test = pd.read_csv('test.csv')
df_test.drop(columns='id', inplace=True)
df_test.head()

Unnamed: 0,battery_power,blue,clock_speed,dual_sim,fc,four_g,int_memory,m_dep,mobile_wt,n_cores,pc,px_height,px_width,ram,sc_h,sc_w,talk_time,three_g,touch_screen,wifi
0,1043,1,1.8,1,14,0,5,0.1,193,3,16,226,1412,3476,12,7,2,0,1,0
1,841,1,0.5,1,4,1,61,0.8,191,5,12,746,857,3895,6,0,7,1,0,0
2,1807,1,2.8,0,1,0,27,0.9,186,3,4,1270,1366,2396,17,10,10,0,1,1
3,1546,0,0.5,1,18,1,25,0.5,96,8,20,295,1752,3893,10,0,7,1,1,0
4,1434,0,1.4,0,11,1,49,0.5,108,6,18,749,810,1773,15,8,7,1,0,1


In [26]:
sample = list(df_test.iloc[0:10, :].to_dict('index').values())
sample

[{'battery_power': 1043,
  'blue': 1,
  'clock_speed': 1.8,
  'dual_sim': 1,
  'fc': 14,
  'four_g': 0,
  'int_memory': 5,
  'm_dep': 0.1,
  'mobile_wt': 193,
  'n_cores': 3,
  'pc': 16,
  'px_height': 226,
  'px_width': 1412,
  'ram': 3476,
  'sc_h': 12,
  'sc_w': 7,
  'talk_time': 2,
  'three_g': 0,
  'touch_screen': 1,
  'wifi': 0},
 {'battery_power': 841,
  'blue': 1,
  'clock_speed': 0.5,
  'dual_sim': 1,
  'fc': 4,
  'four_g': 1,
  'int_memory': 61,
  'm_dep': 0.8,
  'mobile_wt': 191,
  'n_cores': 5,
  'pc': 12,
  'px_height': 746,
  'px_width': 857,
  'ram': 3895,
  'sc_h': 6,
  'sc_w': 0,
  'talk_time': 7,
  'three_g': 1,
  'touch_screen': 0,
  'wifi': 0},
 {'battery_power': 1807,
  'blue': 1,
  'clock_speed': 2.8,
  'dual_sim': 0,
  'fc': 1,
  'four_g': 0,
  'int_memory': 27,
  'm_dep': 0.9,
  'mobile_wt': 186,
  'n_cores': 3,
  'pc': 4,
  'px_height': 1270,
  'px_width': 1366,
  'ram': 2396,
  'sc_h': 17,
  'sc_w': 10,
  'talk_time': 10,
  'three_g': 0,
  'touch_screen': 1,
 

In [31]:
import requests
import json

# URL for the web service, should be similar to:
# 'http://8530a665-66f3-49c8-a953-b82a2d312917.eastus.azurecontainer.io/score'
scoring_uri = 'http://7ad2a469-bb4e-4dbd-bc92-76496d925381.southcentralus.azurecontainer.io/score'
# If the service is authenticated, set the key or token
# key = 'zcdL9IVlIn5Gb6yCEAZ0NrBapBkOQvbw'

# Two sets of data to score, so we get two results back
data = {"data": sample}

# Convert to JSON string
input_data = json.dumps(data)
with open("data.json", "w") as _f:
    _f.write(input_data)

# Set the content type
headers = {'Content-Type': 'application/json'}
# If authentication is enabled, set the authorization header
# headers['Authorization'] = f'Bearer {key}'

# Make the request and display the response
resp = requests.post(scoring_uri, input_data, headers=headers)
print(resp.json())

{"result": [3, 3, 2, 3, 1, 3, 3, 1, 3, 0]}


TODO: In the cell below, print the logs of the web service and delete the service

In [48]:
print(service.get_logs())

2021-03-20T19:13:50,926065000+00:00 - iot-server/run 
2021-03-20T19:13:50,933904600+00:00 - gunicorn/run 
2021-03-20T19:13:50,936763000+00:00 - rsyslog/run 
2021-03-20T19:13:50,970808000+00:00 - nginx/run 
/usr/sbin/nginx: /azureml-envs/azureml_661474bbe74e96b5d8added5888dfc85/lib/libcrypto.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_661474bbe74e96b5d8added5888dfc85/lib/libcrypto.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_661474bbe74e96b5d8added5888dfc85/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_661474bbe74e96b5d8added5888dfc85/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_661474bbe74e96b5d8added5888dfc85/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
rsyslogd

In [49]:
cpu_cluster.delete()
service.delete()

Current provisioning state of AmlCompute is "Deleting"

