## Azure AML Experiment

### Use case:
Predict car price using regression.

Dataset: [Car details v3](https://www.kaggle.com/nehalbirla/vehicle-dataset-from-cardekho?select=Car+details+v3.csv)


Based on: [azureml-workshop-2020 Sample](https://github.com/csiebler/azureml-workshop-2020/blob/master/2-training-inference/2.3-automl-training/remote-compute/binayclassification-employee-attrition-autoaml-remote-amlcompute.ipynb)

### How to:
1. Create new Azure Resource Group.
2. Add Machine Learning Service.
3. Login at [ml.azurem.com](https://ml.zure.com/).
4. Connect with created ML service.
5. Go to "Compute" settings under Manage tab.
6. Create new "Compute Instance".
7. Go to "Compute Cluster" tab and create a new one.
8. Go to Datasets under Assets tab.
9. Import dataset and select all the settings (import without path).
10. Go to "New" and create a new experiment Notebook.
11. Follow the code below:



### 1. Connect with Azure Workspace

In [1]:
from azureml.core import Workspace, Dataset
ws = Workspace.from_config()

### 2. Load data using Pandas

In [2]:
aml_dataset = ws.datasets['car-ds'] # load ds

full_df = aml_dataset.to_pandas_dataframe() # convert to dataframe
full_df.head() # preview first 5 rows

Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner,mileage,engine,max_power,torque,seats
0,Maruti Swift Dzire VDI,2014,450000,145500,Diesel,Individual,Manual,First Owner,23.4 kmpl,1248 CC,74 bhp,190Nm@ 2000rpm,5.0
1,Skoda Rapid 1.5 TDI Ambition,2014,370000,120000,Diesel,Individual,Manual,Second Owner,21.14 kmpl,1498 CC,103.52 bhp,250Nm@ 1500-2500rpm,5.0
2,Honda City 2017-2020 EXi,2006,158000,140000,Petrol,Individual,Manual,Third Owner,17.7 kmpl,1497 CC,78 bhp,"12.7@ 2,700(kgm@ rpm)",5.0
3,Hyundai i20 Sportz Diesel,2010,225000,127000,Diesel,Individual,Manual,First Owner,23.0 kmpl,1396 CC,90 bhp,22.4 kgm at 1750-2750rpm,5.0
4,Maruti Swift VXI BSIII,2007,130000,120000,Petrol,Individual,Manual,First Owner,16.1 kmpl,1298 CC,88.2 bhp,"11.5@ 4,500(kgm@ rpm)",5.0


### 3. Split data into training and test set

In [3]:
train_dataset, test_dataset = aml_dataset.random_split(0.85, seed=1000003) # 85% train | 15% test

train_dataset_df = train_dataset.to_pandas_dataframe() # convert set to df
test_dataset_df = test_dataset.to_pandas_dataframe()

print(train_dataset_df.describe()) # preview train data

              year  selling_price     km_driven        seats
count  6909.000000   6.909000e+03  6.909000e+03  6718.000000
mean   2013.788681   6.370007e+05  6.981722e+04     5.417982
std       4.041359   8.058132e+05  5.746771e+04     0.959528
min    1983.000000   2.999900e+04  1.000000e+00     2.000000
25%    2011.000000   2.549990e+05  3.500000e+04     5.000000
50%    2015.000000   4.500000e+05  6.000000e+04     5.000000
75%    2017.000000   6.750000e+05  9.900000e+04     5.000000
max    2020.000000   1.000000e+07  2.360457e+06    14.000000


### 4. Connect with Compute Node

In [5]:
from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget

amlcompute_cluster_name = "wut-ai-aml-ci"

found = False
cts = ws.compute_targets

if amlcompute_cluster_name in cts and cts[amlcompute_cluster_name].type == 'ComputeInstance':
     found = True
     print('Found existing training cluster.')
     # Get existing cluster
     aml_remote_compute = cts[amlcompute_cluster_name]
    
print('Checking cluster status...')
# Can poll for a minimum number of nodes and for a specific timeout.
# If no min_node_count is provided, it will use the scale settings for the cluster.
aml_remote_compute.wait_for_completion(show_output = True)

Found existing training cluster.
Checking cluster status...

Running


In [None]:
# additional details of current AmlCompute status:
aml_remote_compute.get_status().serialize()

### 5. Get primary metrics to evaluate model 

In [7]:
from azureml.train import automl

automl.utilities.get_primary_metrics('regression')

['spearman_correlation',
 'r2_score',
 'normalized_root_mean_squared_error',
 'normalized_mean_absolute_error']

### 6. Set experiment settings
* task - 'regression'
* primary_metric - 'r2_score'
* label_column_name - 'selling_price'

In [8]:
import logging
import os

from azureml.train.automl import AutoMLConfig

project_folder = './'
os.makedirs(project_folder, exist_ok=True)

automl_config = AutoMLConfig(compute_target=aml_remote_compute,
                             task='regression',
                             primary_metric='r2_score',
                             experiment_timeout_minutes=15,
                             training_data=train_dataset,
                             label_column_name="selling_price",
                             n_cross_validations=5,                           
                             enable_early_stopping=True,
                             featurization='auto',
                             debug_log='automated_ml_errors.log',
                             verbosity=logging.INFO,
                             path=project_folder
                             )

### 7. Set experiment name and run it

In [9]:
from azureml.core import Experiment
from datetime import datetime

now = datetime.now()
time_string = now.strftime("%m-%d-%Y-%H")
experiment_name = "regr-automl-cars-remote-{0}".format(time_string)
print(experiment_name)

experiment = Experiment(workspace=ws, name=experiment_name)

import time
start_time = time.time()
            
run = experiment.submit(automl_config, show_output=True)

print('Manual run timing: --- %s seconds needed for running the whole Remote AutoML Experiment ---' % (time.time() - start_time))

regr-automl-cars-remote-01-27-2021-14
Running on remote.
No run_configuration provided, running on wut-ai-aml-ci with default configuration
Running on remote compute: wut-ai-aml-ci
Parent Run ID: AutoML_f0ae5551-cb5d-472c-b485-e3b5d00d4749

Current status: FeaturesGeneration. Generating features for the dataset.
Current status: DatasetCrossValidationSplit. Generating individually featurized CV splits.
Current status: ModelSelection. Beginning model selection.

****************************************************************************************************
DATA GUARDRAILS: 

TYPE:         Missing feature values imputation
STATUS:       DONE
DESCRIPTION:  If the missing values are expected, let the run complete. Otherwise cancel the current run and use a script to customize the handling of missing feature values that may be more appropriate based on the data type and business requirement.
              Learn more about missing value imputation: https://aka.ms/AutomatedMLFeaturization

### 8. Retrieve the 'Best Model'

In [14]:
best_run, fitted_model = run.get_output()
print(best_run)
print(fitted_model)

Run(Experiment: regr-automl-cars-remote-01-27-2021-14,
Id: AutoML_f0ae5551-cb5d-472c-b485-e3b5d00d4749_7,
Type: azureml.scriptrun,
Status: Completed)
RegressionPipeline(pipeline=Pipeline(memory=None,
                                     steps=[('datatransformer',
                                             DataTransformer(enable_dnn=None,
                                                             enable_feature_sweeping=None,
                                                             feature_sweeping_config=None,
                                                             feature_sweeping_timeout=None,
                                                             featurization_config=None,
                                                             force_text_dnn=None,
                                                             is_cross_validation=None,
                                                             is_onnx_compatible=None,
                                         

### 9. Predict values

In [16]:
import pandas as pd

# drop predicted data column
if 'selling_price' in test_dataset_df.columns:
    y_test_df = test_dataset_df.pop('selling_price')
x_test_df = test_dataset_df

# predict values
y_predictions = fitted_model.predict(x_test_df)

### 10. Calculate and show r2 score:

In [17]:
from sklearn.metrics import r2_score

print('R2 Score:')
r2_score(y_test_df, y_predictions)

R2 Score:


0.9566988104580337

The accuracy of the best model reached around 96% which seems relatively good.