# AUTO ML - Predict Cars Price

## USE CASE 
Data contains information about cars sold in US. The use case is to predict price of the car based on parameters. </br>
Below you can see representation of the part of the data: </br></br>
<img src="https://github.com/R3J3NT/AI-on-Microsoft-Azure/blob/main/Introduction-to-AI-Machine-Learning/AML/images/dataTable.PNG?raw=true">

## Create resources on Azure Portal
<ol>
<li>Go to Azure portal and create Resource Group</li>
<li>Create Azure Machine Learning</li>
</ol>

## Load data to Azure Machine Learning Portal
<ol>
<li>Go to Machine Learning Portal https://ml.azure.com/</li>
<li>Go to datasets tab</li>
<li>Import new file from local drive</li>
<li>Configure columns and rows data</li></ol></br>
<img src="https://github.com/R3J3NT/AI-on-Microsoft-Azure/blob/main/Introduction-to-AI-Machine-Learning/AML/images/newDataset.png?raw=true">

## Create new Notebook
<ol>
<li>Go to Notebooks tab</li>
<li>Create new Notebook</li>
<li>Create new Compute instance, with parameters depending of your needs </li></ol></br>
<img src="https://github.com/R3J3NT/AI-on-Microsoft-Azure/blob/main/Introduction-to-AI-Machine-Learning/AML/images/clusterData.PNG?raw=true">

## Authenticate to Azure

In [1]:
from azureml.core import Workspace, Dataset

ws = Workspace.from_config()

## Load previously created dataset by name parameter

In [2]:
aml_dataset = Dataset.get_by_name(ws, name='USCarsData')

full_df = aml_dataset.to_pandas_dataframe()
full_df.head()

Unnamed: 0,price,brand,model,year,title_status,mileage,color,state,country,condition
0,6300,toyota,cruiser,2008,clean vehicle,274117.0,black,new jersey,usa,10 days left
1,2899,ford,se,2011,clean vehicle,190552.0,silver,tennessee,usa,6 days left
2,5350,dodge,mpv,2018,clean vehicle,39590.0,silver,georgia,usa,2 days left
3,25000,ford,door,2014,clean vehicle,64146.0,blue,virginia,usa,22 hours left
4,27700,chevrolet,1500,2018,clean vehicle,6654.0,red,florida,usa,22 hours left


## Split dataset to traning and testing set

In [3]:
train_dataset, test_dataset = aml_dataset.random_split(0.8, seed=23423)

train_dataset_df = train_dataset.to_pandas_dataframe()
test_dataset_df = test_dataset.to_pandas_dataframe()

print(train_dataset_df.describe())

              price         year       mileage
count   1999.000000  1999.000000  1.999000e+03
mean   18583.241621  2016.687344  5.191408e+04
std    12139.191183     3.538547  5.555603e+04
min        0.000000  1973.000000  0.000000e+00
25%     9752.500000  2016.000000  2.127550e+04
50%    16800.000000  2018.000000  3.527400e+04
75%    25350.000000  2019.000000  6.261250e+04
max    84900.000000  2020.000000  1.017936e+06


## Connect to previously created Compute Unit

In [4]:
from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget

amlcompute_cluster_name = "CarsCluster"

found = False
cts = ws.compute_targets

if amlcompute_cluster_name in cts and cts[amlcompute_cluster_name].type == 'ComputeInstance':
     found = True
     print('Found existing training cluster.')
     # Get existing cluster
     # Method 1:
     aml_remote_compute = cts[amlcompute_cluster_name]
     # Method 2:
     # aml_remote_compute = ComputeTarget(ws, amlcompute_cluster_name)
    
if not found:
     print('Cluster not exists...')
    
print('Checking cluster status...')
# Can poll for a minimum number of nodes and for a specific timeout.
# If no min_node_count is provided, it will use the scale settings for the cluster.
aml_remote_compute.wait_for_completion(show_output = True)

Found existing training cluster.
Checking cluster status...

Running


## Check metrics which can be applied with regression

In [5]:
from azureml.train import automl

automl.utilities.get_primary_metrics('regression')

['spearman_correlation',
 'r2_score',
 'normalized_root_mean_squared_error',
 'normalized_mean_absolute_error']

## Configure experiment paramiters
For regresion change only label_column_name to column which you would like to predict

In [6]:
import logging
import os

from azureml.train.automl import AutoMLConfig

project_folder = './'
os.makedirs(project_folder, exist_ok=True)

automl_config = AutoMLConfig(compute_target=aml_remote_compute,
                             task='regression',
                             primary_metric='r2_score',
                             experiment_timeout_minutes=15,                            
                             training_data=train_dataset,
                             label_column_name="price",
                             n_cross_validations=5,
                             # blacklist_models='XGBoostClassifier', 
                             # iteration_timeout_minutes=5,                                                    
                             enable_early_stopping=True,
                             featurization='auto',
                             debug_log='automated_ml_errors.log',
                             verbosity=logging.INFO,
                             path=project_folder
                             )

## Start experiment

In [7]:
from azureml.core import Experiment
from datetime import datetime

now = datetime.now()
time_string = now.strftime("%m-%d-%Y-%H")
experiment_name = "regress-automl-remote-{0}".format(time_string)
print(experiment_name)

experiment = Experiment(workspace=ws, name=experiment_name)

import time
start_time = time.time()
            
run = experiment.submit(automl_config, show_output=True)

print('Manual run timing: --- %s seconds needed for running the whole Remote AutoML Experiment ---' % (time.time() - start_time))

regress-automl-remote-12-31-2020-15
Running on remote.
No run_configuration provided, running on CarsCluster with default configuration
Running on remote compute: CarsCluster
Parent Run ID: AutoML_41db094e-bb8c-48d7-92ce-79293d95e23f

Current status: FeaturesGeneration. Generating features for the dataset.
Current status: DatasetCrossValidationSplit. Generating individually featurized CV splits.
Current status: ModelSelection. Beginning model selection.

****************************************************************************************************
DATA GUARDRAILS: 

TYPE:         Missing feature values imputation
STATUS:       PASSED
DESCRIPTION:  No feature missing values were detected in the training data.
              Learn more about missing value imputation: https://aka.ms/AutomatedMLFeaturization

****************************************************************************************************

TYPE:         High cardinality feature detection
STATUS:       DONE
DESCRIPT

## Get results of best model from experiment

In [8]:
best_run, fitted_model = run.get_output()
print(best_run)
print(fitted_model)

Run(Experiment: regress-automl-remote-12-31-2020-15,
Id: AutoML_41db094e-bb8c-48d7-92ce-79293d95e23f_14,
Type: azureml.scriptrun,
Status: Completed)
RegressionPipeline(pipeline=Pipeline(memory=None,
                                     steps=[('datatransformer',
                                             DataTransformer(enable_dnn=None,
                                                             enable_feature_sweeping=None,
                                                             feature_sweeping_config=None,
                                                             feature_sweeping_timeout=None,
                                                             featurization_config=None,
                                                             force_text_dnn=None,
                                                             is_cross_validation=None,
                                                             is_onnx_compatible=None,
                                          

## Prepare testing set, remove values to predict

In [9]:
import pandas as pd

#Remove Label/y column
if 'price' in test_dataset_df.columns:
    y_test_df = test_dataset_df.pop('price')

x_test_df = test_dataset_df

## Test predicitions using trained model

In [10]:
y_predictions = fitted_model.predict(x_test_df)

print('10 predictions: ')
print(y_predictions[:10])

10 predictions: 
[ 4976.5247239  16135.67080713 26837.87049122 18831.07828488
 21427.58059085  7097.02730724 12830.82325822 30708.952202
 17309.78479339 17092.58532277]


## Get value of r2_score

In [11]:
from sklearn.metrics import r2_score

print('R2 Score:')
r2_score(y_test_df, y_predictions)

R2 Score:


0.6826730967720989

The r2_score metric shows us value almost 70%, which is good result but for sure it can be improved using more data about cars in dataset. 