## Auto AML - przewidywanie ceny samochodu
### Use case

Celem ćwiczenia jest przewidzenie ceny samochodu mając dostęp do danych niezależnych m.in.:
* enginetype
* enginesize
* fuelsystem
* horsepower
* car model

Link do dataset'u: https://www.kaggle.com/hellbuoy/car-price-prediction 
### Kroki do wykonania eksperymentu

1. Zaloguj się do Azure Portal i stwórz resource Azure Machine Learning
2. Zaloguj się na stronę https://ml.azure.com/ 
3. Pobierz z internetu zbiór danych
4. Przejdź do zakładki Datasets i załaduj zbiór danych
5. Stwórz Compute Instances oraz Compute Clusters 
6. Utwórz nowy notebook w Azure ML 


### Sprawdzenie wersji sdk

In [1]:
import azureml.core

print("SDK version:", azureml.core.VERSION)

SDK version: 1.20.0


In [2]:
from azureml.core import Workspace, Dataset

# Get Workspace defined in by default config.json file
ws = Workspace.from_config()

### Załadowanie danych

In [11]:

aml_dataset = ws.datasets['car']

# Use Pandas DataFrame just to sneak peak some data and schema
full_df = aml_dataset.to_pandas_dataframe()
# .to_pandas_dataframe().dropna()
full_df.head()

Unnamed: 0,car_ID,symboling,CarName,fueltype,aspiration,doornumber,carbody,drivewheel,enginelocation,wheelbase,...,enginesize,fuelsystem,boreratio,stroke,compressionratio,horsepower,peakrpm,citympg,highwaympg,price
0,1,3,alfa-romero giulia,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495.0
1,2,3,alfa-romero stelvio,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500.0
2,3,1,alfa-romero Quadrifoglio,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500.0
3,4,2,audi 100 ls,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950.0
4,5,2,audi 100ls,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450.0


In [12]:
full_df.describe()

Unnamed: 0,car_ID,symboling,wheelbase,carlength,carwidth,carheight,curbweight,enginesize,boreratio,stroke,compressionratio,horsepower,peakrpm,citympg,highwaympg,price
count,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0
mean,103.0,0.834146,98.756585,174.049268,65.907805,53.724878,2555.565854,126.907317,3.329756,3.255415,10.142537,104.117073,5125.121951,25.219512,30.75122,13276.710571
std,59.322565,1.245307,6.021776,12.337289,2.145204,2.443522,520.680204,41.642693,0.270844,0.313597,3.97204,39.544167,476.985643,6.542142,6.886443,7988.852332
min,1.0,-2.0,86.6,141.1,60.3,47.8,1488.0,61.0,2.54,2.07,7.0,48.0,4150.0,13.0,16.0,5118.0
25%,52.0,0.0,94.5,166.3,64.1,52.0,2145.0,97.0,3.15,3.11,8.6,70.0,4800.0,19.0,25.0,7788.0
50%,103.0,1.0,97.0,173.2,65.5,54.1,2414.0,120.0,3.31,3.29,9.0,95.0,5200.0,24.0,30.0,10295.0
75%,154.0,2.0,102.4,183.1,66.9,55.5,2935.0,141.0,3.58,3.41,9.4,116.0,5500.0,30.0,34.0,16503.0
max,205.0,3.0,120.9,208.1,72.3,59.8,4066.0,326.0,3.94,4.17,23.0,288.0,6600.0,49.0,54.0,45400.0


### Podział danych na zbiór uczący i testowy

In [14]:
train_dataset, test_dataset = aml_dataset.random_split(0.8, seed=1)

train_dataset_df = train_dataset.to_pandas_dataframe()
test_dataset_df = test_dataset.to_pandas_dataframe()

print(train_dataset_df.describe())

           car_ID   symboling   wheelbase   carlength    carwidth   carheight  \
count  175.000000  175.000000  175.000000  175.000000  175.000000  175.000000   
mean   101.034286    0.851429   98.545143  173.511429   65.860000   53.704571   
std     59.461752    1.218009    5.958011   12.431753    2.195973    2.395010   
min      1.000000   -2.000000   86.600000  141.100000   60.300000   47.800000   
25%     49.500000    0.000000   94.500000  166.300000   64.000000   52.000000   
50%     99.000000    1.000000   96.500000  172.400000   65.400000   54.100000   
75%    151.500000    2.000000  101.200000  180.250000   66.500000   55.500000   
max    205.000000    3.000000  120.900000  208.100000   72.300000   59.800000   

        curbweight  enginesize   boreratio      stroke  compressionratio  \
count   175.000000  175.000000  175.000000  175.000000        175.000000   
mean   2543.280000  126.634286    3.318514    3.242029         10.051543   
std     520.781846   42.521434    0.272726

In [15]:
amlcompute_cluster_name = "cpus-cluster"

found = False
# Check if this compute target already exists in the workspace.
cts = ws.compute_targets

if amlcompute_cluster_name in cts and cts[amlcompute_cluster_name].type == 'AmlCompute':
     found = True
     print('Found existing training cluster.')
     # Get existing cluster
     # Method 1:
     aml_remote_compute = cts[amlcompute_cluster_name]
     # Method 2:
     # aml_remote_compute = ComputeTarget(ws, amlcompute_cluster_name)
    
if not found:
     print('Creating not found')
    
print('Checking cluster status...')
# Can poll for a minimum number of nodes and for a specific timeout.
# If no min_node_count is provided, it will use the scale settings for the cluster.
aml_remote_compute.wait_for_completion(show_output = True)

Found existing training cluster.
Checking cluster status...
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


In [16]:
from azureml.train import automl

automl.utilities.get_primary_metrics('regression')

['normalized_root_mean_squared_error',
 'normalized_mean_absolute_error',
 'r2_score',
 'spearman_correlation']

In [17]:
import logging
import os

from azureml.train.automl import AutoMLConfig

project_folder = './'
os.makedirs(project_folder, exist_ok=True)

automl_config = AutoMLConfig(compute_target=aml_remote_compute,
                             task='regression',
                             primary_metric='r2_score',
                             experiment_timeout_minutes=15,                            
                             training_data=train_dataset,
                             label_column_name="price",
                             n_cross_validations=5,                                                 
                             enable_early_stopping=True,
                             featurization='auto',
                             debug_log='automated_ml_errors.log',
                             verbosity=logging.INFO,
                             path=project_folder
                             )

In [None]:
from azureml.core import Experiment
from datetime import datetime

now = datetime.now()
time_string = now.strftime("%m-%d-%Y-%H")
experiment_name = "regress-automl-remote-{0}".format(time_string)
print(experiment_name)

experiment = Experiment(workspace=ws, name=experiment_name)

import time
start_time = time.time()
            
run = experiment.submit(automl_config, show_output=True)


print('Manual run timing: --- %s seconds needed for running the whole Remote AutoML Experiment ---' % (time.time() - start_time))

regress-automl-remote-01-26-2021-18
Running on remote.
No run_configuration provided, running on cpus-cluster with default configuration
Running on remote compute: cpus-cluster
Parent Run ID: AutoML_96ab8256-bde0-459f-9c85-c610046fd3e9

Current status: FeaturesGeneration. Generating features for the dataset.
Current status: ModelSelection. Beginning model selection.

****************************************************************************************************
DATA GUARDRAILS: 

TYPE:         Missing feature values imputation
STATUS:       PASSED
DESCRIPTION:  No feature missing values were detected in the training data.
              Learn more about missing value imputation: https://aka.ms/AutomatedMLFeaturization

****************************************************************************************************

TYPE:         High cardinality feature detection
STATUS:       DONE
DESCRIPTION:  High cardinality features were detected in your inputs and handled.
              L

In [21]:
best_run, fitted_model = run.get_output()
print(best_run)
print(fitted_model)

Run(Experiment: regress-automl-remote-01-26-2021-18,
Id: AutoML_96ab8256-bde0-459f-9c85-c610046fd3e9_16,
Type: azureml.scriptrun,
Status: Completed)
RegressionPipeline(pipeline=Pipeline(memory=None,
                                     steps=[('datatransformer',
                                             DataTransformer(enable_dnn=None,
                                                             enable_feature_sweeping=None,
                                                             feature_sweeping_config=None,
                                                             feature_sweeping_timeout=None,
                                                             featurization_config=None,
                                                             force_text_dnn=None,
                                                             is_cross_validation=None,
                                                             is_onnx_compatible=None,
                                          

In [24]:
import pandas as pd

if 'price' in test_dataset_df.columns:
    y_test_df = test_dataset_df.pop('price')

x_test_df = test_dataset_df

In [25]:
y_predictions = fitted_model.predict(x_test_df)

print('10 predictions: ')
print(y_predictions[:10])

10 predictions: 
[14047.89781494  6377.06065034  7797.15107187  9125.37602026
 35435.47835752 12334.25745746 10651.11683417 10402.39697368
 29632.88858731 13754.29995059]


In [26]:
y_predictions.shape


(30,)

In [28]:
from sklearn.metrics import r2_score

print('Result: r2 score')
r2_score(y_test_df, y_predictions)

Result: r2 score


0.9055293351826881