Dataset: https://archive.ics.uci.edu/ml/datasets/wine+quality \
Project predicts quality of wine basing on it's parameters. To select best model, Azure Automated Machine Learning service was used. \
First of all, there was carried out test of that service using portal: https://ml.azure.com/ with following steps:
- create workspace
- create compute cluster with name "AiOnAzureCluster"
- create dataset, by uploading it from computer
- run Automated ML Experiment 
Best model was selected and results was quite good.

After previous steps computing cluster and dataset was already created, so in this notebook they are only loaded.

In [1]:
print("setting of names variables")
dataset_name = 'Wine'
amlcompute_cluster_name = "AiOnAzureCluster"
predicting_column_name = 'quality'

project_folder = './wine-automl'
os.makedirs(project_folder, exist_ok=True)

setting of names variables


In [2]:
from azureml.core import Workspace, Dataset

print("loading azure ml workspace and dataset")
ws = Workspace.from_config()
aml_dataset = ws.datasets[dataset_name]
full_df = aml_dataset.to_pandas_dataframe()

loading azure ml workspace and dataset


In [3]:
print("dataset characteristics")
full_df.describe()

dataset characteristics


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
count,1599.0,1599.0,1599.0,1599.0,1599.0,1595.0,1597.0,1599.0,1599.0,1599.0,1599.0,1599.0
mean,8.319637,0.527821,0.270976,2.538806,0.087467,15.838871,46.428929,0.996747,3.311113,0.658149,10.422983,5.636023
std,1.741096,0.17906,0.194801,1.409928,0.047065,10.423696,32.89757,0.001887,0.154386,0.169507,1.065668,0.807569
min,4.6,0.12,0.0,0.9,0.012,1.0,6.0,0.99007,2.74,0.33,8.4,3.0
25%,7.1,0.39,0.09,1.9,0.07,7.0,22.0,0.9956,3.21,0.55,9.5,5.0
50%,7.9,0.52,0.26,2.2,0.079,14.0,38.0,0.99675,3.31,0.62,10.2,6.0
75%,9.2,0.64,0.42,2.6,0.09,21.0,62.0,0.997835,3.4,0.73,11.1,6.0
max,15.9,1.58,1.0,15.5,0.611,72.0,289.0,1.00369,4.01,2.0,14.9,8.0


In [4]:
print("splitting dataset into train and test")
train_dataset, test_dataset = aml_dataset.random_split(0.9, seed=5)

train_dataset_df = train_dataset.to_pandas_dataframe()
test_dataset_df = test_dataset.to_pandas_dataframe()

train_dataset_df.describe()

splitting dataset into train and test


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
count,1402.0,1402.0,1402.0,1402.0,1402.0,1399.0,1401.0,1402.0,1402.0,1402.0,1402.0,1402.0
mean,8.334736,0.528181,0.271605,2.537518,0.088106,15.989278,47.049251,0.99679,3.310899,0.661084,10.404743,5.631241
std,1.738866,0.180044,0.194177,1.397057,0.048025,10.476107,32.762714,0.001859,0.154512,0.172484,1.05373,0.798085
min,4.6,0.12,0.0,0.9,0.034,1.0,6.0,0.99007,2.74,0.33,8.4,3.0
25%,7.1,0.39,0.1,1.9,0.071,7.5,23.0,0.99565,3.21,0.55,9.5,5.0
50%,7.9,0.52,0.26,2.2,0.08,14.0,38.0,0.9968,3.31,0.62,10.2,6.0
75%,9.2,0.635,0.43,2.6,0.091,22.0,63.0,0.997877,3.4,0.73,11.0,6.0
max,15.9,1.58,1.0,15.5,0.611,72.0,289.0,1.00369,4.01,2.0,14.9,8.0


In [5]:
from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget

print("loading compute cluster. Cluster should be type 'AmlCompute'.")
targets = ws.compute_targets
if amlcompute_cluster_name in targets and targets[amlcompute_cluster_name].type == 'AmlCompute':
    aml_remote_compute = targets[amlcompute_cluster_name]
    aml_remote_compute.wait_for_completion(show_output = True, min_node_count = 0, timeout_in_minutes = 20)
else:
    print(f'ERROR - no cluster with name "{amlcompute_cluster_name}" found')
    

loading compute cluster. Cluster should be type 'AmlCompute'.
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


In [6]:
from azureml.core import Experiment
from azureml.train.automl import AutoMLConfig
from datetime import datetime
import logging
import os

print("setting azure ML experiment - using regression and spearman_correlation as a metric.")
automl_config = AutoMLConfig(compute_target=aml_remote_compute,
                             task='regression',
                             primary_metric='spearman_correlation',
                             experiment_timeout_minutes=15,
                             training_data=train_dataset,
                             label_column_name=predicting_column_name,
                             n_cross_validations=5,
                             enable_early_stopping=True,
                             featurization='auto',
                             debug_log='automated_ml_errors.log',
                             verbosity=logging.INFO,
                             path=project_folder
                             )

print("Creating experiment with unique name and starting it")
experiment_name = f'regr-aml-wine-{datetime.now().strftime("%m-%d-%Y-%H")}'
experiment = Experiment(workspace=ws, name=experiment_name)
run = experiment.submit(automl_config, show_output=True)

setting azure ML experiment - using regression and spearman_correlation as a metric.
Creating experiment with unique name and starting it
Running on remote.
No run_configuration provided, running on AiOnAzureCluster with default configuration
Running on remote compute: AiOnAzureCluster
Parent Run ID: AutoML_bb64e09a-366e-4a7d-9a60-6dfcf2bc16e9

Current status: FeaturesGeneration. Generating features for the dataset.
Current status: DatasetCrossValidationSplit. Generating individually featurized CV splits.
Current status: ModelSelection. Beginning model selection.

****************************************************************************************************
DATA GUARDRAILS: 

TYPE:         Missing feature values imputation
STATUS:       DONE
DESCRIPTION:  If the missing values are expected, let the run complete. Otherwise cancel the current run and use a script to customize the handling of missing feature values that may be more appropriate based on the data type and business req

In [7]:
from azureml.widgets import RunDetails

print("Displaying details of the run. Every model has it's own result for many metrics")
RunDetails(run).show()

Displaying details of the run. Every model has it's own result for many metrics


_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

In [8]:
print("Preparing test dataset to measure performance")
y_test_df = test_dataset_df.pop(predicting_column_name)
x_test_df = test_dataset_df

print("getting best model and getting predictions for test dataset")
_, best_model = run.get_output()
y_hat = best_model.predict(x_test_df)

Preparing test dataset to measure performance
getting best model and getting predictions for test dataset


In [9]:
from sklearn.metrics import mean_squared_error, r2_score

print("calculating and displaying best model performance:")
print('MSE:')
print(mean_squared_error(y_test_df, y_hat))
print('R2:')
print(r2_score(y_test_df, y_hat))

calculating and displaying best model performance:
MSE:
0.4430247264380246
R2:
0.4164229648926313
