# Azure Machine Learning Service
## Regression on Blackfriday dataset with Azure Automated ML
Adapted notebook from: https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/automated-machine-learning/classification/auto-ml-classification.ipynb

Dataset from: https://www.kaggle.com/mehdidag/black-friday (The dataset comes from a competition hosted by Analytics Vidhya.)

Author: Korkrid Akepanidtaworn (Microsoft) x Pongsakorn

## 1. Import important library
- matplotlib for visualization
- numpy and pandas for preprocessing data in an appropiate format
- sklearn for splitting dataset into training and test
- azureml.core for Azure machine learning services

In [1]:
# In case of importing error from "from azureml.train.automl import AutoMLConfig"
# Install new version of tensorflow and restart kernel to solve this
#! pip install tensorflow==1.5.0

In [2]:
import logging

from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, r2_score

import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.train.automl import AutoMLConfig
from azureml.train.automl.automlexplainer import explain_model
from azureml.widgets import RunDetails

## 2. Preprocess dataset
The dataset here is a sample of the transactions made in a retail store. The store wants to know better the customer purchase behaviour against different products. Specifically, here the problem is a regression problem where we are trying to predict the dependent variable (the amount of purchase) with the help of the information contained in the other variables. In this case, we will:
- Download data and explore data a bit
- Drop some columns that is not necessary
- Split into training and test data

In [3]:
# If you already downloaded the dataset, please ignore this part
# ! wget https://www.dropbox.com/s/4i0ty34dzgrkd4e/BlackFriday.csv

In [4]:
input_df = pd.read_csv("./BlackFriday.csv")
input_df.head()

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
0,1000001,P00069042,F,0-17,10,A,2,0,3,,,8370
1,1000001,P00248942,F,0-17,10,A,2,0,1,6.0,14.0,15200
2,1000001,P00087842,F,0-17,10,A,2,0,12,,,1422
3,1000001,P00085442,F,0-17,10,A,2,0,12,14.0,,1057
4,1000002,P00285442,M,55+,16,C,4+,0,8,,,7969


In [5]:
input_df.dtypes

User_ID                         int64
Product_ID                     object
Gender                         object
Age                            object
Occupation                      int64
City_Category                  object
Stay_In_Current_City_Years     object
Marital_Status                  int64
Product_Category_1              int64
Product_Category_2            float64
Product_Category_3            float64
Purchase                        int64
dtype: object

In [6]:
input_df.describe()

Unnamed: 0,User_ID,Occupation,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
count,537577.0,537577.0,537577.0,537577.0,370591.0,164278.0,537577.0
mean,1002991.85,8.08,0.41,5.3,9.84,12.67,9333.86
std,1714.39,6.52,0.49,3.75,5.09,4.12,4981.02
min,1000001.0,0.0,0.0,1.0,2.0,3.0,185.0
25%,1001495.0,2.0,0.0,1.0,5.0,9.0,5866.0
50%,1003031.0,7.0,0.0,5.0,9.0,14.0,8062.0
75%,1004417.0,14.0,1.0,8.0,15.0,16.0,12073.0
max,1006040.0,20.0,1.0,18.0,18.0,18.0,23961.0


First we will ignore *user_id* and *product_id*. Moreover, We see that some product_category 2 and 3 are missing. We will let Automated ML handle this.

In [7]:
print("Missing ratio of product_category 2: "+str((input_df.count()['Product_Category_2']/input_df.shape[0])*100))
print("Missing ratio of product_category 3: "+str((input_df.count()['Product_Category_3']/input_df.shape[0])*100))

Missing ratio of product_category 2: 68.93728712351906
Missing ratio of product_category 3: 30.55897108693266


Now, we got preprocessed input dataframe which is droped User_ID and Product_ID. Next, let's split into training and test data into 70:30 portion.

In [8]:
prep_inputdf = input_df.drop(['User_ID', 'Product_ID'],axis=1)
prep_inputdf.head()

Unnamed: 0,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
0,F,0-17,10,A,2,0,3,,,8370
1,F,0-17,10,A,2,0,1,6.0,14.0,15200
2,F,0-17,10,A,2,0,12,,,1422
3,F,0-17,10,A,2,0,12,14.0,,1057
4,M,55+,16,C,4+,0,8,,,7969


In [9]:
X_train, X_test, y_train, y_test = train_test_split(prep_inputdf.drop(['Purchase'], axis=1), 
                                                    prep_inputdf['Purchase'], 
                                                    test_size=0.7)

In [10]:
X_train.head()

Unnamed: 0,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3
183912,M,36-45,15,C,0,1,1,2.0,5.0
171879,M,36-45,14,C,1,0,2,15.0,
493421,F,26-35,2,C,2,1,6,8.0,13.0
268223,F,26-35,7,A,4+,0,5,,
457733,F,26-35,0,B,1,0,5,14.0,16.0


In [11]:
y_train.head()

183912    15884
171879     9564
493421    12605
268223     6926
457733     3764
Name: Purchase, dtype: int64

## 3. Automated Machine Learning with Azure

### 3.1 Setup

- As part of the setup you have already created an Azure ML `Workspace` object. It indicates path of config.json file. This file contains subscription_id, resource_group, and workspace_name for authentication in the later. 
- For Automated ML you will need to create an `Experiment` object, which is a named object in a `Workspace` used to run experiments.
- After that, it will appear a link, authenticate in this link with provided code. This should be completed for connecting Azure Machine Learning Service
- If you want to see details, try to print outputDf (comment section)
- **Read more:** https://docs.microsoft.com/bs-latn-ba/azure/machine-learning/service/setup-create-workspace?view=sql-server-2016

In [12]:
ws = Workspace.from_config()

# Choose a name for the experiment and specify the project folder.
experiment_name = 'automl-local-regression'
project_folder = './automl-local-regression'

experiment = Experiment(ws, experiment_name)

output = {}
output['SDK version'] = azureml.core.VERSION
output['Subscription ID'] = ws.subscription_id
output['Workspace Name'] = ws.name
output['Resource Group'] = ws.resource_group
output['Location'] = ws.location
output['Project Directory'] = project_folder
output['Experiment Name'] = experiment.name
pd.set_option('display.max_colwidth', -1)
#outputDf = pd.DataFrame(data = output, index = [''])
#outputDf.T

Opt-in diagnostics for better experience, quality, and security of future releases.

In [13]:
from azureml.telemetry import set_diagnostics_collection
set_diagnostics_collection(send_diagnostics = True)

Turning diagnostics collection on. 


### 3.2 Train the model

Instantiate an `AutoMLConfig` object to specify the settings and data used to run the experiment.

|Property|Description|
|-|-|
|**task**|classification or regression|
|**primary_metric**|This is the metric that you want to optimize. Regression supports the following primary metrics: <br><i>spearman_correlation</i><br><i>normalized_root_mean_squared_error</i><br><i>r2_score</i><br><i>normalized_mean_absolute_error</i>|
|**iteration_timeout_minutes**|Time limit in minutes for each iteration.|
|**iterations**|Number of iterations. In each iteration AutoML trains a specific pipeline with the data.|
|**n_cross_validations**|Number of cross validation splits.|
|**X**|(sparse) array-like, shape = [n_samples, n_features]|
|**y**|(sparse) array-like, shape = [n_samples, ], [n_samples, n_classes]<br>Multi-class targets. An indicator matrix turns on multilabel classification. This should be an array of integers.|
|**path**|Relative path to the project folder. AutoML stores configuration files for the experiment under this folder. You can specify a new empty folder.|

#### Remarks:
- When running the model with dataframe *y*, it shows error that **"y should be numpy array"**. So, we will convert *y* into numpy array before running.
- If we set *iteration_timeout_minutes* too low, it will cause to *"Fit operation exceeded provided timeout"*. It will terminate current iteration and move onto the next iteration. So, make sure that you set appropiate timeout.
- As we said early, we have not appropriate preprocessing and some product_category 2 and 3 are missing. So, we must set *preprocess* parameter as *"True"*. It helps about:
    -  Drop high cardinality or no variance features.
    -  Missing value imputation
    -  Generate additional features
    -  Transformations and encodings

**Read more** for preprocessing of the Automated ML: https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-configure-auto-train#data-pre-processing-and-featurization

In [14]:
automl_config = AutoMLConfig(task = 'regression',
                             debug_log = 'automl_errors.log',
                             primary_metric = 'normalized_mean_absolute_error',
                             iteration_timeout_minutes = 5,
                             iterations = 10,
                             n_cross_validations = 5,
                             verbosity = logging.INFO,
                             X = X_train, 
                             y = np.array(y_train),
                             path = project_folder,
                             preprocess=True)

Call the `submit` method on the experiment object and pass the run configuration. Execution of local runs is synchronous. Depending on the data and the number of iterations this can run for a while.
In this example, we specify `show_output = True` to print currently running iterations to the console.

In [15]:
local_run = experiment.submit(automl_config, show_output = True)

Running on local machine
Parent Run ID: AutoML_96ab294d-c1fc-44e2-8d45-fd455a5ddf1f
Current status: DatasetEvaluation. Gathering dataset statistics.
Current status: FeaturesGeneration. Generating features for the dataset.
Current status: DatasetFeaturization. Featurizing the dataset.
Current status: DatasetCrossValidationSplit. Generating individually featurized CV splits.
Current status: DatasetFeaturization. Featurizing the dataset.
Current status: DatasetFeaturization. Featurizing the dataset.
Current status: DatasetFeaturization. Featurizing the dataset.
Current status: DatasetFeaturization. Featurizing the dataset.
Current status: DatasetFeaturization. Featurizing the dataset.
Current status: DatasetFeaturization. Featurizing the dataset.
Current status: DatasetFeaturization. Featurizing the dataset.
Current status: DatasetFeaturization. Featurizing the dataset.
Current status: DatasetFeaturization. Featurizing the dataset.
Current status: DatasetFeaturization. Featurizing the dat

In [16]:
local_run

Experiment,Id,Type,Status,Details Page,Docs Page
automl-local-regression,AutoML_96ab294d-c1fc-44e2-8d45-fd455a5ddf1f,automl,NotStarted,Link to Azure Portal,Link to Documentation


Optionally, you can continue an interrupted local run by calling `continue_experiment` without the `iterations` parameter, or run more iterations for a completed run by specifying the `iterations` parameter:

In [17]:
#local_run = local_run.continue_experiment(X = X_train, 
#                                          y = y_train, 
#                                          show_output = True,
#                                          iterations = 5)

### 3.3 Training Results

#### Widget for Monitoring Runs

The widget will first report a "loading" status while running the first iteration. After completing the first iteration, an auto-updating graph and table will be shown. The widget will refresh once per minute, so you should see the graph update as child runs complete.

**Note:** The widget displays a link at the bottom. Use this link to open a web interface to explore the individual run details.

In [18]:
from azureml.widgets import RunDetails
RunDetails(local_run).show() 

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': True, 'log_level': 'INFO', 'sdâ€¦


#### Retrieve All Child Runs
You can also use SDK methods to fetch all the child runs and see individual metrics that we log.

In [19]:
children = list(local_run.get_children())
metricslist = {}
for run in children:
    properties = run.get_properties()
    metrics = {k: v for k, v in run.get_metrics().items() if isinstance(v, float)}
    metricslist[int(properties['iteration'])] = metrics

rundata = pd.DataFrame(metricslist).sort_index(1)
rundata

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
explained_variance,0.62,0.38,0.63,0.59,0.63,0.59,0.5,0.63,0.63,0.63
mean_absolute_error,2370.02,2956.63,2306.42,2434.64,2301.69,2411.6,2616.54,2321.58,2311.73,2292.24
median_absolute_error,1879.8,2088.94,1812.73,1917.53,1768.32,1835.0,1957.35,1821.51,1800.7,1780.7
normalized_mean_absolute_error,0.1,0.12,0.1,0.1,0.1,0.1,0.11,0.1,0.1,0.1
normalized_median_absolute_error,0.08,0.09,0.08,0.08,0.07,0.08,0.08,0.08,0.08,0.07
normalized_root_mean_squared_error,0.13,0.17,0.13,0.13,0.13,0.13,0.15,0.13,0.13,0.13
normalized_root_mean_squared_log_error,0.09,0.12,0.08,0.09,0.08,0.09,0.11,0.08,0.08,
r2_score,0.62,0.38,0.63,0.59,0.63,0.59,0.5,0.63,0.63,0.63
root_mean_squared_error,3071.37,3939.07,3034.18,3178.11,3054.71,3209.49,3533.9,3047.17,3038.74,3019.18
root_mean_squared_log_error,0.42,0.58,0.39,0.45,0.39,0.43,0.52,0.4,0.39,


### 3.4 Retrieve the Best Model

Below we select the best pipeline from our iterations. The `get_output` method returns the best run and the fitted model. The Model includes the pipeline and any pre-processing.  Overloads on `get_output` allow you to retrieve the best run and fitted model for *any* logged metric or for a particular *iteration*.

In [20]:
best_run, fitted_model = local_run.get_output()
print(best_run)
print(fitted_model)

Run(Experiment: automl-local-regression,
Id: AutoML_96ab294d-c1fc-44e2-8d45-fd455a5ddf1f_9,
Type: None,
Status: Completed)
RegressionPipeline(pipeline=Pipeline(memory=None,
     steps=[('datatransformer', DataTransformer(enable_feature_sweeping=None, feature_sweeping_timeout=None,
        is_onnx_compatible=None, logger=None, observer=None, task=None)), ('stackensembleregressor', StackEnsembleRegressor(base_learners=[('4', Pipeline(memory=None,
     steps=[('standardscalerw...om_state=None, selection='cyclic', tol=0.0001, warm_start=False),
            training_cv_folds=5))]),
          stddev=None)


#### Best Model Based on Any Other Metric
Show the run and the model that has the smallest `normalized_root_mean_squared_error` value:

In [21]:
lookup_metric = "r2_score"
best_run, fitted_model = local_run.get_output(metric = lookup_metric)
print(best_run)
print(fitted_model)

Run(Experiment: automl-local-regression,
Id: AutoML_96ab294d-c1fc-44e2-8d45-fd455a5ddf1f_9,
Type: None,
Status: Completed)
RegressionPipeline(pipeline=Pipeline(memory=None,
     steps=[('datatransformer', DataTransformer(enable_feature_sweeping=None, feature_sweeping_timeout=None,
        is_onnx_compatible=None, logger=None, observer=None, task=None)), ('stackensembleregressor', StackEnsembleRegressor(base_learners=[('4', Pipeline(memory=None,
     steps=[('standardscalerw...om_state=None, selection='cyclic', tol=0.0001, warm_start=False),
            training_cv_folds=5))]),
          stddev=None)


#### Model from a Specific Iteration
Show the run and the model from the third iteration:

In [22]:
#iteration = 3
#third_run, third_model = local_run.get_output(iteration = iteration)
#print(third_run)
#print(third_model)

## 4. Test The Model
In this section, we will fit the best model with test data to see model performance.

We measure in four metrics including Mean Absolute Error (MAE) and its percentage, R2-Score, and accuracy.

In [23]:
Y_pred = fitted_model.predict(X_test)

In [24]:
print("R2-Score: "+str(r2_score(y_test, Y_pred)))

R2-Score: 0.6334573893861751


In [25]:
sum_actuals = sum_errors = 0

for actual_val, predict_val in zip(y_test, Y_pred):
    abs_error = actual_val - predict_val
    if abs_error < 0:
        abs_error = abs_error * -1

    sum_errors = sum_errors + abs_error
    sum_actuals = sum_actuals + actual_val

mean_abs_percent_error = sum_errors / sum_actuals
print("Model MAPE:")
print(mean_abs_percent_error)
print()
print("Model Accuracy:")
print(1 - mean_abs_percent_error)

Model MAPE:
0.24509007969267277

Model Accuracy:
0.7549099203073273
