Optimizing an ML Pipeline in Azure

Overview

This project is part of the Udacity Azure ML Nanodegree. In this project, we build and optimize an Azure ML pipeline using the Python SDK and a provided Scikit-learn model. This model is then compared to an Azure AutoML run.

Summary

The dataset used in this project is Bankmarketing_train.csv that contains data of clients collected through marketing campaigns held by banking institution and we seek to predict, if client subscribes to a term deposit or not i.e.,prediction of target variable(y) with a value yes/no.

No.of clients : 32951
Input variables : 20
Output/target variable : y

Best performing model

The best performing model was AutoML with accuracy of 0.9154738177206015 by Voting Ensemble algorithm.

Scikit-learn Pipeline

Pipeline architecture of HyperDrive Experiment

Create a workspace and experiment objects to start building a hyperdrive pipeline in jupyter notebook.

1. Create a Compute cluster : Training a model requires virtual machine in which experiment is run through creating compute cluster with required vm priority - Low priority, vm type - CPU, vm size - Standard_D2_v2 and max_nodes - 4.

2. Set up train.py script : This python script file is used to run the hyperdrive experiment that includes custom coded Logistic Regression model using sklearn.

Data Import : Create tabular dataset of bankmarketing_csv file by importing of data using azureml TabularDatasetFactory class.
Clean and Encode data : A clean_data function is used to replace or clean the missing values in dataset and is subjected to one hot encoding where categorical values are converted to numbers.
Split Data : Split of data into train and test subsets is done using train_test_split function with specified random state of split(random_state=42) and size of test set(test_size=0.33).
Script arguments : LogisticRegression class is used to regularise by specifying parameters like Regularization strength and Maximum number of iterations.

3. Create HyperDrive configuration : Creating a configuration for hyperdrive run to execute experiment with specified parameters like maximum total no.of runs to create and execute concurrently, name of the primary metric and primary metric goal is defined along with following hyperparameters:

Parameter sampler : Specifying parameter sampler using RandomParameterSampling class that enables random sampling over a hyperparameter search space from a set of discrete or continuous values(C and max_iter).
Policy : Specifies early termintaion policy for early stopping with required amount of evaluation interval, slack factor and delay_evaluation.
Estimator : SKLearn class is used to create a estimator to use with train.py script that specifies source directory, compute target, vm size, vm priority, entry script file path.

4. Submit Hyperdrive run : Submission of hyperdriveconfig to run the experiment. View progress of run through RunDetails widget.

5. Retrieve best run : Use get_best_run_by_primary_metric method for hyperdrive run to choose best hyperparameters of model and retrieve metrics of run.

6. Save model : Use the joblib import to save the trained model that creates a new file with specified name for model in outputs directory.

Benefits of Random Parameter Sampler

Random parameter sampling supports both discrete and continuous hyperparameter values.
Helps to identify low performing runs and thereby helps in early termination.
Low bias in random sampling as hyperparameter values are randomly selected from the defined search space and have equal probability of being selected.
choice function helps to sample from only specified set of values.
uniform function helps to maintain uniform distribution of samples taken.

Benefits of Bandit Policy

It is a early termination policy that terminates low performing runs which improves computation for existing runs.
It terminates runs if primary metric is not within specified slack factor in comparison to the best performing run.
This policy is based on slack factor and evaluation interval.

AutoML

Voting Ensemble algorithm is generated by AutoML as the best model. Ensembled algorithms of this model includes XGBoostClassifier and LightGBM. Best ensemble weight of XGBoostClassifier is 0.14285714285714285 and best individual pipeline score is 0.9142054720057982.

A prefitted soft voting classifier is applied where every individual classifier provides a probability value, the predictions are weighted according to classifier's importance that are summed up and greatest sum of weighted probabilities wins the vote.

Parameters of xgboostclassifier are:

booster : gbtree use tree based models.
colsample_bytree : It is subsample ratio of columns when constructing each tree.
objective : (binary:logistic) Logistic regression for binary classification.
max_depth : Maximum depth of a tree that makes model more complex.
min_child_weight : Minimum weight needed and larger the min_child_weight is, more conservative is the algorithm.

Pipeline comparison

The hyperdrive optimized logistic regression model produced an accuracy of 0.91024462 and AutoML model is with accuracy of 0.91547381 from Voting Ensemble algorithm. The difference in accuracy is 0.00522919.

The differences in architecture are hyperdrive run requires manual tuning of hyperparameters by running through different hyperparameter values and choose the best hyperparameter that outperforms the other. Whereas, automl run tunes the hyperparameters automatically that results in producing best hyperparameters for that run and also automates model selection.

Yes, there was a slight difference in accuracy. As AutoML model processes different machine learning algorithms and Voting Ensemble being ensemble algorithm that combines predictions of multiple models could be the reason for difference in accuracy and better model performance.

Future work: Some areas of improvement for future experiments.

Use Dedicated virtual machine instead of low-priority as these vm do not guarantee the compute nodes.
Enable deep learning while specifying classification task type for autoML as it applies default techniques depending on the number of rows present in training dataset provided and applies train/validation split with required no.of cross validations without explicitly being provided.
Use iterations parameter of AutoMLConfig Class that enables use of different algorithms and parameter combinations to test during an automated ML experiment and increase experiment timeout minutes.

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
images		images
README.md		README.md
automl (3) (2).ipynb		automl (3) (2).ipynb
bankmarketing_train.csv		bankmarketing_train.csv
hyperparameter_tuning(f).ipynb		hyperparameter_tuning(f).ipynb
train.py		train.py
udacity-project.ipynb		udacity-project.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Optimizing an ML Pipeline in Azure

Overview

Summary

Best performing model

Scikit-learn Pipeline

Pipeline architecture of HyperDrive Experiment

Benefits of Random Parameter Sampler

Benefits of Bandit Policy

AutoML

Pipeline comparison

Future work: Some areas of improvement for future experiments.

Proof of cluster clean up

About

Releases

Packages

Languages

SwapnaKategaru/Optimising-ML-Pipeline-In-Azure

Folders and files

Latest commit

History

Repository files navigation

Optimizing an ML Pipeline in Azure

Overview

Summary

Best performing model

Scikit-learn Pipeline

Pipeline architecture of HyperDrive Experiment

Benefits of Random Parameter Sampler

Benefits of Bandit Policy

AutoML

Pipeline comparison

Future work: Some areas of improvement for future experiments.

Proof of cluster clean up

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages