Optimizing an ML Pipeline in Azure

Overview

This project is part of the Udacity Azure ML Nanodegree. In this project, we build and optimize an Azure ML pipeline using the Python SDK and a provided Scikit-learn model. This model is then compared to an Azure AutoML run.

Useful Resources

Summary

Problem statement:

The dataset contains information from a bank marketing campaign. The problem is a binary classification problem where we need to predict whether the client subscribed for a term deposit(y) or not(n). The target column is represented by 'y' in the given dataset. Source of information: UCI ML Repository

Solution:

The best performing model was a VotingEnsemble model trained by the AutoML feature of AzureML. It had accuracy of 91.7%.

Scikit-learn Pipeline

Explain the pipeline architecture, including data, hyperparameter tuning, and classification algorithm.

The scikit-learn pipeline consists of the following stages:

Fetching the data from the remote URL
Cleaning the data
Splitting the data into train and test sets
Hyperparameter Tuning on a Logistic Regression classifier using Hyperdrive python package of AzureML

What are the benefits of the parameter sampler you chose?

I chose RandomParameterSampling.
Hyperparameter values are randomly selected from the defined search space.
It allows the search space to include both discrete and continuous hyperparameters.
It greatly reduces computation costs and speeds up the parameter space exploration.

What are the benefits of the early stopping policy you chose?

I chose BanditPolicy.
The policy early terminates any runs where the primary metric is not within the specified slack factor/slack amount with respect to the best performing training run.
This prevents certain unnecesary Runs from consuming compute resources.
It allows us to define a minimum required improvement to continue with the parameter search.

AutoML

In 1-2 sentences, describe the model and hyperparameters generated by AutoML.

The best model generated by AutoML is a VotingEnsemble model, with accuracy 91.7%. We can view the hyperparameters of the ensemble from this:

Pipeline comparison

Compare the two models and their performance. What are the differences in accuracy? In architecture? If there was a difference, why do you think there was one?

The best model generated with Hyperdrive optimization is a Logistic Regression model, with accuracy 91%. The best model generated by AutoML is a VotingEnsemble model, with accuracy 91.7%. VotingEnsemble is a soft voting ensemble created from previous AutoML iterations. There is a minor difference of 0.7% in accuracy, which is not significant.

If Ensemble was not used, the best model would be "StandardScalerWrapper XGBoostClassifier" with 91.34% accuracy.

Future work

What are some areas of improvement for future experiments? Why might these improvements help the model?

Add more data to the model
Add more columns/features with/without existing ones with feature engineering. Apply domain knowledge
Add additional parameters to AutoMLConfig
Give more choices for the hyper-parameters inside RandomParameterSampling, i.e. for C and max_iter
Try other Parameter Sampling techniques in Hyperdrive
Try other Early Stopping policies in Hyperdrive
Train Deep Learning models instead of Logistic Regression, they are capable of improving accuracy further. Also include deep learning models in the AutoML feature
We noticed that there is class imbalance in the dataset. By using over-sampling techniques on the minority class, we can bring about class balance, and this could potentially improve model performance.
Perform additional data processing such as normalisation, standardisation etc. as required

It might help to store the data permanently in our Datastore. Currently, we are directly bringing the data from the remote URL to our notebook. In future:

      => if the data changes in the remote URL, the results would vary. We need to be able to associate model performance with the corresponding data that was used. 

      => Also, if the URL does not work in future, it becomes problematic.

Proof of cluster clean up

If you did not delete your compute cluster in the code, please complete this section. Otherwise, delete this section. Image of cluster marked for deletion

Cluster clean up is handled in the code.

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
.github/workflows		.github/workflows
images		images
.gitignore		.gitignore
CODEOWNERS		CODEOWNERS
README.md		README.md
conda_dependencies.yml		conda_dependencies.yml
train.py		train.py
udacity-project.ipynb		udacity-project.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Optimizing an ML Pipeline in Azure

Overview

Useful Resources

Summary

Scikit-learn Pipeline

AutoML

Pipeline comparison

Future work

Proof of cluster clean up

About

Releases

Packages

Contributors 6

Languages

RohitRanga12/Optimize_Pipeline_Azure

Folders and files

Latest commit

History

Repository files navigation

Optimizing an ML Pipeline in Azure

Overview

Useful Resources

Summary

Scikit-learn Pipeline

AutoML

Pipeline comparison

Future work

Proof of cluster clean up

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Languages

Packages