Optimizing an ML Pipeline in Azure

Overview

This project is part of the Udacity Azure ML Nanodegree. In this project, we build and optimize an Azure ML pipeline using the Python SDK and a provided Scikit-learn model. This model is then compared to an Azure AutoML run.

Useful Resources

Summary

The data contains information about bank customers such as job, age, marital info, education, whether or not they had housing. We try to predict whether a customer will subscribed a term deposit. The data was first fitted to a Logistic Regression model. The model managed an accuracy of 91.5%. After this, I ran an AutoMl, where the data was fitted to 40 models. The best performing model was a VotingEnsemble with an accuracy of 91.44%, followed by MaxAbsScaler XGBoostClassifier with an accuracy of 91.2%.

Scikit-learn Pipeline

The Pipeline Architecture

Create a tabular dataset using TabularDatasetFactory. The data is from Dataset
train python scriptcontains data that is used to clean the data, and enode categorical values.
Data is then split into train and test sets
A Logistic Regression model is then fitted to the data
I then used hyperdrive so as to tune hyperparameters(C and max_iter).

The results of the hyperdrive are as shown below:

Benefits of Random Parameter Sampler

Random Sampling was chosen as hyperparameter values were selected from the defined search space. inverse regularization strength discrete values were 0.01, 5, 20, 100, 500. Lower values indicate strong regularization. As for max iteration were 10,50, 100,15,200.

Benefits of Bandit Policy for Early Stopping

The early stopping policies automatically terminates poorly performing runs. I defined a slack factor of 0.1, where the policy terminates runs where the primary metric, in this case accuracy, is not within the specified slack factor compared to the best performing run

AutoML

We configured the AutoML with the following parameters:

task - whether it is a classification or regression problem. In this case, we chose classification.
primary_metric - This is the metric that we want the autoML to tune. In this case, we prioritize accuracy
training_data - specify the data to be used during training
label_column_name - label of the column that will be predicted
n_cross_validations - number of cross validations that were performed. In this case, I chose 3

These are the results of the AutoML:

The best model is:

Pipeline comparison

The Logistic Regression model tuned with hyperdrive had an accuracy of 91.5% while the VotingEnsemble from AutoMl had an accuracy of 91.44%. This means that there was no significant difference in the two models

Future work

The data was highly imbalanced. I would like to perform balancing techniques and see whether there will be improvements in the model.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
images		images
CODEOWNERS		CODEOWNERS
README.md		README.md
conda_dependencies.yml		conda_dependencies.yml
train.py		train.py
udacity-project.ipynb		udacity-project.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Optimizing an ML Pipeline in Azure

Overview

Useful Resources

Summary

Scikit-learn Pipeline

The Pipeline Architecture

Benefits of Random Parameter Sampler

Benefits of Bandit Policy for Early Stopping

AutoML

Pipeline comparison

Future work

Proof of cluster clean up

About

Releases

Packages

Languages

Kevin-Nduati/Churn-Detection

Folders and files

Latest commit

History

Repository files navigation

Optimizing an ML Pipeline in Azure

Overview

Useful Resources

Summary

Scikit-learn Pipeline

The Pipeline Architecture

Benefits of Random Parameter Sampler

Benefits of Bandit Policy for Early Stopping

AutoML

Pipeline comparison

Future work

Proof of cluster clean up

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages