##### TODO Recording

Click on Data in the left navigation pane select "Browse DBFS"

Go to Data -> DBFS -> FileStore ->datasets-> Upload Bank Customer.csv

In [0]:
import numpy as np
import pandas as pd
 
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

import mlflow
import mlflow.sklearn
from hyperopt import fmin, tpe, hp, SparkTrials, STATUS_OK

Loading the data
Displaying  the first 5 rows of the dataset
link=-https://www.kaggle.com/santoshd3/bank-customers?select=Churn+Modeling.csv

In [0]:
cust_attrition_data = pd.read_csv('/dbfs/FileStore/datasets/Bank_customer.csv')

cust_attrition_data.head()

Unnamed: 0,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,15634602,Hargrave,619,France,Female,42,2,0.0,1,Yes,Yes,101348.88,Yes
1,15647311,Hill,608,Spain,Female,41,1,83807.86,1,No,Yes,112542.58,No
2,15619304,Onio,502,France,Female,42,8,159660.8,3,Yes,No,113931.57,Yes
3,15701354,Boni,699,France,Female,39,1,0.0,2,No,No,93826.63,No
4,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,Yes,Yes,79084.1,No


In [0]:
cust_attrition_data .shape

Out[4]: (10000, 13)

In [0]:
cust_attrition_data.columns

Out[5]: Index(['CustomerId', 'Surname', 'CreditScore', 'Geography', 'Gender', 'Age',
       'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard', 'IsActiveMember',
       'EstimatedSalary', 'Exited'],
      dtype='object')

Checking unique values for each column.Customerid and Surname can be dropped

In [0]:
cust_attrition_data.nunique()

Out[6]: CustomerId         10000
Surname             2932
CreditScore          460
Geography              3
Gender                 3
Age                   70
Tenure                11
Balance             6382
NumOfProducts          4
HasCrCard              2
IsActiveMember         2
EstimatedSalary     9999
Exited                 2
dtype: int64

Features data and target data are made into separate dataframes

In [0]:
X = cust_attrition_data.drop(['CustomerId', 'Surname', 'Exited'], axis = 1)
y = cust_attrition_data['Exited']

Target variable is converted into numeric form

In [0]:
class_dict = {'No': 0, 'Yes': 1}

y = y.replace(class_dict)
y

Out[8]: 0       1
1       0
2       1
3       0
4       0
       ..
9995    0
9996    0
9997    1
9998    1
9999    0
Name: Exited, Length: 10000, dtype: int64

Data is imbalanced

In [0]:
y.value_counts()

Out[9]: 0    7963
1    2037
Name: Exited, dtype: int64

We are going to one hot encode categorical columns with dropping first category.

In [0]:
categoricalCols = ['Geography', 'Gender', 'HasCrCard', 'IsActiveMember']
numericCols = ['Age', 'Tenure', 'Balance', 'NumOfProducts', 'EstimatedSalary']

categorical_transformer = OneHotEncoder(drop = 'first')

Preprocessing steps are defined with ColumnTransformer which will one hot encode only categorical colums and scale remainder numeric columns

In [0]:
preprocessor = ColumnTransformer(
    transformers = [('cat', categorical_transformer, categoricalCols)], 
    remainder = StandardScaler() 
)

print(preprocessor)

ColumnTransformer(remainder=StandardScaler(),
                  transformers=[('cat', OneHotEncoder(drop='first'),
                                 ['Geography', 'Gender', 'HasCrCard',
                                  'IsActiveMember'])])


In [0]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, stratify = y)

X_train.shape, X_test.shape

Out[12]: ((7000, 10), (3000, 10))

Creating and train a decision tree classifier model and by using MLFLOW we are logging 
 parameters, metrics, and the model.

In [0]:
with mlflow.start_run():
 
    criterion= 'entropy'
    max_depth = 5
    max_features = 5
    
    dtc = DecisionTreeClassifier(criterion = criterion, max_depth = max_depth,
                                 max_features = max_features)

    pipeline = Pipeline( steps = [('preprocessor', preprocessor), ('classifier', dtc)])
    pipeline.fit(X_train, y_train)

    predictions =  pipeline.predict(X_test) 
    accuracy = accuracy_score(y_test, predictions)
    precisionscore = precision_score(y_test, predictions)
    recallscore = recall_score(y_test, predictions)
    f1score = f1_score(y_test, predictions)
  
    mlflow.log_param('criterion', criterion)
    mlflow.log_param('max_depth', max_depth)
    mlflow.log_param('max_features', max_features)
  
    mlflow.log_metric('F1score',  f1score)
    mlflow.log_metric('Recall_score',  recallscore)
    mlflow.log_metric('Precision_score',  precisionscore)
    mlflow.log_metric('Accuracy_score',  accuracy)

    mlflow.sklearn.log_model(dtc, 'dtc_model')

#### TODO Recording:

- Open up the Runs on the right hand side (using the flask icon)
- Expand the metrics logged as well as the parameters we logged

Hyperopt is a Python library for hyperparameter tuning. Databricks Runtime for Machine Learning includes an optimized and enhanced version of Hyperopt, including automated MLflow tracking.
The objective is to find the best value for max features,max depth,and criterion
Most of the code for a Hyperopt workflow is in the objective function.
Objective function is defined
We are using  the f1 score( As the dataset is quite highly imbalanced) as our objective parameter to compare the models' performance which is to be maximised 
Hyperopt tries to minimize the objective function. A F1 score means a better model, so you must return the negative f1 score.

When calling fmin(), Databricks recommends active MLflow run management; that is, wrap the call to fmin() inside a with mlflow.start_run(): statement. This ensures that each fmin() call is logged to a separate MLflow main run, and makes it easier to log extra tags, parameters, or metrics to that run.

#### TODO Recording:
- Directly record with ML Flow enabled

In [0]:
from hyperopt.pyll import scope

def objective(params):
    
    with mlflow.start_run(nested = True):

        clf = DecisionTreeClassifier(max_features = params['max_features'], 
                                     max_depth = params['max_depth'],
                                     criterion = params['criterion'])
        pipeline = Pipeline(steps = [('preprocessor', preprocessor), ('model', clf)])
        pipeline.fit(X_train, y_train)
        
        predictions =  pipeline.predict(X_test) 
        accuracy = accuracy_score(y_test, predictions)
        precisionscore = precision_score(y_test, predictions)
        recallscore = recall_score(y_test, predictions)
        f1score = f1_score(y_test, predictions)

        mlflow.log_metric('F1score',  f1score)
        mlflow.log_metric('Recall_score',  recallscore)
        mlflow.log_metric('Precision_score',  precisionscore)
        mlflow.log_metric('Accuracy_score',  accuracy)

        mlflow.sklearn.log_model(dtc, 'dtc_hpo')
  
        return {'loss': -f1score, 'status': STATUS_OK}

Search space for hyperparameters is defined

In [0]:
search_space = {'max_features': scope.int(hp.quniform('max_features', 1, 10, 1)),
                'max_depth': scope.int(hp.quniform('max_depth', 1, 15, 1)),
                'criterion': hp.choice('criterion', ['gini', 'entropy'])} 

Algorithm is defined

The two main choices are:

hyperopt.tpe.suggest: Tree of Parzen Estimators, a Bayesian approach which iteratively and adaptively selects new hyperparameter settings to explore based on past results
hyperopt.rand.suggest: Random search, a non-adaptive approach that samples over the search space

In [0]:
algo = tpe.suggest

Now We are running  the tuning algorithm with Hyperopt fmin()

Setting  max_evals to the maximum number of points in hyperparameter space to test, that is, the maximum number of models to fit and evaluate, in our case it is set as 16.
Best value found out is with f1 score  around 57% with parameters '{'criterion': 1(entropy), 'max_depth':9.0, 'max_features':9.0}. max depth and max features are in float form as we are using hp.quniform and then for using those values in model we are casting those values into integer.

In [0]:
argmin = fmin(
  fn = objective,
  space = search_space,
  algo = algo,
  max_evals = 16)

print('Best value found: ', argmin)

  0%|          | 0/16 [00:00<?, ?trial/s, best loss=?]  6%|▋         | 1/16 [00:03<00:54,  3.62s/trial, best loss: -0.5337781484570475] 12%|█▎        | 2/16 [00:06<00:48,  3.45s/trial, best loss: -0.5337781484570475] 19%|█▉        | 3/16 [00:10<00:44,  3.40s/trial, best loss: -0.5337781484570475] 25%|██▌       | 4/16 [00:13<00:39,  3.32s/trial, best loss: -0.5337781484570475] 31%|███▏      | 5/16 [00:16<00:36,  3.35s/trial, best loss: -0.5337781484570475] 38%|███▊      | 6/16 [00:20<00:33,  3.31s/trial, best loss: -0.5337781484570475] 44%|████▍     | 7/16 [00:23<00:30,  3.34s/trial, best loss: -0.5337781484570475] 50%|█████     | 8/16 [00:27<00:27,  3.41s/trial, best loss: -0.5337781484570475] 56%|█████▋    | 9/16 [00:30<00:24,  3.48s/trial, best loss: -0.5337781484570475] 62%|██████▎   | 10/16 [00:34<00:21,  3.55s/trial, best loss: -0.5337781484570475] 69%|██████▉   | 11/16 [00:38<00:17,  3.58s/trial, best loss: -0.5337781484570475] 75%|███████▌  | 12/16 [00:41<00:14,  3.

#### TODO Recording

- After running the code above, stay on this page and watch as more runs are added to the same experiment
- Click on "experiment" and that will open up the Experiment on a new tab
- There should be 17 runs in the Experiment
- Sort by F1 score and find the run with the highest F1 score
- Click on that and expand the "Parameters" and "Metrics" tab and show
- IMPORTANT: Go to the experiments tab and delete all runs (so it's easier to see our next set of runs)

We are now using distributed tuning, adding one more argument to fmin(): a Trials class called SparkTrials.

SparkTrials takes 2 optional arguments:

- parallelism: Number of models to fit and evaluate concurrently. The default is the number of available Spark task slots.
- timeout: Maximum time (in seconds) that fmin() can run. The default is no maximum time limit.

This example uses the same  simple objective function defined earlier. In this case, the function runs quickly and the overhead of starting the Spark jobs dominates the calculation time, so the calculations for the distributed case take more time. For typical real-world problems, the objective function is more complex, and using SparkTrails to distribute the calculations will be faster than single-machine tuning.
Automated MLflow tracking is enabled by default.

Best value found out is with accuracy around 85% with parameters {'criterion': 0, 'max_depth': 8.033104023258073, 'max_features': 6.610009057572852}.

In [0]:
from hyperopt import SparkTrials
 
spark_trials = SparkTrials()
 
with mlflow.start_run():
    argmin = fmin(
    fn = objective,
    space = search_space,
    algo = algo,
    max_evals = 16,
    trials = spark_trials)

print('Best value found: ', argmin)

Because the requested parallelism was None or a non-positive value, parallelism will be set to (4), which is Spark's default parallelism (4), or 1, whichever is greater. We recommend setting parallelism explicitly to a positive value because the total of Spark task slots is subject to cluster sizing.
Hyperopt with SparkTrials will automatically track trials in MLflow. To view the MLflow experiment associated with the notebook, click the 'Runs' icon in the notebook context bar on the upper right. There, you can view all runs.
To view logs from trials, please check the Spark executor logs. To view executor logs, expand 'Spark Jobs' above until you see the (i) icon next to the stage from the trial job. Click it and find the list of tasks. Click the 'stderr' link for a task to view trial logs.
  0%|          | 0/16 [00:00<?, ?trial/s, best loss=?]  6%|▋         | 1/16 [00:08<02:02,  8.14s/trial, best loss: -0.39491916859122406] 12%|█▎        | 2/16 [00:09<00:55,  4.00s/trial, best loss:

#### TODO Recording

- After running the code above, stay on this page and watch as more runs are added to the same experiment
- Click on "experiment" and that will open up the Experiment on a new tab
- There should be 17 runs in the Experiment
- Sort by F1 score and find the run with the highest F1 score
- Click on that and expand the "Parameters" and "Metrics" tab and show
- To examine the effect of tuning a specific hyperparameter:
- Select the resulting runs and click Compare.
- In the Scatter Plot, select the hyperparameter from the X-axis drop-down menu and select metric from the Y-axis drop-down menu. (e.g. max_depth and F1 score)
- IMPORTANT: Go to the experiments tab and delete all runs (so it's easier to see our next set of runs)

In [0]:
spark_trials.results

Out[21]: [{'loss': -0.39491916859122406, 'status': 'ok'},
 {'loss': -0.5118644067796609, 'status': 'ok'},
 {'loss': -0.5155642023346303, 'status': 'ok'},
 {'loss': -0.4788441692466461, 'status': 'ok'},
 {'loss': -0.4968496849684968, 'status': 'ok'},
 {'loss': -0.5417406749555951, 'status': 'ok'},
 {'loss': -0.501039501039501, 'status': 'ok'},
 {'loss': -0.5014409221902018, 'status': 'ok'},
 {'loss': -0.5420393559928444, 'status': 'ok'},
 {'loss': -0.5456171735241503, 'status': 'ok'},
 {'loss': -0.4485981308411215, 'status': 'ok'},
 {'loss': -0.5495750708215298, 'status': 'ok'},
 {'loss': -0.5743494423791822, 'status': 'ok'},
 {'loss': -0.5014409221902018, 'status': 'ok'},
 {'loss': -0.5212669683257919, 'status': 'ok'},
 {'loss': -0.0, 'status': 'ok'}]

Now, we examine four algorithms available in scikit-learn: support vector machines (SVM), random forest, and logistic regression and Decison tree

In the following cell, we are defining  a parameter params['type'] for the model name. This function also runs the training and calculates the cross-validation accuracy.

In [0]:
def objective(params):

    with mlflow.start_run(nested = True):
        classifier_type = params['type']
        del params['type']
        
        if classifier_type == 'svm':
            clf = SVC(**params)
        elif classifier_type == 'rf':
            clf = RandomForestClassifier(**params)
        elif classifier_type == 'logreg':
            clf = LogisticRegression(**params)
        elif classifier_type == 'dtc':
            clf = DecisionTreeClassifier(**params)
        else:
            return 0
        
        pipeline = Pipeline(steps = [('preprocessor', preprocessor), ('model', clf)])
        pipeline.fit(X_train, y_train)
        
        predictions =  pipeline.predict(X_test) 
        accuracy = accuracy_score(y_test, predictions)
        precisionscore = precision_score(y_test, predictions)
        recallscore = recall_score(y_test, predictions)
        f1score = f1_score(y_test, predictions)
        
        mlflow.log_metric('F1score',  f1score)
        mlflow.log_metric('Recall_score',  recallscore)
        mlflow.log_metric('Precision_score',  precisionscore)
        mlflow.log_metric('Accuracy_score',  accuracy)
        
        mlflow.sklearn.log_model(clf, 'clf_hpo')

        return {'loss': -f1score, 'status': STATUS_OK}

Search space is defined for multiple models
Using hp.choice to select different models.

In [0]:
search_space = hp.choice('classifier_type', [
    {
        'type': 'svm',
        'C': hp.lognormal('SVM_C', 0, 1.0),
        'kernel': hp.choice('kernel', ['linear', 'rbf'])
    },
    {
        'type': 'rf',
        'n_estimators':scope.int(hp.quniform('n_estimators', 100, 500, 50)),
        'max_depth': scope.int(hp.quniform('max_depth_rf', 2, 20 , 1)),
        'criterion': hp.choice('criterion_rf', ['gini', 'entropy'])
    },
    {
        'type': 'logreg',
        'C': hp.lognormal('LR_C', 0, 1.0),
        'solver': hp.choice('solver', ['liblinear', 'lbfgs'])
    },
  
    {
        'type': 'dtc',
        'max_features':scope.int(hp.quniform('max_features', 1,10,1)),
        'max_depth': scope.int(hp.quniform('max_depth_dtc', 2, 20, 1)),
        'criterion': hp.choice('criterion_dtc', ['gini', 'entropy'])
    }
    
    
])

Same steps are repeated as done for single model case. This time Best accuracy obtained is around 86% with RF model with hyperparameters {'criterion': 'gini', 'max_depth': 11, 'n_estimators': 100, 'type': 'rf'}

In [0]:
algo = tpe.suggest

spark_trials = SparkTrials()

with mlflow.start_run():
    best_result = fmin(
        fn = objective, 
        space = search_space,
        algo = algo,
        max_evals = 32,
        trials = spark_trials)

Because the requested parallelism was None or a non-positive value, parallelism will be set to (4), which is Spark's default parallelism (4), or 1, whichever is greater. We recommend setting parallelism explicitly to a positive value because the total of Spark task slots is subject to cluster sizing.
Hyperopt with SparkTrials will automatically track trials in MLflow. To view the MLflow experiment associated with the notebook, click the 'Runs' icon in the notebook context bar on the upper right. There, you can view all runs.
To view logs from trials, please check the Spark executor logs. To view executor logs, expand 'Spark Jobs' above until you see the (i) icon next to the stage from the trial job. Click it and find the list of tasks. Click the 'stderr' link for a task to view trial logs.
  0%|          | 0/32 [00:00<?, ?trial/s, best loss=?]  6%|▋         | 2/32 [00:09<02:17,  4.59s/trial, best loss: -0.5202821869488536]  9%|▉         | 3/32 [00:11<01:43,  3.58s/trial, best loss: 

#### TODO Recording

- After running the code above, stay on this page and watch as more runs are added to the same experiment
- Click on "experiment" and that will open up the Experiment on a new tab
- There should be 33 runs in the Experiment
- Sort by F1 score and find the run with the highest F1 score
- Click on that and expand the "Parameters" and "Metrics" tab and show

Using hyperopt.space_eval to display the results of the hyperparameter search.

In [0]:
import hyperopt

print(hyperopt.space_eval(search_space, best_result))

{'criterion': 'gini', 'max_depth': 20, 'n_estimators': 300, 'type': 'rf'}
