# Assignment WE04-Universal Bank

Universal bank has recently trialed a marketing campaign to sell their new Securities account product to existing customers. They contacted 5000 of their non-Securities account customers with an offer. The data provided in universal.csv is the result of this market test. 

Use the techniques covered in this class to load and clean the data. Then, identify the best predictive model (using only the models covered thus far: Logistic Regression, SVM (with various kernels), and Decision trees). Your target variable is Securities Account. Your scoring measure is precision. Use RandomSearchCV combined with GridSearchCV to identify the best parameters for each model tested.

Be sure to document your thought process using markdown. Think of this as a report that your manager will read. This assignment requires you to decide how to process the provided data best (i.e., encoding). Be sure to provide your arguments/observations in markdown as you progress through data preparation, fitting, and performance evaluation.


    Id: Customer ID
    Age: Customers age in completed years  
    Experience: Number of years of professional experience  
    Income: Annual income of the customer ($000s)  
    Family Size: Family size of the customer  
    CCAvg: Average spending on credit cards per month ($000s)  
    Education: Education Level. 1: Undergrad; 2: Graduate; 3: Advanced/Professional  
    Mortgage: Value of house mortgage if any ($000s)  
    Personal Loan: (1 if customer has personal loand with bank, 0 otherwise)
    Securities Account: (1 if customer has securities account with bank, 0 otherwise)  
    CD Account: (1 if customer has certificate of deposit (CD) account with bank, 0 otherwise)  
    Online Banking: (1 if customer uses Internet banking facilities, 0 otherwise)  
    Credit Card: (1 if customer uses credit card issued by Universal Bank, 0 otherwise) 

# 1. Setup

In [1]:
import pandas as pd
from matplotlib import pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier 
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

np.random.seed(1)

## 2. Load data

In [2]:
X_train = pd.read_csv('./data/ubank_train_X.csv') 
y_train = pd.read_csv('./data/ubank_train_y.csv') 
X_test = pd.read_csv('./data/ubank_test_X.csv') 
y_test = pd.read_csv('./data/ubank_test_y.csv') 

## 3. Models

Set up the performance dataframe to store the scores of each model.

In [3]:
performance = pd.DataFrame({"model": [], "Accuracy": [], "Precision": [], "Recall": [], "F1": []})

In [4]:
def performance_metrics(test_y, model_preds, performance, model_name):
    c_matrix = confusion_matrix(test_y, model_preds)
    TP = c_matrix[1][1]
    TN = c_matrix[0][0]
    FP = c_matrix[0][1]
    FN = c_matrix[1][0]
    performance = pd.concat([performance, pd.DataFrame({'model': str(model_name), 
                                                        'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                        'Precision': [TP/(TP+FP)], 
                                                        'Recall': [TP/(TP+FN)], 
                                                        'F1': [2*TP/(2*TP+FP+FN)]
                                                        }, index=[0])])
    return performance

### 3.1 Logistic regression

Random search and grid search for parameters `penalty`, `C`, and `max_iter`.


In [5]:
score_measure = "precision"
kfolds = 5

param_grid = {
    'max_iter': np.arange(400,1000),
    'C': np.arange(1,20),
    'penalty': [None, 'l1', 'l2', 'elasticnet']
}

logiR = LogisticRegression(solver='saga')
rand_search = RandomizedSearchCV(estimator = logiR, param_distributions=param_grid, cv=kfolds, n_iter=500,
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

_ = rand_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {rand_search.best_score_}")
print(f"... with parameters: {rand_search.best_params_}")

bestprecision = rand_search.best_estimator_

Fitting 5 folds for each of 500 candidates, totalling 2500 fits


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = colu

The best precision score is 0.6352055906048646
... with parameters: {'penalty': 'l2', 'max_iter': 807, 'C': 1}


  y = column_or_1d(y, warn=True)
660 fits failed out of a total of 2500.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
660 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/80025078/opt/anaconda3/envs/DataScience/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/80025078/opt/anaconda3/envs/DataScience/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 1291, in fit
    fold_coefs_ = Parallel(n_jobs=self.n_jobs, verbose=self.verbose, prefer=prefer)(
  File "/Users/80025078/opt/anaconda3/envs/DataScience/lib/python3.9/site-packages/sklearn/utils/parallel.py", line 63, in __

In [6]:
#chosen parameters after random search
param_grid = {
    'max_iter': np.arange(790,810),
    'C': np.arange(1,3),
    'penalty': ['l2']
}

grid_search = GridSearchCV(estimator = logiR, param_grid=param_grid, cv=kfolds, 
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

_ = grid_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {grid_search.best_score_}")
print(f"... with parameters: {grid_search.best_params_}")

bestprecision = grid_search.best_estimator_

Fitting 5 folds for each of 40 candidates, totalling 200 fits


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = colu

The best precision score is 0.6352055906048646
... with parameters: {'C': 1, 'max_iter': 790, 'penalty': 'l2'}


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


In [7]:
performance = performance_metrics(y_test, grid_search.predict(X_test), performance, "logistic l2")
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,logistic l2,0.911111,0.6,0.268966,0.371429


### 3.2 SVM with kernels

#### 3.2.1 SVM with linear kernel

Random search and grid search for parameters `C`, and `max_iter`.

In [8]:
score_measure = "precision"
kfolds = 5

param_grid = {
    'max_iter': np.arange(400,1000),
    'C': np.arange(1,20)}

linSVM = SVC(kernel="linear")
rand_search = RandomizedSearchCV(estimator = linSVM, param_distributions=param_grid, cv=kfolds, n_iter=500,
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

_ = rand_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {rand_search.best_score_}")
print(f"... with parameters: {rand_search.best_params_}")

bestprecision = rand_search.best_estimator_

Fitting 5 folds for each of 500 candidates, totalling 2500 fits


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = colu

The best precision score is 0.6446397152301043
... with parameters: {'max_iter': 783, 'C': 1}


  y = column_or_1d(y, warn=True)


In [10]:
#chosen parameters after random search
param_grid = {
    'max_iter': np.arange(700,800),
    'C': np.arange(1,3)
}

grid_search = GridSearchCV(estimator = linSVM, param_grid=param_grid, cv=kfolds, 
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

_ = grid_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {grid_search.best_score_}")
print(f"... with parameters: {grid_search.best_params_}")

bestprecision = grid_search.best_estimator_

Fitting 5 folds for each of 200 candidates, totalling 1000 fits


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = colu

The best precision score is 0.6581704260651631
... with parameters: {'C': 1, 'max_iter': 744}


  y = column_or_1d(y, warn=True)


In [11]:
performance = performance_metrics(y_test, grid_search.predict(X_test), performance, "linear SVM")
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,logistic l2,0.911111,0.6,0.268966,0.371429
0,linear SVM,0.884848,0.358696,0.227586,0.278481


#### 3.2.2 SVM with rbf kernel

Random search and grid search for parameters `C`, `gamma` and `max_iter`.

In [12]:
score_measure = "precision"
kfolds = 5

param_grid = { 
            'max_iter': np.arange(400,800),
            'C': np.arange(1,20),
            'gamma': np.arange(0, 5, 0.1)
}

rbfSVM = SVC(kernel="rbf")
rand_search = RandomizedSearchCV(estimator = rbfSVM, param_distributions=param_grid, cv=kfolds, n_iter=500,
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

_ = rand_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {rand_search.best_score_}")
print(f"... with parameters: {rand_search.best_params_}")

bestprecision = rand_search.best_estimator_

Fitting 5 folds for each of 500 candidates, totalling 2500 fits


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  _warn_prf(average, modifier, msg_start, len(result))
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  _warn_prf(average, 

The best precision score is 0.735
... with parameters: {'max_iter': 775, 'gamma': 0.30000000000000004, 'C': 1}


  y = column_or_1d(y, warn=True)


In [13]:
#chosen parameters after random search
param_grid = {
            'max_iter': np.arange(775,780),
            'C': np.arange(1,3),
            'gamma': np.arange(0.2, 0.4, 0.1)
}

grid_search = GridSearchCV(estimator = rbfSVM, param_grid=param_grid, cv=kfolds, 
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

_ = grid_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {grid_search.best_score_}")
print(f"... with parameters: {grid_search.best_params_}")

bestprecision = grid_search.best_estimator_

Fitting 5 folds for each of 20 candidates, totalling 100 fits


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = colu

The best precision score is 0.7403174603174603
... with parameters: {'C': 1, 'gamma': 0.2, 'max_iter': 776}


  y = column_or_1d(y, warn=True)


In [14]:
performance = performance_metrics(y_test, grid_search.predict(X_test), performance, "rbf SVM")
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,logistic l2,0.911111,0.6,0.268966,0.371429
0,linear SVM,0.884848,0.358696,0.227586,0.278481
0,rbf SVM,0.909764,0.73913,0.117241,0.202381


#### 3.2.3 SVM with polynomial kernel

Random search and grid search for parameters `C`, `gamma`, `coef0`, `degree` and `max_iter`.

In [15]:
score_measure = "precision"
kfolds = 5

param_grid = { 
            'max_iter': np.arange(400,800),
            'C': np.arange(1,10),
            'gamma': np.arange(0, 5, 0.1),
            'coef0': np.arange(1, 100),
            'degree': np.arange(3,10)
}

polySVM = SVC(kernel="poly")
rand_search = RandomizedSearchCV(estimator = polySVM, param_distributions=param_grid, cv=kfolds, n_iter=500,
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

_ = rand_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {rand_search.best_score_}")
print(f"... with parameters: {rand_search.best_params_}")

bestprecision = rand_search.best_estimator_

Fitting 5 folds for each of 500 candidates, totalling 2500 fits


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = colu

The best precision score is 0.13572594771391067
... with parameters: {'max_iter': 799, 'gamma': 1.8, 'degree': 3, 'coef0': 61, 'C': 5}


  y = column_or_1d(y, warn=True)


In [16]:
#chosen parameters after random search
param_grid = { 
            'max_iter': np.arange(790,800),
            'C': np.arange(3,5),
            'gamma': np.arange(1.7, 2.3, 0.1),
            'coef0': np.arange(60, 65),
            'degree': np.arange(3,4)
}

grid_search = GridSearchCV(estimator = polySVM, param_grid=param_grid, cv=kfolds, 
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

_ = grid_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {grid_search.best_score_}")
print(f"... with parameters: {grid_search.best_params_}")

bestprecision = grid_search.best_estimator_

Fitting 5 folds for each of 600 candidates, totalling 3000 fits


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = colu

The best precision score is 0.13572594771391067
... with parameters: {'C': 3, 'coef0': 61, 'degree': 3, 'gamma': 1.8, 'max_iter': 799}


  y = column_or_1d(y, warn=True)


In [17]:
performance = performance_metrics(y_test, grid_search.predict(X_test), performance, "poly SVM")
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,logistic l2,0.911111,0.6,0.268966,0.371429
0,linear SVM,0.884848,0.358696,0.227586,0.278481
0,rbf SVM,0.909764,0.73913,0.117241,0.202381
0,poly SVM,0.548148,0.118841,0.565517,0.196407


### 3.3 Decision trees

In [18]:
score_measure = "precision"
kfolds = 5

param_grid = {
    'min_samples_split': np.arange(1,100),  
    'min_samples_leaf': np.arange(1,100),
    'min_impurity_decrease': np.arange(0.0001, 0.01, 0.0005),
    'max_leaf_nodes': np.arange(5, 200), 
    'max_depth': np.arange(1,50), 
    'criterion': ['entropy', 'gini'],
}

dtree = DecisionTreeClassifier()
rand_search = RandomizedSearchCV(estimator = dtree, param_distributions=param_grid, cv=kfolds, n_iter=500,
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

_ = rand_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {rand_search.best_score_}")
print(f"... with parameters: {rand_search.best_params_}")

bestprecisionTree = rand_search.best_estimator_

Fitting 5 folds for each of 500 candidates, totalling 2500 fits


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_pr

The best precision score is 0.6355921855921856
... with parameters: {'min_samples_split': 33, 'min_samples_leaf': 20, 'min_impurity_decrease': 0.0001, 'max_leaf_nodes': 156, 'max_depth': 8, 'criterion': 'gini'}


In [19]:

param_grid = {
    'min_samples_split': np.arange(33,35),  
    'min_samples_leaf': np.arange(20,22),
    'min_impurity_decrease': np.arange(0.0001, 0.0006, 0.0001),
    'max_leaf_nodes': np.arange(153,156), 
    'max_depth': np.arange(6,8), 
    'criterion': ['gini'],
}

dtree = DecisionTreeClassifier()
grid_search = GridSearchCV(estimator = dtree, param_grid=param_grid, cv=kfolds, 
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

_ = grid_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {grid_search.best_score_}")
print(f"... with parameters: {grid_search.best_params_}")

bestprecisionTree = grid_search.best_estimator_

Fitting 5 folds for each of 120 candidates, totalling 600 fits
The best precision score is 0.6404014939309057
... with parameters: {'criterion': 'gini', 'max_depth': 6, 'max_leaf_nodes': 153, 'min_impurity_decrease': 0.0005, 'min_samples_leaf': 20, 'min_samples_split': 33}


In [20]:
performance = performance_metrics(y_test, grid_search.predict(X_test), performance, "decision tree")
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,logistic l2,0.911111,0.6,0.268966,0.371429
0,linear SVM,0.884848,0.358696,0.227586,0.278481
0,rbf SVM,0.909764,0.73913,0.117241,0.202381
0,poly SVM,0.548148,0.118841,0.565517,0.196407
0,decision tree,0.913131,0.735294,0.172414,0.27933


## 4.0 Summary

In [21]:
performance.sort_values(by=['Precision'])

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,poly SVM,0.548148,0.118841,0.565517,0.196407
0,linear SVM,0.884848,0.358696,0.227586,0.278481
0,logistic l2,0.911111,0.6,0.268966,0.371429
0,decision tree,0.913131,0.735294,0.172414,0.27933
0,rbf SVM,0.909764,0.73913,0.117241,0.202381


To evaluate every model, random search and grid search were performed in order to find the best parameters for each one. And, the best classifiers (taking into account only the precision score) were:
1) SVM with rbf kernel with precision score 0.739130 and parameters: 
{'C': 1, 'gamma': 0.2, 'max_iter': 776}
2) Decision tree with precision score 0.735294 and parameters: 
{'criterion': 'gini', 'max_depth': 6, 'max_leaf_nodes': 153, 'min_impurity_decrease': 0.0005, 'min_samples_leaf': 20, 'min_samples_split': 33}