## Agenda

- Decision Tree 
- Pruning/Tuning a Decision Tree model
- GridSearchCV and RandomizedSearchCV
- Random Forest
- Tuning Random Forest
- Gradient Boosting Methods 
- Tuning GBMs

### Margin Separation
- Support Vector Machines
- Tuning SVMs

## Problem statement:

ABC Bank has provided us with a dataset that contains customer details for their customers in `BankAttrition - Details.csv` file. The transactions related information and what kind of credit card the customer holds is provided to us in another file `Transaction and Card Details.csv`. The bank is currently facing problems of customer attrition. They have consulted us to understand how can they understand the patterns of customer attrition and if they can get early signals so to stop losing customers.

Till now: Merged data, performed exploratory data analysis, KNN and Logistic Regression Models, Validation Strategies, Model Improvement Strategies

In [None]:
import pandas as pd
import numpy as np

# read input files
details = pd.read_csv("Datasets/BankAttrition - Details.csv")
transaction = pd.read_csv("Datasets/Transaction and Card Details.csv")

details.shape, transaction.shape

((10127, 8), (10127, 14))

In [None]:
# merge to create ADS
ads = pd.merge(details, transaction, how = 'outer', on = ['CLIENTNUM'])

In [None]:
## consider Unknown as a separate category

# typecasting variables
ads['Gender'] = ads['Gender'].astype('category')
ads['Education_Level'] = ads['Education_Level'].astype('category')
ads['Marital_Status'] = ads['Marital_Status'].astype('category')
ads['Income_Category'] = ads['Income_Category'].astype('category')
ads['Card_Category'] = ads['Card_Category'].astype('category')



# encoding target to - 0, 1
ads['Attrition_Flag'] = ads['Attrition_Flag'].map({'Existing Customer':0,'Attrited Customer':1})

In [None]:
# drop ClientNum as it is just the identifier
ads.drop(["CLIENTNUM"], axis = 1, inplace = True)

In [None]:
# One hot encoding the categories
categorical_vars = ads.select_dtypes(exclude = ['int64', 'Int64', 'float64']).columns
ads = pd.get_dummies(ads, columns = categorical_vars)

ads['Attrition_Flag'] = ads['Attrition_Flag'].astype('category')

In [None]:
## Feature engineering - log transformation (Credit Limit, Total Revolving Balance)
ads['Credit_Limit'] = np.log(ads['Credit_Limit'])
ads['Total_Revolving_Bal'] = np.log(ads['Total_Revolving_Bal'] + 0.01)
ads['Total_Trans_Amt'] = np.log(ads['Total_Trans_Amt'] + 0.01)

## Feature engineering - Customer Age bins
bins = [0, 18, 30, 50, 70, 110]
ads['binned_age'] = pd.cut(ads['Customer_Age'], bins)
ads = pd.get_dummies(ads, columns = ['binned_age'])

In [None]:
#seperating independent and dependent variables
x = ads.drop(['Attrition_Flag'], axis=1)
y = ads['Attrition_Flag']
x.shape, y.shape

((10127, 42), (10127,))

In [None]:
# importing the train test split function
from sklearn.model_selection import train_test_split
train_x, test_x, train_y, test_y = train_test_split(x, y, random_state = 111, stratify = y, test_size = 0.25)

In [None]:
# check proportions of target variables
train_y.value_counts(normalize = True), test_y.value_counts(normalize = True)

(0    0.839368
 1    0.160632
 Name: Attrition_Flag, dtype: float64,
 0    0.839258
 1    0.160742
 Name: Attrition_Flag, dtype: float64)

In [None]:
# import scalers
from sklearn.preprocessing import StandardScaler, MinMaxScaler
scaler = StandardScaler()

train_x = scaler.fit_transform(train_x)
test_x = scaler.transform(test_x)

In [None]:
## Basic Decision Tree Model
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import recall_score

# model instace
dt_model = DecisionTreeClassifier(random_state=10)

# fitting the model
dt_model.fit(train_x, train_y)

# make training prediction
train_yhat = dt_model.predict(train_x)
train_score = recall_score(train_y, train_yhat)

# make test prediction
test_yhat = dt_model.predict(test_x)
test_score = recall_score(test_y, test_yhat)

train_score, test_score

(1.0, 0.8157248157248157)

In [None]:
## Tuning depth of the tree
train_score = []
test_score = []

for depth in range(1,20):
    dt_model = DecisionTreeClassifier(max_depth=depth, random_state=10)
    dt_model.fit(train_x, train_y)
    train_yhat = dt_model.predict(train_x)
    train_score.append(recall_score(train_y, train_yhat))
    test_yhat = dt_model.predict(test_x)
    test_score.append(recall_score(test_y, test_yhat))

In [None]:
frame = pd.DataFrame({'max_depth':range(1,20), 'train_score':train_score, 'test_score':test_score})
frame

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(12,6))
plt.plot(frame['max_depth'], frame['train_score'], marker='o')
plt.plot(frame['max_depth'], frame['test_score'], marker='o')
plt.xlabel('Depth of tree')
plt.ylabel('performance')

In [None]:
dt_model = DecisionTreeClassifier(max_depth=11, random_state=10, class_weight = 'balanced')
# fitting the model
dt_model.fit(train_x, train_y)

# make training prediction
train_yhat = dt_model.predict(train_x)
train_score = recall_score(train_y, train_yhat)

# make test prediction
test_yhat = dt_model.predict(test_x)
test_score = recall_score(test_y, test_yhat)

train_score, test_score

In [None]:
## Gridsearch CV

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
params = {'max_depth': [5, 10, 11, 15], 'max_leaf_nodes': list(range(2, 100, 5)), 'min_samples_split': list(range(2, 100, 5)), 'criterion': ['gini', 'entropy'], 'ccp_alpha': [0, 0.001, 0.01, 0.1, 1]}


grid_search_cv = RandomizedSearchCV(DecisionTreeClassifier(random_state=10, class_weight = 'balanced'), params, n_iter = 500, verbose=1, cv=3, scoring = 'f1')
grid_search_cv.fit(train_x, train_y)

Fitting 3 folds for each of 500 candidates, totalling 1500 fits


RandomizedSearchCV(cv=3,
                   estimator=DecisionTreeClassifier(class_weight='balanced',
                                                    random_state=10),
                   n_iter=500,
                   param_distributions={'ccp_alpha': [0, 0.001, 0.01, 0.1, 1],
                                        'criterion': ['gini', 'entropy'],
                                        'max_depth': [5, 10, 11, 15],
                                        'max_leaf_nodes': [2, 7, 12, 17, 22, 27,
                                                           32, 37, 42, 47, 52,
                                                           57, 62, 67, 72, 77,
                                                           82, 87, 92, 97],
                                        'min_samples_split': [2, 7, 12, 17, 22,
                                                              27, 32, 37, 42,
                                                              47, 52, 57, 62,
                       

In [None]:
grid_search_cv.best_estimator_

DecisionTreeClassifier(ccp_alpha=0.001, class_weight='balanced',
                       criterion='entropy', max_depth=10, max_leaf_nodes=77,
                       min_samples_split=7, random_state=10)

In [None]:
dt_model = DecisionTreeClassifier(ccp_alpha=0.001, class_weight='balanced',
                       criterion='entropy', max_depth=10, max_leaf_nodes=77,
                       min_samples_split=7, random_state=10)

from sklearn.metrics import f1_score
# fitting the model
dt_model.fit(train_x, train_y)

# make training prediction
train_yhat = dt_model.predict(train_x)
train_score = f1_score(train_y, train_yhat)

# make test prediction
test_yhat = dt_model.predict(test_x)
test_score = f1_score(test_y, test_yhat)

train_score, test_score

(0.8587937883712532, 0.7904967602591793)

In [None]:
from sklearn.model_selection import cross_val_score

dt_model = DecisionTreeClassifier(ccp_alpha=1, class_weight='balanced',
                       criterion='entropy', max_depth=10, random_state=10)

cross_val_score(dt_model, train_x, train_y, cv=3, scoring = 'recall')

### Try the whole procedure of tuning with `class_weight = 'balanced'`. Do you observe any improvement in the recall score?

### Try the whole procedure of tuning without scaling the variables or with a MinMaxScaler(). Do you observe any improvement in the recall score?

### Try the whole procedure of tuning with `RandomizedSearchCV()`. Does it reduce the computation time?

In [None]:
from sklearn.ensemble import RandomForestClassifier as RF

rf_model = RF(random_state=10, class_weight = 'balanced')

from sklearn.metrics import f1_score
# fitting the model
rf_model.fit(train_x, train_y)

# make training prediction
train_yhat = rf_model.predict(train_x)
train_score = recall_score(train_y, train_yhat)

# make test prediction
test_yhat = rf_model.predict(test_x)
test_score = recall_score(test_y, test_yhat)

train_score, test_score

In [None]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
params = {'max_depth': [5, 10, 11, 15], 'max_leaf_nodes': list(range(2, 100, 5)), 'min_samples_split': list(range(2, 100, 5)), 'criterion': ['gini', 'entropy'], 'ccp_alpha': [0, 0.001, 0.01, 0.1, 1], 'n_estimators': [100, 200, 300, 500], 'max_features': ['auto', 'log2']}


grid_search_cv = RandomizedSearchCV(RF(random_state=10, class_weight = 'balanced'), params, n_iter = 600, verbose=1, cv=3, scoring = 'f1')
grid_search_cv.fit(train_x, train_y)

In [None]:
grid_search_cv.best_estimator_

In [None]:
from sklearn.ensemble import RandomForestClassifier as RF

rf_model = RF(ccp_alpha=0, class_weight='balanced',
                       criterion='entropy', max_depth=10, max_leaf_nodes=97,
                       min_samples_split=22, n_estimators=500, random_state=10)

from sklearn.metrics import f1_score
# fitting the model
rf_model.fit(train_x, train_y)

# make training prediction
train_yhat = rf_model.predict(train_x)
train_score = f1_score(train_y, train_yhat)

# make test prediction
test_yhat = rf_model.predict(test_x)
test_score = f1_score(test_y, test_yhat)

train_score, test_score

(0.8962655601659751, 0.8398169336384439)

In [None]:
import xgboost as xgb
from sklearn.metrics import f1_score

model = xgb.XGBClassifier()

model.fit(train_x, train_y)
# make training prediction
train_yhat = model.predict(train_x)
train_score = f1_score(train_y, train_yhat)

# make test prediction
test_yhat = model.predict(test_x)
test_score = f1_score(test_y, test_yhat)

train_score, test_score



(1.0, 0.9168765743073047)

In [None]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

param_dist = {"max_depth": [5, 10, 15, 30, 50],
              "n_estimators": [100, 200, 300],
              "learning_rate": [0.05, 0.1, 0.15, 0.2, 0.3],
             "reg_alpha": [0, 0.01, 0.1, 1, 10],
             "reg_lambda": [0, 0.01, 0.1, 1, 10]}

grid_search = RandomizedSearchCV(model, param_dist, cv = 3, n_iter = 50,  
                                   verbose=10, n_jobs=-1, scoring = "f1")

grid_search.fit(train_x, train_y)

grid_search.best_estimator_

Fitting 3 folds for each of 50 candidates, totalling 150 fits






XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.3, max_delta_step=0, max_depth=5,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=200, n_jobs=8, num_parallel_tree=1, random_state=0,
              reg_alpha=0.01, reg_lambda=10, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

In [None]:
import xgboost as xgb
from sklearn.metrics import f1_score, precision_score

model = xgb.XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.3, max_delta_step=0, max_depth=5,
              min_child_weight=1, monotone_constraints='()',
              n_estimators=200, n_jobs=8, num_parallel_tree=1, random_state=0,
              reg_alpha=0.01, reg_lambda=10, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

model.fit(train_x, train_y)
# make training prediction
train_yhat = model.predict(train_x)
train_score = f1_score(train_y, train_yhat)

# make test prediction
test_yhat = model.predict(test_x)
test_score = f1_score(test_y, test_yhat)

train_score, test_score





(0.9995903318312167, 0.9209535759096613)

In [None]:
pip install LogitBoost

Collecting LogitBoost
  Downloading logitboost-0.7-py3-none-any.whl (9.1 kB)
Installing collected packages: LogitBoost
Successfully installed LogitBoost-0.7
Note: you may need to restart the kernel to use updated packages.


In [None]:
from logitboost import LogitBoost

lboost = LogitBoost(n_estimators=200, random_state=0)
lboost.fit(train_x, train_y)

# make training prediction
train_yhat = lboost.predict(train_x)
train_score = f1_score(train_y, train_yhat)

# make test prediction
test_yhat = lboost.predict(test_x)
test_score = f1_score(test_y, test_yhat)

train_score, test_score

(0.9170159262363788, 0.8972431077694234)

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier



estimators = [
    ('rf', RandomForestClassifier(n_estimators=500, random_state=10)),
    ('lr', LogitBoost(n_estimators=200, random_state=0))
]

clf = StackingClassifier(
    estimators=estimators, final_estimator=xgb.XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.3, max_delta_step=0, max_depth=5,
              min_child_weight=1, monotone_constraints='()',
              n_estimators=200, n_jobs=8, num_parallel_tree=1, random_state=0,
              reg_alpha=0.01, reg_lambda=10, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)
 )

clf.fit(train_x, train_y)

# make training prediction
train_yhat = clf.predict(train_x)
train_score = f1_score(train_y, train_yhat)

# make test prediction
test_yhat = clf.predict(test_x)
test_score = f1_score(test_y, test_yhat)

train_score, test_score





(0.9869706840390878, 0.8986568986568987)

In [None]:
from sklearn.svm import SVC

svc_model = SVC()
svc_model.fit(train_x, train_y)

# make training prediction
train_yhat = svc_model.predict(train_x)
train_score = f1_score(train_y, train_yhat)

# make test prediction
test_yhat = svc_model.predict(test_x)
test_score = f1_score(test_y, test_yhat)

train_score, test_score

(0.7778819119025304, 0.7033285094066569)

In [None]:
param_dist = {"C":[0.01, 0.1, 1, 10, 100, 1000, 10000],
             "kernel": ['linear', 'poly', 'rbf'],
             "degree": [2, 3, 4],
             "gamma": ['auto', 'scale']}

grid_search = RandomizedSearchCV(svc_model, param_dist, cv = 3, n_iter = 30,  
                                   verbose=10, n_jobs=-1, scoring = "f1")

grid_search.fit(train_x, train_y)

grid_search.best_estimator_

Fitting 3 folds for each of 30 candidates, totalling 90 fits
