With some knowledge in building and testing machine learning models already, in this section we look into the model selection and testing techniques mainly in the sklearn package. This allows us to gain a basic understanding, at a programming/operational level to implement the code that supports testing of the machine learning and statistical models. 

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn import datasets
from sklearn import svm
from sklearn import metrics

## Reading the data

For this exercise we will read a dataset from credit scoring. I previously uploaded the data to Google, and it is available at https://docs.google.com/spreadsheets/d/e/2PACX-1vQfU3NMvbp_FFneuYQQCadXxs5RnZ7Po3-4E94nruZ0vjbsOWT1lWPXF-ybJ5Lzd6Uy5LNiuyCVWR2-/pub?gid=1314911186&single=true&output=csv

The dataset (called **Bankloan**, from IBM) has a set of loans with default information. It includes the following variables:

- Customer: ID, or unique label, of the borrower (NOT predictive).
- Age: Age of the borrower in years.
- Employ: Years at current job.
- Address: Years at current address.
- Leverage: Debt/Income Ratio.
- CredDebt: Credit card standing debt.
- MonthlyLoad: Monthly percentage from salary used to repay debts.
- Default: 1 If default has occurred, 0 if not (Target variable).

The goal is to construct a model to predict whether the loan is going to default or not.

In [None]:
!wget --no-check-certificate --output-document=Bankloan.csv 'https://docs.google.com/spreadsheets/d/e/2PACX-1vQfU3NMvbp_FFneuYQQCadXxs5RnZ7Po3-4E94nruZ0vjbsOWT1lWPXF-ybJ5Lzd6Uy5LNiuyCVWR2-/pub?gid=1314911186&single=true&output=csv'

# Import the files as Pandas datasets
bankloan_data = pd.read_csv('Bankloan.csv')

y = bankloan_data['Default']
X = bankloan_data.drop(columns=['Default'])

--2023-02-08 13:17:08--  https://docs.google.com/spreadsheets/d/e/2PACX-1vQfU3NMvbp_FFneuYQQCadXxs5RnZ7Po3-4E94nruZ0vjbsOWT1lWPXF-ybJ5Lzd6Uy5LNiuyCVWR2-/pub?gid=1314911186&single=true&output=csv
Resolving docs.google.com (docs.google.com)... 173.194.69.101, 173.194.69.138, 173.194.69.113, ...
Connecting to docs.google.com (docs.google.com)|173.194.69.101|:443... connected.
HTTP request sent, awaiting response... 307 Temporary Redirect
Location: https://doc-0c-9o-sheets.googleusercontent.com/pub/mq6he3r7ig44qobar1fsg51390/7elr9261t6gpkbl2sr8e82u7sc/1675862225000/108328119934179437001/*/e@2PACX-1vQfU3NMvbp_FFneuYQQCadXxs5RnZ7Po3-4E94nruZ0vjbsOWT1lWPXF-ybJ5Lzd6Uy5LNiuyCVWR2-?gid=1314911186&single=true&output=csv [following]
--2023-02-08 13:17:09--  https://doc-0c-9o-sheets.googleusercontent.com/pub/mq6he3r7ig44qobar1fsg51390/7elr9261t6gpkbl2sr8e82u7sc/1675862225000/108328119934179437001/*/e@2PACX-1vQfU3NMvbp_FFneuYQQCadXxs5RnZ7Po3-4E94nruZ0vjbsOWT1lWPXF-ybJ5Lzd6Uy5LNiuyCVWR2-?gid=13149111

In its simplest form, we could split the data into training and testing set. The ratio we use is 70% training versus 30% testing. Random state is used to provide a “random seed” to allow reproducible results. Because the split happens randomly, hence in order to be able to achieve reproducible results switching devices/testing environment. 

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

To test the predictive performance, in this specific instance we choose <a href="https://scikit-learn.org/stable/modules/classes.html#module-sklearn.svm">support vector machine</a> as the classifer  and <a href="https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics">accuracy</a> as the evaluation metrics. You should also be confident to test other models and metrics later. For the SVM, we choose a linear kernel function and regularization parameter is set to 1. 

In [None]:
clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
clf.score(X_test, y_test)

0.7333333333333333

We could repeatedly test the results. In the example below, we run 10 repetitions to obtain a series of predictions and evaluations.  

In [None]:
total_itr = 10
acc_values = np.zeros(total_itr)
for itr in range(0, 10):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=itr)
    clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
    acc_values[itr] = clf.score(X_test, y_test)

In [None]:
acc_values

array([0.73333333, 0.76666667, 0.71666667, 0.73333333, 0.81666667,
       0.81666667, 0.76666667, 0.76666667, 0.76666667, 0.7       ])

Alternatively, cross validation (CV) could be deployed to test the performance of our models.
The CV could also be combined with repeated testing. 


In [None]:
scores = cross_val_score(clf, X, y, cv=5, scoring='accuracy')
scores

array([0.775, 0.7  , 0.8  , 0.675, 0.675])

Stratified cross validation is a cross validation mechanism that provide train/test indices to split data in train/test sets.
This cross-validation object is a variation of KFold that returns stratified folds. The folds are made by preserving the percentage of samples for each class. 
This is useful if we want to maintain the distributional information of the classes. 


In [None]:
from sklearn.model_selection import StratifiedKFold, KFold

skf = StratifiedKFold(n_splits=5)
print("StratifiedKFold")
for train, test in skf.split(X, y):
  print('train -  {}   |   test -  {}'.format(
  np.bincount(y[train]), np.bincount(y[test])))
print("KFold")
kf = KFold(n_splits=5)
for train, test in kf.split(X, y):
  print('train -  {}   |   test -  {}'.format(
  np.bincount(y[train]), np.bincount(y[test])))


StratifiedKFold
train -  [108  52]   |   test -  [28 12]
train -  [109  51]   |   test -  [27 13]
train -  [109  51]   |   test -  [27 13]
train -  [109  51]   |   test -  [27 13]
train -  [109  51]   |   test -  [27 13]
KFold
train -  [112  48]   |   test -  [24 16]
train -  [112  48]   |   test -  [24 16]
train -  [108  52]   |   test -  [28 12]
train -  [105  55]   |   test -  [31  9]
train -  [107  53]   |   test -  [29 11]


We apply this to split the data, train and test the model results. 

In [None]:
skf = StratifiedKFold(n_splits=5)
acc_values_Stratified = np.zeros(5)

for i, (train, test) in enumerate(skf.split(X, y)):
    clf = svm.SVC(kernel='linear', C=1).fit(X.loc[train,:], y[train])
    acc_values_Stratified[i] = clf.score(X.loc[test,:], y[test])
acc_values_Stratified

array([0.775, 0.7  , 0.8  , 0.675, 0.675])

In [None]:
kf = KFold(n_splits=5)
acc_values_noStratified = np.zeros(5)

for i, (train, test) in enumerate(kf.split(X, y)):
    clf = svm.SVC(kernel='linear', C=1).fit(X.loc[train,:], y[train])
    acc_values_noStratified[i] = clf.score(X.loc[test,:], y[test])
acc_values_noStratified

array([0.675, 0.75 , 0.675, 0.775, 0.775])

Noticeably, so far we use a fixed parameter value with C=1 for the support vector machine classifier. We can “tune” the classifier to adjust its performance. This is achieved by defining that the C could potentially draw numbers from a distribution with random search. 

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform

distributions = dict(C=uniform(loc=0, scale=4))
svc = svm.SVC(kernel='linear')
clf = RandomizedSearchCV(svc, distributions,cv=5, random_state=0,n_iter=15)
search = clf.fit(X_train, y_train)
search.best_params_

{'C': 2.4110535042865755}

In [None]:
search.cv_results_

In [None]:
skf = StratifiedKFold(n_splits=5)
acc_values_Stratified = np.zeros(5)
svc = svm.SVC(kernel='linear')
for i, (train, test) in enumerate(skf.split(X, y)):
    clf = RandomizedSearchCV(svc, distributions,cv=5, random_state=0)
    search = clf.fit(X.loc[train,:], y[train])
    acc_values_Stratified[i] = search.score(X.loc[test,:], y[test])
acc_values_Stratified

array([0.725, 0.7  , 0.8  , 0.675, 0.8  ])

In [None]:
acc_values_Stratified

array([0.725, 0.7  , 0.8  , 0.675, 0.8  ])

However, the tuning part of the code will not run, … simply because it takes too long to get the results during the lecture. The result was obtained outside of the lecture hours when I tested myself. You are encouraged to experience the code and check the results yourself outside of the lecture time as well.  

In [None]:
acc_values_notune = np.zeros(10)
for itr in range(0,10):
    print(itr)
    skf = StratifiedKFold(n_splits=5,shuffle=True,random_state=itr)
    for i, (train, test) in enumerate(skf.split(X, y)):
      clf = svm.SVC(kernel='linear', C=1).fit(X.loc[train,:], y[train])
      acc_values_Stratified[i] = clf.score(X.loc[test,:], y[test])
    acc_values_notune[itr] = np.mean(acc_values_Stratified)

0
1
2
3
4
5
6
7
8
9


In [None]:
acc_values_notune

array([0.74 , 0.725, 0.71 , 0.75 , 0.725, 0.745, 0.725, 0.735, 0.735,
       0.75 ])

In [None]:
acc_values_tune = np.zeros(10)
for itr in range(0,10):
    print(itr)
    skf = StratifiedKFold(n_splits=5,shuffle=True,random_state=itr)
    for i, (train, test) in enumerate(skf.split(X, y)):
      clf = RandomizedSearchCV(svc, distributions,cv=5, random_state=0)
      search = clf.fit(X.loc[train,:], y[train])
      acc_values_Stratified[i] = search.score(X.loc[test,:], y[test])
    acc_values_tune[itr] = np.mean(acc_values_Stratified)

0
1
2
3
4
5
6
7
8
9


In [None]:
acc_values_tune

array([0.715, 0.73 , 0.71 , 0.76 , 0.735, 0.725, 0.725, 0.735, 0.73 ,
       0.75 ])

In [None]:
acc_values_tune = [0.715, 0.73 , 0.71 , 0.76 , 0.735, 0.725, 0.725, 0.735, 0.73 ,
       0.75 ]

Now we obtain the ten times of the average results from a SVM with fixed parameter and the ten times of the average accuracy result from a “tuned” SVM classifier. What could we conclude in terms of its performance? 

In [None]:
np.mean(acc_values_notune)

0.734

In [None]:
np.mean(acc_values_tune)

0.7314999999999999

Despite the numerical differences, is such difference significant? 

In [None]:
from scipy import stats
stats.ttest_ind(acc_values_notune, acc_values_tune)

Ttest_indResult(statistic=0.4013220814108344, pvalue=0.6929073910119481)

In [None]:
from scipy.stats import ranksums
ranksums(acc_values_notune, acc_values_tune)

RanksumsResult(statistic=0.680336051416609, pvalue=0.49629170223109287)

The p value of the t test and ranksum test both suggested that >0.05 hence we failed to reject the null hypotheses to conclude that the numerical differences in terms of accuracy is statistically significant. 
Some of you might argue then what is the benefit of using random search to perform parameter tuning. Please be aware that we only tested one model parameter. If you test other models, e.g. logistic regression and random forests, do you get a different result. Another reminder is that this dataset is rather small, it is a subset of the original dataset. We might get different results if we test on other datasets. 
This, sounds like a good exercise for you after the lecture. 😀


Last but not least, bootstrapping using sampling with replacement could generate repetitive copies of data to allow us to perform cross validation testing on each repetitions. 

In [None]:
from sklearn.utils import resample
svc = svm.SVC(kernel='linear')
acc_values_tune = np.zeros(3)
for itr in range(0,3):
    print(itr)
    bankloan_data_s = resample(bankloan_data, n_samples=len(X), stratify=y,random_state=itr)
    y_s = bankloan_data_s['Default']
    X_s = bankloan_data_s.drop(columns=['Default'])
    skf = StratifiedKFold(n_splits=5,shuffle=True,random_state=itr)
    for i, (train, test) in enumerate(skf.split(X_s, y_s)):
        clf = RandomizedSearchCV(svc, distributions,cv=5, random_state=0)
        search = clf.fit(X_s.iloc[train,:], y_s.iloc[train])
        acc_values_Stratified[i] = search.score(X_s.iloc[test,:], y_s.iloc[test])
    acc_values_tune[itr] = np.mean(acc_values_Stratified)

0
1
2


In [None]:
acc_values_tune

array([0.765, 0.735, 0.775])