# Classification problem

## Instructions

-  We consider the dataset file <code>**dataset.csv**</code>, which is contained in the <code>**loan-prediction**</code> directory

-  A description of the dataset is available in the <code>**README.txt**</code> file on the same directory.

-  **GOAL:** Use information from past loan applicants contained in <code>**dataset.csv**</code> to predict whether a _new_ applicant should be granted a loan or not.

## Dataset preparation

In [20]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import math

### Data collection

In [21]:
path = './exercises/sklearn/loan-prediction/dataset.csv'

dataset = pd.read_csv(path)

dataset.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


### Handling missing values

The first thing we might do is to replace the NA values with the mean of all the values (in the case of numerical values). The reality is that with the presence of _outliers_, the mean might not be the best choice. The __median__ is a better solution, being indeed robust to the outliers in the dataset.

In [22]:
from pandas.api.types import is_numeric_dtype

data = dataset.apply(lambda x: x.fillna(x.median()) if is_numeric_dtype(x) else x.fillna(x.mode()[0]))
dataset.describe()


Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History
count,614.0,614.0,592.0,600.0,564.0
mean,5403.459283,1621.245798,146.412162,342.0,0.842199
std,6109.041673,2926.248369,85.587325,65.12041,0.364878
min,150.0,0.0,9.0,12.0,0.0
25%,2877.5,0.0,100.0,360.0,1.0
50%,3812.5,1188.5,128.0,360.0,1.0
75%,5795.0,2297.25,168.0,360.0,1.0
max,81000.0,41667.0,700.0,480.0,1.0


### Encoding categorical features - _One-hot Encoding_

Categorical values should be transformed into numerical values to be used in the machine-learning pipeline. Not all the ML models can support categorical values.

This procedure is achieved by the <tt>get_dummies</tt> function.


In [23]:
categorical_featues = [col for col in data.columns if not is_numeric_dtype(data[col]) and col != 'Loan_Status']
data_with_dummies = pd.get_dummies(data, columns=categorical_featues)
data_with_dummies.head()

Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Loan_Status,Loan_ID_LP001002,Loan_ID_LP001003,Loan_ID_LP001005,Loan_ID_LP001006,...,Dependents_1,Dependents_2,Dependents_3+,Education_Graduate,Education_Not Graduate,Self_Employed_No,Self_Employed_Yes,Property_Area_Rural,Property_Area_Semiurban,Property_Area_Urban
0,5849,0.0,128.0,360.0,1.0,Y,True,False,False,False,...,False,False,False,True,False,True,False,False,False,True
1,4583,1508.0,128.0,360.0,1.0,N,False,True,False,False,...,True,False,False,True,False,True,False,True,False,False
2,3000,0.0,66.0,360.0,1.0,Y,False,False,True,False,...,False,False,False,True,False,False,True,False,False,True
3,2583,2358.0,120.0,360.0,1.0,Y,False,False,False,True,...,False,False,False,False,True,True,False,False,False,True
4,6000,0.0,141.0,360.0,1.0,Y,False,False,False,False,...,False,False,False,True,False,True,False,False,False,True


Move the predicted column to the last

In [24]:
columns = data_with_dummies.columns.tolist()
columns.insert(len(columns), columns.pop(columns.index('Loan_Status')))
data_with_dummies = data_with_dummies.loc[:, columns]
data_with_dummies.head()

Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Loan_ID_LP001002,Loan_ID_LP001003,Loan_ID_LP001005,Loan_ID_LP001006,Loan_ID_LP001008,...,Dependents_2,Dependents_3+,Education_Graduate,Education_Not Graduate,Self_Employed_No,Self_Employed_Yes,Property_Area_Rural,Property_Area_Semiurban,Property_Area_Urban,Loan_Status
0,5849,0.0,128.0,360.0,1.0,True,False,False,False,False,...,False,False,True,False,True,False,False,False,True,Y
1,4583,1508.0,128.0,360.0,1.0,False,True,False,False,False,...,False,False,True,False,True,False,True,False,False,N
2,3000,0.0,66.0,360.0,1.0,False,False,True,False,False,...,False,False,True,False,False,True,False,False,True,Y
3,2583,2358.0,120.0,360.0,1.0,False,False,False,True,False,...,False,False,False,True,True,False,False,False,True,Y
4,6000,0.0,141.0,360.0,1.0,False,False,False,False,True,...,False,False,True,False,True,False,False,False,True,Y


### Encoding binary class label

To make the binary class labels in a numerical value, first identify the col and the two possible values. Then replace the with 1 and -1.

In [27]:
data = data_with_dummies
data.Loan_Status = data.Loan_Status.map(lambda x: 1 if x == 'Y' else -1)
data.head()

Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Loan_ID_LP001002,Loan_ID_LP001003,Loan_ID_LP001005,Loan_ID_LP001006,Loan_ID_LP001008,...,Dependents_2,Dependents_3+,Education_Graduate,Education_Not Graduate,Self_Employed_No,Self_Employed_Yes,Property_Area_Rural,Property_Area_Semiurban,Property_Area_Urban,Loan_Status
0,5849,0.0,128.0,360.0,1.0,True,False,False,False,False,...,False,False,True,False,True,False,False,False,True,1
1,4583,1508.0,128.0,360.0,1.0,False,True,False,False,False,...,False,False,True,False,True,False,True,False,False,-1
2,3000,0.0,66.0,360.0,1.0,False,False,True,False,False,...,False,False,True,False,False,True,False,False,True,1
3,2583,2358.0,120.0,360.0,1.0,False,False,False,True,False,...,False,False,False,True,True,False,False,False,True,1
4,6000,0.0,141.0,360.0,1.0,False,False,False,False,True,...,False,False,True,False,True,False,False,False,True,1


## Build the model

In [34]:
from sklearn.model_selection import *
from sklearn.metrics import *
from sklearn.ensemble import *
from sklearn.tree import *
from sklearn.linear_model import *
from sklearn.neighbors import *

### Split the dataset

In [32]:
x = data.iloc[:, :-1]
x.head()
y = data.iloc[:, -1]
y.head()

0    1
1   -1
2    1
3    1
4    1
Name: Loan_Status, dtype: int64

Let's split our dataset with __scikit-learn__ <tt>train_test_split</tt> function, which splits the input dataset into a training set and a test set, respectively.

We want the training set to account for 80% of the original dataset, whilst 
the test set to account for the remaining 20%.

Additionally, we would like to take advantage of _stratified_ sampling to obtain the same target distribution in both the training and the test sets.


In [35]:
seed = 1855

X_train, X_test, y_train, y_test = train_test_split(x,y,test_size=0.2, random_state=seed)

### Evaluate function

We can create a function such that it will print the evaluation of the prediction.

In [36]:
def evaluate(expected_values, predicted_values):
    print('Accuracy:', accuracy_score(expected_values, predicted_values))
    print('Precision:', precision_score(expected_values, predicted_values))
    print('Recall:', recall_score(expected_values, predicted_values))
    print('F1:', f1_score(expected_values, predicted_values))
    print('ROC AUC:', roc_auc_score(expected_values, predicted_values))

### Cross-validation

In [38]:

model = KNeighborsClassifier()

cross_validation = cross_validate(model, x, y, cv = 10, scoring=['accuracy', 'precision', 'recall', 'f1', 'roc_auc'], return_train_score=True)

pd.DataFrame(cross_validation)

Unnamed: 0,fit_time,score_time,test_accuracy,train_accuracy,test_precision,train_precision,test_recall,train_recall,test_f1,train_f1,test_roc_auc,train_roc_auc
0,0.027954,0.049537,0.677419,0.737319,0.709091,0.757709,0.906977,0.907652,0.795918,0.82593,0.596695,0.753901
1,0.02213,0.040588,0.66129,0.744565,0.703704,0.759825,0.883721,0.918206,0.783505,0.831541,0.54896,0.759056
2,0.022839,0.045508,0.532258,0.724638,0.644444,0.75,0.690476,0.9,0.666667,0.818182,0.48869,0.764152
3,0.030335,0.058446,0.580645,0.733696,0.66,0.757174,0.785714,0.902632,0.717391,0.823529,0.497619,0.754735
4,0.025025,0.045182,0.606557,0.730561,0.673077,0.746269,0.833333,0.921053,0.744681,0.824499,0.538221,0.750928
5,0.027254,0.046367,0.639344,0.734177,0.685185,0.754923,0.880952,0.907895,0.770833,0.824373,0.583333,0.745148
6,0.024714,0.044369,0.606557,0.735986,0.673077,0.76,0.833333,0.9,0.744681,0.824096,0.425439,0.75502
7,0.014205,0.044022,0.606557,0.734177,0.666667,0.749465,0.857143,0.921053,0.75,0.826446,0.420426,0.765037
8,0.023504,0.041,0.655738,0.723327,0.714286,0.742004,0.833333,0.915789,0.769231,0.819788,0.56203,0.750555
9,0.021016,0.038632,0.590164,0.734177,0.688889,0.756044,0.738095,0.905263,0.712644,0.823952,0.446115,0.769128


In [39]:
print("Mean of test set scores:")
print(f"Accuracy: {cross_validation['test_accuracy'].mean()}")
print(f"Precision: {cross_validation['test_precision'].mean()}")
print(f"Recall: {cross_validation['test_recall'].mean()}")
print(f"F1: {cross_validation['test_f1'].mean()}")
print(f"ROC AUC: {cross_validation['test_roc_auc'].mean()}")

Mean of test set scores:
Accuracy: 0.6156530936012691
Precision: 0.6818419358419359
Recall: 0.8243078626799557
F1: 0.7455550975853288
ROC AUC: 0.5107528268345282


### K-fold cross-validation

The k-fold cross-validation is an improved validation test where the dataset is divided into $K$ parts and at every iteration a part is used as a test set and the others $K - 1$ as a train set.

In [41]:
model = KNeighborsClassifier()

k_fold = KFold(n_splits=10, random_state=seed, shuffle=True)

cross_validation_result = cross_validate(model, x, y, cv = k_fold, scoring=['accuracy', 'precision', 'recall', 'f1', 'roc_auc'], return_train_score=True)
pd.DataFrame(cross_validation_result)

Unnamed: 0,fit_time,score_time,test_accuracy,train_accuracy,test_precision,train_precision,test_recall,train_recall,test_f1,train_f1,test_roc_auc,train_roc_auc
0,0.025849,0.049849,0.66129,0.724638,0.76087,0.747253,0.777778,0.901857,0.769231,0.817308,0.551634,0.741925
1,0.025085,0.054349,0.629032,0.744565,0.679245,0.763797,0.857143,0.910526,0.757895,0.830732,0.513095,0.775765
2,0.022136,0.042533,0.612903,0.717391,0.679245,0.742919,0.837209,0.899736,0.75,0.813842,0.531212,0.744803
3,0.026118,0.043777,0.548387,0.75,0.607843,0.762313,0.794872,0.929504,0.688889,0.837647,0.482163,0.757142
4,0.02821,0.044094,0.622951,0.746835,0.66,0.76129,0.846154,0.924282,0.741573,0.834906,0.503497,0.77154
5,0.02437,0.042037,0.704918,0.719711,0.72,0.742004,0.9,0.910995,0.8,0.817861,0.660714,0.741144
6,0.022618,0.041681,0.721311,0.725136,0.767857,0.74833,0.914894,0.896,0.834951,0.815534,0.575228,0.745596
7,0.021926,0.043068,0.622951,0.735986,0.686275,0.753247,0.833333,0.915789,0.752688,0.826603,0.52193,0.761827
8,0.023008,0.042925,0.639344,0.734177,0.711538,0.756098,0.840909,0.902116,0.770833,0.822678,0.549465,0.753719
9,0.024423,0.043177,0.557377,0.74141,0.659091,0.759825,0.707317,0.913386,0.682353,0.829559,0.491463,0.765885


In [44]:
print("Mean of test set scores:")
print(f"Accuracy: {cross_validation_result['test_accuracy'].mean()}")
print(f"Precision: {cross_validation_result['test_precision'].mean()}")
print(f"Recall: {cross_validation_result['test_recall'].mean()}")
print(f"F1: {cross_validation_result['test_f1'].mean()}")
print(f"ROC AUC: {cross_validation_result['test_roc_auc'].mean()}")

Mean of test set scores:
Accuracy: 0.6385774722369117
Precision: 0.6964817807464867
Recall: 0.8414728682170545
F1: 0.7618310155497572
ROC AUC: 0.5161642478288745


### Stratified k-fold cross-validation

An even better option is to use a stratified k-fold validation. This variant splits the dataset in a way such that every fold contains the same proportion of features.

In [43]:
model = KNeighborsClassifier()

stratified_k_fold = StratifiedKFold(n_splits=10, random_state=seed, shuffle=True)

cross_validation_result = cross_validate(model, x, y, cv = stratified_k_fold, scoring=['accuracy', 'precision', 'recall', 'f1', 'roc_auc'], return_train_score=True)

pd.DataFrame(cross_validation_result)

Unnamed: 0,fit_time,score_time,test_accuracy,train_accuracy,test_precision,train_precision,test_recall,train_recall,test_f1,train_f1,test_roc_auc,train_roc_auc
0,0.032698,0.065267,0.580645,0.728261,0.666667,0.749455,0.790698,0.907652,0.723404,0.821002,0.505508,0.761732
1,0.028888,0.05289,0.629032,0.733696,0.708333,0.754386,0.790698,0.907652,0.747253,0.823952,0.452876,0.753939
2,0.02121,0.041575,0.612903,0.744565,0.666667,0.762637,0.857143,0.913158,0.75,0.831138,0.491667,0.766371
3,0.022225,0.041667,0.645161,0.728261,0.685185,0.746781,0.880952,0.915789,0.770833,0.822695,0.467857,0.753825
4,0.022478,0.042564,0.655738,0.723327,0.714286,0.743041,0.833333,0.913158,0.769231,0.819362,0.600251,0.751605
5,0.021504,0.038377,0.721311,0.725136,0.745098,0.747826,0.904762,0.905263,0.817204,0.819048,0.673559,0.74841
6,0.02129,0.039951,0.639344,0.725136,0.692308,0.747826,0.857143,0.905263,0.765957,0.819048,0.356516,0.770414
7,0.022117,0.042345,0.622951,0.748644,0.686275,0.764835,0.833333,0.915789,0.752688,0.833533,0.559524,0.761553
8,0.0279,0.043029,0.704918,0.719711,0.74,0.739872,0.880952,0.913158,0.804348,0.817432,0.58396,0.746235
9,0.020656,0.041828,0.57377,0.743219,0.66,0.762115,0.785714,0.910526,0.717391,0.829736,0.469925,0.759545


In [45]:
print("Mean of test set scores:")
print(f"Accuracy: {cross_validation_result['test_accuracy'].mean()}")
print(f"Precision: {cross_validation_result['test_precision'].mean()}")
print(f"Recall: {cross_validation_result['test_recall'].mean()}")
print(f"F1: {cross_validation_result['test_f1'].mean()}")
print(f"ROC AUC: {cross_validation_result['test_roc_auc'].mean()}")

Mean of test set scores:
Accuracy: 0.6385774722369117
Precision: 0.6964817807464867
Recall: 0.8414728682170545
F1: 0.7618310155497572
ROC AUC: 0.5161642478288745


## Comparing different models

There might be a situation where different models can be compared to see which one fits better to the classification problem we need to solve.

### Select the best hyper-params of a fixed family of model

In this first case, we study the influence different hyper-params have on the same family model (logistic regression) and choose the best

In [46]:
X_train, X_test, y_train, y_test = train_test_split(x,y,test_size=0.2, random_state=seed)


models_and_parameters = {
    'KNeighborsClassifier': (KNeighborsClassifier(), {'n_neighbors': [3, 5, 7, 9, 11]}),
    # 'DecisionTreeClassifier': (DecisionTreeClassifier(), {'max_depth': [3, 5, 7, 9, 11]}),
    # 'RandomForestClassifier': (RandomForestClassifier(), {'n_estimators': [10, 50, 100, 200, 300]}),
    # 'LogisticRegression': (LogisticRegression(), {'C': [0.1, 0.5, 1, 5, 10]}),
}

k_fold = StratifiedKFold(n_splits=10, random_state=seed, shuffle=True)

model  = models_and_parameters['KNeighborsClassifier'][0]
parameters = models_and_parameters['KNeighborsClassifier'][1]

grid_search = GridSearchCV(model, parameters, cv=k_fold, scoring='f1', verbose = True, return_train_score=True)

grid_search.fit(X_train, y_train)

pd.DataFrame(grid_search.cv_results_)

Fitting 10 folds for each of 5 candidates, totalling 50 fits


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_n_neighbors,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,...,split2_train_score,split3_train_score,split4_train_score,split5_train_score,split6_train_score,split7_train_score,split8_train_score,split9_train_score,mean_train_score,std_train_score
0,0.020634,0.001829,0.018454,0.000702,3,{'n_neighbors': 3},0.773333,0.694444,0.657143,0.783784,...,0.850467,0.8482,0.849145,0.84858,0.84953,0.853543,0.846034,0.8482,0.848081,0.003969
1,0.018151,0.00165,0.017285,0.001049,5,{'n_neighbors': 5},0.753247,0.736842,0.704225,0.8,...,0.817496,0.810811,0.817109,0.823881,0.810241,0.822823,0.811377,0.809595,0.815341,0.004849
2,0.017741,0.00045,0.017799,0.000963,7,{'n_neighbors': 7},0.78481,0.769231,0.685714,0.814815,...,0.81296,0.815029,0.811765,0.811852,0.814706,0.81351,0.801749,0.810967,0.812071,0.003977
3,0.018337,0.000675,0.018452,0.002083,9,{'n_neighbors': 9},0.753247,0.74359,0.695652,0.871795,...,0.821173,0.809798,0.816739,0.812589,0.811511,0.818444,0.804064,0.808571,0.813637,0.004964
4,0.017946,0.000701,0.017452,0.000395,11,{'n_neighbors': 11},0.769231,0.74359,0.712329,0.835443,...,0.816208,0.808023,0.812772,0.804665,0.810496,0.823529,0.810967,0.814815,0.813681,0.005451


In [49]:
print("Best parameters found:")
print(grid_search.best_params_)
print("Best F1-score found:")
print(f"{grid_search.best_score_:.3f}")

Best parameters found:
{'n_neighbors': 11}
Best F1-score found:
0.781


In [50]:
model_with_best_parameters = KNeighborsClassifier(n_neighbors=grid_search.best_params_['n_neighbors'])

model.fit(X_train, y_train)

evaluate(y_test, model.predict(X_test))

Accuracy: 0.6829268292682927
Precision: 0.7373737373737373
Recall: 0.8488372093023255
F1: 0.7891891891891892
ROC AUC: 0.5730672532998113


### Best model from fixed hyper-params

Here we fix the hyper-params for each model (we use the default params) and compare the different models

In [52]:
import warnings

warnings.filterwarnings('ignore')
X_train, X_test, y_train, y_test = train_test_split(x,y,test_size=0.2, random_state=seed)


models= {
    'KNeighborsClassifier': KNeighborsClassifier(), 
    'DecisionTreeClassifier': DecisionTreeClassifier(), 
    'RandomForestClassifier': RandomForestClassifier(), 
    'LogisticRegression': LogisticRegression()
}

k_fold = StratifiedKFold(n_splits=10, random_state=seed, shuffle=True)

cross_validation_scores = {}
for model_name, model in models.items():
  cross_validation_scores[model_name] = cross_val_score(model, X_train, y_train, cv=k_fold, scoring='precision')
  
cross_validation_scores = pd.DataFrame(cross_validation_scores).transpose()

cross_validation_scores['mean'] = cross_validation_scores.mean(axis=1)
cross_validation_scores['std'] = cross_validation_scores.std(axis=1)
cross_validation_scores = cross_validation_scores.sort_values(['mean', 'std'], ascending=False)


cross_validation_scores

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,mean,std
LogisticRegression,0.804878,0.842105,0.780488,0.790698,0.769231,0.772727,0.815789,0.820513,0.780488,0.761905,0.793882,0.024662
DecisionTreeClassifier,0.815789,0.8,0.810811,0.777778,0.815789,0.744186,0.787879,0.818182,0.794872,0.764706,0.792999,0.023418
RandomForestClassifier,0.785714,0.825,0.820513,0.772727,0.785714,0.733333,0.810811,0.820513,0.8,0.744186,0.789851,0.030439
KNeighborsClassifier,0.674419,0.666667,0.675676,0.695652,0.674419,0.7,0.675,0.642857,0.717949,0.714286,0.683692,0.021881


By comparing the mean and the standard deviation we can deduce that the best classifier is the logistic regression. We now need to train the model on the whole train set (so far we trained in the cross-validation folds only). After training in the whole train set, we predict the values on the test set and evaluate the result. There is nothing more we can do.

In [53]:
best_model = models[cross_validation_scores.index[0]]
best_model.fit(X_train, y_train)


evaluate(y_test, best_model.predict(X_test))

Accuracy: 0.8211382113821138
Precision: 0.8076923076923077
Recall: 0.9767441860465116
F1: 0.8842105263157894
ROC AUC: 0.7181018227529855
