# Classification problem

## Instructions

-  We consider the dataset file <code>**dataset.csv**</code>, which is contained in the <code>**loan-prediction**</code> directory

-  A description of the dataset is available in the <code>**README.txt**</code> file on the same directory.

-  **GOAL:** Use information from past loan applicants contained in <code>**dataset.csv**</code> to predict whether a _new_ applicant should be granted a loan or not.

## Dataset preparation

In [40]:
import math
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats
import warnings

### Data collection

In [41]:
path = './exercises/sklearn/loan-prediction/dataset.csv'
data = pd.read_csv(path, sep=',', index_col='Loan_ID')
data.head()

Unnamed: 0_level_0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
Loan_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


### Handling missing values

The first thing we might do is to replace the NA values with the mean of all the values (in the case of numerical values). The reality is that with the presence of _outliers_, the mean might not be the best choice. The __median__ is a better solution, being indeed robust to the outliers in the dataset.

In [42]:
from pandas.api.types import is_numeric_dtype

data = data.apply(lambda x:
									x.fillna(x.median()) if is_numeric_dtype(x) else x.fillna(x.mode().iloc(0))
									)

data.describe()

Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History
count,614.0,614.0,614.0,614.0,614.0
mean,5403.459283,1621.245798,145.752443,342.410423,0.855049
std,6109.041673,2926.248369,84.107233,64.428629,0.352339
min,150.0,0.0,9.0,12.0,0.0
25%,2877.5,0.0,100.25,360.0,1.0
50%,3812.5,1188.5,128.0,360.0,1.0
75%,5795.0,2297.25,164.75,360.0,1.0
max,81000.0,41667.0,700.0,480.0,1.0


### Encoding categorical features - _One-hot Encoding_

Categorical values should be transformed into numerical values to be used in the machine-learning pipeline. Not all the ML models can support categorical values.

This procedure is achieved by the <tt>get_dummies</tt> function.


In [43]:
categorical_features = [col for col in data.columns if not is_numeric_dtype(data[col]) and col != 'Loan_Status']
data_with_dummy = pd.get_dummies(data=data, columns=categorical_features)
data_with_dummy.head()

Unnamed: 0_level_0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Loan_Status,Gender_<pandas.core.indexing._iLocIndexer object at 0x7fe41216b0c0>,Gender_Female,Gender_Male,Married_<pandas.core.indexing._iLocIndexer object at 0x7fe412168af0>,...,Dependents_2,Dependents_3+,Education_Graduate,Education_Not Graduate,Self_Employed_<pandas.core.indexing._iLocIndexer object at 0x7fe411bb32a0>,Self_Employed_No,Self_Employed_Yes,Property_Area_Rural,Property_Area_Semiurban,Property_Area_Urban
Loan_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
LP001002,5849,0.0,128.0,360.0,1.0,Y,False,False,True,False,...,False,False,True,False,False,True,False,False,False,True
LP001003,4583,1508.0,128.0,360.0,1.0,N,False,False,True,False,...,False,False,True,False,False,True,False,True,False,False
LP001005,3000,0.0,66.0,360.0,1.0,Y,False,False,True,False,...,False,False,True,False,False,False,True,False,False,True
LP001006,2583,2358.0,120.0,360.0,1.0,Y,False,False,True,False,...,False,False,False,True,False,True,False,False,False,True
LP001008,6000,0.0,141.0,360.0,1.0,Y,False,False,True,False,...,False,False,True,False,False,True,False,False,False,True


Move the predicted column to the last

In [44]:
columns = data_with_dummy.columns.tolist()
columns.insert(len(columns), columns.pop(columns.index('Loan_Status')))
data_with_dummy = data_with_dummy.loc[:, columns]
data_with_dummy.head()

Unnamed: 0_level_0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Gender_<pandas.core.indexing._iLocIndexer object at 0x7fe41216b0c0>,Gender_Female,Gender_Male,Married_<pandas.core.indexing._iLocIndexer object at 0x7fe412168af0>,Married_No,...,Dependents_3+,Education_Graduate,Education_Not Graduate,Self_Employed_<pandas.core.indexing._iLocIndexer object at 0x7fe411bb32a0>,Self_Employed_No,Self_Employed_Yes,Property_Area_Rural,Property_Area_Semiurban,Property_Area_Urban,Loan_Status
Loan_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
LP001002,5849,0.0,128.0,360.0,1.0,False,False,True,False,True,...,False,True,False,False,True,False,False,False,True,Y
LP001003,4583,1508.0,128.0,360.0,1.0,False,False,True,False,False,...,False,True,False,False,True,False,True,False,False,N
LP001005,3000,0.0,66.0,360.0,1.0,False,False,True,False,False,...,False,True,False,False,False,True,False,False,True,Y
LP001006,2583,2358.0,120.0,360.0,1.0,False,False,True,False,False,...,False,False,True,False,True,False,False,False,True,Y
LP001008,6000,0.0,141.0,360.0,1.0,False,False,True,False,True,...,False,True,False,False,True,False,False,False,True,Y


### Encoding binary class label

To make the binary class labels in a numerical value, first identify the col and the two possible values. Then replace the with 1 and -1.

In [45]:
data = data_with_dummy

data.Loan_Status = data.Loan_Status.map(lambda x: 1 if x == 'Y' else - 1)

data.head()

Unnamed: 0_level_0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Gender_<pandas.core.indexing._iLocIndexer object at 0x7fe41216b0c0>,Gender_Female,Gender_Male,Married_<pandas.core.indexing._iLocIndexer object at 0x7fe412168af0>,Married_No,...,Dependents_3+,Education_Graduate,Education_Not Graduate,Self_Employed_<pandas.core.indexing._iLocIndexer object at 0x7fe411bb32a0>,Self_Employed_No,Self_Employed_Yes,Property_Area_Rural,Property_Area_Semiurban,Property_Area_Urban,Loan_Status
Loan_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
LP001002,5849,0.0,128.0,360.0,1.0,False,False,True,False,True,...,False,True,False,False,True,False,False,False,True,1
LP001003,4583,1508.0,128.0,360.0,1.0,False,False,True,False,False,...,False,True,False,False,True,False,True,False,False,-1
LP001005,3000,0.0,66.0,360.0,1.0,False,False,True,False,False,...,False,True,False,False,False,True,False,False,True,1
LP001006,2583,2358.0,120.0,360.0,1.0,False,False,True,False,False,...,False,False,True,False,True,False,False,False,True,1
LP001008,6000,0.0,141.0,360.0,1.0,False,False,True,False,True,...,False,True,False,False,True,False,False,False,True,1


## Build the model

In [46]:
from sklearn.metrics import get_scorer
from sklearn.feature_extraction import DictVectorizer as DV
from sklearn import tree

# Cross Validation

from sklearn.model_selection import KFold
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_validate
from sklearn.model_selection import train_test_split

# Hyperparams optimization

from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import classification_report
from sklearn.metrics import explained_variance_score

# Models 

from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier


### Split the dataset

In [47]:
x = data.iloc[:, :  -1]
x.head()

y = data.iloc[:, -1]
y.head()

Loan_ID
LP001002    1
LP001003   -1
LP001005    1
LP001006    1
LP001008    1
Name: Loan_Status, dtype: int64

Let's split our dataset with __scikit-learn__ <tt>train_test_split</tt> function, which splits the input dataset into a training set and a test set, respectively.

We want the training set to account for 80% of the original dataset, whilst 
the test set to account for the remaining 20%.

Additionally, we would like to take advantage of _stratified_ sampling to obtain the same target distribution in both the training and the test sets.


In [48]:
seed = 314

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=314, stratify=y)

### Evaluate function

We can create a function such that it will print the evaluation of the prediction.

In [49]:
def evaluate(true_value, predicted_value):
	print(f"Accuracy: {accuracy_score(true_value, predicted_value):.3f}")
	print(f"Area under the ROC Curve = {roc_auc_score(true_value, predicted_value):.3f}")

### Cross-validation

In [50]:
warnings.filterwarnings('ignore')

model = LogisticRegression()

cross_validation = cross_validate(model, x, y, scoring=('roc_auc', 'accuracy'), return_train_score=True)
pd.DataFrame(cross_validation)

print("Mean of the test set score")
print(f"Accuracy: {np.mean(cross_validation['test_accuracy']):.3f}")
print(f"AUROC: {np.mean(cross_validation['test_roc_auc']):.3f}")

Mean of the test set score
Accuracy: 0.809
AUROC: 0.769


### K-fold cross-validation

The k-fold cross-validation is an improved validation test where the dataset is divided into $K$ parts and at every iteration a part is used as a test set and the others $K - 1$ as a train set.

In [51]:
model = LogisticRegression()

k_fold = KFold(n_splits=10, shuffle=True, random_state=seed)

cross_validation = cross_validate(model, x, y, scoring=('roc_auc', 'accuracy'), return_train_score=True)

pd.DataFrame(cross_validation)

print("Mean of the test set score")
print(f"Accuracy: {np.mean(cross_validation['test_accuracy']):.3f}")
print(f"AUROC: {np.mean(cross_validation['test_roc_auc']):.3f}")


Mean of the test set score
Accuracy: 0.809
AUROC: 0.769


### Stratified k-fold cross-validation

An even better option is to use a stratified k-fold validation. This variant splits the dataset in a way such that every fold contains the same proportion of features.

In [52]:
model  = LogisticRegression()

k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)

cross_validation = cross_validate(model, x, y, scoring=('roc_auc', 'accuracy'), return_train_score=True)

pd.DataFrame(cross_validation)

print("Mean of the test set score")
print(f"Accuracy: {np.mean(cross_validation['test_accuracy']):.3f}")
print(f"AUROC: {np.mean(cross_validation['test_roc_auc']):.3f}")

Mean of the test set score
Accuracy: 0.809
AUROC: 0.769


## Comparing different models

There might be a situation where different models can be compared to see which one fits better to the classification problem we need to solve.

### Select the best hyper-params of a fixed family of model

In this first case, we study the influence different hyper-params have on the same family model (logistic regression) and choose the best

In [53]:
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, shuffle=True, random_state=seed)


models_and_hyperparams = {
	'LogisticRegression': (LogisticRegression(solver='liblinear'), {
		# 'C': [0.01, 0.05, 0.1, 0.2, 0.5],
		'C': [0.01, 0.5],
		# 'n_jobs': [5, 10, 25]
		'n_jobs': [5, 10]
	})
}

k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)

model = models_and_hyperparams['LogisticRegression'][0]
hyperparams = models_and_hyperparams['LogisticRegression'][1]

grid_search = GridSearchCV(model, hyperparams, cv = k_fold, scoring='accuracy', verbose=True, return_train_score=True)

grid_search.fit(X_train, y_train)

pd.DataFrame(grid_search.cv_results_)

Fitting 10 folds for each of 4 candidates, totalling 40 fits


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_C,param_n_jobs,params,split0_test_score,split1_test_score,split2_test_score,...,split2_train_score,split3_train_score,split4_train_score,split5_train_score,split6_train_score,split7_train_score,split8_train_score,split9_train_score,mean_train_score,std_train_score
0,0.009575,0.00138,0.004759,0.000751,0.01,5,"{'C': 0.01, 'n_jobs': 5}",0.68,0.693878,0.693878,...,0.699095,0.69457,0.699095,0.701357,0.699095,0.696833,0.699095,0.701357,0.699707,0.002596
1,0.00901,0.000674,0.004406,0.000423,0.01,10,"{'C': 0.01, 'n_jobs': 10}",0.68,0.693878,0.693878,...,0.699095,0.69457,0.699095,0.701357,0.699095,0.696833,0.699095,0.701357,0.699707,0.002596
2,0.010154,0.000454,0.004193,0.00052,0.5,5,"{'C': 0.5, 'n_jobs': 5}",0.76,0.816327,0.755102,...,0.80543,0.798643,0.798643,0.79638,0.800905,0.809955,0.80543,0.800905,0.803577,0.004967
3,0.010238,0.000688,0.004465,0.000568,0.5,10,"{'C': 0.5, 'n_jobs': 10}",0.76,0.816327,0.755102,...,0.80543,0.798643,0.798643,0.79638,0.800905,0.809955,0.80543,0.800905,0.803577,0.004967


In [54]:
print(f"Best hyperparameter:")
print(grid_search.best_params_)
print(f"Best accuracy score: {grid_search.best_score_}:.3f")

Best hyperparameter:
{'C': 0.5, 'n_jobs': 5}
Best accuracy score: 0.7984489795918368:.3f


In [55]:
model = LogisticRegression(n_jobs=grid_search.best_params_['n_jobs'], C=grid_search.best_params_['C'], solver='liblinear')

model.fit(X_train, y_train)
evaluate(y_test, model.predict((X_test)))

Accuracy: 0.837
Area under the ROC Curve = 0.750


### Best model from fixed hyper-params

Here we fix the hyper-params for each model (we use the default params) and compare the different models

In [62]:
warnings.filterwarnings('ignore')

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=seed, stratify=y)


models = {
	'LogisticRegression': LogisticRegression(),
	'DecisionTreeClassifier': DecisionTreeClassifier(),
	'RandomForestClassifier': RandomForestClassifier(),
	'GradientBoostingClassifier': GradientBoostingClassifier(),
}

k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)

cross_validation_scores = {}

for model in models:
	cross_validation_scores[model] = cross_val_score(models[model], X_train, y_train, cv=k_fold, scoring='accuracy')
 
cross_validation_scores = pd.DataFrame(cross_validation_scores).transpose()

cross_validation_scores['mean'] = np.mean(cross_validation_scores, axis=1)
cross_validation_scores['std'] = np.std(cross_validation_scores, axis=1)
cross_validation_scores = cross_validation_scores.sort_values(['mean', 'std'], ascending=False)


cross_validation_scores

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,mean,std
LogisticRegression,0.86,0.714286,0.836735,0.795918,0.816327,0.816327,0.693878,0.795918,0.795918,0.673469,0.779878,0.057289
RandomForestClassifier,0.84,0.653061,0.795918,0.795918,0.795918,0.816327,0.77551,0.795918,0.734694,0.734694,0.773796,0.048418
GradientBoostingClassifier,0.84,0.591837,0.836735,0.816327,0.755102,0.77551,0.755102,0.755102,0.816327,0.734694,0.767673,0.065558
DecisionTreeClassifier,0.76,0.653061,0.714286,0.693878,0.612245,0.693878,0.714286,0.755102,0.612245,0.673469,0.688245,0.046755


By comparing the mean and the standard deviation we can deduce that the best classifier is the logistic regression. We now need to train the model on the whole train set (so far we trained in the cross-validation folds only). After training in the whole train set, we predict the values on the test set and evaluate the result. There is nothing more we can do.

In [63]:

model = models[cross_validation_scores.index[0]]
model.fit(X_train, y_train)

evaluate(y_test, model.predict(X_test))

Accuracy: 0.829
Area under the ROC Curve = 0.738
