### Basic Package Imports

In [63]:
import pandas as pd
import numpy as np

### Data Import

In [64]:
dataset = pd.read_csv('dataset.csv')
dataset.head(10)

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y
5,LP001011,Male,Yes,2,Graduate,Yes,5417,4196.0,267.0,360.0,1.0,Urban,Y
6,LP001013,Male,Yes,0,Not Graduate,No,2333,1516.0,95.0,360.0,1.0,Urban,Y
7,LP001014,Male,Yes,3+,Graduate,No,3036,2504.0,158.0,360.0,0.0,Semiurban,N
8,LP001018,Male,Yes,2,Graduate,No,4006,1526.0,168.0,360.0,1.0,Urban,Y
9,LP001020,Male,Yes,1,Graduate,No,12841,10968.0,349.0,360.0,1.0,Semiurban,N


That's what the data looks like and I think that the attribues are self-explanatory. It is a fairly easy classification problem where we just have to see if the bank should or shouldn't give a loan to a customer.

### Pre-processing

The first we will do is to check if we have a complete dataset.

In [65]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
Loan_ID              614 non-null object
Gender               601 non-null object
Married              611 non-null object
Dependents           599 non-null object
Education            614 non-null object
Self_Employed        582 non-null object
ApplicantIncome      614 non-null int64
CoapplicantIncome    614 non-null float64
LoanAmount           592 non-null float64
Loan_Amount_Term     600 non-null float64
Credit_History       564 non-null float64
Property_Area        614 non-null object
Loan_Status          614 non-null object
dtypes: float64(4), int64(1), object(8)
memory usage: 62.5+ KB


As we can see, there are 7 categories which are missing data. We could check now, how the missing data affects the homogeneity of the dataset.

In [66]:
print("Gender missing data:", round(dataset.isnull().sum()['Gender'] / len(dataset) * 100, 2), '%')
print("Married missing data:", round(dataset.isnull().sum()['Married'] / len(dataset) * 100, 2), '%')
print("Dependents missing data:", round(dataset.isnull().sum()['Dependents'] / len(dataset) * 100, 2), '%')
print("Self employment missing data:", round(dataset.isnull().sum()['Self_Employed'] / len(dataset) * 100, 2), '%')
print("Loan amount missing data:", round(dataset.isnull().sum()['LoanAmount'] / len(dataset) * 100, 2), '%')
print("Loan term missing data:", round(dataset.isnull().sum()['Loan_Amount_Term'] / len(dataset) * 100, 2), '%')
print("Credit history missing data:", round(dataset.isnull().sum()['Credit_History'] / len(dataset) * 100, 2), '%')

Gender missing data: 2.12 %
Married missing data: 0.49 %
Dependents missing data: 2.44 %
Self employment missing data: 5.21 %
Loan amount missing data: 3.58 %
Loan term missing data: 2.28 %
Credit history missing data: 8.14 %


Well, it seems like there's not much of an issue here. It's quite a small dataset, so each data point is very valuable. In most of the cases only <3% of the data is missing with an exception for Loan Amount, Credit History and Self-Employment attributes

Firstly, we will split the data into categorical value and numerical values

In [67]:
# skip the Loan ID
cat_data = dataset.iloc[:, 1:6].values
cat_data = np.append(cat_data, dataset.iloc[:, 11:13].values, axis = 1)

num_data = dataset.iloc[:, 6:11].values

cat_data = pd.DataFrame(cat_data)
num_data = pd.DataFrame(num_data)

num_data.head()

Unnamed: 0,0,1,2,3,4
0,5849.0,0.0,,360.0,1.0
1,4583.0,1508.0,128.0,360.0,1.0
2,3000.0,0.0,66.0,360.0,1.0
3,2583.0,2358.0,120.0,360.0,1.0
4,6000.0,0.0,141.0,360.0,1.0


### Missing values
#### Numerical data

In [68]:
print('Loan Amount Term equal to 360 ratio:', len(dataset[dataset['Loan_Amount_Term'] == 360]) / len(dataset) * 100, '%')

Loan Amount Term equal to 360 ratio: 83.38762214983714 %


In [69]:
dataset['Credit_History'].value_counts()

1.0    475
0.0     89
Name: Credit_History, dtype: int64

##### Now it is time to handle the missing data points. We have made a small analysis of 2 features as shown above so that we can find the best replacement.
- Credit History - replace the 'NaN' values with the most common value (1.0)
- Loan Amount Term - replacing 'NaN' with the most common value (360.0)
- Loan Amount - we will replace missing values with the average loan

In [95]:
from sklearn.impute import SimpleImputer

# Loan Amount
num_data[2].fillna(num_data.iloc[:, 2].mean(), inplace = True)

# Loan Amount Term and Credit History
imputer= SimpleImputer(missing_values = np.nan, strategy = 'most_frequent')
num_data.values[:, 3:5] = imputer.fit_transform(num_data.values[:, 3:5])

num_data.head(20)

Unnamed: 0,0,1,2,3,4
0,5849.0,0.0,146.412162,360.0,1.0
1,4583.0,1508.0,128.0,360.0,1.0
2,3000.0,0.0,66.0,360.0,1.0
3,2583.0,2358.0,120.0,360.0,1.0
4,6000.0,0.0,141.0,360.0,1.0
5,5417.0,4196.0,267.0,360.0,1.0
6,2333.0,1516.0,95.0,360.0,1.0
7,3036.0,2504.0,158.0,360.0,0.0
8,4006.0,1526.0,168.0,360.0,1.0
9,12841.0,10968.0,349.0,360.0,1.0


#### Categorical Data

In [71]:
cat_data.head()

Unnamed: 0,0,1,2,3,4,5,6
0,Male,No,0,Graduate,No,Urban,Y
1,Male,Yes,1,Graduate,No,Rural,N
2,Male,Yes,0,Graduate,Yes,Urban,Y
3,Male,Yes,0,Not Graduate,No,Urban,Y
4,Male,No,0,Graduate,No,Urban,Y


In [72]:
print('Ratio self-employed / all applicants:', round(len(dataset[dataset['Self_Employed'] == 'Yes']) / len(dataset) * 100, 2), '%')

Ratio self-employed / all applicants: 13.36 %


In [73]:
dataset['Dependents'].value_counts()

0     345
1     102
2     101
3+     51
Name: Dependents, dtype: int64

##### Replacing missing values in categorical data:
- Gender - assuming that there is no bias towards the gender, we can replace it with either option
- Married - only 3 data points are missing, we can just fill it with either option
- Dependents - 345 has no kids, 254 does have, it shouldn't be a problem if we replaced the missing values with just one kid
- Self Employed - only 13.36% are self-employed, it is safe to fill the 'NaN's with a value corresponding to no self-employment

In [74]:
cat_data[0].fillna('Male', inplace=True) # gender
cat_data[1].fillna('No', inplace=True)   # married
cat_data[2].fillna('1', inplace=True)    # dependents
cat_data[4].fillna('No', inplace=True)   # self-employed
cat_data.head(20)

Unnamed: 0,0,1,2,3,4,5,6
0,Male,No,0,Graduate,No,Urban,Y
1,Male,Yes,1,Graduate,No,Rural,N
2,Male,Yes,0,Graduate,Yes,Urban,Y
3,Male,Yes,0,Not Graduate,No,Urban,Y
4,Male,No,0,Graduate,No,Urban,Y
5,Male,Yes,2,Graduate,Yes,Urban,Y
6,Male,Yes,0,Not Graduate,No,Urban,Y
7,Male,Yes,3+,Graduate,No,Semiurban,N
8,Male,Yes,2,Graduate,No,Urban,Y
9,Male,Yes,1,Graduate,No,Semiurban,N


In [75]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer

label_encoder = LabelEncoder()
cat_data_enc = cat_data.copy()

# skipping Property Area attribute for now
cat_data_enc.values[:, 0] = label_encoder.fit_transform(cat_data.values[:, 0]) # gender
cat_data_enc.values[:, 1] = label_encoder.fit_transform(cat_data.values[:, 1]) # married
cat_data_enc.values[:, 3] = label_encoder.fit_transform(cat_data.values[:, 3]) # education
cat_data_enc.values[:, 4] = label_encoder.fit_transform(cat_data.values[:, 4]) # self-employed
cat_data_enc.values[:, 6] = label_encoder.fit_transform(cat_data.values[:, 6]) # eligible for a loan
cat_data_enc[2] = cat_data[2].replace({"3+": "3"})                             # dependents

In [96]:
cat_data_enc.head(40)

Unnamed: 0,0,1,2,3,4,5,6
0,1,0,0,0,0,Urban,1
1,1,1,1,0,0,Rural,0
2,1,1,0,0,1,Urban,1
3,1,1,0,1,0,Urban,1
4,1,0,0,0,0,Urban,1
5,1,1,2,0,1,Urban,1
6,1,1,0,1,0,Urban,1
7,1,1,3,0,0,Semiurban,0
8,1,1,2,0,0,Urban,1
9,1,1,1,0,0,Semiurban,0


Everything looks perfect. One more thing, as it was stated higher, we have skipped Property Area for now, we will deal with it later on. It is also worth mentioning that even though Dependants columns has found itself in categorical data, we won't be applying One Hot Encoder on it, from now on, we will treat it as a numerical data.

In [77]:
dataset_fin = pd.concat([num_data, cat_data_enc], axis = 1)
dataset_fin.columns = range(dataset_fin.shape[1]) # Reset indices of columns
dataset_fin.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,5849.0,0.0,146.412162,360.0,1.0,1,0,0,0,0,Urban,1
1,4583.0,1508.0,128.0,360.0,1.0,1,1,1,0,0,Rural,0
2,3000.0,0.0,66.0,360.0,1.0,1,1,0,0,1,Urban,1
3,2583.0,2358.0,120.0,360.0,1.0,1,1,0,1,0,Urban,1
4,6000.0,0.0,141.0,360.0,1.0,1,0,0,0,0,Urban,1


In [78]:
y = cat_data_enc.iloc[:, 6]
y = y.astype(float)
print(y)

0      1.0
1      0.0
2      1.0
3      1.0
4      1.0
      ... 
609    1.0
610    1.0
611    1.0
612    1.0
613    0.0
Name: 6, Length: 614, dtype: float64


Using OneHotEncoder to encode Property Area attribute

In [91]:
transformer = ColumnTransformer([('Property Area', OneHotEncoder(), [10])], remainder = 'passthrough')
X = np.array(transformer.fit_transform(dataset_fin), dtype = np.float64)
# Drop one dummy variable
X = X[:, 1:]
print(X.shape)
print(pd.DataFrame(X).head())

(614, 13)
     0    1       2       3           4      5    6    7    8    9   10   11  \
0  0.0  1.0  5849.0     0.0  146.412162  360.0  1.0  1.0  0.0  0.0  0.0  0.0   
1  0.0  0.0  4583.0  1508.0  128.000000  360.0  1.0  1.0  1.0  1.0  0.0  0.0   
2  0.0  1.0  3000.0     0.0   66.000000  360.0  1.0  1.0  1.0  0.0  0.0  1.0   
3  0.0  1.0  2583.0  2358.0  120.000000  360.0  1.0  1.0  1.0  0.0  1.0  0.0   
4  0.0  1.0  6000.0     0.0  141.000000  360.0  1.0  1.0  0.0  0.0  0.0  0.0   

    12  
0  1.0  
1  0.0  
2  1.0  
3  1.0  
4  1.0  


Okey, so the number of attributes of dataset_fin is equal to 12. After we have applied One Hot Encoder, the Property Area has been automatically dropped and replaced with its dummy variables. That gives us 12 - 1 + 3 = 14 attributes. We are also dropping one dummy variable, because we want to avoid the dummy variable trap which gives us in the 13 attribute, which adds up to X.shape (614, 13)

##### Split train test

In [119]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 12)

print('X_train.shape', X_train.shape, 'y_train.shape', y_train.shape)
print('X_test.shape', X_test.shape, 'y_test.shape', y_test.shape)

X_train.shape (429, 13) y_train.shape (429,)
X_test.shape (185, 13) y_test.shape (185,)


##### Scaling
Since some of the values are around 1 and other ones around 1000, we will scale the train and the test set.

In [98]:
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

In [99]:
print(X_train)

[[-0.77409417  1.42984071 -0.11323362 ... -0.52196343 -0.39717311
   0.6687175 ]
 [ 1.29183248 -0.69937861 -0.50992898 ... -0.52196343 -0.39717311
   0.6687175 ]
 [-0.77409417  1.42984071 -0.41288167 ...  1.91584304 -0.39717311
  -1.49539977]
 ...
 [-0.77409417  1.42984071 -0.54627339 ...  1.91584304 -0.39717311
  -1.49539977]
 [ 1.29183248 -0.69937861 -0.51708187 ...  1.91584304 -0.39717311
   0.6687175 ]
 [-0.77409417 -0.69937861  0.74066668 ... -0.52196343 -0.39717311
   0.6687175 ]]


### Classification
##### Now it's time to check if we can solve the classification problem. To do that we will use various classifiers included in sklearn.
- Logistic Regression,
- K-Nearest Neighbours,
- SVM with a linear kernel,
- SVM RBF,
- Naive Bayes,
- Decision Tree,
- Random Forest 

##### Logistic Regression

In [129]:
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression(solver='lbfgs', max_iter=1000)
log_reg.fit(X_train, y_train)
y_pred = log_reg.predict(X_test)

from sklearn.metrics import precision_score, recall_score
log_reg_precision = precision_score(y_test, y_pred)
log_reg_recall = recall_score(y_test, y_pred)

from sklearn.metrics import f1_score
log_reg_score = f1_score(y_test, y_pred)

print(f'Precision: {log_reg_precision}')
print(f'Recall: {log_reg_recall}')
print(f'Score: {log_reg_score}')

Precision: 0.9769230769230769
Recall: 1.0
Score: 0.9883268482490272


##### K-Nearest Neighbours

In [102]:
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', n_jobs = -1)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

knn_precision = precision_score(y_test, y_pred)
knn_recall = recall_score(y_test, y_pred)
knn_score = f1_score(y_test, y_pred)

print(f'Precision: {knn_precision}')
print(f'Recall: {knn_recall}')
print(f'Score: {knn_score}')

Precision: 0.9115646258503401
Recall: 0.9710144927536232
Score: 0.9403508771929824


##### SVM with a Linear Kernel

In [103]:
from sklearn.svm import SVC
classifier = SVC(kernel = 'linear', random_state = 42)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

svm_precision = precision_score(y_test, y_pred)
svm_recall = recall_score(y_test, y_pred)
svm_score = f1_score(y_test, y_pred)

print(f'Precision: {svm_precision}')
print(f'Recall: {svm_recall}')
print(f'Score: {svm_score}')

Precision: 1.0
Recall: 1.0
Score: 1.0


##### SVM with RBF Kernel

In [104]:
from sklearn.svm import SVC
classifier = SVC(kernel = 'rbf', random_state = 42)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

svm_RBF_precision = precision_score(y_test, y_pred)
svm_RBF_recall = recall_score(y_test, y_pred)
svm_RBF_score = f1_score(y_test, y_pred)

print(f'Precision: {svm_RBF_precision}')
print(f'Recall: {svm_RBF_recall}')
print(f'Score: {svm_RBF_score}')

Precision: 0.9517241379310345
Recall: 1.0
Score: 0.9752650176678445


##### Naive Bayes

In [105]:
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

naive_bayes_precision = precision_score(y_test, y_pred)
naive_bayes_recall = recall_score(y_test, y_pred)
naive_bayes_score = f1_score(y_test, y_pred)

print(f'Precision: {naive_bayes_precision}')
print(f'Recall: {naive_bayes_recall}')
print(f'Score: {naive_bayes_score}')

Precision: 1.0
Recall: 1.0
Score: 1.0


##### Decision Tree

In [106]:
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 42)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

decision_tree_precision = precision_score(y_test, y_pred)
decision_tree_recall = recall_score(y_test, y_pred)
decision_tree_score = f1_score(y_test, y_pred)

print(f'Precision: {decision_tree_precision}')
print(f'Recall: {decision_tree_recall}')
print(f'Score: {decision_tree_score}')

Precision: 1.0
Recall: 1.0
Score: 1.0


##### Random Forest

In [107]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 42)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

forest_precision = precision_score(y_test, y_pred)
forest_recall = recall_score(y_test, y_pred)
forest_score = f1_score(y_test, y_pred)

print(f'Precision: {forest_precision}')
print(f'Recall: {forest_recall}')
print(f'Score: {forest_score}')

Precision: 1.0
Recall: 1.0
Score: 1.0


##### Summary of the classifiers

In [108]:
scores = [['Logistic Regression', log_reg_precision, log_reg_score],
          ['K-NN', knn_precision, knn_score],
          ['SVM Linear Kernel', svm_precision, svm_score],
          ['SVM RBF Kernel', svm_RBF_precision, svm_RBF_score],
          ['Naive Bayes', naive_bayes_precision, naive_bayes_score],
          ['Decision Tree', decision_tree_precision, decision_tree_score],
          ['Random Forest', forest_precision, forest_score]]

scores_list = []
for i in range(len(scores)):
    scores_list.append(scores[i][2])
best_score = max(scores_list)
best_classifier_name = scores[scores_list.index(best_score)][0]
best_classifier_precision = scores[scores_list.index(best_score)][1]

print('The best classifier is', best_classifier_name, 'with the score:', best_score, 'and precision:', best_classifier_precision)


The best classifier is  Logistic Regression  with the score:  1.0  and precision:  1.0


Logistic Regression seems to be the most effective out of all listed classifiers. We will use Cross Validation to reduce the risk of overfitting by classifiers which is quite important, because as we could see a few of them scored a perfect score, which is somewhat worrying.

In [130]:
from sklearn.model_selection import cross_val_score
cross_val_scores = cross_val_score(log_reg, X_train, y_train, cv = 10)

print("Scores: ", cross_val_scores)
print("Mean: ", cross_val_scores.mean())
print("STD: ", cross_val_scores.std())

Scores:  [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
Mean:  1.0
STD:  0.0


Well, this is worrying, because Logistic Regression has just achieved 10 perfect scores in the Cross Validation. It is a tough call, because the initial classifier has had the precision of 97%.

##### Hyperparameter tuning
After the script have been changed and the results of Logistic Regression has been greatly improved, it seems like tweaking the parameters does not make sense anymore and so, the next part is just for formality.

In [131]:
from sklearn.model_selection import GridSearchCV
param_grid = [
        {'solver': ['lbfgs'], 'max_iter': [1000], 
         'penalty': ['l2'], 'C': [0.25, 0.5, 1, 2]}
    ]
log_reg_fin = LogisticRegression()
grid_search = GridSearchCV(log_reg_fin, param_grid, cv = 5, scoring = 'neg_mean_squared_error')
grid_search.fit(X_train, y_train)

print(grid_search.best_estimator_)
print('Score:', grid_search.best_score_)

LogisticRegression(C=0.25, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)
Score: 0.0


##### Evaluation

In [132]:
from sklearn.metrics import confusion_matrix

log_reg_fin = LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)
log_reg_fin.fit(X_train, y_train)
y_pred = log_reg.predict(X_test)

cm = confusion_matrix(y_test, y_pred)
print(cm)

[[ 55   3]
 [  0 127]]


As we can see, even the fine-tuned classifier made 3 mistakes, so maybe it is not over-fitting after all.