# Loan prediction modelling

In [2]:
#import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, roc_auc_score, f1_score, precision_score, recall_score, confusion_matrix, ConfusionMatrixDisplay
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, train_test_split, cross_val_score, StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.feature_selection import RFE

### Table of Contents

* [Objective](#objective)
* [Data Gathering](#data_gather)
* [Data Cleaning](#data_prep)
    * [Missing value imputation](#imputation)
    * [Scaling and Encoding](#scale_encode)
* [Data Modelling](#modelling)
    * [Logistic regression](#logit)
    * [Random Feature Elimination (RFE)](#RFE)
    * [Prediction on test set](#prediction)

# Objective <a class='anchor' id='objective'>

A bank's filtering process to reject or approve a loan can be a time consuming and tedious process. Therefore, the idea behind this project is to build an machine learning model that the bank can use to classify if a user can be granted a loan or not.

# Data Gathering <a class='anchor' id='data_gather'>

Both the train and test datasets for this project are extracted from the [Dphi official github](https://github.com/dphi-official/Datasets/tree/master/Loan_Data), which corresponds to a machine learning competition hosted in the [Dphi's official page](https://dphi.tech/challenges/loan-or-no-loan/54/overview/about)

In [3]:
loan_data  = pd.read_csv("https://raw.githubusercontent.com/dphi-official/Datasets/master/Loan_Data/loan_train.csv" )

In [4]:
loan_data = loan_data.drop('Unnamed: 0', axis=1)

So that there are no problems in the deployment with the order of the columns in the final dataset, let's make sure every single column list is sorted. This way, in the Flask file and HTML we can get the data in whatever order possible.

In [5]:
loan_data = loan_data[loan_data.columns.sort_values()]

**Make a copy of original dataset**

In [9]:
df = loan_data.copy(deep=True)
df

Unnamed: 0,ApplicantIncome,CoapplicantIncome,Credit_History,Dependents,Education,Gender,LoanAmount,Loan_Amount_Term,Loan_ID,Loan_Status,Married,Property_Area,Self_Employed
0,4547,0.0,1.0,0,Graduate,Female,115.0,360.0,LP002305,1,No,Semiurban,No
1,5703,0.0,1.0,3+,Not Graduate,Male,130.0,360.0,LP001715,1,Yes,Rural,Yes
2,4333,2451.0,1.0,0,Graduate,Female,110.0,360.0,LP002086,0,Yes,Urban,No
3,4695,0.0,1.0,0,Not Graduate,Male,96.0,,LP001136,1,Yes,Urban,Yes
4,6700,1750.0,1.0,2,Graduate,Male,230.0,300.0,LP002529,1,Yes,Semiurban,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...
486,9833,1833.0,1.0,1,Graduate,,182.0,180.0,LP002103,1,Yes,Urban,Yes
487,3812,0.0,1.0,1,Graduate,Female,112.0,360.0,LP001790,1,No,Rural,No
488,14583,0.0,1.0,1,Graduate,Male,185.0,180.0,LP001401,1,Yes,Rural,No
489,1836,33837.0,1.0,0,Graduate,Male,90.0,360.0,LP002893,0,No,Urban,No


# Data Cleaning <a class='anchor' id='data_prep'>

The Loan_ID feature is not useful for prediction, let's drop it:

In [8]:
df.drop('Loan_ID', axis=1, inplace=True)

In [10]:
df.Loan_Status.value_counts()

1    343
0    148
Name: Loan_Status, dtype: int64

We have an imbalanced dataset. Therefore, we should be wary of using accuracy as our metric. 

In [12]:
df.isnull().any()

ApplicantIncome      False
CoapplicantIncome    False
Credit_History        True
Dependents            True
Education            False
Gender                True
LoanAmount            True
Loan_Amount_Term      True
Loan_Status          False
Married               True
Property_Area        False
Self_Employed         True
dtype: bool

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 491 entries, 0 to 490
Data columns (total 12 columns):
ApplicantIncome      491 non-null int64
CoapplicantIncome    491 non-null float64
Credit_History       448 non-null float64
Dependents           482 non-null object
Education            491 non-null object
Gender               481 non-null object
LoanAmount           475 non-null float64
Loan_Amount_Term     478 non-null float64
Loan_Status          491 non-null int64
Married              490 non-null object
Property_Area        491 non-null object
Self_Employed        462 non-null object
dtypes: float64(4), int64(2), object(6)
memory usage: 46.2+ KB


We have null values in most of the feature columns. Missing value imputation in the training set will be done according to the most frequent features when grouped by the target label. This imputation will also be done in relative terms by creating first an initial baseline in case a particular feature is also imabalanced.

## Missing value imputation: <a class='anchor' id='imputation'>

### Gender column


In [14]:
df.Gender.value_counts()

Male      393
Female     88
Name: Gender, dtype: int64

The dataset is highly biased towards male clients:

In [15]:
403/88

4.579545454545454

This is the baseline ratio for the categorical values to be imputed for 0 and 1 loan status

In [16]:
df.groupby('Loan_Status')['Gender'].value_counts()

Loan_Status  Gender
0            Male      112
             Female     33
1            Male      281
             Female     55
Name: Gender, dtype: int64

In [17]:
print('Loan status 0 ratio: {}. Loan status 1 ratio: {}'.format(112/33, 281/55))

Loan status 0 ratio: 3.393939393939394. Loan status 1 ratio: 5.109090909090909


Replace gender values by more frequent. In the case of loan status 0 is more frequent to have FEMALES (ratio 3.39 vs 4.58), while for loan status 1 is MALES

In [18]:
mask_loan_0 = (df.Gender.isnull()) & (df.Loan_Status == 0)
mask_loan_1 = (df.Gender.isnull()) & (df.Loan_Status == 1)

df.loc[mask_loan_1, 'Gender'] = 'Male'
df.loc[mask_loan_0, 'Gender'] = 'Female'

In [19]:
df.Gender.isnull().any()

False

### Married column

In [20]:
df.Married.value_counts()

Yes    324
No     166
Name: Married, dtype: int64

In [21]:
324/166

1.9518072289156627

In [22]:
df.groupby('Married').Loan_Status.value_counts()

Married  Loan_Status
No       1              102
         0               64
Yes      1              240
         0               84
Name: Loan_Status, dtype: int64

In [23]:
df.groupby('Loan_Status').Married.value_counts(dropna=False)

Loan_Status  Married
0            Yes         84
             No          64
1            Yes        240
             No         102
             NaN          1
Name: Married, dtype: int64

In [24]:
print('Loan status 0 ratio: {}. Loan status 1 ratio: {}'.format(84/64, 240/102))

Loan status 0 ratio: 1.3125. Loan status 1 ratio: 2.3529411764705883


Loan status 0 to be imputed with 'No' and loan status 1 with 'Yes'

In [25]:
mask_loan_0_married = (df.Married.isnull()) & (df.Loan_Status == 0)
mask_loan_1_married = (df.Married.isnull()) & (df.Loan_Status == 1)

df.loc[mask_loan_0_married, 'Married'] = 'No'
df.loc[mask_loan_1_married, 'Married'] = 'Yes'

In [26]:
df.isnull().any()

ApplicantIncome      False
CoapplicantIncome    False
Credit_History        True
Dependents            True
Education            False
Gender               False
LoanAmount            True
Loan_Amount_Term      True
Loan_Status          False
Married              False
Property_Area        False
Self_Employed         True
dtype: bool

### Dependents column

In [27]:
df.Dependents.value_counts()

0     276
1      85
2      78
3+     43
Name: Dependents, dtype: int64

In [28]:
print('{}, {}, {}'.format(276/85, 276/78, 276/43))  

3.2470588235294118, 3.5384615384615383, 6.4186046511627906


In [29]:
df.groupby('Loan_Status').Dependents.value_counts(dropna=False)

Loan_Status  Dependents
0            0              81
             1              31
             2              16
             3+             16
             NaN             4
1            0             195
             2              62
             1              54
             3+             27
             NaN             5
Name: Dependents, dtype: int64

Loan status 0 with 1 and loan status 1 with 0.

In [30]:
mask_loan_0_dependents = (df.Dependents.isnull()) & (df.Loan_Status == 0)
mask_loan_1_dependents = (df.Dependents.isnull()) & (df.Loan_Status == 1)

df.loc[mask_loan_0_dependents, 'Dependents'] = 1
df.loc[mask_loan_1_dependents, 'Dependents'] = 0

In [31]:
df.isnull().any()

ApplicantIncome      False
CoapplicantIncome    False
Credit_History        True
Dependents           False
Education            False
Gender               False
LoanAmount            True
Loan_Amount_Term      True
Loan_Status          False
Married              False
Property_Area        False
Self_Employed         True
dtype: bool

### Self employed column

In [32]:
df.Self_Employed.value_counts()

No     398
Yes     64
Name: Self_Employed, dtype: int64

In [33]:
398/64

6.21875

In [34]:
df.Self_Employed.isnull().sum()

29

In [35]:
df.groupby('Loan_Status').Self_Employed.value_counts(dropna=False)

Loan_Status  Self_Employed
0            No               119
             Yes               20
             NaN                9
1            No               279
             Yes               44
             NaN               20
Name: Self_Employed, dtype: int64

In [36]:
print('{}, {}'.format(119/20, 279/44))

5.95, 6.340909090909091


Loan status 0 with YES and loan status 1 with NO

In [37]:
mask_loan_0_self_employed = (df.Self_Employed.isnull()) & (df.Loan_Status == 0)
mask_loan_1_self_employed = (df.Self_Employed.isnull()) & (df.Loan_Status == 1)

df.loc[mask_loan_0_self_employed, 'Self_Employed'] = 'Yes'
df.loc[mask_loan_1_self_employed, 'Self_Employed'] = 'No'

In [38]:
df.isnull().any()

ApplicantIncome      False
CoapplicantIncome    False
Credit_History        True
Dependents           False
Education            False
Gender               False
LoanAmount            True
Loan_Amount_Term      True
Loan_Status          False
Married              False
Property_Area        False
Self_Employed        False
dtype: bool

### Loan amount column

In [39]:
df.LoanAmount.isnull().sum()

16

In [40]:
df.LoanAmount.fillna(value=df.LoanAmount.mean(), inplace=True)

In [41]:
df.isnull().any()

ApplicantIncome      False
CoapplicantIncome    False
Credit_History        True
Dependents           False
Education            False
Gender               False
LoanAmount           False
Loan_Amount_Term      True
Loan_Status          False
Married              False
Property_Area        False
Self_Employed        False
dtype: bool

### Loan Amount Term column

In [42]:
df.Loan_Amount_Term.value_counts(dropna=False)

360.0    404
180.0     35
480.0     13
NaN       13
300.0     12
84.0       4
120.0      3
240.0      3
36.0       2
60.0       1
12.0       1
Name: Loan_Amount_Term, dtype: int64

In [43]:
df.groupby('Loan_Status').Loan_Amount_Term.value_counts(dropna=False)

Loan_Status  Loan_Amount_Term
0            360.0               116
             180.0                12
             480.0                 7
             NaN                   5
             300.0                 4
             36.0                  2
             84.0                  1
             240.0                 1
1            360.0               288
             180.0                23
             NaN                   8
             300.0                 8
             480.0                 6
             84.0                  3
             120.0                 3
             240.0                 2
             12.0                  1
             60.0                  1
Name: Loan_Amount_Term, dtype: int64

Lets impute it with random variables from the feature column.

In [44]:
df[(df.Loan_Amount_Term.isnull()) & (df.Loan_Status == 1)].Loan_Amount_Term.shape[0]

8

In [45]:
mask_loan_0_Loan_Amount_Term = (df.Loan_Amount_Term.isnull()) & (df.Loan_Status == 0)
mask_loan_1_Loan_Amount_Term = (df.Loan_Amount_Term.isnull()) & (df.Loan_Status == 1)

df.loc[mask_loan_0_Loan_Amount_Term, 'Loan_Amount_Term'] = np.random.choice(list(df.Loan_Amount_Term.value_counts()), 
                                                                           size=df[(df.Loan_Amount_Term.isnull()) & (df.Loan_Status == 0)].Loan_Amount_Term.shape[0])
df.loc[mask_loan_1_Loan_Amount_Term, 'Loan_Amount_Term'] = np.random.choice(list(df.Loan_Amount_Term.value_counts()), 
                                                                           size=df[(df.Loan_Amount_Term.isnull()) & (df.Loan_Status == 1)].Loan_Amount_Term.shape[0])

In [46]:
df.isnull().any()

ApplicantIncome      False
CoapplicantIncome    False
Credit_History        True
Dependents           False
Education            False
Gender               False
LoanAmount           False
Loan_Amount_Term     False
Loan_Status          False
Married              False
Property_Area        False
Self_Employed        False
dtype: bool

### Credit History column

In [47]:
df.Credit_History.value_counts()

1.0    380
0.0     68
Name: Credit_History, dtype: int64

In [48]:
df.groupby('Loan_Status').Credit_History.value_counts(dropna=False)

Loan_Status  Credit_History
0            1.0                74
             0.0                62
             NaN                12
1            1.0               306
             NaN                31
             0.0                 6
Name: Credit_History, dtype: int64

Loan status 0 with 0 and loan status 1 with 1

In [49]:
mask_loan_0_Credit_History = (df.Credit_History.isnull()) & (df.Loan_Status == 0)
mask_loan_1_Credit_History = (df.Credit_History.isnull()) & (df.Loan_Status == 1)

df.loc[mask_loan_0_Credit_History, 'Credit_History'] = 0
df.loc[mask_loan_1_Credit_History, 'Credit_History'] = 1

In [50]:
df.isnull().any()

ApplicantIncome      False
CoapplicantIncome    False
Credit_History       False
Dependents           False
Education            False
Gender               False
LoanAmount           False
Loan_Amount_Term     False
Loan_Status          False
Married              False
Property_Area        False
Self_Employed        False
dtype: bool

## Scaling and Encoding <a class='anchor' id='scale_encode'>

In [51]:
df.head()

Unnamed: 0,ApplicantIncome,CoapplicantIncome,Credit_History,Dependents,Education,Gender,LoanAmount,Loan_Amount_Term,Loan_Status,Married,Property_Area,Self_Employed
0,4547,0.0,1.0,0,Graduate,Female,115.0,360.0,1,No,Semiurban,No
1,5703,0.0,1.0,3+,Not Graduate,Male,130.0,360.0,1,Yes,Rural,Yes
2,4333,2451.0,1.0,0,Graduate,Female,110.0,360.0,0,Yes,Urban,No
3,4695,0.0,1.0,0,Not Graduate,Male,96.0,13.0,1,Yes,Urban,Yes
4,6700,1750.0,1.0,2,Graduate,Male,230.0,300.0,1,Yes,Semiurban,No


In [52]:
numerical_cols = sorted(['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term'])
categorical_cols = sorted(list(set(df.columns)-set(numerical_cols)))


In [10]:
def scale_and_encode(X, df_train_original=loan_data):
    '''Function to scale and one-hot encode the treated dataframe. Returns a scaled and encoded table for implementation in the ML model'''
    cols_scale = numerical_cols
    cols_encode = sorted(list(set(categorical_cols)-set(['Loan_Status'])))
    
    scaler = StandardScaler()
    scaler.fit(df_train_original[cols_scale])
    scaled_X = pd.DataFrame(scaler.transform(X[cols_scale]), columns=cols_scale)
    
    oh_encoder = OneHotEncoder(handle_unknown = "ignore", sparse=False)
    oh_encoder.fit(df_train_original[cols_encode].dropna(axis=0))# fit to train data wihtout NaN values
    
    encoded_X = pd.DataFrame(oh_encoder.transform(X[cols_encode]), columns=oh_encoder.get_feature_names(cols_encode))
    
    #I re-sort columns after the concatenation because the order is lost
    final_X = pd.concat([encoded_X, scaled_X], axis=1)
    final_X = final_X[final_X.columns.sort_values()]
    return final_X
    

In [54]:
X = scale_and_encode(df.drop('Loan_Status', axis=1))
y = df.Loan_Status

In [55]:
X

Unnamed: 0,ApplicantIncome,CoapplicantIncome,Credit_History_0.0,Credit_History_1.0,Dependents_0,Dependents_1,Dependents_2,Dependents_3+,Education_Graduate,Education_Not Graduate,...,Gender_Male,LoanAmount,Loan_Amount_Term,Married_No,Married_Yes,Property_Area_Rural,Property_Area_Semiurban,Property_Area_Urban,Self_Employed_No,Self_Employed_Yes
0,-0.133199,-0.545111,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,...,0.0,-0.348120,0.279591,1.0,0.0,0.0,1.0,0.0,1.0,0.0
1,0.047063,-0.545111,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,...,1.0,-0.174145,0.279591,0.0,1.0,1.0,0.0,0.0,0.0,1.0
2,-0.166569,0.295325,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,...,0.0,-0.406111,0.279591,0.0,1.0,0.0,0.0,1.0,1.0,0.0
3,-0.110120,-0.545111,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,...,1.0,-0.568487,-4.907723,0.0,1.0,0.0,0.0,1.0,0.0,1.0
4,0.202531,0.054955,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,...,1.0,0.985683,-0.617351,0.0,1.0,0.0,1.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
486,0.691079,0.083416,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,...,1.0,0.428966,-2.411235,0.0,1.0,0.0,0.0,1.0,0.0,1.0
487,-0.247812,-0.545111,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,...,0.0,-0.382914,0.279591,1.0,0.0,1.0,0.0,0.0,1.0,0.0
488,1.431775,-0.545111,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,...,1.0,0.463761,-2.411235,0.0,1.0,1.0,0.0,0.0,1.0,0.0
489,-0.555941,11.057421,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,...,1.0,-0.638077,0.279591,1.0,0.0,0.0,0.0,1.0,1.0,0.0


Create train and test sets keeping the ratio of target values constant across sets:

In [56]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=17, stratify=y)

In [57]:
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=17)

In [58]:
X.columns

Index(['ApplicantIncome', 'CoapplicantIncome', 'Credit_History_0.0',
       'Credit_History_1.0', 'Dependents_0', 'Dependents_1', 'Dependents_2',
       'Dependents_3+', 'Education_Graduate', 'Education_Not Graduate',
       'Gender_Female', 'Gender_Male', 'LoanAmount', 'Loan_Amount_Term',
       'Married_No', 'Married_Yes', 'Property_Area_Rural',
       'Property_Area_Semiurban', 'Property_Area_Urban', 'Self_Employed_No',
       'Self_Employed_Yes'],
      dtype='object')

# Data Modelling <a class='anchor' id='modelling'>

## Logistic regression <a class='anchor' id='logit'>


Set parameters for optimization with GridSearch

In [59]:
params_logit = {'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
                'C':[0.0001, 0.001, 0.01, 0.1, 1, 10]}

In [60]:
logit = LogisticRegression(class_weight='balanced', random_state=17, max_iter=10000)
logit_grid = GridSearchCV(logit, cv=skf, param_grid=params_logit, scoring='f1', n_jobs=4)

In [61]:
logit_grid.fit(X_train, y_train)

GridSearchCV(cv=StratifiedKFold(n_splits=5, random_state=17, shuffle=True),
             estimator=LogisticRegression(class_weight='balanced',
                                          max_iter=10000, random_state=17),
             n_jobs=4,
             param_grid={'C': [0.0001, 0.001, 0.01, 0.1, 1, 10],
                         'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag',
                                    'saga']},
             scoring='f1')

In [62]:
logit_grid.best_estimator_

LogisticRegression(C=0.01, class_weight='balanced', max_iter=10000,
                   random_state=17, solver='liblinear')

In [63]:
logit_grid.best_score_

0.876199921130457

In [64]:
y_logit_pred = logit_grid.predict(X_val)

In [65]:
f1_score(y_val, y_logit_pred)

0.8843537414965987

## Random Feature Elimination (RFE) <a class='anchor' id='RFE'>

In [66]:
selector = RFE(logit_grid.best_estimator_, verbose=2)

In [67]:
params_RFE = {'n_features_to_select':np.arange(1, len(X.columns), 1)}

In [68]:
selector_grid = GridSearchCV(selector, param_grid=params_RFE, cv=skf, verbose=2, n_jobs=-1, scoring='f1')

In [69]:
selector_grid.fit(X_train, y_train)

Fitting 5 folds for each of 20 candidates, totalling 100 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:    6.3s


Fitting estimator with 21 features.
Fitting estimator with 20 features.
Fitting estimator with 19 features.
Fitting estimator with 18 features.
Fitting estimator with 17 features.
Fitting estimator with 16 features.
Fitting estimator with 15 features.
Fitting estimator with 14 features.
Fitting estimator with 13 features.
Fitting estimator with 12 features.
Fitting estimator with 11 features.
Fitting estimator with 10 features.
Fitting estimator with 9 features.
Fitting estimator with 8 features.
Fitting estimator with 7 features.
Fitting estimator with 6 features.


[Parallel(n_jobs=-1)]: Done  85 out of 100 | elapsed:    7.0s remaining:    1.1s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:    7.0s finished


GridSearchCV(cv=StratifiedKFold(n_splits=5, random_state=17, shuffle=True),
             estimator=RFE(estimator=LogisticRegression(C=0.01,
                                                        class_weight='balanced',
                                                        max_iter=10000,
                                                        random_state=17,
                                                        solver='liblinear'),
                           verbose=2),
             n_jobs=-1,
             param_grid={'n_features_to_select': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20])},
             scoring='f1', verbose=2)

In [70]:
selector_grid.best_estimator_

RFE(estimator=LogisticRegression(C=0.01, class_weight='balanced',
                                 max_iter=10000, random_state=17,
                                 solver='liblinear'),
    n_features_to_select=5, verbose=2)

In [71]:
selector_grid.best_score_

0.8918333356312648

In [72]:
selector_grid.best_params_

{'n_features_to_select': 5}

In [73]:
selector_grid.best_estimator_.ranking_

array([16,  8,  1,  1, 17,  6, 10, 14, 12,  3,  4, 15, 11, 13,  1,  5,  1,
        1,  7,  9,  2])

In [74]:
y_pred_RFE = selector_grid.best_estimator_.predict(X_val)

In [75]:
f1_score(y_val, y_pred_RFE)

0.9078947368421052

## Prediction on test set <a class='anchor' id='prediction'>

In [76]:
def get_X_dataframe(X, df_train_original=loan_data):
    
    '''Note: this function takes X as input and
    returns the prepared data WITHOUT the target label i.e only X.'''
    
    X = X[X.columns.sort_values()]
    
    imputer = SimpleImputer(strategy='most_frequent')
    imputer.fit(df_train_original.drop('Loan_ID', axis=1)) #fit to train data
    X_imputed = pd.DataFrame(imputer.transform(X), columns=X.columns)
    
    cols_scale = numerical_cols
    cols_encode = sorted(list(set(categorical_cols)-set(['Loan_Status'])))
    
    scaler = StandardScaler()
    scaler.fit(df_train_original[cols_scale])
    scaled_X = pd.DataFrame(scaler.transform(X_imputed[cols_scale]), columns=cols_scale)
    
    oh_encoder = OneHotEncoder(handle_unknown = "ignore", sparse=False)
    oh_encoder.fit(df_train_original[cols_encode].dropna(axis=0))# fit to train data wihtout NaN values
    
    encoded_X = pd.DataFrame(oh_encoder.transform(X_imputed[cols_encode]), columns=oh_encoder.get_feature_names(cols_encode))
    
    final_X = pd.concat([encoded_X, scaled_X], axis=1)
    final_X = final_X[final_X.columns.sort_values()]
    
    return final_X

In [77]:
test_data = pd.read_csv('https://raw.githubusercontent.com/dphi-official/Datasets/master/Loan_Data/loan_test.csv')

In [78]:
test_data.isnull().sum()

Loan_ID              0
Gender               3
Married              2
Dependents           6
Education            0
Self_Employed        3
ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           6
Loan_Amount_Term     1
Credit_History       7
Property_Area        0
dtype: int64

In [79]:
X_test = get_X_dataframe(test_data)

In [80]:
X_test.columns == X.columns

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True])

In [81]:
X_test.columns

Index(['ApplicantIncome', 'CoapplicantIncome', 'Credit_History_0.0',
       'Credit_History_1.0', 'Dependents_0', 'Dependents_1', 'Dependents_2',
       'Dependents_3+', 'Education_Graduate', 'Education_Not Graduate',
       'Gender_Female', 'Gender_Male', 'LoanAmount', 'Loan_Amount_Term',
       'Married_No', 'Married_Yes', 'Property_Area_Rural',
       'Property_Area_Semiurban', 'Property_Area_Urban', 'Self_Employed_No',
       'Self_Employed_Yes'],
      dtype='object')

In [82]:
y_rdf_pred_test = selector_grid.predict(X_test)

In [83]:
res = pd.DataFrame(y_rdf_pred_test) #final predictions of the model 
res.index = X_test.index #important for comparison. "test_data_enc" is the encoded test data
res.columns = ["prediction"]
res.to_csv("prediction_loan_data_logit+RFE_02.csv", index = False) 

### Saving dtypes, Encoder and Scaler

Saving dtypes:

In [84]:
dtype_df = df.drop('Loan_Status', axis=1).dtypes
dtype_df

ApplicantIncome        int64
CoapplicantIncome    float64
Credit_History       float64
Dependents            object
Education             object
Gender                object
LoanAmount           float64
Loan_Amount_Term     float64
Married               object
Property_Area         object
Self_Employed         object
dtype: object

In [86]:
import pickle

In [87]:
with open('dtypes.pkl', 'wb') as f:
    pickle.dump(dtype_df, f)

Saving StandardScaler:

In [88]:
cols_scale = sorted(numerical_cols)
cols_encode = sorted(list(set(categorical_cols)-set(['Loan_Status'])))
    
scaler = StandardScaler()
scaler.fit(loan_data[cols_scale].dropna(axis=0))

with open('StandardScaler.pkl', 'wb') as f:
    pickle.dump(scaler, f)

Saving OneHotEncoder

In [89]:
oh_encoder = OneHotEncoder(handle_unknown = "ignore", sparse=False)
oh_encoder.fit(loan_data[cols_encode].dropna(axis=0)) # fit to train data wihtout NaN value

with open('OneHotEncoder.pkl', 'wb') as f:
    pickle.dump(oh_encoder, f)