Predict Loan Eligibility for Dream Housing Finance company

Dream Housing Finance company deals in all kinds of home loans. They have presence across all urban, semi urban and rural areas. Customer first applies for home loan and after that company validates the customer eligibility for loan.

Company wants to automate the loan eligibility process (real time) based on customer detail provided while filling online application form. These details are Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and others. To automate this process, they have provided a dataset to identify the customers segments that are eligible for loan amount so that they can specifically target these customers. 



Data Dictionary

Train file: CSV containing the customers for whom loan eligibility is known as 'Loan_Status'

Variable	Description
Loan_ID	Unique Loan ID
Gender	Male/ Female
Married	Applicant married (Y/N)
Dependents	Number of dependents
Education	Applicant Education (Graduate/ Under Graduate)
Self_Employed	Self employed (Y/N)
ApplicantIncome	Applicant income
CoapplicantIncome	Coapplicant income
LoanAmount	Loan amount in thousands
Loan_Amount_Term	Term of loan in months
Credit_History	credit history meets guidelines
Property_Area	Urban/ Semi Urban/ Rural
Loan_Status	(Target) Loan approved (Y/N)

Test file: CSV containing the customer information for whom loan eligibility is to be predicted

Variable	Description
Loan_ID	Unique Loan ID
Gender	Male/ Female
Married	Applicant married (Y/N)
Dependents	Number of dependents
Education	Applicant Education (Graduate/ Under Graduate)
Self_Employed	Self employed (Y/N)
ApplicantIncome	Applicant income
CoapplicantIncome	Coapplicant income
LoanAmount	Loan amount in thousands
Loan_Amount_Term	Term of loan in months
Credit_History	credit history meets guidelines
Property_Area	Urban/ Semi Urban/ Rural


Submission file format

Variable	Description
Loan_ID	Unique Loan ID
Loan_Status	(Target) Loan approved (Y/N)



In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import KFold
from lightgbm import LGBMClassifier
from sklearn.model_selection import train_test_split, StratifiedShuffleSplit, cross_val_score, KFold, GridSearchCV, StratifiedKFold
from sklearn.metrics import balanced_accuracy_score, auc, mean_squared_error, roc_curve, confusion_matrix, precision_score, recall_score, f1_score,\
log_loss, roc_auc_score
import gc
from sklearn.preprocessing import LabelEncoder

#tuning hyperparameters
from bayes_opt import BayesianOptimization
from skopt  import BayesSearchCV 

  import pandas.util.testing as tm
This means that in case of installing LightGBM from PyPI via the ``pip install lightgbm`` command, you don't need to install the gcc compiler anymore.
Instead of that, you need to install the OpenMP library, which is required for running LightGBM on the system with the Apple Clang compiler.
You can install the OpenMP library by the following command: ``brew install libomp``.


In [2]:
train = pd.read_csv("train.csv")


In [3]:
test = pd.read_csv('test.csv')

In [4]:
train.head()

Unnamed: 0.1,Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [5]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 14 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         614 non-null    int64  
 1   Loan_ID            614 non-null    object 
 2   Gender             601 non-null    object 
 3   Married            611 non-null    object 
 4   Dependents         599 non-null    object 
 5   Education          614 non-null    object 
 6   Self_Employed      582 non-null    object 
 7   ApplicantIncome    614 non-null    int64  
 8   CoapplicantIncome  614 non-null    float64
 9   LoanAmount         592 non-null    float64
 10  Loan_Amount_Term   600 non-null    float64
 11  Credit_History     564 non-null    float64
 12  Property_Area      614 non-null    object 
 13  Loan_Status        614 non-null    object 
dtypes: float64(4), int64(2), object(8)
memory usage: 67.3+ KB


In [6]:
# Check out the value counts for some variables
for c in train.columns:
    print ("---- %s ---" % c)
    print (train[c].value_counts())

---- Unnamed: 0 ---
613    1
201    1
208    1
207    1
206    1
      ..
408    1
407    1
406    1
405    1
0      1
Name: Unnamed: 0, Length: 614, dtype: int64
---- Loan_ID ---
LP002065    1
LP002959    1
LP001091    1
LP001864    1
LP002626    1
           ..
LP001993    1
LP002772    1
LP002408    1
LP002180    1
LP001349    1
Name: Loan_ID, Length: 614, dtype: int64
---- Gender ---
Male      489
Female    112
Name: Gender, dtype: int64
---- Married ---
Yes    398
No     213
Name: Married, dtype: int64
---- Dependents ---
0     345
1     102
2     101
3+     51
Name: Dependents, dtype: int64
---- Education ---
Graduate        480
Not Graduate    134
Name: Education, dtype: int64
---- Self_Employed ---
No     500
Yes     82
Name: Self_Employed, dtype: int64
---- ApplicantIncome ---
2500    9
4583    6
2600    6
6000    6
5000    5
       ..
5818    1
5819    1
5821    1
2750    1
3691    1
Name: ApplicantIncome, Length: 505, dtype: int64
---- CoapplicantIncome ---
0.0       273
250

In [7]:
train.isnull().sum()

Unnamed: 0            0
Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

In [8]:
# Label encode some variables
labenc = LabelEncoder()
#train[['Gender','Married','Education','Self_Employed','Property_Area','Loan_Status']] = labenc.fit_transform(train['Gender','Married','Education','Self_Employed','Property_Area','Loan_Status'].astype(str))
train['Gender'] = labenc.fit_transform(train['Gender'].astype(str))
train['Married'] = labenc.fit_transform(train['Married'].astype(str))
train['Education'] = labenc.fit_transform(train['Education'].astype(str))
train['Self_Employed'] = labenc.fit_transform(train['Self_Employed'].astype(str))
train['Property_Area'] = labenc.fit_transform(train['Property_Area'].astype(str))
train['Loan_Status'] = labenc.fit_transform(train['Loan_Status'].astype(str))

In [9]:
train["Dependents"] = labenc.fit_transform(train["Dependents"].astype(str))

In [10]:
#Perform label encode on test data
test['Gender'] = labenc.fit_transform(test['Gender'].astype(str))
test['Married'] = labenc.fit_transform(test['Married'].astype(str))
test['Education'] = labenc.fit_transform(test['Education'].astype(str))
test['Self_Employed'] = labenc.fit_transform(test['Self_Employed'].astype(str))
test['Property_Area'] = labenc.fit_transform(test['Property_Area'].astype(str))


In [11]:
test["Dependents"] = labenc.fit_transform(test["Dependents"].astype(str))

In [12]:
#train["Dependents"] = pd.to_numeric(train["Dependents"])

In [13]:
train.describe()

Unnamed: 0.1,Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
count,614.0,614.0,614.0,614.0,614.0,614.0,614.0,614.0,592.0,600.0,564.0,614.0,614.0
mean,306.5,0.838762,0.65798,0.84202,0.218241,0.237785,5403.459283,1621.245798,146.412162,342.0,0.842199,1.037459,0.687296
std,177.390811,0.421752,0.484971,1.120531,0.413389,0.534737,6109.041673,2926.248369,85.587325,65.12041,0.364878,0.787482,0.463973
min,0.0,0.0,0.0,0.0,0.0,0.0,150.0,0.0,9.0,12.0,0.0,0.0,0.0
25%,153.25,1.0,0.0,0.0,0.0,0.0,2877.5,0.0,100.0,360.0,1.0,0.0,0.0
50%,306.5,1.0,1.0,0.0,0.0,0.0,3812.5,1188.5,128.0,360.0,1.0,1.0,1.0
75%,459.75,1.0,1.0,2.0,0.0,0.0,5795.0,2297.25,168.0,360.0,1.0,2.0,1.0
max,613.0,2.0,2.0,4.0,1.0,2.0,81000.0,41667.0,700.0,480.0,1.0,2.0,1.0


In [14]:
train.columns

Index(['Unnamed: 0', 'Loan_ID', 'Gender', 'Married', 'Dependents', 'Education',
       'Self_Employed', 'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
       'Loan_Amount_Term', 'Credit_History', 'Property_Area', 'Loan_Status'],
      dtype='object')

In [15]:
test.columns

Index(['Unnamed: 0', 'Loan_ID', 'Gender', 'Married', 'Dependents', 'Education',
       'Self_Employed', 'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
       'Loan_Amount_Term', 'Credit_History', 'Property_Area'],
      dtype='object')

In [16]:
train_drop = train.drop(['Unnamed: 0', 'Loan_ID'], axis=1)
test_drop = test.drop(['Unnamed: 0', 'Loan_ID'], axis=1)

In [17]:
# Fill na in train and test data
train_fill = train_drop.fillna(-999)
train_fill = train_drop.fillna(-999)

In [18]:
train_fill.columns

Index(['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed',
       'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
       'Loan_Amount_Term', 'Credit_History', 'Property_Area', 'Loan_Status'],
      dtype='object')

In [19]:
target = train_fill["Loan_Status"]
train_d = train_fill.drop(["Loan_Status"], axis=1)

In [20]:
train_d.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Gender             614 non-null    int64  
 1   Married            614 non-null    int64  
 2   Dependents         614 non-null    int64  
 3   Education          614 non-null    int64  
 4   Self_Employed      614 non-null    int64  
 5   ApplicantIncome    614 non-null    int64  
 6   CoapplicantIncome  614 non-null    float64
 7   LoanAmount         614 non-null    float64
 8   Loan_Amount_Term   614 non-null    float64
 9   Credit_History     614 non-null    float64
 10  Property_Area      614 non-null    int64  
dtypes: float64(4), int64(7)
memory usage: 52.9 KB


In [21]:
print(train_d.shape, test_drop.shape)

(614, 11) (367, 11)


In [22]:
X = train_d.values
y = target.values

In [23]:
from sklearn.model_selection import RepeatedStratifiedKFold
from numpy import mean
from numpy import std

model = LGBMClassifier(n_estimators=1100,
        learning_rate=0.01,
        reg_lambda=30,
        feature_fraction = 0.4, 
        num_leaves = 50, 
        max_depth = 50, split = 'gain', 
        boosting = 'gbdt')
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=132)
n_scores = cross_val_score(model, X, y.ravel(), scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

# fit the model on the whole dataset
model.fit(X, y)

Accuracy: 0.811 (0.039)


LGBMClassifier(boosting='gbdt', feature_fraction=0.4, learning_rate=0.01,
               max_depth=50, n_estimators=1100, num_leaves=50, reg_lambda=30,
               split='gain')

In [52]:
pred = model.predict_proba(test_drop)[:,1]
test["Loan_Status"] = (pred >= 0.8).astype('int')



In [33]:
#y_pred = y_pred.astype(str)

In [36]:
sub = pd.read_csv("submissions.csv")

In [49]:
#pred = labenc.inverse_transform(y_pred)


In [53]:
submissions = pd.DataFrame()
submissions['Loan_ID'] = test['Loan_ID']
submissions["Loan_Status"]= labenc.inverse_transform(test["Loan_Status"])

In [54]:

submissions.to_csv('submission_1.csv', index=False)