## Problem
Predict Loan Eligibility for Dream Housing Finance company
Dream Housing Finance company deals in all kinds of home loans. They have presence across all urban, semi urban and rural areas. Customer first applies for home loan and after that company validates the customer eligibility for loan.

Company wants to automate the loan eligibility process (real time) based on customer detail provided while filling online application form. These details are Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and others. To automate this process, they have provided a dataset to identify the customers segments that are eligible for loan amount so that they can specifically target these customers.



In [109]:
# importing the necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [110]:
# loading the train dataset
train_data = pd.read_csv("/content/train_ctrUa4K.csv")
train_data.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [111]:
# getting info of train data
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            614 non-null    object 
 1   Gender             601 non-null    object 
 2   Married            611 non-null    object 
 3   Dependents         599 non-null    object 
 4   Education          614 non-null    object 
 5   Self_Employed      582 non-null    object 
 6   ApplicantIncome    614 non-null    int64  
 7   CoapplicantIncome  614 non-null    float64
 8   LoanAmount         592 non-null    float64
 9   Loan_Amount_Term   600 non-null    float64
 10  Credit_History     564 non-null    float64
 11  Property_Area      614 non-null    object 
 12  Loan_Status        614 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 62.5+ KB


In [112]:
# checking for null values
train_data.isna().sum()

Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

In [113]:
# creating a function to fill na values of columns
def fill_missing(column_list, strategy, df):
  '''
  Takes column_list, strategy and dataframe as input and fills na values
  with the specified strategy and returns the datafram
  '''
  if strategy=="mode":
    for column in column_list:
      df[column].fillna(df[column].mode()[0],inplace=True)

  if strategy =="median":
    for column in column_list:
      df[column].fillna(df[column].median(), inplace=True)

  #returning the modified df
  return df

There are null values in the dataset. so handling the null values

In [114]:
#filling the na values of categorical columns with mode
train_data = fill_missing(column_list=['Gender','Married','Dependents','Self_Employed',
                                       'Credit_History','Loan_Amount_Term'],
                          strategy="mode",
                          df=train_data)

#filling the numerical column with median
train_data = fill_missing(column_list=['LoanAmount'], strategy="median", df=train_data)
train_data.isna().sum()

Loan_ID              0
Gender               0
Married              0
Dependents           0
Education            0
Self_Employed        0
ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
Property_Area        0
Loan_Status          0
dtype: int64

In [115]:
#dropping the Loan_ID column
train_data.drop('Loan_ID', axis=1, inplace=True)
train_data.head()

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,Male,No,0,Graduate,No,5849,0.0,128.0,360.0,1.0,Urban,Y
1,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


Since we will be using models which are immune to outliers we will not be handling outliers.

In [116]:
# Label Encoding the categorical columns
from sklearn.preprocessing import LabelEncoder
def label_encode(column_list, df):
  '''
  Takes the column list and the dataframe and perform label encoding.
  Returns the encoded df
  '''
  #iterating over the column list
  for column in column_list:
    #instantiating a new instance of LabelEncoder
    label_encoder = LabelEncoder()
    #applying the fit_transform() function on the column of df and assigning it
    # back to the original column
    df[column] = label_encoder.fit_transform(df[column])
  #return the modified df
  return df

In [117]:
# Label encoding the categorical columns
train_data = label_encode(column_list=[
    'Gender','Married','Education','Self_Employed','Property_Area','Loan_Status','Dependents'
], df=train_data
)
train_data.head()

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,1,0,0,0,0,5849,0.0,128.0,360.0,1.0,2,1
1,1,1,1,0,0,4583,1508.0,128.0,360.0,1.0,0,0
2,1,1,0,0,1,3000,0.0,66.0,360.0,1.0,2,1
3,1,1,0,1,0,2583,2358.0,120.0,360.0,1.0,2,1
4,1,0,0,0,0,6000,0.0,141.0,360.0,1.0,2,1


In [118]:
# scaling the values
from sklearn.preprocessing import MinMaxScaler
def scaler(df):
  '''
  Takes the dataframe as input
  returns the min max scaled df as output
  '''
  #instantiating scaler
  min_max_scaler = MinMaxScaler()
  #ftting the scaler on the data
  # min_max_scaler.fit(df)
  scaled_df = pd.DataFrame(min_max_scaler.fit_transform(df), columns=df.columns)
  # print(scaled_df)

  #returing the scaled_df

  return scaled_df

In [119]:
# splitting the data into features and target
X_train = train_data.drop('Loan_Status', axis=1)
y_train = train_data['Loan_Status']



In [120]:
# scaling the dataset with minmax scaler
X_train = scaler(X_train)

In [121]:
X_train.head()

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
0,1.0,0.0,0.0,0.0,0.0,0.070489,0.0,0.172214,0.74359,1.0,1.0
1,1.0,1.0,0.333333,0.0,0.0,0.05483,0.036192,0.172214,0.74359,1.0,0.0
2,1.0,1.0,0.0,0.0,1.0,0.03525,0.0,0.082489,0.74359,1.0,1.0
3,1.0,1.0,0.0,1.0,0.0,0.030093,0.056592,0.160637,0.74359,1.0,1.0
4,1.0,0.0,0.0,0.0,0.0,0.072356,0.0,0.191027,0.74359,1.0,1.0


In [122]:
# loading the test_data
test_data = pd.read_csv("/content/test_lAUu6dG.csv")
test_data.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
0,LP001015,Male,Yes,0,Graduate,No,5720,0,110.0,360.0,1.0,Urban
1,LP001022,Male,Yes,1,Graduate,No,3076,1500,126.0,360.0,1.0,Urban
2,LP001031,Male,Yes,2,Graduate,No,5000,1800,208.0,360.0,1.0,Urban
3,LP001035,Male,Yes,2,Graduate,No,2340,2546,100.0,360.0,,Urban
4,LP001051,Male,No,0,Not Graduate,No,3276,0,78.0,360.0,1.0,Urban


In [123]:
# checking for null values
test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 367 entries, 0 to 366
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            367 non-null    object 
 1   Gender             356 non-null    object 
 2   Married            367 non-null    object 
 3   Dependents         357 non-null    object 
 4   Education          367 non-null    object 
 5   Self_Employed      344 non-null    object 
 6   ApplicantIncome    367 non-null    int64  
 7   CoapplicantIncome  367 non-null    int64  
 8   LoanAmount         362 non-null    float64
 9   Loan_Amount_Term   361 non-null    float64
 10  Credit_History     338 non-null    float64
 11  Property_Area      367 non-null    object 
dtypes: float64(3), int64(2), object(7)
memory usage: 34.5+ KB


In [124]:
test_data.isna().sum()

Loan_ID               0
Gender               11
Married               0
Dependents           10
Education             0
Self_Employed        23
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount            5
Loan_Amount_Term      6
Credit_History       29
Property_Area         0
dtype: int64

In [125]:
#filling the na values of categorical columns with mode
test_data = fill_missing(column_list=['Gender','Dependents','Self_Employed',
                                       'Credit_History','Loan_Amount_Term'],
                          strategy="mode",
                          df=test_data)
test_data.isna().sum()

Loan_ID              0
Gender               0
Married              0
Dependents           0
Education            0
Self_Employed        0
ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           5
Loan_Amount_Term     0
Credit_History       0
Property_Area        0
dtype: int64

In [126]:
#filling the numerical column with median
test_data = fill_missing(column_list=['LoanAmount'], strategy="median", df=test_data)
test_data.isna().sum()

Loan_ID              0
Gender               0
Married              0
Dependents           0
Education            0
Self_Employed        0
ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
Property_Area        0
dtype: int64

In [127]:
# Label encoding the categorical columns
test_data = label_encode(column_list=[
    'Gender','Married','Education','Self_Employed','Property_Area','Dependents'
], df=test_data
)
test_data.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
0,LP001015,1,1,0,0,0,5720,0,110.0,360.0,1.0,2
1,LP001022,1,1,1,0,0,3076,1500,126.0,360.0,1.0,2
2,LP001031,1,1,2,0,0,5000,1800,208.0,360.0,1.0,2
3,LP001035,1,1,2,0,0,2340,2546,100.0,360.0,1.0,2
4,LP001051,1,0,0,1,0,3276,0,78.0,360.0,1.0,2


In [128]:
# splitting the loan_id column of test data
loan_id = test_data['Loan_ID']

#dropping the Loan_ID column from test_data
test_data.drop('Loan_ID', axis=1, inplace=True)

In [129]:
# scaling the dataset with minmax scaler
test_data = scaler(test_data)
test_data.head()

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
0,1.0,1.0,0.0,0.0,0.0,0.078865,0.0,0.157088,0.746835,1.0,1.0
1,1.0,1.0,0.333333,0.0,0.0,0.042411,0.0625,0.187739,0.746835,1.0,1.0
2,1.0,1.0,0.666667,0.0,0.0,0.068938,0.075,0.344828,0.746835,1.0,1.0
3,1.0,1.0,0.666667,0.0,0.0,0.032263,0.106083,0.137931,0.746835,1.0,1.0
4,1.0,0.0,0.0,1.0,0.0,0.045168,0.0,0.095785,0.746835,1.0,1.0


In [145]:
# creating a sample submission file
def make_submission_file(y_pred, submission_id,id=loan_id):
  '''
  Make a csv file from the predictions
  '''
  submission_df = pd.DataFrame({'Loan_ID':id,'Loan_Status':y_pred})
  # mapping 1 to Y and 0 to N on y_pred
  submission_df.Loan_Status=submission_df.Loan_Status.map({0:'N',1:'Y'})
  submission_df.to_csv(f"Submission_{submission_id}.csv", index=False)
  return submission_df

In [147]:
# trying out the RandomForest classifier
from sklearn.ensemble import RandomForestClassifier
rf_clf = RandomForestClassifier()
rf_clf.fit(X_train, y_train)
y_preds = rf_clf.predict(test_data)

sub_df = make_submission_file(y_preds,"001")
sub_df.head()

Unnamed: 0,Loan_ID,Loan_Status
0,LP001015,Y
1,LP001022,Y
2,LP001031,Y
3,LP001035,Y
4,LP001051,Y


Got 77% accuracy

In [148]:
# trying out decision tree classifier
from sklearn.tree import DecisionTreeClassifier
dt_clf = DecisionTreeClassifier()
dt_clf.fit(X_train, y_train)
y_preds = dt_clf.predict(test_data)
sub_df = make_submission_file(y_preds,"002")
sub_df.head()

Unnamed: 0,Loan_ID,Loan_Status
0,LP001015,Y
1,LP001022,N
2,LP001031,Y
3,LP001035,Y
4,LP001051,N


Got 68% accuracy