<a href="https://colab.research.google.com/github/JohnPaulPrabhu/Kaggle/blob/master/Home_Loan_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Import All the necessary Packages**

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics

**Read data**

In [3]:
train = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Udacity/Linear Regression/train/Train_Loan_Home.csv')
test = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Udacity/Linear Regression/train/Test_Loan_Home.csv')

In [4]:
train.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


Data preprocessing

**Remove unwanted features**


*   Since **Loan_ID** doesn't have any proper info for our model We should remove it
*   I believe most of the bank verify your source of income and Salary range. So we don't need Education details



In [7]:
train.drop(['Loan_ID','Education'],axis=1,inplace=True)

In [8]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Gender             601 non-null    object 
 1   Married            611 non-null    object 
 2   Dependents         599 non-null    object 
 3   Self_Employed      582 non-null    object 
 4   ApplicantIncome    614 non-null    int64  
 5   CoapplicantIncome  614 non-null    float64
 6   LoanAmount         592 non-null    float64
 7   Loan_Amount_Term   600 non-null    float64
 8   Credit_History     564 non-null    float64
 9   Property_Area      614 non-null    object 
 10  Loan_Status        614 non-null    object 
dtypes: float64(4), int64(1), object(6)
memory usage: 52.9+ KB


As we can see above there are some Features has some NaN values. We can either remove them or Impute the missing values with Mean/Most frequent values.

In [9]:
# I used the simple imputer to impute the missing values
imp1 = SimpleImputer(strategy="most_frequent")
imp2 = SimpleImputer(strategy="mean")
x = imp1.fit_transform(train[['Gender','Married','Dependents','Self_Employed','Credit_History','Loan_Amount_Term']])
y = imp2.fit_transform(train[['LoanAmount']])
xx = pd.DataFrame(x,columns=['Gender','Married','Dependents','Self_Employed','Credit_History','Loan_Amount_Term'])
yy = pd.DataFrame(y, columns=['LoanAmount'])

# concatenate the imputed values into one single dataframe
data = pd.concat([xx, yy,train[['ApplicantIncome','CoapplicantIncome','Property_Area','Loan_Status']]], axis=1)

In [10]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Gender             614 non-null    object 
 1   Married            614 non-null    object 
 2   Dependents         614 non-null    object 
 3   Self_Employed      614 non-null    object 
 4   Credit_History     614 non-null    object 
 5   Loan_Amount_Term   614 non-null    object 
 6   LoanAmount         614 non-null    float64
 7   ApplicantIncome    614 non-null    int64  
 8   CoapplicantIncome  614 non-null    float64
 9   Property_Area      614 non-null    object 
 10  Loan_Status        614 non-null    object 
dtypes: float64(2), int64(1), object(8)
memory usage: 52.9+ KB


We have imputed the NaN values. Now we have to scale the numerical values to get the better performance

In [11]:
scaler = StandardScaler()
scaled_data_num = scaler.fit_transform(data[['LoanAmount','ApplicantIncome','CoapplicantIncome']])
for i in ['LoanAmount','ApplicantIncome','CoapplicantIncome']:
  data[i] = scaler.fit_transform(data[[i]])

Next we need to convert the categorical data into numerical data since our model cannot interpret the categorical values

In [12]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
# scaled_data_cat = le.fit_transform(data[['Gender','Married','Dependents','Self_Employed','Credit_History','Loan_Amount_Term','Property_Area','Loan_Status']])
# df[categ] = df[categ].apply(le.fit_transform)
# data[['Gender','Married','Dependents','Self_Employed','Credit_History','Loan_Amount_Term','Property_Area','Loan_Status']] =
for i in ['Gender','Married','Dependents','Self_Employed','Credit_History','Loan_Amount_Term','Property_Area','Loan_Status']:
  data[i] = le.fit_transform(data[i])


All the basic preprocess has been done. Next we nedd to split the data into training and testing

In [14]:
x = data.drop(['Loan_Status'],axis=1)
y = data.Loan_Status
train_x, test_x, train_y, test_y = train_test_split(x,y,random_state = 0)
print(train_x.shape)
print(train_y.shape)
print(test_x.shape)
print(test_y.shape)

(460, 10)
(460,)
(154, 10)
(154,)


Next step is to create our model. Here i tested several model since one model cannot fit to all the problems.

**Logistic Regression**

In [15]:
lgstRgr = LogisticRegression()
lgstRgr.fit(train_x,train_y)
y_pred = lgstRgr.predict(test_x)
accuracy = metrics.accuracy_score(test_y, y_pred)
accuracy_percentage = 100 * accuracy
accuracy_percentage

83.76623376623377

In [16]:
knn = KNeighborsClassifier(n_neighbors=12)
knn.fit(train_x,train_y)
y_pred = knn.predict(test_x)
accuracy = metrics.accuracy_score(test_y, y_pred)
accuracy_percentage = 100 * accuracy
accuracy_percentage

80.51948051948052

In [17]:
dtree = DecisionTreeClassifier(max_depth=1)
dtree.fit(train_x,train_y)
y_pred = dtree.predict(test_x)
accuracy = metrics.accuracy_score(test_y, y_pred)
accuracy_percentage = 100 * accuracy
accuracy_percentage

83.11688311688312

In [18]:
svc = SVC()
svc.fit(train_x,train_y)
y_pred = svc.predict(test_x)
accuracy = metrics.accuracy_score(test_y, y_pred)
accuracy_percentage = 100 * accuracy
accuracy_percentage

83.76623376623377

In [19]:
nb = GaussianNB()
nb.fit(train_x,train_y)
y_pred = nb.predict(test_x)
accuracy = metrics.accuracy_score(test_y, y_pred)
accuracy_percentage = 100 * accuracy
accuracy_percentage

82.46753246753246

As we can see above Logistic Regression and SVC gives the better accuracy than the rest of the model