# 1. Introduction 

### 1.1 Problem Statement:
Dream Housing Finance company deals in all home loans. They have a presence across all urban, semi-urban, and rural areas. Customer-first applies for a home loan after that company validates the customer eligibility for a loan.

The company wants to automate the loan eligibility process (real-time) based on customer detail provided while filling the online application form. These details are Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History, and others. To automate this process, they have given a problem to identify the customer's segments, those are eligible for loan amount so that they can specifically target these customers. Here they have provided a partial data set.

### 1.2 Objective
1. Analyze the dataset and find the hidden patterns through EDA(Exploratory Data Analysis).
2. Preprocess the data to get a clean data.
3. Train the best possible Model to predict the outcome and tune it for best outcome.
4. Deploy the Model.

#### Machine Learning algo used in this project:
1. Logistic Regression
2. Decision Tree
3. Random Forest
4. Naive Bayes 
5. k-Nearest Neighbours(kNN)

#### Dataset key information:
Loan_ID ----> Unique Loan ID

Gender ----> Male/ Female

Married ----> Applicant married (Y/N)

Dependents ----> Number of dependents

Education ----> Applicant Education (Graduate/ Under Graduate)

Self_Employed ----> Self-employed (Y/N)

ApplicantIncome ----> Applicant income

CoapplicantIncome ----> Coapplicant income

LoanAmount ----> Loan amount in thousands

Loan_Amount_Term ----> Term of a loan in months

Credit_History ----> credit history meets guidelines

Property_Area ----> Urban/ Semi-Urban/ Rural

Loan_Status ----> Loan approved (Y/N)

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# 2. Importing Libraries
Librearies are collection of related modules which contains code that can be used repeatedly in different programs. Importing libraries used in this project

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import CategoricalNB
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

# 3. Exploratory Data Analysis
* Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns,to spot anomalies, to test hypothesis and to check assumptions with the help of summary statistics and graphical representations.

* Your goal during EDA is to develop an understanding of your data. The easiest way to do this is to use questions as tools to guide your investigation. When you ask a question, the question focuses your attention on a specific part of your dataset and helps you decide which graphs, models, or transformations to make.
 
* Search for answers by visualising, transforming, and modelling your data.
 
* Let's start exploring our data

#### 3.1 Importing dataset

In [3]:
data = pd.read_csv("/kaggle/input/loan-eligible-dataset/loan-train.csv")
databackup = data.copy()
#creating backup for future 

#### 3.2 Understanding the "raw" data

In [4]:
data.shape

In [5]:
data.head()

In [6]:
data.columns

In [7]:
data.info()

In [8]:
data.describe()
#This function returns the count, mean, standard deviation, minimum and maximum values and the quantiles of the data.

In [9]:
# Inference
# 1. Raw data has missing values in Gender, Married, Dependents, Self_Employed, LoanAmount, Loan_Amount_Term and Credit_history.
# 2. Mean in ApplicantIncome and CoapplicantIncome is much lower then their max value, probablity of an outlier.
# 3. The Raw data is right skewed as the values are much more spread out at higher range.

In [10]:
data["Loan_Status"].value_counts()
# The raw data is imbalanced

In [11]:
categorical_columns = [column for column in data.columns if data[column].dtypes == 'O']
categorical_columns

In [12]:
numerical_columns = [column for column in data.columns if data[column].dtypes != 'O']
numerical_columns

In [13]:
data.isnull().sum()
#to check the null values in the data set
#since the dataset is small , dropping the rows with null values is not right.

#### 3.3 Categorical Columns
Columns that does not have continuous values and have discreet entries are call categorical columns.

In [14]:
for col in categorical_columns:
    if(col != "Loan_Status" and col!="Loan_ID"):
        plt.figure(figsize=(10, 6))
        sns.countplot(x = col,data = data,palette = "muted")
        plt.title("Plot representing {} distribution".format(col),fontsize = 15)
        plt.show()

In [15]:
for index,col_name in enumerate(categorical_columns):
    if(col_name!="Loan_ID" and col_name!="Loan_Status"):
        plt.figure(figsize = (10,6))
        sns.countplot(x = col_name,data = data,hue = "Loan_Status", palette = "hls")
        plt.title("Plot to represent distribution of {} with respect to Loan_Status".format(col_name),fontsize = 15)
        plt.show()

In [16]:
for col in categorical_columns:
    if(col!="Loan_Status" and col!="Loan_ID"):
        print("{} column value distribution".format(col))
        print(data[col].value_counts(dropna = False))
        print("***********************")

In [17]:
# The data shows that the data has more male applicants and married applicants.
# It also shows that among the applicants, they are more likely to be graduated and less self employed.
# Applicants are more likely to get the loan if they are graduate and living in suburban areas.

#### 3.3.1 Gender

In [18]:
for col in categorical_columns:
    if(col!="Loan_ID" and col!="Gender"):
        print(data.groupby(col)["Gender"].value_counts(dropna = False))
        
        sns.countplot(x = col,hue="Gender",data = data,palette = "Paired")
        plt.show()
        
        print('********************')        

In [19]:
# Most of the married applicants are male.
# Most of the Applicants that are not self employed are males.
# Applicants whose Loan was approved are male.

#### 3.3.2 Married

In [20]:
for col in categorical_columns:
    if(col!="Loan_ID" and col!="Married"):
        print(data.groupby(col)["Married"].value_counts(dropna = False))
        
        sns.countplot(x = col,hue="Married",data = data,palette = "crest")
        plt.show()
        
        print('********************')

In [21]:
# Most Graduates are married.
# Most non self employed applicants are married.

#### 3.3.3 Dependents

In [22]:
for col in categorical_columns:
    if(col!="Loan_ID" and col!="Dependents"):
        print(data.groupby(col)["Dependents"].value_counts(dropna = False))
        
        sns.countplot(x = col,data= data,hue = "Dependents",palette = "rocket")
        plt.show()
        print('********************')

In [23]:
# Semi urban and urban have higher number of dependents.

#### 3.3.4 Education

In [24]:
for col in categorical_columns:
    if(col!="Loan_ID" and col!="Education"):
        print(data.groupby(col)["Education"].value_counts(dropna = False))
        
        sns.countplot(x = col,data = data,hue = "Education",palette = "Paired")
        plt.show()
        
        print('********************')

In [25]:
for col in categorical_columns:
    if(col!="Loan_ID" and col!="Self_Employed"):
        print(data.groupby(col)["Self_Employed"].value_counts(dropna = False))
        
        sns.countplot(x =col,data = data,hue = "Self_Employed",palette = "crest")
        plt.show()
        
        print('********************')

In [26]:
# Males are much less self employed

In [27]:
for col in categorical_columns:
    if(col!="Loan_ID" and col!="Property_Area"):
        print(data.groupby(col)["Property_Area"].value_counts(dropna = False))
        
        sns.countplot(x = col, data = data,hue = "Property_Area",palette = "rocket")
        plt.show()
        
        print('********************')

#### 3.4 Numerical Columns
Numerical columns have data types as numericals(int, float) , they have continuous values.

In [28]:
for col in numerical_columns:
    sns.histplot(data[col],kde = True,color = "green")
    plt.show()

In [29]:
# We can clearly see that credit history and Loan Amount Term has only few values which are its categories.
# Applicant Income ,Coapplicant Income and Loan Amount have a wide distribution.
# They are right skewed , we need to normalise them.

In [30]:
sns.pairplot(data)

In [31]:
for col in numerical_columns:
    sns.violinplot(y = col,data = data,kde = True,color = "purple")
    plt.show()

#### 3.4 Numerical and Categorical

In [32]:
for cat in categorical_columns:
    if(cat!="Loan_ID"):
        for num in numerical_columns:
            sns.boxplot(data = data, x = cat  ,y = num, palette = "YlOrBr")
            plt.title("Distribution of {} with respect to {}".format(num,cat),fontsize = 15)
            plt.show()
        print("*************************************************************************")
        print("                                                                         ")

In [33]:
# Male , Graduate and Married have higer Income, Coapplicant Income and Loan Amount.
# 3+ dependents have much more income.

#### 3.5 Correlation
Correlation determines how much is one feature correlated to others.

In [34]:
plt.figure(figsize = (15,10))
sns.heatmap(data.corr(),annot = True,linewidths = 0.5,cmap = "Purples")

In [35]:
# Loan Amount and Applicant Income has 60% positive correlation.

#### 3.6 Summary of data
* We will work in an area where men are dominant.

* More than half of the people are married
 
* Min Income =150
 
* Max Income =81k
 
* Mean Income =5350

* Data is right skewed for Applicant Income , Coapplicant Income and Loan Amount.

* There are outliers in the dataset.
 
* 85 percent of accepted applications have a positive credit history
 
* Rate of accepted applications 70 meaning that the data set is unbalanced

# 4. Data Preprocessing

### 4.1 Handling Missing values

Missing values are those values which which does not have data filled for that particular variable.

In [36]:
data.isnull().sum()

#### 4.1.1 Categorical Variables

In [37]:
data['Gender'].fillna(data['Gender'].mode()[0],inplace=True)
data['Married'].fillna(data['Married'].mode()[0],inplace=True)
data['Dependents'].fillna(data['Dependents'].mode()[0],inplace=True)
data['Self_Employed'].fillna(data['Self_Employed'].mode()[0],inplace=True)
data['Credit_History'].fillna(data['Credit_History'].mode()[0],inplace=True)
data['Loan_Amount_Term'].fillna(data['Loan_Amount_Term'].mode()[0],inplace=True)

#### 4.1.2 Numrical Variables

In [38]:
data["LoanAmount"].fillna(data["LoanAmount"].mean(),inplace = True)

In [39]:
data.isnull().sum()

### 4.2 Droping Unecessary Features

Features from which no information can be gained for the machine learning model are dropped to reduce training time.

In [40]:
data = data.drop(["Loan_ID"],axis = "columns")

### 4.3 Detecting and Removing Outliers

An outlier is an observation that lies an abnormal distance from other values in a random sample from a population.

In [41]:
data.size

In [42]:

def detect_outliers_iqr(data,outliers):
    data = sorted(data)
    q1 = np.percentile(data, 25)
    q3 = np.percentile(data, 75)
    # print(q1, q3)
    IQR = q3-q1
    lwr_bound = q1-(1.5*IQR)
    upr_bound = q3+(1.5*IQR)
    # print(lwr_bound, upr_bound)
    for i in data: 
        if (i<lwr_bound or i>upr_bound):
            outliers.append(i)

ap_outliers = []
co_outliers = []
la_outliers = []

detect_outliers_iqr(data["ApplicantIncome"],ap_outliers)
detect_outliers_iqr(data["CoapplicantIncome"],co_outliers)
detect_outliers_iqr(data["LoanAmount"],la_outliers)

In [43]:
ap_outliers

In [44]:
co_outliers

In [45]:
la_outliers

In [46]:
sns.histplot(x = "LoanAmount",data = data,color = "cyan")

In [47]:
sns.histplot(x = "ApplicantIncome",data = data, color = "violet")

In [48]:
sns.histplot(x = "CoapplicantIncome",data =data,color = "yellow")

In [49]:
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1

data = data[~((data < (Q1 - 1.5 * IQR)) |(data > (Q3 + 1.5 * IQR))).any(axis=1)]

In [50]:
data.size

In [51]:
sns.histplot(data = data,x = "LoanAmount",color = "cyan")

In [52]:
sns.histplot(data = data,x = "ApplicantIncome",color = "violet")

In [53]:
sns.histplot(data = data,x = "CoapplicantIncome",color = "yellow")

In [54]:
data.ApplicantIncome = np.sqrt(data.ApplicantIncome)
data.CoapplicantIncome = np.sqrt(data.CoapplicantIncome)
data.LoanAmount = np.sqrt(data.LoanAmount)

### 4.4 One-Hot Encoding

One Hot Encoding is used to treat categorical variables by making new column for each category in column so to improve predictions as well as classification accuracy of a model.

In [55]:
data = pd.get_dummies(data)

data = data.drop(['Gender_Female', 'Married_No', 'Education_Not Graduate', 
              'Self_Employed_No', 'Loan_Status_N'], axis = 1)

latcolumns = {"Gender_Male":"Gender","Married_Yes":"Married","Education_Graduare":"Education","Self_Employed_Yes":"Self_Employed","Loan_Status_Y":"Loan_Status"}

data.rename(columns = latcolumns,inplace=True)

In [56]:
data


In [57]:
X = data.drop(["Loan_Status"],axis = 1)
y = data["Loan_Status"]

### 4.5 SMOTE 

Smote (synthetic minority oversampling technique) is used when dataset is imbalanced and it balance class distribution by randomly increasing minority class examples by replicating them. It introduces new entries in existing data of minority class.

In [58]:
X,y = SMOTE().fit_resample(X,y)

In [59]:
# new distribution after resampling
sns.countplot(y =y,data = data, palette = "inferno" )
plt.ylabel("Loan_Status")
plt.xlabel("Total")
plt.show()

### 4.6 Scaling the data

Scaling is done to make sure that data is not spread out a lot, so model can learn from it easily and efficiently.

In [60]:
scaling = MinMaxScaler()
X = scaling.fit_transform(X)

### 4.7 Data Splitting

Splitting data to test the models.

In [61]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2,random_state = 42)

In [62]:
print('X_train :',X_train.shape)
print('X_test :',X_test.shape)
print('y_train :',y_train.shape)
print('y_test :',y_test.shape)

In [63]:
data.info()

# 5. Models

### 5.1 Logistic Regression

Logistic regression is a simple and more efficient method for binary and linear classification problems. It is a classification model, which is very easy to realize and achieves very good performance with linearly separable classes.

In [64]:
LogisticClassifier = LogisticRegression(solver="liblinear",max_iter = 500,random_state = 42)
LogisticClassifier.fit(X_train,y_train)
LogisticClassifier

In [65]:
y_pred_lr = LogisticClassifier.predict(X_test)

In [66]:
from sklearn.metrics import accuracy_score
accuracy_score(y_pred_lr,y_test)

In [67]:
cm_lr = confusion_matrix(y_test,y_pred_lr)
plt.rcParams['figure.figsize'] = (5, 5)
sns.heatmap(cm_lr, annot = True, annot_kws = {'size':15}, cmap = 'PuBu')

In [68]:
print("Training Accuracy :", LogisticClassifier.score(X_train, y_train))

print("Testing Accuracy :", LogisticClassifier.score(X_test, y_test))

In [69]:
cross_val_score(LogisticClassifier,X_test,y_test,cv = 20).mean()

### 5.2 K-Nearest Neighbour (KNN)

K-NN algorithm stores all the available data and classifies a new data point based on the similarity. This means when new data appears then it can be easily classified into a well suite category by using K- NN algorithm.

In [70]:
knn = KNeighborsClassifier()
knn_model = knn.fit(X_train, y_train)
knn_model

In [71]:
y_pred_knn = knn_model.predict(X_test)

In [72]:
accuracy_score(y_pred_knn,y_test)

In [73]:
cm_knn = confusion_matrix(y_test, y_pred_knn)
plt.rcParams['figure.figsize'] = (5, 5)
sns.heatmap(cm_knn, annot = True, annot_kws = {'size':15}, cmap = 'PuBu')

In [74]:
# checking the best model for knn
knn_score = []
for i in range(1,21):
    knnclassifier = KNeighborsClassifier(n_neighbors = i)
    knnclassifier.fit(X_train,y_train)
    knn_score.append(knnclassifier.score(X_test,y_test))
plt.figure(figsize = (10,10))
plt.plot(range(1,21), knn_score)
plt.xticks(np.arange(1,21,1))
plt.xlabel("K value")
plt.ylabel("Score")
plt.show()
KNAcc = max(knn_score)
print("KNN best accuracy: {:.2f}%".format(KNAcc*100))


### 5.3 Naive Bayes

Naive Bayes algorithms are mostly used in sentiment analysis, spam filtering, recommendation systems etc. They are fast and easy to implement but their biggest disadvantage is that the requirement of predictors to be independent.

In [75]:
GaussianNB = GaussianNB()
GaussianNB.fit(X_train,y_train)

In [76]:
y_pred_NB = GaussianNB.predict(X_test)

In [77]:
accuracy_score(y_pred_NB,y_test)

In [78]:
cm_NB = confusion_matrix(y_test, y_pred_NB)
plt.rcParams['figure.figsize'] = (5, 5)
sns.heatmap(cm_NB, annot = True, annot_kws = {'size':15}, cmap = 'PuBu')

### 5.4 Decision Tree

In decision analysis, a decision tree can be used to visually and explicitly represent decisions and decision making. As the name goes, it uses a tree-like model of decisions.

In [79]:
dt = DecisionTreeClassifier()
dt.fit(X_train,y_train)
dt

In [80]:
y_pred_dt = dt.predict(X_test)

In [81]:
accuracy_score(y_test,y_pred_dt)

In [82]:
cm_dt = confusion_matrix(y_test, y_pred_dt)
plt.rcParams['figure.figsize'] = (5, 5)
sns.heatmap(cm_dt, annot = True, annot_kws = {'size':15}, cmap = 'PuBu')

In [83]:
dt_score = []
for i in range(2,21):
    DtClassifier = DecisionTreeClassifier(max_leaf_nodes = i)
    DtClassifier.fit(X_train,y_train)
    dt_score.append(DtClassifier.score(X_test,y_test))
    
plt.figure(figsize = (10,10))
plt.plot(range(2,21), dt_score)
plt.xticks(np.arange(2,21,1))
plt.xlabel("Leaf")
plt.ylabel("Score")
plt.show()
DTAcc = max(dt_score)
print("Decision Tree Accuracy: {:.2f}%".format(DTAcc*100))

### 5.5 Random Forest

Random forest, like its name implies, consists of a large number of individual decision trees that operate as an ensemble. Each individual tree in the random forest spits out a class prediction and the class with the most votes becomes our model’s prediction.

In [84]:
rf = RandomForestClassifier()
rf.fit(X_train,y_train)

In [85]:
y_pred_rf = rf.predict(X_test)

In [86]:
accuracy_score(y_pred_rf,y_test)

In [87]:
cm_rf = confusion_matrix(y_test, y_pred_rf)
plt.rcParams['figure.figsize'] = (5, 5)
sns.heatmap(cm_rf, annot = True, annot_kws = {'size':15}, cmap = 'PuBu')

In [88]:
rf_score = []
for i in range(2,35):
    RFclassifier = RandomForestClassifier(n_estimators = 1000, random_state = 1, max_leaf_nodes=i)
    RFclassifier.fit(X_train, y_train)
    rf_score.append(RFclassifier.score(X_test, y_test))

plt.figure(figsize = (10,10))
plt.plot(range(2,35), rf_score)
plt.xticks(np.arange(2,35,1))
plt.xlabel("RF Value")
plt.ylabel("Score")
plt.show()
RFAcc = max(rf_score)
print("Random Forest Accuracy:  {:.2f}%".format(RFAcc*100))

In [89]:
finmodel = GridSearchCV(RandomForestClassifier(criterion = "gini"),{
            "max_depth": [8,10],
            "max_features": [5,8],
            "n_estimators": [500,1000,1500],
            "min_samples_split": [2,5],
            "max_leaf_nodes" : [15,20,25,30]
},cv = 5,n_jobs = -1)

finmodel.fit(X_train,y_train)
finmodel.cv_results_

In [90]:
res = pd.DataFrame(finmodel.cv_results_)
print(res)

In [91]:
print("Best parameters - {}".format(finmodel.best_params_))

In [108]:
tuned_rf = RandomForestClassifier(max_depth =  10, max_features = 8, min_samples_split = 2,max_leaf_nodes = 30, n_estimators = 500)
tuned_rf.fit(X_train,y_train)
y_pred_trf = tuned_rf.predict(X_test)

print(accuracy_score(y_pred_trf,y_test))
print(cross_val_score(RandomForestClassifier(max_depth =  10, max_features = 8, min_samples_split = 2,max_leaf_nodes = 30, n_estimators = 500),X_train,y_train,cv = 5).mean())

In [109]:
cm_trf = confusion_matrix(y_test, y_pred_trf)
plt.rcParams['figure.figsize'] = (5, 5)
sns.heatmap(cm_trf, annot = True, annot_kws = {'size':15}, cmap = 'PuBu')

In [110]:
data

# 6. Test Dataset

In [111]:
test_data = pd.read_csv("/kaggle/input/loan-eligible-dataset/loan-test.csv")

In [112]:
test_data

# 7. Preparing test data

In [113]:
test_data.isnull().sum()

In [114]:
test_data.shape

In [115]:
test_data = test_data.dropna()

In [116]:
test_data.shape

In [117]:
test_data = test_data.drop(["Loan_ID"],axis = "columns")

In [118]:
test_data.ApplicantIncome = np.sqrt(test_data.ApplicantIncome)
test_data.CoapplicantIncome = np.sqrt(test_data.CoapplicantIncome)
test_data.LoanAmount = np.sqrt(test_data.LoanAmount)

In [119]:
test_data = pd.get_dummies(test_data)

test_data = test_data.drop(['Gender_Female', 'Married_No', 'Education_Not Graduate', 
              'Self_Employed_No'], axis = 1)

latcolumnstest = {"Gender_Male":"Gender","Married_Yes":"Married","Education_Graduare":"Education","Self_Employed_Yes":"Self_Employed"}

test_data.rename(columns = latcolumnstest,inplace=True)

In [120]:
test_data

In [121]:
test_data = scaling.transform(test_data)

In [122]:
test_pred = tuned_rf.predict(test_data)

In [123]:
test_pred

# 8. Conclusion

* The data is small, so models can't train properly.
* We cant train properly as the y is imbalanced (values of N are less hence model cant train properly on it).
* The models are overfeeding due to above reasons.
* Feature Engineering can be done for better results.
* We can completely reject or accept newer calls.