# LOAN ELIGIBILITY PREDICTION #

![image.png](attachment:image.png)

**MODEL PERFORMANCE AND EVALUATION REPORT**

***Contents*** :
1. Introduction 
2. Exploratory Data Analysis (EDA) and Preprocessing
3. Modeling
4. Model Comparison
5. Conclusion
6. Reference

**INTRODUCTION**

Risk is always involved in the approval of the loans. Even after analyzing the loan application data numerous times, the decisions are not always correct. Dream Housing Finance company deals in home loans with presence across all urban, semi- urban and rural areas. The whole process of ascertaining if a borrower would pay back loans might be tedious hence the need to automate the procedure. Dream Housing Finance Company wants automation of this process so that the loan approval is less risky and does not incur loss to the company. We need to identify the factors that make a customer eligible for taking a home loan. The more accurate we are in predicting the eligible customers the more beneficial it would be for the Dream Housing Finance Company.

This is a supervised binary classification problem, where various machine learning algorithms were implemented to find the best performing model.

I collected the dataset from Kaggle which had 13 unique features/columns and 614 rows/samples. The dataset was found to be imbalanced and had missing values in some of the columns. The missing values were imputed using mean values for the numerical features and mode for the categorical features.


**EDA AND PREPROCESSING**

**Steps:**
 1. Data Wrangling and Exploration
        Cleaning, transforming and visualizing the data
 2. Preprocessing and Feature Engineering
        Preparing the data for modeling and creating new features 


**EDA Summary**

Analyzed data sets to summarize their main characteristics using data visualization methods through univariate and bivariate analysis.

The dataset was found to be imbalanced based on the value counts of Approval(Yes-Y) and Rejection(No-N). Out of total 614 applications, 422 loans were approved and 192 were rejected. 

Our dataset also had missing values in some of the columns. The missing values were imputed using mean values for the numerical features and mode value for the categorical features. 

Loan Status is the target variable, and the other 12 are the predictor variables. Brief description of all the features in the dataset:

![image.png](attachment:image.png)

**Feature Engineering**

4 new features which we created for a thorough data analysis are:

1. Total_Income = ApplicantIncome + CoapplicantIncome, as a feature engineering measure to optimize feature importance.
2. EMI (Equated Monthly Installment) = Loan Amount/ Loan Amount Term
3. Balance Income = Total Income- EMI
4. DTI (Debt To Income) ratio = (EMI/Total Income)*100

**Bivariate Analysis**

    Analyzing the relationship between Loan Status, the target variable and the rest of the predictor variables.
    
1. Credit History: 
    Applicant with good credit history are far more likely to be accepted.
2. Education: 
    About 5/6th of the population is a ‘Graduate’ and graduates have higher proportion of loan approval.
3. Total Income:
    Applicants with higher total income are more likely to have loans approved.
4. Property Area:
    More applicants are from Semi-urban and  also more likely to be granted loans.


|![image.png](attachment:image.png) |![image-2.png](attachment:image-2.png)| 
|-|-|
|![image-7.png](attachment:image-7.png) |![image-6.png](attachment:image-6.png)
!![image-8.png](attachment:image-8.png) |![image-9.png](attachment:image-9.png)

5. Gender : There are more Male (81%) applicants than Female(19%). Males have an approval rate of around 69% whereas females have around 67%.
6. Martial Status: 2/3rd of the population in the dataset is Married; Married applicants are more likely to be granted Home Loans.

Some other inferences drawn from the above-carried analysis are:

1. Applicant’s Income is not normally distributed with a fair amount of anomalies and outliers present..
2. However, the density of Loan Amount is somehow normally distributed but it still carries outliers. 
   In such cases log transformation was used to remove the skewness. 

**Modeling** 

we are going to implement the model with the help of three various kinds of Machine learning classification algorithms.

1.  Supervised Machine Learning implemented
2.  Binary classification problem
3.  Tools used: Python’s sklearn, pandas, numpy, matplotlib, seaborn in Jupyter notebook

   **Model Fitting:**
Before modelling turned all the categorical variables into numbers through Label Encoding.

**Modeling Steps**

1. Feature Engineering
2. Train-Test Split (70/30) : Divided the dataset into training and test set in the ratio of 70:30 respectively
3. Classifier training using optimal parameters
4. Fit the data
5. Hyperparameter Tuning(5-fold Cross Validation)
6. Metrics : ROC-AUC scores and Accuracy
7. Model Evaluation

**Classification Algorithms**

The following Machine Learning algorithms were used for the given binary classification problem:

1. Logistic Regression
2. Random Forest
3. Decision Tree

**MODEL COMPARISION**

***ROC(Reciever Operating Characteristic)-AUC(Area Under Curve) scores for the dataset:***

ROC-AUC scores were taken because of the imbalanced nature of the data. These score summarizes the curves and used to compare classifiers.

Although the ROC Curve is a helpful diagnostic tool, it was challenging to compare two or more classifiers based on their curves. Hence, the area under the curve (AUC) can be calculated to give a single score for a classifier model across all threshold values. This is called the ROC area under curve or ROC AUC or sometimes ROCAUC.

|![image.png](attachment:image.png) | ![image-2.png](attachment:image-2.png) | ![image-3.png](attachment:image-3.png)
|-|-|-|

![image.png](attachment:image.png)

![image.png](attachment:image.png)

**Evaluation:** 

For the given dataset, The Logistic Regression model gave the best results compared to Random Forest and Decision Tree.

**Best Features:**

From our best performing model, Logistic Regression, we found that 'Credit History', 'Total Income', 'DTI' , 'Education’,  features are most important in predicting the target variable, Loan Status.

**Correlation Heatmap**

From the heatmap, we can infer that:

1. The target variable, Loan Status shows a  positive correlation to applicants Credit history, marital status, total income and property area. 
2. Loan Status shows a negative correlation to DTI (Debt To Income) and Loan Amount


![image.png](attachment:image.png)

**CONCLUSION**

Out of the three supervised classification models, Logistic Regression provided the best results compared to Random Forest and Decision Tree. 4 new features(predictor variables) were added to the dataset( through feature engineering) for detailed data analysis.

Due to the imbalanced and limited nature of the data, the accuracy may not be of the correct measure. With more data and ideas, the model can be improved in future.


**REFERENCES**

Kaggle, Google, Github