<h1 style="text-align:center;">[ Loan Approval Prediction ]</h1>

### Problem Statement:

Automate the loan eligibility process (real-time) based on customer detail provided while filling the online application form. These details are Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History, and others.


#### The major aim of this notebook is to predict which of the customers will have their loan approved.

###  Features of our data:

- LoanID = Unique Loan ID<br>
- Gender = Male/ Female<br>
- Married = Applicant married (Y/N)<br>
- Dependents = Number of dependents<br>
- Education = Applicant Education (Graduate/ Under Graduate)<br>
- SelfEmployed = Self-employed (Y/N)<br>
- ApplicantIncome = Applicant income<br>
- CoapplicantIncome = Coapplicant income<br>
- LoanAmount = Loan amount in thousands<br>
- LoanAmountTerm = Term of the loan in months<br>
- CreditHistory = Credit history<br>
- PropertyArea= Urban/ Semi-Urban/ Rural<br>
- LoanStatus = (Target) Loan approved (Y/N)<br>

### Importing Modules

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

ModuleNotFoundError: No module named 'seaborn'

### Loading the Dataset

In [3]:
data = pd.read_csv("LoanData.csv")
data.head()

FileNotFoundError: [Errno 2] No such file or directory: 'LoanData.csv'

In [None]:
data.info()

In [None]:
data.describe()

In [None]:
data.describe(include=object)

In [None]:
data['Loan_Status'].value_counts()

### Handling null values

In [None]:
data.isnull().sum()

In [None]:
# Two types of data

# Numerical
# Categorical

### Replacing null values with mode for object data type

In [None]:
data['Gender'] = data['Gender'].fillna(data['Gender'].mode()[0])
data['Married'] = data['Married'].fillna(data['Married'].mode()[0])
data['Dependents'] = data['Dependents'].fillna(data['Dependents'].mode()[0])
data['Self_Employed'] = data['Self_Employed'].fillna(data['Self_Employed'].mode()[0])

In [None]:
data.isnull().sum()

### Replacing null values with median for numberic data type

In [None]:
data['LoanAmount'] = data['LoanAmount'].fillna(data['LoanAmount'].median())
data['Loan_Amount_Term'] = data['Loan_Amount_Term'].fillna(data['Loan_Amount_Term'].median())
data['Credit_History'] = data['Credit_History'].fillna(data['Credit_History'].median())

In [None]:
data.isnull().sum()

### Outliers Detection & Handling

In [None]:
plt.style.use('fivethirtyeight')
plt.rcParams['figure.figsize'] = (15,5)

plt.subplot(1,3,1)
sns.boxplot(data['ApplicantIncome'])

plt.subplot(1,3,2)
sns.boxplot(data['CoapplicantIncome'])

plt.subplot(1,3,3)
sns.boxplot(data['LoanAmount'])

plt.suptitle('Outliers Detection')
plt.show()

In [None]:
print("Before Removing the outliers", data.shape)

data = data[data['ApplicantIncome']<25000]

print("After Removing the outliers", data.shape)

In [None]:
print("Before Removing the outliers", data.shape)

data = data[data['CoapplicantIncome']<12000]

print("After Removing the outliers", data.shape)

In [None]:
print("Before Removing the outliers", data.shape)

data = data[data['LoanAmount']<400]

print("After Removing the outliers", data.shape)

### Analysis

In [None]:
plt.subplot(1,3,1)
sns.distplot(data['ApplicantIncome'], color='green')

plt.subplot(1,3,2)
sns.distplot(data['CoapplicantIncome'], color='green')

plt.subplot(1,3,3)
sns.distplot(data['LoanAmount'], color='green')

plt.show()

In [None]:
data['ApplicantIncome'] = np.log(data['ApplicantIncome'])
data['CoapplicantIncome'] = np.log1p(data['CoapplicantIncome'])

plt.subplot(1,3,1)
sns.distplot(data['ApplicantIncome'], color='green')

plt.subplot(1,3,2)
sns.distplot(data['CoapplicantIncome'], color='green')

plt.subplot(1,3,3)
sns.distplot(data['LoanAmount'], color='green')

plt.suptitle('After Log Transformation Data')
plt.show()

### Analysis on Categorical with target

In [None]:
categorical_col = data.select_dtypes(include='object').columns
cat = categorical_col[1:-1]

In [None]:
cat

In [None]:
fig , axes = plt.subplots(figsize=(20,15),nrows=2, ncols=3)        # Plot Configuration 

for ax, column in zip(axes.flatten(),cat):                         # Using For loop 
    sns.countplot(data[column],ax=ax, hue=data['Loan_Status'])

### Categorical with Target

In [None]:
data.columns

In [None]:
print(pd.crosstab(data['Loan_Status'],data['Married']))

In [None]:
print(pd.crosstab(data['Loan_Status'],data['Education']))

In [None]:
print(pd.crosstab(data['Loan_Status'],data['Property_Area']))

In [None]:
print(pd.crosstab(data['Loan_Status'],data['Self_Employed']))

### Data Preparation

In [None]:
data.select_dtypes('object').head()

### Drop the Loan_id column, doesn't make any impact on the target

In [None]:
data = data.drop(['Loan_ID'], axis = 1)

In [None]:
data.select_dtypes('object').head()

In [None]:
data['Gender'] = data['Gender'].replace(('Male', 'Female'),(1,0))
data['Married'] = data['Married'].replace(('Yes', 'No'),(1,0))
data['Education'] = data['Education'].replace(('Graduate', 'Not Graduate'),(1,0))

In [None]:
data.head()

In [None]:
data['Dependents'].value_counts()

In [None]:
data['Self_Employed'] = data['Self_Employed'].replace(('Yes', 'No'),(1,0))
data['Loan_Status'] = data['Loan_Status'].replace(('Y', 'N'),(1,0))

data['Property_Area'] = data['Property_Area'].replace(('Urban', 'Semiurban','Rural'),(1,1,0))
data['Dependents'] = data['Dependents'].replace(('0','1','2','3+'),(0,1,1,1))

In [None]:
data.head()

In [None]:
data.info()

### Checking unique values in our dataset for better understanding

In [None]:
data.nunique()

### Correlation of data

In [None]:
sns.heatmap(data.corr(),annot=True)
plt.show()

 * Credit History is Highly correlated to our target.
 * Self Employed, Applicant Income, Coapplicant Income, Loan Amount, Loan Amount Term has Negative correlation.
 * Gender, Married & Dependents are correlated.

### Splitting the Dataset

In [None]:
y = data['Loan_Status']
x = data.drop(['Loan_Status'], axis = 1)

In [None]:
x.shape

In [None]:
y.shape

### Handling Imbalance Data

In [None]:
from imblearn.over_sampling import SMOTE

In [None]:
x_resample, y_resample = SMOTE().fit_resample(x, y)

In [None]:
print(x_resample.shape)
print(y_resample.shape)

### Train Test Split

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x_resample, y_resample, test_size = 0.20, random_state = 10)

In [None]:
x_train.shape, x_test.shape

In [None]:
y_train.shape, y_test.shape

### Creating model function to test multiple models and choosing the ideal one

In [None]:
def mymodel(model):
    
    model.fit(x_train, y_train)
    y_pred = model.predict(x_test)
    train_accuracy = model.score(x_train,y_train)
    test_accuracy = model.score(x_test,y_test)
    
    print("Model :-",str(model))
    
    print('\nModel Accuracy: ', accuracy_score(y_test, y_pred))
    print(f'\nTraining Accuracy: {train_accuracy} \nTesting Accuracy :{test_accuracy}')
    print('--------------------------------------')
    print()

    return model

In [None]:
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report

from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier

knn = mymodel(KNeighborsClassifier())
svc = mymodel(SVC())
dt = mymodel(DecisionTreeClassifier())
lr = mymodel(LogisticRegression())
gnb = mymodel(GaussianNB())
rfc = mymodel(RandomForestClassifier())

### Model Building
#### Logistic regression can be used for our model as its giving effective accuracy.

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
LR = LogisticRegression()

In [None]:
LR.fit(x_train,y_train)

In [None]:
y_pred = LR.predict(x_test)

In [None]:
print("Training Accuracy",LR.score(x_train,y_train))

In [None]:
print("Testing Accuracy",LR.score(x_test,y_test))

In [None]:
print(classification_report(y_test,y_pred))

In [None]:
print("Our Model Accuracy is",accuracy_score(y_pred,y_test))

- ### Categorical Analysis:

  - **Loan Approval Rates:** Married individuals tend to have higher loan approval rates compared to unmarried applicants.

  - **Education Impact:** Graduates have higher chances of loan approval compared to non-graduates.

  - **Property Area Influence:** Applicants from semi-urban areas have higher chances of loan approval compared to urban and rural areas.

  - **Self-Employment Factor:** Self-employed applicants seem to have slightly lower loan approval rates than non-self-employed individuals.

- **Model Development:**

  - **Imbalanced Data:** SMOTE technique was used to handle the imbalance in the target variable ('Loan_Status') by oversampling the minority class.

  - **Model Comparison:** Various classifiers like K-Nearest Neighbors, Support Vector Machine, Decision Tree, Naive Bayes, and Random Forest were compared. Logistic Regression emerged as the most effective model with good accuracy for this dataset.

### Recommendations:

- **Further Feature Engineering:** Explore additional features or create new features from existing ones that might enhance the predictive power of the model.

- **Exploration of Other Algorithms:** Though Logistic Regression performed well, testing more complex algorithms or ensemble methods could potentially yield better performance.

- **Feature Importance:** Conduct feature importance analysis to identify key features driving loan approvals, which might help in focusing on crucial factors during applicant evaluations.

- **Data Collection & Quality:** Ensure ongoing data quality checks and consider expanding the dataset to improve model robustness and generalizability.

- **Deployment and Monitoring:** Once the model is ready, deploy it in a production environment and continuously monitor its performance for any drift or degradation in accuracy.

- **Regulatory Compliance:** Ensure the model complies with legal and regulatory frameworks governing the financial domain, especially in loan approval scenarios.

_By implementing these recommendations, the loan approval prediction model can be enhanced for more accurate and reliable real-time predictions._


*Thanks!*