# Loan Approval Classification
A machine learning project to predict whether a loan application will be approved or not, based on applicant information.

## Project Overview
This project serves the purpose of understanding how classifications work, specifically Logistic Regression. It uses a classification model to predict the likelihood of loan approval based on historical data. It includes preprocessing, model selection, performance evaluation, and model deployment preparation.


## Data Understanding
### Load Libraries and Dataset
We import key Python libraries and load the dataset into a Pandas DataFrame. We then inspect the structure, shape, and missing values.

In [None]:
# Import necessary libraries.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

In [19]:
# Fetch dataset directly from github.
url = "https://raw.githubusercontent.com/prasertcbs/basic-dataset/master/Loan-Approval-Prediction.csv"
df = pd.read_csv(url)

# Check the first few rows
df.head()


Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


### Initial exploration

In [20]:
# Shape of the dataset
print("Rows, Columns:", df.shape)

Rows, Columns: (614, 13)


In [None]:
# Data types and non-null values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            614 non-null    object 
 1   Gender             601 non-null    object 
 2   Married            611 non-null    object 
 3   Dependents         599 non-null    object 
 4   Education          614 non-null    object 
 5   Self_Employed      582 non-null    object 
 6   ApplicantIncome    614 non-null    int64  
 7   CoapplicantIncome  614 non-null    float64
 8   LoanAmount         592 non-null    float64
 9   Loan_Amount_Term   600 non-null    float64
 10  Credit_History     564 non-null    float64
 11  Property_Area      614 non-null    object 
 12  Loan_Status        614 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 62.5+ KB


In [None]:
# Summary stats for numerical columns
df.describe()

Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History
count,614.0,614.0,592.0,600.0,564.0
mean,5403.459283,1621.245798,146.412162,342.0,0.842199
std,6109.041673,2926.248369,85.587325,65.12041,0.364878
min,150.0,0.0,9.0,12.0,0.0
25%,2877.5,0.0,100.0,360.0,1.0
50%,3812.5,1188.5,128.0,360.0,1.0
75%,5795.0,2297.25,168.0,360.0,1.0
max,81000.0,41667.0,700.0,480.0,1.0


In [23]:
# Total and percent missing
missing = df.isnull().sum()
missing_percent = (missing / len(df)) * 100
missing_df = pd.DataFrame({'Missing Values': missing, 'Percent': missing_percent})
missing_df[missing_df['Missing Values'] > 0]


Unnamed: 0,Missing Values,Percent
Gender,13,2.117264
Married,3,0.488599
Dependents,15,2.442997
Self_Employed,32,5.211726
LoanAmount,22,3.583062
Loan_Amount_Term,14,2.28013
Credit_History,50,8.143322


In [None]:
# Check for duplicates
df.duplicated().sum()

0

### Data Understanding summary
In our initial exploration, we've found that the dataset contains missing values. In the next step, we are going to deal with this data integrity issue. Here's how:  

| Column                                               | Fix Strategy                                        |
| :--------------------------------------------------: | :-------------------------------------------------: |
| `Gender`, `Married`, `Dependents`, `Self_Employed`   | Fill with **mode** (most frequent)                  |
| `LoanAmount`                                         | Fill with **median** (less sensitive to outliers)   |
| `Loan_Amount_Term`                                   | Fill with **mode**                                  |
| `Credit_History`                                     | Fill with **mode** (it's binary: 1.0 or 0.0)        |

## Data Cleaning and Preparation
### Dealing with missing values

In [25]:
# Fill categorical missing values with mode
for col in ['Gender', 'Married', 'Dependents', 'Self_Employed']:
    df[col].fillna(df[col].mode()[0], inplace=True)

# Fill numerical missing values
df['LoanAmount'].fillna(df['LoanAmount'].median(), inplace=True)
df['Loan_Amount_Term'].fillna(df['Loan_Amount_Term'].mode()[0], inplace=True)
df['Credit_History'].fillna(df['Credit_History'].mode()[0], inplace=True)

# Check again to confirm no missing values
df.isnull().sum()


Loan_ID              0
Gender               0
Married              0
Dependents           0
Education            0
Self_Employed        0
ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
Property_Area        0
Loan_Status          0
dtype: int64

### Encoding categorical data
| Column          | Example Mapping       |
| --------------- | --------------------- |
| `Gender`        | Male = 1, Female = 0  |
| `Married`       | Yes = 1, No = 0       |
| `Education`     | Graduate = 1, Not = 0 |
| `Self_Employed` | Yes = 1, No = 0       |
| `Loan_Status`   | Y = 1, N = 0          |


We’ll also:

- Map `Dependents`: "0", "1", "2", "3+" → 0, 1, 2, 3

- Drop `Loan_ID`, `Property_Area` — not useful

In [None]:
# Drop Loan_ID and Property_Area
df.drop(columns=['Loan_ID', 'Property_Area'], axis=1, inplace=True)

# Encode binary categorical features
binary_map = {'Male': 1, 'Female': 0,
              'Yes': 1, 'No': 0,
              'Graduate': 1, 'Not Graduate': 0,
              'Y': 1, 'N': 0}

df.replace(binary_map, inplace=True)

# Convert 'Dependents' column
df['Dependents'].replace({'3+': 3}, inplace=True)
df['Dependents'] = df['Dependents'].astype(int)


## Feature Selection & Model Building
### Feature selection and splitting data

In [27]:
# Target variable
y = df['Loan_Status']

# Drop target from features
X = df.drop('Loan_Status', axis=1)


In [None]:
# 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Confirm shape
print("Train shape:", X_train.shape)
print("Test shape:", X_test.shape)


Train shape: (491, 10)
Test shape: (123, 10)


### Model Building

In [None]:
# Initialize and train the model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)


LogisticRegression(max_iter=1000)

In [31]:
# Predict on the test set
y_pred = model.predict(X_test)

# Confusion matrix and classification report
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))


Confusion Matrix:
 [[18 25]
 [ 1 79]]

Classification Report:
               precision    recall  f1-score   support

           0       0.95      0.42      0.58        43
           1       0.76      0.99      0.86        80

    accuracy                           0.79       123
   macro avg       0.85      0.70      0.72       123
weighted avg       0.83      0.79      0.76       123



The model is biased toward approving loans. It's rarely missed actual approvals but has mistaken many bad applications as good → 25 false approvals which could be a potential risk

Right now, the model is biased toward "Yes" loans, because they're more common. We’ll be using `class_weight='balanced'` to tell the model to pay equal attention to both classes even though one is rarer.

In [None]:
# Re-train model with class weights
model_balanced = LogisticRegression(max_iter=1000, class_weight='balanced')
model_balanced.fit(X_train, y_train)

# Predict again
y_pred_bal = model_balanced.predict(X_test)

# Evaluate again
from sklearn.metrics import classification_report, confusion_matrix

print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_bal))
print("\nClassification Report:\n", classification_report(y_test, y_pred_bal))


Confusion Matrix:
 [[21 22]
 [ 5 75]]

Classification Report:
               precision    recall  f1-score   support

           0       0.81      0.49      0.61        43
           1       0.77      0.94      0.85        80

    accuracy                           0.78       123
   macro avg       0.79      0.71      0.73       123
weighted avg       0.79      0.78      0.76       123



We’ve now got a fairer and more trustworthy model that is not biased towards loan approvals, it now catches almost half of bad applications (up from 42% → 49%).

In [35]:
# Define models
models = {
    "Logistic Regression (Balanced)": LogisticRegression(max_iter=1000, class_weight='balanced'),
    "Random Forest": RandomForestClassifier(random_state=42),
    "SVM": SVC()
}

# Fit and evaluate
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(f"\n{name} Results:")
    print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
    print("Classification Report:\n", classification_report(y_test, y_pred))



Logistic Regression (Balanced) Results:
Confusion Matrix:
 [[21 22]
 [ 5 75]]
Classification Report:
               precision    recall  f1-score   support

           0       0.81      0.49      0.61        43
           1       0.77      0.94      0.85        80

    accuracy                           0.78       123
   macro avg       0.79      0.71      0.73       123
weighted avg       0.79      0.78      0.76       123


Random Forest Results:
Confusion Matrix:
 [[19 24]
 [ 3 77]]
Classification Report:
               precision    recall  f1-score   support

           0       0.86      0.44      0.58        43
           1       0.76      0.96      0.85        80

    accuracy                           0.78       123
   macro avg       0.81      0.70      0.72       123
weighted avg       0.80      0.78      0.76       123


SVM Results:
Confusion Matrix:
 [[ 0 43]
 [ 0 80]]
Classification Report:
               precision    recall  f1-score   support

           0       0.00   

  _warn_prf(average, modifier, msg_start, len(result))


In [38]:
model_scores = pd.DataFrame({
    "Model": ["Logistic Regression", "Random Forest", "SVM"],
    "Accuracy": [0.78, 0.78, 0.65],
    "F1 Score (Approved)": [0.85, 0.85, 0.79],
    "Recall (Approved)": [0.94, 0.96, 1.00],
    "Recall (Rejected)": [0.49, 0.44, 0.00]
})
model_scores


Unnamed: 0,Model,Accuracy,F1 Score (Approved),Recall (Approved),Recall (Rejected)
0,Logistic Regression,0.78,0.85,0.94,0.49
1,Random Forest,0.78,0.85,0.96,0.44
2,SVM,0.65,0.79,1.0,0.0


Observations:
- `SVM` failed hard → It just predicted everything as class 1 (loan approved)

- `Logistic Regression` was more balanced than Random Forest on class 0 (slightly higher recall)

- `Random Forest` achieved highest class 1 performance with great precision and near-perfect recall

The final model we will be working with is `Random Forest` because:
- It outperformed the rest at recognizing eligible loan applications (which is the bank’s revenue driver)

- Slightly better balance on metrics than Logistic Regression

In [36]:
import joblib

# Save the best model
joblib.dump(models["Random Forest"], "Models/random_forest_model.pkl")


['Models/random_forest_model.pkl']