I have a certain amount of people who are looking to see whether or not their loan application will be approved based on applicant and loan characteristics. I have been provided with information such as a mix of categorical and numerical features that are relevant to loan approval decisions. This data can be used to train a classification model to predict the loan approval status. Therefore, with addditional information that I am going to include such as Gender, Married, Dependents, Self_Employed,LoanAmount, Loan_Amount_Term, and credit history, I will see whether my clients loan application will approved or not.


In [None]:

import pandas as pd

# Load the dataset
file_path = './loan_data_set.csv'
data = pd.read_csv(file_path)

# Fill missing values
data['Gender'].fillna(data['Gender'].mode()[0], inplace=True)
data['Married'].fillna(data['Married'].mode()[0], inplace=True)
data['Dependents'].fillna(data['Dependents'].mode()[0], inplace=True)
data['Self_Employed'].fillna(data['Self_Employed'].mode()[0], inplace=True)
data['LoanAmount'].fillna(data['LoanAmount'].median(), inplace=True)
data['Loan_Amount_Term'].fillna(data['Loan_Amount_Term'].mode()[0], inplace=True)
data['Credit_History'].fillna(data['Credit_History'].mode()[0], inplace=True)

# Encode categorical variables
data = pd.get_dummies(data, columns=['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'Property_Area'], drop_first=True)

# Map target variable 'Loan_Status' to binary values
data['Loan_Status'] = data['Loan_Status'].map({'Y': 1, 'N': 0})

data.head()

Unnamed: 0,Loan_ID,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Loan_Status,Gender_Male,Married_Yes,Dependents_1,Dependents_2,Dependents_3+,Education_Not Graduate,Self_Employed_Yes,Property_Area_Semiurban,Property_Area_Urban
0,LP001002,5849,0.0,128.0,360.0,1.0,1,1,0,0,0,0,0,0,0,1
1,LP001003,4583,1508.0,128.0,360.0,1.0,0,1,1,1,0,0,0,0,0,0
2,LP001005,3000,0.0,66.0,360.0,1.0,1,1,1,0,0,0,0,1,0,1
3,LP001006,2583,2358.0,120.0,360.0,1.0,1,1,1,0,0,0,1,0,0,1
4,LP001008,6000,0.0,141.0,360.0,1.0,1,1,0,0,0,0,0,0,0,1


Next, lets split the dataset into train and test sets to train a model.



In [None]:
from sklearn.model_selection import train_test_split

# X contains the features used for training the model, and y contains the target variable
X = data.drop(columns=['Loan_ID', 'Loan_Status'])
y = data['Loan_Status']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Now that we have a train and test set, we can use the train set to train a model. To help my clients, we will a Logistic Regression model for classification due to the nature of the target variable and the simplicity of the model.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

# Predict on the test set
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

accuracy, conf_matrix, class_report

(0.7886178861788617,
 array([[18, 25],
        [ 1, 79]]),
 '              precision    recall  f1-score   support\n\n           0       0.95      0.42      0.58        43\n           1       0.76      0.99      0.86        80\n\n    accuracy                           0.79       123\n   macro avg       0.85      0.70      0.72       123\nweighted avg       0.83      0.79      0.76       123\n')

- **Precision**: 95% for class 0 (not approved), 76% for class 1 (approved)
- **Recall**: 42% for class 0, 99% for class 1
- **F1-score**: 58% for class 0, 86% for class 1

The model performs well in predicting loan approvals (class 1) but struggles with loan rejections (class 0). This imbalance suggests a need for further improvement.



Intepretation : The model performs well in identifying loan approvals (class 1) with high recall (0.99), meaning it correctly identifies most approved loans. However, it has a lower precision (0.76) for class 1, indicating some false positives.
For non-approved loans (class 0), the model has high precision (0.95) but low recall (0.42), meaning it misses a significant number of actual non-approved loans.
The overall accuracy of 78.86% suggests that the model is reasonably effective, but the imbalance between precision and recall for class 0 indicates areas for potential improvement, possibly through techniques such as balancing the dataset, tuning the model, or exploring different algorithms.