**Objective**

The objective of this project is to build a machine learning model that will look into cutomer data and predict if they are eligible for a loan or not.

**Dataset Description**

This dataset contains information of customers as shown below. The dataset can be used for various data analysis and machine learning tasks, such as predicting loan default risk. 

Loan_ID - A unique identifier for each loan application. Used for tracking purposes only.

Gender - Gender of applicant male or female

Married - Marital Status! Yes or no

Dependents - Number of people who depend on the applicant financially (e.g., 0, 1, 2, or 3 etc)

Education - Education, Graduate or Not Graduate

Self_Employed - Whether the applicant is Self_Employed! Yes or No

ApplicantIncome - Total monthly income of the main applicant 

CoapplicantIncome - Monthly income of a co-applicant (e.g., spouse or partner)

LoanAmount - Loan Amount apply for loan

Loan_Amount_Term - The duration (in months) over which the loan will be repaid

Credit_History - Whether the applicant has a good credit history (1.0 = good, 0.0 = bad/no history)

Property_Area - Area where the property is located: Urban, Semiurban, or Rural

Loan_Status - Target variable! (Y: Approved or N: Not Approved)


**Importing Libraries**

In [6]:
import pandas as pd                  # For data manipulation and analysis using DataFrames (e.g., loading CSVs, filtering, grouping)
import numpy as np                   # For numerical operations, and mathematical functions used in data preprocessing
import seaborn as sns                # For advanced, beautiful statistical visualizations (like heatmaps, boxplots, etc.)
import matplotlib.pyplot as plt      # For creating plots and graphs (line charts, bar charts, confusion matrix visualization)
import joblib                        # For saving and loading trained models
from sklearn.model_selection import train_test_split   # For splitting the dataset into training and testing sets
from sklearn.preprocessing import LabelEncoder         # For converting categorical text labels into numerical values
from sklearn.ensemble import RandomForestClassifier    
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix 

# accuracy_score - Measures the percentage of correct predictions. 
# classification_report - Provides precision, recall, f1-score, and support for each class
# confusion_matrix - Shows the actual vs predicted classifications to understand model performance

TN	FP

FN	TP

TP (True Positive):
The means the values that where **Positive** and the Model correctly predicted them to be **Positive**

TN (True Negative):
The means the values that where **Negative** and the Model correctly predicted them to be **Negative**.

FP (False Positive):
The means the values that where **Negative** but the Model incorrectly predicted them to be **Positive**.

FN (False Negative):
The means the values that where **Positive** but the Model incorrectly predicted them to be **Negative**


**Precision** - This means that of all the predicted positive cases, how many were actually positive?

Formula:
Precision = TP / (TP + FP)


**Recall** - This means that of all the actual positive cases, how many did the model correctly predict?

Formula:
Recall = TP / (TP + FN)

**F1-Score** - This gives a balanced score between precision and recall.

Formula:
F1 = 2 * (Precision * Recall) / (Precision + Recall)


**Loading dataset**

In [9]:
df = pd.read_csv('customer_loan.csv')  # Change to your actual path
df

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y
...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,LP002586,Female,Yes,1,Graduate,No,3326,913.0,105.0,84.0,1.0,Semiurban,Y
496,LP002587,Male,Yes,0,Not Graduate,No,2600,1700.0,107.0,360.0,1.0,Rural,Y
497,LP002588,Male,Yes,0,Graduate,No,4625,2857.0,111.0,12.0,,Urban,Y
498,LP002600,Male,Yes,1,Graduate,Yes,2895,0.0,95.0,360.0,1.0,Semiurban,Y


**Exploriing dataset**

In [13]:
df.shape

(500, 13)

This shows that our data has 500 rows and 13 columns

In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            500 non-null    object 
 1   Gender             491 non-null    object 
 2   Married            497 non-null    object 
 3   Dependents         488 non-null    object 
 4   Education          500 non-null    object 
 5   Self_Employed      473 non-null    object 
 6   ApplicantIncome    500 non-null    int64  
 7   CoapplicantIncome  500 non-null    float64
 8   LoanAmount         482 non-null    float64
 9   Loan_Amount_Term   486 non-null    float64
 10  Credit_History     459 non-null    float64
 11  Property_Area      500 non-null    object 
 12  Loan_Status        500 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 50.9+ KB


In [18]:
df.select_dtypes(include='number').columns

Index(['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
       'Loan_Amount_Term', 'Credit_History'],
      dtype='object')

In [20]:
df.select_dtypes(include='object').columns

Index(['Loan_ID', 'Gender', 'Married', 'Dependents', 'Education',
       'Self_Employed', 'Property_Area', 'Loan_Status'],
      dtype='object')

In [22]:
df.isnull().sum()

Loan_ID               0
Gender                9
Married               3
Dependents           12
Education             0
Self_Employed        27
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           18
Loan_Amount_Term     14
Credit_History       41
Property_Area         0
Loan_Status           0
dtype: int64

In [24]:
df['Loan_Status'].value_counts()

Loan_Status
Y    345
N    155
Name: count, dtype: int64

**Observations**

1. There are missing values; 9 in Gender, 3 in Married, 12 in Dependents, 27 in Self_Employed, 18 in LoanAmount, 14 in Loan_Amount_Term, and 41 in Credit_History. This will make us check the value count of each of those columns to know how to fill them.
2. I noticed that Dependent datatype should be Numerical instead of object. So it should be changed to Numerical


**Data cleaning**

In [28]:
selected_columns = ['Gender', 'Married', 'Dependents', 'Self_Employed', 'LoanAmount', 'Loan_Amount_Term', 'Credit_History']

for col in selected_columns:
    print(f"Value counts for column '{col}':")
    print(df[col].value_counts(dropna=False))
    print("-" * 30)

Value counts for column 'Gender':
Gender
Male      400
Female     91
NaN         9
Name: count, dtype: int64
------------------------------
Value counts for column 'Married':
Married
Yes    322
No     175
NaN      3
Name: count, dtype: int64
------------------------------
Value counts for column 'Dependents':
Dependents
0      288
1       81
2       79
3+      40
NaN     12
Name: count, dtype: int64
------------------------------
Value counts for column 'Self_Employed':
Self_Employed
No     407
Yes     66
NaN     27
Name: count, dtype: int64
------------------------------
Value counts for column 'LoanAmount':
LoanAmount
120.0    19
NaN      18
160.0    12
110.0    12
100.0    11
         ..
58.0      1
260.0     1
73.0      1
495.0     1
209.0     1
Name: count, Length: 180, dtype: int64
------------------------------
Value counts for column 'Loan_Amount_Term':
Loan_Amount_Term
360.0    415
180.0     35
NaN       14
300.0     12
480.0     12
120.0      3
240.0      3
60.0       2
84.0 

From our observation, we are going to fill the missing values with values that has less inputs. So this is how we are going to fill them;

Gender with Female

Married with No

Dependents with 3.0

Self_Employed with Yes

LoanAmount with 209

Loan_Amount_Term with 12.0

Credit_History with 0.0

In [31]:
df['Gender'] = df['Gender'].fillna('Female')
df['Married'] = df['Married'].fillna('No')
df['Dependents'] = df['Dependents'].fillna('3.0')
df['Self_Employed'] = df['Self_Employed'].fillna('Yes')
df['LoanAmount'] = df['LoanAmount'].fillna('209')
df['Loan_Amount_Term'] = df['Loan_Amount_Term'].fillna('12.0')
df['Credit_History'] = df['Credit_History'].fillna('0.0')

In [33]:
df.isnull().any()

Loan_ID              False
Gender               False
Married              False
Dependents           False
Education            False
Self_Employed        False
ApplicantIncome      False
CoapplicantIncome    False
LoanAmount           False
Loan_Amount_Term     False
Credit_History       False
Property_Area        False
Loan_Status          False
dtype: bool

In [35]:
df.dtypes

Loan_ID               object
Gender                object
Married               object
Dependents            object
Education             object
Self_Employed         object
ApplicantIncome        int64
CoapplicantIncome    float64
LoanAmount            object
Loan_Amount_Term      object
Credit_History        object
Property_Area         object
Loan_Status           object
dtype: object

We will have to convert these columns to numerical (Dependents LoanAmount Loan_Amount_Term Credit_History). We error in the Dependents columns were we have 3+, so we can to change it before converting.

In [38]:
df['Dependents'].value_counts()

Dependents
0      288
1       81
2       79
3+      40
3.0     12
Name: count, dtype: int64

In [40]:
df['Dependents'] = df['Dependents'].replace({
    '3+': '3',
    '3.0': '3'
})

In [42]:
cols_to_convert = ['Dependents', 'LoanAmount', 'Loan_Amount_Term', 'Credit_History']

for col in cols_to_convert:
    df[col] = pd.to_numeric(df[col])


In [44]:
df.dtypes

Loan_ID               object
Gender                object
Married               object
Dependents             int64
Education             object
Self_Employed         object
ApplicantIncome        int64
CoapplicantIncome    float64
LoanAmount           float64
Loan_Amount_Term     float64
Credit_History       float64
Property_Area         object
Loan_Status           object
dtype: object

So we have now successfully converted the 4 columns to Numeric.

**Feature Engineering**

In [48]:
# Encode target variable
df['Loan_Status'] = df['Loan_Status'].map({'Y': 1, 'N': 0})

# Drop Loan_ID because it's not useful in modeling
df.drop('Loan_ID', axis=1, inplace=True)

# Convert categorical features to numeric
categorical_cols = ['Gender', 'Married', 'Education', 'Self_Employed', 'Property_Area', 'Dependents']

# Label Encoding
le = LabelEncoder()
for col in categorical_cols:
    df[col] = le.fit_transform(df[col])

df.head()

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,1,0,0,0,0,5849,0.0,209.0,360.0,1.0,2,1
1,1,1,1,0,0,4583,1508.0,128.0,360.0,1.0,0,0
2,1,1,0,0,1,3000,0.0,66.0,360.0,1.0,2,1
3,1,1,0,1,0,2583,2358.0,120.0,360.0,1.0,2,1
4,1,0,0,0,0,6000,0.0,141.0,360.0,1.0,2,1


In [54]:
# Features and Target
X = df.drop('Loan_Status', axis=1)
y = df['Loan_Status']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 
# random_state=42 is used to control randomness in functions that involve random splitting or shuffling

**Model training**


We are training Random Forest, Gradient Boosting and Logistic Reegression

In [58]:
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Standard Scaler is used to standardize the data to be in a scale of (0,1), which means feature has mean = 0, and standard deviation = 1. This helps 
# to improve the model performance

In [64]:

rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
gb_model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
logreg_model = LogisticRegression(max_iter=100)

# Train models
rf_model.fit(X_train, y_train)
gb_model.fit(X_train, y_train)
logreg_model.fit(X_train_scaled, y_train)


**Model evaluation**


In [67]:
models = {
    'Random Forest': rf_model,
    'Gradient Boosting': gb_model,
    'Logistic Regression': logreg_model
}

for name, model in models.items():
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    print(f"\n{name} Results")
    print("Accuracy:", acc)
    print("Classification Report:\n", classification_report(y_test, y_pred))


Random Forest Results
Accuracy: 0.74
Classification Report:
               precision    recall  f1-score   support

           0       0.59      0.55      0.57        31
           1       0.80      0.83      0.81        69

    accuracy                           0.74       100
   macro avg       0.69      0.69      0.69       100
weighted avg       0.74      0.74      0.74       100


Gradient Boosting Results
Accuracy: 0.72
Classification Report:
               precision    recall  f1-score   support

           0       0.56      0.45      0.50        31
           1       0.77      0.84      0.81        69

    accuracy                           0.72       100
   macro avg       0.67      0.65      0.65       100
weighted avg       0.71      0.72      0.71       100


Logistic Regression Results
Accuracy: 0.29
Classification Report:
               precision    recall  f1-score   support

           0       0.28      0.81      0.41        31
           1       0.40      0.06      0.



Random Forest achieved 74% accuracy, performing well on both classes. It predicts approved loans (1) with high precision (0.80) and recall (0.83), though it struggles slightly with rejected loans (0). Overall, it’s the most balanced and reliable of the three models.

Gradient Boosting achieved 72% accuracy. It predicts approved loans (1) well but underperforms on rejected loans (0), with lower recall (0.45). It's slightly less balanced than Random Forest but still effective for positive loan decisions.

Logistic Regression performed poorly, with only 29% accuracy. It misclassified most approved loans and only did well predicting rejected ones. The recall for class 1 (approved loans) was very low (0.06), making this model unsuitable without further tuning or preprocessing.

**Saving model**

In [256]:
#joblib.dump(rf_model, 'rf_model.pkl')


In [71]:
X_test

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
361,1,1,2,0,0,5000,3667.0,236.0,360.0,1.0,1
73,1,1,3,1,0,4755,0.0,95.0,12.0,0.0,1
374,0,0,0,0,1,2764,1459.0,110.0,360.0,1.0,2
155,1,1,3,0,0,39999,0.0,600.0,180.0,0.0,1
104,1,0,3,0,0,3816,754.0,160.0,360.0,1.0,2
...,...,...,...,...,...,...,...,...,...,...,...
347,1,1,2,1,0,3083,2168.0,126.0,360.0,1.0,2
86,1,1,2,1,0,3333,2000.0,99.0,360.0,0.0,1
75,1,0,0,0,0,3750,0.0,113.0,480.0,1.0,2
438,1,0,0,0,1,10416,0.0,187.0,360.0,0.0,2


In [77]:
y_test.head(10)

361    1
73     0
374    1
155    1
104    1
394    1
377    1
124    1
68     1
450    0
Name: Loan_Status, dtype: int64


**Making predictions**

In [79]:
predict = X_test.iloc[:10]
pred = rf_model.predict(predict)
print("Predictions:", pred)

Predictions: [1 0 1 0 1 1 0 1 1 0]


In [252]:
#df1 = pd.read_csv('new data without target column.csv')
#df1

In [173]:
#predict = df1
#pred = rf_model.predict(predict)
#print("Predictions:", pred)

Predictions: [1 0 1 1 1 1 1 0 1]


In [317]:
#logreg_model.predict(X_test_scaled)

**Optional**

To train models individually

In [322]:
#model = RandomForestClassifier(n_estimators=100, random_state=42)
#model.fit(X_train, y_train)

To predict models individually

In [325]:
# Predictions
#y_pred = model.predict(X_test)

# Accuracy and report
#print("Accuracy:", accuracy_score(y_test, y_pred))
#print("\nClassification Report:\n", classification_report(y_test, y_pred))

# Confusion Matrix
#sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt='d')
#plt.title('Confusion Matrix')
#plt.xlabel('Predicted')
#plt.ylabel('Actual')
#plt.show()

to finetune Logistic Regression

In [328]:
#from sklearn.model_selection import GridSearchCV

#scaler = StandardScaler()
#X_train_scaled = scaler.fit_transform(X_train)
#X_test_scaled = scaler.transform(X_test)

#param_grid = {'C': [0.01, 0.1, 1, 10, 100]}
#grid = GridSearchCV(LogisticRegression(max_iter=2000, solver='saga', class_weight='balanced'), param_grid, cv=5)
#grid.fit(X_train_scaled, y_train)

#best_model = grid.best_estimator_
#y_pred = best_model.predict(X_test_scaled)