# Final Project: Loan Acceptance Prediction

**Name:** Ashwin Santhanakrishnan <br>
**Course:** Business Data Analytics

---



## Step 1: Load Libraries

In this step, I am importing all the necessary Python libraries for data processing, machine learning model building, evaluation, and preprocessing.


In [37]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, accuracy_score



## Step 2: Load and Clean the Dataset

I loaded the dataset bank.csv into a pandas DataFrame. Infinite values are replaced with NaN to clean the data.


In [38]:
data = pd.read_csv('bank.csv')
data = data.replace([np.inf, -np.inf], np.nan)
data.head()


Unnamed: 0,ID,Age,Experience,Income,ZIP Code,Family,CCAvg,Education,Mortgage,Personal Loan,Securities Account,CD Account,Online,CreditCard
0,1,25,1,49,91107,4,1.6,1,0,0,1,0,0,0
1,2,45,19,34,90089,3,1.5,1,0,0,1,0,0,0
2,3,39,15,11,94720,1,1.0,1,0,0,0,0,0,0
3,4,35,9,100,94112,1,2.7,2,0,0,0,0,0,0
4,5,35,8,45,91330,4,1.0,2,0,0,0,0,0,1



## Step 3: Preprocessing

- In this step, I prepared the dataset for machine learning by converting all categorical columns into numerical form.
- Since machine learning models work only with numbers, I used Label Encoding to convert text-based categories into integers.
- I wrote a loop to check each column's data type and applied encoding only to those that were of type 'object'.
- The encoded values replaced the original values in the dataset so that the models can train properly.
- I also stored each encoder in a dictionary in case I needed to decode or reuse them later.
- After encoding, I printed the data types to confirm that all columns are now numeric.



In [39]:
lencoders = {}
for column in data.columns:
    if data[column].dtype == 'object':
        le = LabelEncoder()
        data[column] = le.fit_transform(data[column])
        lencoders[column] = le
data.dtypes


ID                      int64
Age                     int64
Experience              int64
Income                  int64
ZIP Code                int64
Family                  int64
CCAvg                 float64
Education               int64
Mortgage                int64
Personal Loan           int64
Securities Account      int64
CD Account              int64
Online                  int64
CreditCard              int64
dtype: object


## Step 4: Partition Data into Training and Validation Sets

- I partition the dataset into training (70%) and validation (30%) in step 4.
- The training set is used to train the machine learning models.
- The validation set is used to evaluate the performance of the models on unseen data.
- I partitioned the dataset with the train_test_split function from sklearn.
- I have changed random_state=1 to ensure that outcome are reproducible at any time that I run the code.
- I have printed rows of the two sets so that I can see if the split has indeed taken place.
- This is done to avoid overfitting and to ensure that an unbiased estimate of model performance is obtained.



In [40]:
A = data.drop('Personal Loan', axis=1)
B = data['Personal Loan']
A_train, A_val, B_train, B_val = train_test_split(A, B, test_size=0.3, random_state=1)
print(f"Training set size (A_train): {A_train.shape[0]} rows")
print(f"Validation set size (B_val): {B_val.shape[0]} rows")

Training set size (A_train): 3500 rows
Validation set size (B_val): 1500 rows



## Step 5: Build Models

I choose this three classification models:
- Naive Bayes Classifier
- Decision Tree Classifier
- Random Forest Classifier 

And Each model is trained using the training set.


In [41]:
# Naive Bayes
modelnb = GaussianNB()
modelnb.fit(A_train, B_train)
prednb = modelnb.predict(A_val)

# Decision Tree
modeltree = DecisionTreeClassifier(random_state=1)
modeltree.fit(A_train, B_train)
preddt = modeltree.predict(A_val)

# Random Forest
modelforest = RandomForestClassifier(random_state=1)
modelforest.fit(A_train, B_train)
predforest = modelforest.predict(A_val)



## Step 6: Confusion Matrices for Each Model

- Here, I compared the predictions of each of the three models with confusion matrices.
- A confusion matrix helps me to know the correct and incorrect classifications made by each model.
- Each matrix is a 2x2 table that shows:
  - True positives correctly predicted 1s
  - True negatives correctly predicted 0s
  - False positives incorrectly predicted 1s
- Mislabeled 0s false negatives
- I calculated the outcome by utilizing the confusion_matrix function in sklearn.
- Comparing the matrices, I can simply look at which models are more accurate and where they are getting them wrong.
- This step is worth it because it tells me more than accuracy alone — it tells me which types of errors each model is making.


In [42]:
print("Naive Bayes Confusion Matrix:\n", confusion_matrix(B_val, prednb))
print("Decision Tree Confusion Matrix:\n", confusion_matrix(B_val, preddt))
print("Random Forest Confusion Matrix:\n", confusion_matrix(B_val, predforest))


Naive Bayes Confusion Matrix:
 [[1239  112]
 [  64   85]]
Decision Tree Confusion Matrix:
 [[1336   15]
 [  16  133]]
Random Forest Confusion Matrix:
 [[1348    3]
 [  23  126]]



## Step 7: Create DataFrame of Predictions

I have created the DataFrame that contains:
- The actual loan outcome
- The predicted outcomes from each of the three models

This helps organize predictions for later ensemble analysis.


In [43]:
outcome = pd.DataFrame({
    'Actual': B_val.values,
    'Naive_Bayes_Pred': prednb,
    'Decision_Tree_Pred': preddt,
    'Random_Forest_Pred': predforest
})
outcome.head(100)


Unnamed: 0,Actual,Naive_Bayes_Pred,Decision_Tree_Pred,Random_Forest_Pred
0,0,0,0,0
1,0,0,0,0
2,0,0,0,0
3,0,0,0,0
4,0,0,0,0
...,...,...,...,...
95,0,0,0,0
96,0,0,0,0
97,0,0,0,0
98,0,0,1,0



## Step 8: Create Ensemble Predictions

- Here, I converted the ensemble predictions of all three models into two ensemble outputs.
- The first one I performed was Majority Vote, where I used the most common prediction of Naive Bayes, Decision Tree, and Random Forest for every observation.
- The second one is Average Probability, where I calculated the average of the predicted probabilities of all three models.
- I then established a cutoff of 0.5: if the mean probability was 0.5 or greater, I predicted 1; otherwise, I predicted 0.
- Then, after generating both the ensemble columns, I printed out the first 100 predictions to observe how the ensemble methods compared to the actual values.
- This enables me to view the agreement or disagreement visually among the models and have an idea of how the ensemble methods will perform on the validation set.



In [44]:
outcome['Majority_Vote'] = outcome[['Naive_Bayes_Pred', 'Decision_Tree_Pred', 'Random_Forest_Pred']].mode(axis=1)[0]
probapilitynb = modelnb.predict_proba(A_val)[:, 1]
probapilitydt = modeltree.predict_proba(A_val)[:, 1]
probapilityrf = modelforest.predict_proba(A_val)[:, 1]
avg_probs = (probapilitynb + probapilitydt + probapilityrf) / 3
outcome['Average_Probability_Pred'] = np.where(avg_probs >= 0.5, 1, 0)
print("Sample of Ensemble Predictions:")
print(outcome[['Actual', 'Majority_Vote', 'Average_Probability_Pred']].head(100))



Sample of Ensemble Predictions:
    Actual  Majority_Vote  Average_Probability_Pred
0        0              0                         0
1        0              0                         0
2        0              0                         0
3        0              0                         0
4        0              0                         0
..     ...            ...                       ...
95       0              0                         0
96       0              0                         0
97       0              0                         0
98       0              0                         0
99       0              0                         0

[100 rows x 3 columns]



## Step 9: Confusion Matrices for Ensemble Methods

Confusion matrices are generated for the Majority Vote and Average Probability ensemble methods. 
This helps evaluate how well the ensemble strategies perform compared to individual models.


In [45]:
print("Majority Vote Confusion Matrix:\n", confusion_matrix(B_val, outcome['Majority_Vote']))
print("Average Probability Confusion Matrix:\n", confusion_matrix(B_val, outcome['Average_Probability_Pred']))


Majority Vote Confusion Matrix:
 [[1346    5]
 [  23  126]]
Average Probability Confusion Matrix:
 [[1346    5]
 [  23  126]]



## Step 10: Compare Error Rates

- I have calculated each model's error rate according to the following formula: Error Rate = 1 - Accuracy.
- It says how often every model is making an incorrect prediction in the validation set.
- Here I am comparing a total of five models:
  - Naive Bayes
  - Decision Tree
  - Random Forest
  - Ensemble method via Majority Vote
  - Ensemble method via Average Probability
- Here I have done a contrast each model's prediction with actual validation set outputs.
- compute accuracy via the accuracy_score function in sklearn and determine the error rate by taking 1 minus that.
- From this contrast, we can observe which model has the best minimization of error.
- It also allows us to check whether minimization of error by ensemble methods with lots of models outcome in improved performance compared to individual models.



In [46]:
print("\nError Rates:")
print(f"Naive Bayes Error Rate: {1 - accuracy_score(B_val, prednb):.4f}")
print(f"Decision Tree Error Rate: {1 - accuracy_score(B_val, preddt):.4f}")
print(f"Random Forest Error Rate: {1 - accuracy_score(B_val, predforest):.4f}")
print(f"Majority Vote Error Rate: {1 - accuracy_score(B_val, outcome['Majority_Vote']):.4f}")
print(f"Average Probability Error Rate: {1 - accuracy_score(B_val, outcome['Average_Probability_Pred']):.4f}")



Error Rates:
Naive Bayes Error Rate: 0.1173
Decision Tree Error Rate: 0.0207
Random Forest Error Rate: 0.0173
Majority Vote Error Rate: 0.0187
Average Probability Error Rate: 0.0187



# Conclusion

- In this project, three classification models were developed to predict loan acceptance: Naive Bayes, Decision Tree, and Random Forest.  
- The dataset was carefully preprocessed, infinite values handled, and categorical variables were encoded.  
- The models were trained and evaluated using a 70-30 train-validation split.  
- Two ensemble strategies (Majority Voting and Average Probability) were also implemented to combine model predictions.
- Confusion matrices and error rates were used to assess model performance.  
- Through comparison, it is evident that ensemble methods generally improve the predictive accuracy compared to individual models.
- Thus, the combination of simple preprocessing, basic classifiers, and ensemble techniques provided a solid framework for predicting loan acceptance outcomes.
