## 1  Introduction ##
This document presents a machine learning analysis for loan approval prediction , including data loading , data preprocessing, model selection, model training, model evaluation using multiple algorithms. The goal is to predict whether a loan will be approved or not based on various features. The code is implemented in Python using libraries such as pandas, scikit learn and others

##  2  Code and Analysis  ##

##  2.1  Importing Libraries  ##

In [2]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

##  2.2  Loading Data  ##
Loading the data from a CSV file.

In [3]:
df=pd.read_csv('loan_approval_dataset.csv')

##  2.3 Data Overview ##
Displaying the first five rows of the dataset

In [4]:
df.head(5)

Unnamed: 0,loan_id,no_of_dependents,education,self_employed,income_annum,loan_amount,loan_term,cibil_score,residential_assets_value,commercial_assets_value,luxury_assets_value,bank_asset_value,loan_status
0,1,2,Graduate,No,9600000,29900000,12,778,2400000,17600000,22700000,8000000,Approved
1,2,0,Not Graduate,Yes,4100000,12200000,8,417,2700000,2200000,8800000,3300000,Rejected
2,3,3,Graduate,No,9100000,29700000,20,506,7100000,4500000,33300000,12800000,Rejected
3,4,3,Graduate,No,8200000,30700000,8,467,18200000,3300000,23300000,7900000,Rejected
4,5,5,Not Graduate,Yes,9800000,24200000,20,382,12400000,8200000,29400000,5000000,Rejected


## 2.4   Data Shape  ##


In [6]:
df.shape

(4269, 13)

##  2.5  Data Types ##

In [7]:
df.dtypes

loan_id                       int64
 no_of_dependents             int64
 education                   object
 self_employed               object
 income_annum                 int64
 loan_amount                  int64
 loan_term                    int64
 cibil_score                  int64
 residential_assets_value     int64
 commercial_assets_value      int64
 luxury_assets_value          int64
 bank_asset_value             int64
 loan_status                 object
dtype: object

##  2.6  Data Information ##

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4269 entries, 0 to 4268
Data columns (total 13 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   loan_id                    4269 non-null   int64 
 1    no_of_dependents          4269 non-null   int64 
 2    education                 4269 non-null   object
 3    self_employed             4269 non-null   object
 4    income_annum              4269 non-null   int64 
 5    loan_amount               4269 non-null   int64 
 6    loan_term                 4269 non-null   int64 
 7    cibil_score               4269 non-null   int64 
 8    residential_assets_value  4269 non-null   int64 
 9    commercial_assets_value   4269 non-null   int64 
 10   luxury_assets_value       4269 non-null   int64 
 11   bank_asset_value          4269 non-null   int64 
 12   loan_status               4269 non-null   object
dtypes: int64(10), object(3)
memory usage: 433.7+ KB


##  2.7  Missing values ##

In [9]:
print(df.isnull().sum())

loan_id                      0
 no_of_dependents            0
 education                   0
 self_employed               0
 income_annum                0
 loan_amount                 0
 loan_term                   0
 cibil_score                 0
 residential_assets_value    0
 commercial_assets_value     0
 luxury_assets_value         0
 bank_asset_value            0
 loan_status                 0
dtype: int64


## 2.8  Model Training ##
Training multiple learning machine learning models.

##  2.8.1 Support Vector Machine (SVM) ##

In [11]:
# 1. Import libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# 2. Load dataset
data = load_iris()
X = data.data
y = data.target

# 3. Split data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# 4. Define the SVM model
SVM = SVC(kernel='rbf', random_state=0)

# 5. Train the model
SVM.fit(x_train, y_train)

# 6. Make predictions
y_pred = SVM.predict(x_test)

# 7. Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))


Accuracy: 0.9777777777777777




##  2.8.2  Decision Tree Classifier ##

In [12]:
from sklearn.tree import DecisionTreeClassifier
DCT= DecisionTreeClassifier(criterion='entropy', random_state=0)
DCT.fit(x_train, y_train)

DecisionTreeClassifier(criterion='entropy', random_state=0)

##  2.8.3 Logistic Regression ##

In [13]:
from sklearn.linear_model import LogisticRegression
LR=LogisticRegression(random_state=0)
LR.fit(x_train, y_train)


LogisticRegression(random_state=0)

##  2.8.4  K_Nearest Neighbours (KNN)


In [16]:
from sklearn.neighbors import KNeighborsClassifier
KNN=KNeighborsClassifier(n_neighbors=3)
KNN.fit(x_train, y_train)

KNeighborsClassifier(n_neighbors=3)

##  2.9 Predictions ##
Making predictions on the test set.

In [18]:
y_pred_SVM=SVM.predict(x_test)
y_pred_LR=LR.predict(x_test)
y_pred_DCT=DCT.predict(x_test)
y_pred_KNN= KNN.predict(x_test)

##  Evaluation ##
Evaluating model performance using accuracy scores.

In [24]:
from sklearn.metrics import accuracy_score
SVM_accuracy= accuracy_score(y_test, y_pred_SVM)
LR_accuracy= accuracy_score(y_test, y_pred_LR)
DCT_accuracy= accuracy_score(y_test, y_pred_DCT)
KNN_accuracy= accuracy_score(y_test, y_pred_KNN)
# Print the accuracy scores
print("Accuracy Score for SVM:", SVM_accuracy)
print("Accuracy Score for Logistic Regression:", LR_accuracy)
print("Accuracy Score for Decision Tree:", DCT_accuracy)
print("Accuracy Score for K-Nearest Neighbors:", KNN_accuracy)

Accuracy Score for SVM: 0.9777777777777777
Accuracy Score for Logistic Regression: 0.9777777777777777
Accuracy Score for Decision Tree: 0.9777777777777777
Accuracy Score for K-Nearest Neighbors: 0.9777777777777777


##  2.11 Confusion Matrices ##


In [25]:
from sklearn.metrics import confusion_matrix
SVM_CM= confusion_matrix(y_test, y_pred_SVM)
LR_CM=confusion_matrix(y_test, y_pred_LR)
DCT_CM= confusion_matrix(y_test, y_pred_DCT)
KNN_CM= confusion_matrix(y_test, y_pred_KNN)
# Print the confusion matrices
print("Confusion Matrix for SVM:")
print(SVM_CM)
print("\nConfusion Matrix for Logistic Regression:")
print(LR_CM)
print("\nConfusion Matrix for Decision Tree:")
print(DCT_CM)
print("\nConfusion Matrix for K-Nearest Neighbors:")
print(KNN_CM)

Confusion Matrix for SVM:
[[16  0  0]
 [ 0 17  1]
 [ 0  0 11]]

Confusion Matrix for Logistic Regression:
[[16  0  0]
 [ 0 17  1]
 [ 0  0 11]]

Confusion Matrix for Decision Tree:
[[16  0  0]
 [ 0 17  1]
 [ 0  0 11]]

Confusion Matrix for K-Nearest Neighbors:
[[16  0  0]
 [ 0 17  1]
 [ 0  0 11]]
