# Problem Introduction
# what is the problem
Loan approval processes often involve manual evaluations, which can be time-consuming, prone to inconsistencies, and susceptible to human biases. By creating a predictive model, banks can automate and streamline this process, ensuring faster and more consistent decisions. However, using potentially sensitive data like ethnicity, age, or gender raises ethical concerns and risks of discrimination, even if unintended.

# why is it important
Automating loan approvals is crucial for improving operational efficiency, reducing costs, and enhancing customer satisfaction through faster processing. Moreover, consistent decision-making ensures applicants are treated fairly. However, it is equally important to ensure the model complies with ethical standards and regulatory requirements, avoiding bias and unfair discrimination.

# who are key stakeholders
- Applicants: The individuals or businesses applying for loans. They are directly affected by approval decisions and expect fairness and transparency.
- Bank Management: Responsible for operational efficiency, customer satisfaction, and compliance with laws and regulations.
- Regulators: Ensure the bank’s operations, including machine learning models, comply with anti-discrimination laws and ethical standards.
- Data Scientists: Tasked with building models that are accurate, ethical, and transparent.
- Society at Large: Has a vested interest in ensuring financial institutions operate fairly and do not propagate systemic inequalities.

In [1]:
# importing the required libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, precision_recall_fscore_support
import numpy as np

In [3]:
path = "C:\\Users\\Gahan\\Downloads\\loan_approval.csv"

# Data Collection
we are loading the data and then displaying it.

In [4]:
# loading the dataset
df = pd.read_csv(path)

In [5]:
df.head(5)

Unnamed: 0,approved,gender,age,debt,married,bank_customer,emp_industrial,emp_materials,emp_consumer_services,emp_healthcare,...,ethnicity_other,years_employed,prior_default,employed,credit_score,drivers_license,citizen_bybirth,citizen_other,citizen_temporary,Income
0,1,1,30.83,0.0,1,1,1,0,0,0,...,0,1.25,1,1,1,0,1,0,0,0
1,1,0,58.67,4.46,1,1,0,1,0,0,...,0,3.04,1,1,6,0,1,0,0,560
2,1,0,24.5,0.5,1,1,0,1,0,0,...,0,1.5,1,0,0,0,1,0,0,824
3,1,1,27.83,1.54,1,1,1,0,0,0,...,0,3.75,1,1,5,1,1,0,0,3
4,1,1,20.17,5.625,1,1,1,0,0,0,...,0,1.71,1,0,0,0,0,1,0,0


# Data Exploration and Understanding
we are displaying the statistical summary of the data.

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 27 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   approved               690 non-null    int64  
 1   gender                 690 non-null    int64  
 2   age                    690 non-null    float64
 3   debt                   690 non-null    float64
 4   married                690 non-null    int64  
 5   bank_customer          690 non-null    int64  
 6   emp_industrial         690 non-null    int64  
 7   emp_materials          690 non-null    int64  
 8   emp_consumer_services  690 non-null    int64  
 9   emp_healthcare         690 non-null    int64  
 10  emp_financials         690 non-null    int64  
 11  emp_utilities          690 non-null    int64  
 12  emp_education          690 non-null    int64  
 13  ethnicity_white        690 non-null    int64  
 14  ethnicity_black        690 non-null    int64  
 15  ethnic

In [7]:
df.describe()

Unnamed: 0,approved,gender,age,debt,married,bank_customer,emp_industrial,emp_materials,emp_consumer_services,emp_healthcare,...,ethnicity_other,years_employed,prior_default,employed,credit_score,drivers_license,citizen_bybirth,citizen_other,citizen_temporary,Income
count,690.0,690.0,690.0,690.0,690.0,690.0,690.0,690.0,690.0,690.0,...,690.0,690.0,690.0,690.0,690.0,690.0,690.0,690.0,690.0,690.0
mean,0.444928,0.695652,31.514116,4.758725,0.76087,0.763768,0.35942,0.117391,0.207246,0.076812,...,0.04058,2.223406,0.523188,0.427536,2.4,0.457971,0.905797,0.082609,0.011594,1017.385507
std,0.497318,0.460464,11.860245,4.978163,0.426862,0.425074,0.480179,0.322119,0.405628,0.266485,...,0.197458,3.346513,0.499824,0.49508,4.86294,0.498592,0.292323,0.27549,0.107128,5210.102598
min,0.0,0.0,13.75,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,22.67,1.0,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.165,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
50%,0.0,1.0,28.46,2.75,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,5.0
75%,1.0,1.0,37.7075,7.2075,1.0,1.0,1.0,0.0,0.0,0.0,...,0.0,2.625,1.0,1.0,3.0,1.0,1.0,0.0,0.0,395.5
max,1.0,1.0,80.25,28.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,28.5,1.0,1.0,67.0,1.0,1.0,1.0,1.0,100000.0


# Data Cleaning or Preprocessing
This is not required since the data is already preprocessed data.

# Feature Selection
since it is a small dataset with less features, we don't need to perform this step.

In [8]:
# checking for the null values
df.isnull().sum()

Unnamed: 0,0
approved,0
gender,0
age,0
debt,0
married,0
bank_customer,0
emp_industrial,0
emp_materials,0
emp_consumer_services,0
emp_healthcare,0


# Modelling
# Data splitting
Here we are performing the train test split and also dividing the target variable from the dataset in order to train the model.

In [9]:
X = df.drop('approved' , axis = 1)
y= df['approved']

In [10]:
X.head(5)

Unnamed: 0,gender,age,debt,married,bank_customer,emp_industrial,emp_materials,emp_consumer_services,emp_healthcare,emp_financials,...,ethnicity_other,years_employed,prior_default,employed,credit_score,drivers_license,citizen_bybirth,citizen_other,citizen_temporary,Income
0,1,30.83,0.0,1,1,1,0,0,0,0,...,0,1.25,1,1,1,0,1,0,0,0
1,0,58.67,4.46,1,1,0,1,0,0,0,...,0,3.04,1,1,6,0,1,0,0,560
2,0,24.5,0.5,1,1,0,1,0,0,0,...,0,1.5,1,0,0,0,1,0,0,824
3,1,27.83,1.54,1,1,1,0,0,0,0,...,0,3.75,1,1,5,1,1,0,0,3
4,1,20.17,5.625,1,1,1,0,0,0,0,...,0,1.71,1,0,0,0,0,1,0,0


In [11]:
y.head(5)

Unnamed: 0,approved
0,1
1,1
2,1
3,1
4,1


Here we are splitting the dataset into train and test. 80 percent of the data is given for training while 20 percent of the data is used for testing.

In [12]:
# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Classification Modelling
here we are training two different classification models and checking the performance measures and then declaring the winning model based on threshold.

# SVM Model
here we are looping c value for various 10 powers

In [13]:
from sklearn.preprocessing import StandardScaler
# Scale the numerical features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Verify scaling
X_train_scaled[:5], X_test_scaled[:5]

(array([[ 0.65295987, -0.58195732,  1.8520771 , -1.74045887, -1.74894926,
         -0.78059557, -0.35893503, -0.50734918,  3.53035616, -0.32249031,
         -0.32590248, -0.19886685, -1.21921552, -0.49603393,  3.39786029,
         -0.30151134, -0.20851441, -0.70123581, -1.02569226, -0.86739896,
         -0.49503255, -0.90984316,  0.31211457, -0.28325754, -0.12126781,
         -0.19236401],
        [ 0.65295987, -0.19132987, -0.22996579,  0.57456112,  0.57177187,
          1.28107311, -0.35893503, -0.50734918, -0.28325754, -0.32249031,
         -0.32590248, -0.19886685,  0.82019953, -0.49603393, -0.29430286,
         -0.30151134, -0.20851441,  0.49708679,  0.9749513 ,  1.15287203,
          0.12366473,  1.09909053,  0.31211457, -0.28325754, -0.12126781,
         -0.19236401],
        [ 0.65295987,  0.71587897, -0.85457865,  0.57456112,  0.57177187,
          1.28107311, -0.35893503, -0.50734918, -0.28325754, -0.32249031,
         -0.32590248, -0.19886685,  0.82019953, -0.49603393, -0.29

In [14]:
# Defining a range of C values for SVM
C_values = [10**i for i in range(-4, 5)]

In [15]:
# Training and evaluating SVM models for each C value
best_C = None
best_accuracy = 0


for C in C_values:
    model = SVC(C=C, kernel='linear', probability=True, random_state=42)
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)
    accuracy = accuracy_score(y_test, y_pred)

    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_C = C

In [16]:
# Training the final SVM model using the best C value
final_svm_model = SVC(C=best_C, kernel='linear', probability=True, random_state=42)
final_svm_model.fit(X_train, y_train)
y_proba = final_svm_model.predict_proba(X_test)[:, 1]


creating the performance measure table over a range of possible threshold values.




In [17]:
# Defining thresholds and evaluate SVM model performance
thresholds = np.arange(0.0, 1.1, 0.1)
performance_measures = []

for threshold in thresholds:
    y_pred_threshold = (y_proba >= threshold).astype(int)
    tn, fp, fn, tp = confusion_matrix(y_test, y_pred_threshold).ravel()
    precision, recall, f1, _ = precision_recall_fscore_support(y_test, y_pred_threshold, average='binary', zero_division=0)
    accuracy = accuracy_score(y_test, y_pred_threshold)
    performance_measures.append((threshold, tn, tp, fn, fp, precision, recall, f1, accuracy))

performance_df = pd.DataFrame(performance_measures, columns=['Threshold', 'TN', 'TP', 'FN', 'FP', 'Precision', 'Recall', 'F1', 'Accuracy'])

# Logistic Model
# Training a logistic model.

In [21]:
# Training a Logistic Regression model
logistic_model = LogisticRegression(solver='saga',max_iter = 10000, random_state=42)
logistic_model.fit(X_train, y_train)
y_proba_logistic = logistic_model.predict_proba(X_test)[:, 1]

creating the performance measure table over a range of possible threshold values.

In [22]:
# Evaluating the Logistic Regression model performance over various thresholds
logistic_performance_measures = []

for threshold in thresholds:
    y_pred_threshold_logistic = (y_proba_logistic >= threshold).astype(int)
    tn, fp, fn, tp = confusion_matrix(y_test, y_pred_threshold_logistic).ravel()
    precision, recall, f1, _ = precision_recall_fscore_support(y_test, y_pred_threshold_logistic, average='binary', zero_division=0)
    accuracy = accuracy_score(y_test, y_pred_threshold_logistic)
    logistic_performance_measures.append((threshold, tn, tp, fn, fp, precision, recall, f1, accuracy))

logistic_performance_df = pd.DataFrame(logistic_performance_measures, columns=['Threshold', 'TN', 'TP', 'FN', 'FP', 'Precision', 'Recall', 'F1', 'Accuracy'])

In [23]:
# Printing the results
print("Best C for SVM:", best_C)
print("\nSVM Performance Table:\n", performance_df)
print("\nLogistic Regression Performance Table:\n", logistic_performance_df)

Best C for SVM: 0.001

SVM Performance Table:
     Threshold  TN  TP  FN  FP  Precision    Recall        F1  Accuracy
0         0.0   0  70   0  68   0.507246  1.000000  0.673077  0.507246
1         0.1   0  70   0  68   0.507246  1.000000  0.673077  0.507246
2         0.2  30  61   9  38   0.616162  0.871429  0.721893  0.659420
3         0.3  51  56  14  17   0.767123  0.800000  0.783217  0.775362
4         0.4  57  47  23  11   0.810345  0.671429  0.734375  0.753623
5         0.5  58  39  31  10   0.795918  0.557143  0.655462  0.702899
6         0.6  61  34  36   7   0.829268  0.485714  0.612613  0.688406
7         0.7  63  29  41   5   0.852941  0.414286  0.557692  0.666667
8         0.8  66  25  45   2   0.925926  0.357143  0.515464  0.659420
9         0.9  66  19  51   2   0.904762  0.271429  0.417582  0.615942
10        1.0  68   0  70   0   0.000000  0.000000  0.000000  0.492754

Logistic Regression Performance Table:
     Threshold  TN  TP  FN  FP  Precision    Recall        F1

# Pick a "winning" model
# Based on the various performance measures, decide which of the two modeling frameworks (SVM or Logistic) to move forward with.
Based on the performance metrics, SVM is the winning model. It consistently delivers better precision, recall, and F1 scores across various thresholds. While Logistic Regression shows stable performance at very low thresholds (0.0–0.4), its effectiveness declines noticeably beyond a threshold of 0.5. On the other hand, SVM maintains robust and reliable performance, especially around thresholds of 0.3–0.4, where it achieves a good balance between precision and recall.

# Evaluation at a Low Probability Threshold
I am picking up a lower threshold of 0.3.

# How many False Positives?
17 false positives.

# What do these numbers represent?
It implies that these represent loan applications that are incorrectly approved despite the applicant being unqualified.

# What are the potential costs to the business if we were to make these mistakes in practice?
Financial losses due to defaults, reputational damage, and regulatory scrutiny for approving risky loans.

# How many False Negatives?
14 False Negatives.

# What do these numbers represent?
It implies that these are qualified applicants who were wrongly denied a loan.

# What are the potential costs to the business if we were to make these mistakes in practice?
Missed revenue opportunities, customer dissatisfaction, and potential negative impact on brand reputation.

# Which prediction mistakes do you consider to be more costly?
 False positives may be more costly here due to financial risks and compliance concerns. However, false negatives also carry reputational risks and could deter future customers.

# Evaluation at two Other Probability Thresholds
# At threshold 0.4

# How many False Positives?
11 false positives

# What do these numbers represent?
Fewer loans are incorrectly approved, reducing financial and reputational risk.

# What are the potential costs to the business if we were to make these mistakes in practice?
Still some financial losses due to defaults, but reduced compared to the threshold of 0.3.

# How many False Negatives?
23 false negatives.

# What do these numbers represent?
More qualified applicants are denied, increasing missed opportunities for revenue.

# What are the potential costs to the business if we were to make these mistakes in practice
Customer dissatisfaction and loss of potential market share.

# Which prediction mistakes do you consider to be more costly?
At this threshold, false negatives become a more significant issue as the potential loss of revenue increases significantly.

# At threshold 0.2

# How many False Positives?
38 false positives

# What do these numbers represent?
A large number of unqualified applicants are approved, which poses high financial risks.

# What are the potential costs to the business if we were to make these mistakes in practice?
Increased risk of defaults, potentially damaging the bank's bottom line and reputation.

# How many False Negatives?
9 false negatives.

# What do these numbers represent?
Few qualified applicants are denied loans, limiting revenue loss.

# What are the potential costs to the business if we were to make these mistakes in practice?
it is minimal compared to false positives.but it could potentially leading to Customer dissatisfaction and loss of potential market share.

# Which prediction mistakes do you consider to be more costly?
False positives dominate at this threshold, making it too risky for business.


# Based on your careful consideration of probability threshold options and the corresponding speculated risks/costs - which probability threshold do you recommend going forward with?
The analysis highlights that a threshold of 0.3 is the ideal choice for this model. At this level, the model achieves a well-balanced trade-off between precision and recall, reducing the likelihood of false positives (which represent financial risks) while keeping false negatives (missed opportunities) at a manageable level.

The F1 score of 0.783 and an accuracy of 0.775, both the highest across thresholds, provide additional evidence to support this decision. This approach ensures the model aligns effectively with the organization's priorities, safeguarding financial stability while maximizing revenue opportunities and maintaining compliance with regulatory requirements.