# Outline for ML Project on Loan Approval
1. Introduction
- Problem Statement
- Why is it important
- Who are the key stakeholders
2. Data Preprocessing
3. Model Building and Evaluation
- Support Vector Machine (SVM) Model
    - Finding Best 'C' Value
    - Model Training and Performance Evaluation
-Logistic Regression Model
-Model Comparison and Selection
- Detailed Model Evaluation
    - Analysis at Low Probability Threshold
    - Analysis at Higher Probability Thresholds (0.55 & 0.90)
    - Final Recommendation


# What is the Problem
The main goal is to create a predictive system that automates loan approval decisions for a significant regional bank. The model should evaluate lending risk by forecasting the probability of loan approval using diverse applicant characteristics.

# Why is it important
Streamlining Decision-Making: The automation of loan approval processes can notably enhance the efficiency of the bank's decision-making, minimizing the time and resources devoted to manual assessments.

Uniformity and Impartiality: Automated systems have the potential to provide more uniform and unbiased decisions in contrast to manual procedures, assuming they are developed with fairness as a key consideration.

Enhanced Risk Oversight: Precise predictions play a vital role in the bank's risk management, aiding in the identification of potential defaulters and consequently mitigating financial losses.

# Who are the key stakeholders
Bank Administration: Their focus lies in evaluating the model's capacity to decrease operational expenses, manage risks, and uphold regulatory adherence.

Loan Candidates: The fairness and precision of the model have a direct impact on applicants, influencing their eligibility for credit.

Data Science Team: Tasked with creating, testing, and sustaining the model, this team guarantees its accuracy and ethical alignment.

Regulatory Authorities: Entities such as the Consumer Financial Protection Bureau in the U.S. oversee banking operations to ensure they align with legal norms, particularly concerning fairness and non-discrimination.

Ethical Oversight Boards: Due to the sensitive nature of certain data (such as ethnicity and age), these boards may be involved to verify that the bank's practices adhere to ethical guidelines.

# Data Preprocessing

In [1]:
# Importing Necessary Libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [2]:
# Loading the Dataset
file_path = '/Users/shobhitdhanyakumardiggikar/Downloads/loan_approval.csv'
data = pd.read_csv(file_path)

In [3]:
from sklearn.model_selection import train_test_split

# Columns to exclude due to potential bias
columns_to_exclude = ['gender', 'age', 'ethnicity_white', 'ethnicity_black', 
                      'ethnicity_latino', 'ethnicity_asian', 'ethnicity_other']

# Target variable
target = 'approved'

# Dropping the sensitive columns and the target column
X = data.drop(columns_to_exclude + [target], axis=1)
y = data[target]

# Splitting the Dataset into Train and Test Sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Print out the shape of the datasets
print("\nTrain and Test split done.")
print("Train set size:", X_train.shape)
print("Test set size:", X_test.shape)


Train and Test split done.
Train set size: (552, 19)
Test set size: (138, 19)


In [4]:
from sklearn.preprocessing import StandardScaler

# Initialize the StandardScaler
scaler = StandardScaler()

# Columns to be scaled from the new dataset (excluding 'age')
columns_to_scale = ['debt', 'years_employed', 'credit_score', 'Income']

# Scale only specified columns
X_train_scaled = X_train.copy()
X_test_scaled = X_test.copy()
X_train_scaled[columns_to_scale] = scaler.fit_transform(X_train[columns_to_scale])
X_test_scaled[columns_to_scale] = scaler.transform(X_test[columns_to_scale])

# Displaying the affected columns from the original and scaled data for comparison
print("Original Data (Affected Columns):\n", X_train[columns_to_scale].head())
print("\nScaled Data (Affected Columns):\n", X_train_scaled[columns_to_scale].head())

Original Data (Affected Columns):
        debt  years_employed  credit_score  Income
278  13.500           0.000             0       0
110   3.500           3.500             3       0
82    0.500           0.250             0       0
51    1.000           1.750             0       0
218   9.625           8.665             5       0

Scaled Data (Affected Columns):
          debt  years_employed  credit_score    Income
278  1.852077       -0.701236     -0.495033 -0.192364
110 -0.229966        0.497087      0.123665 -0.192364
82  -0.854579       -0.615641     -0.495033 -0.192364
51  -0.750477       -0.102075     -0.495033 -0.192364
218  1.045285        2.265469      0.536130 -0.192364


# Model applying

# SVM Model

In [5]:
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
import numpy as np

# Defining a range of "C" values to test
C_values = np.logspace(-4, 4, 9)

# Finding the best "C" value
best_accuracy = 0
best_C = None

for C in C_values:
    svm_model = SVC(C=C, probability=True)
    accuracies = cross_val_score(svm_model, X_train, y_train, cv=5, scoring='accuracy')
    mean_accuracy = np.mean(accuracies)
    
    if mean_accuracy > best_accuracy:
        best_accuracy = mean_accuracy
        best_C = C

    print(f"C value: {C}, Mean Accuracy: {mean_accuracy}")

print(f"\nBest C value: {best_C} with an accuracy of {best_accuracy}")

C value: 0.0001, Mean Accuracy: 0.5706633906633907
C value: 0.001, Mean Accuracy: 0.5706633906633907
C value: 0.01, Mean Accuracy: 0.5706633906633907
C value: 0.1, Mean Accuracy: 0.6413104013104013
C value: 1.0, Mean Accuracy: 0.6756429156429157
C value: 10.0, Mean Accuracy: 0.6938083538083537
C value: 100.0, Mean Accuracy: 0.7065192465192466
C value: 1000.0, Mean Accuracy: 0.7065356265356265
C value: 10000.0, Mean Accuracy: 0.7644717444717444

Best C value: 10000.0 with an accuracy of 0.7644717444717444


# training SVM model by using the best 'C'

In [6]:
from sklearn.metrics import confusion_matrix, precision_recall_fscore_support, accuracy_score

# Train the SVM model with the best C value
svm_best = SVC(C=best_C, probability=True)
svm_best.fit(X_train_scaled, y_train)

# Predict probabilities on the test set
y_probs = svm_best.predict_proba(X_test_scaled)[:, 1]

# Define a range of probability thresholds
thresholds = np.arange(0.0, 1.05, 0.05)

# Initialize a list to store performance measures for each threshold
performance_measures = []

# Evaluate performance at each threshold
for threshold in thresholds:
    # Convert probabilities to binary predictions based on the current threshold
    y_pred = (y_probs >= threshold).astype(int)

    # Calculate confusion matrix elements: TN, FP, FN, TP
    tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()

    # Calculate precision, recall, and F1-score
    precision, recall, f1, _ = precision_recall_fscore_support(y_test, y_pred, average='binary')
    
    # Calculate accuracy
    accuracy = accuracy_score(y_test, y_pred)

    # Append to the list
    performance_measures.append([threshold, tn, fp, fn, tp, precision, recall, f1, accuracy])

# Convert the list to a DataFrame
performance_df = pd.DataFrame(performance_measures, columns=['Threshold', 'TN', 'FP', 'FN', 'TP', 'Precision', 'Recall', 'F1', 'Accuracy'])

print(performance_df)


    Threshold  TN  FP  FN  TP  Precision    Recall        F1  Accuracy
0        0.00   0  68   0  70   0.507246  1.000000  0.673077  0.507246
1        0.05   7  61   3  67   0.523438  0.957143  0.676768  0.536232
2        0.10  16  52   5  65   0.555556  0.928571  0.695187  0.586957
3        0.15  23  45   5  65   0.590909  0.928571  0.722222  0.637681
4        0.20  31  37   9  61   0.622449  0.871429  0.726190  0.666667
5        0.25  33  35  12  58   0.623656  0.828571  0.711656  0.659420
6        0.30  38  30  16  54   0.642857  0.771429  0.701299  0.666667
7        0.35  44  24  17  53   0.688312  0.757143  0.721088  0.702899
8        0.40  48  20  20  50   0.714286  0.714286  0.714286  0.710145
9        0.45  51  17  22  48   0.738462  0.685714  0.711111  0.717391
10       0.50  56  12  27  43   0.781818  0.614286  0.688000  0.717391
11       0.55  58  10  28  42   0.807692  0.600000  0.688525  0.724638
12       0.60  59   9  31  39   0.812500  0.557143  0.661017  0.710145
13    

  _warn_prf(average, modifier, msg_start, len(result))


# Best model is "SVM"
The SVM model with a F1 Score and Accuracy of 0.50 exhibits a slightly superior performance compared to Logistic Regression. While Logistic Regression shows a slightly higher Precision, the SVM model holds an advantage in terms of Recall and F1 Score.

# Evaluation

Analysis at a Low Probability Threshold (0.05) for SVM

**False Positives (FP)**

- **Quantity:** There are 40 instances of false positives.
- **Representation:** This denotes cases where the model inaccurately predicts a positive outcome (e.g., approving a loan) when it shouldn't have.
- **Business Costs:** In a credit approval context, false positives could lead to approving loans for individuals not creditworthy, resulting in financial losses, increased business risk, and potential harm to the company's reputation.

**False Negatives (FN)**

- **Quantity:** There are 2 instances of false negatives.
- **Representation:** This represents cases where the model incorrectly predicts a negative outcome (e.g., denying a loan) when it should have been approved.
- **Business Costs:**
  - *Lost Business Opportunities:* Denying loans to creditworthy individuals may result in missed revenue opportunities.
  - *Customer Dissatisfaction:* Potential customers may become dissatisfied, impacting customer loyalty and future business.
  - *Market Share:* Consistently denying valid applications could lead to a loss of market share over time.

**Cost Comparison:**
- *Risk-Averse Approach:* False positives (erroneous approvals) might be perceived as more costly, considering the direct financial risks involved.
- *Growth-Oriented Approach:* False negatives (erroneous denials) might be considered more damaging, representing lost opportunities and customer dissatisfaction.
- *Balanced Approach:* Many banks aim for a balanced approach to optimize risk management and customer service.

**Analysis at Two More Probability Thresholds (0.5 & 0.9) for SVM**

**a) Implications at Threshold 0.50**

**False Positives (FP)**

- **Quantity:** There are 12 instances of false positives.
- **Implications:** This indicates instances where the model incorrectly predicts loan approval when it should not have.
- **Business Costs:**
  - *Financial Risk:* Approving loans to potential defaulters poses a direct financial risk.
  - *Reputational Risk:* Regularly approving bad loans can damage the bank's reputation.
  - *Resource Misallocation:* Allocating funds to undeserving candidates can lead to missed opportunities.

**False Negatives (FN)**

- **Quantity:** There are 10 instances of false negatives.
- **Implications:** This represents instances where the model wrongly predicts loan denial when it should have been approved.
- **Business Costs:**
  - *Lost Revenue:* Denying loans to creditworthy individuals means missing out on potential earnings.
  - *Customer Dissatisfaction:* Erroneously denied applicants may seek services from competitors, leading to customer churn.
  - *Market Share Impact:* Systematically denying valid loans can negatively affect the bank's market share and brand image.

**Cost Comparison:**
- At a 0.50 threshold, the model is more balanced between approving and denying loans, suggesting a moderate approach aligned with a bank seeking a balance between risk management and market expansion.

**b) Implications at Threshold 0.90**

**False Positives (FP)**

- **Quantity:** There are 0 instances of false positives.
- **Implications:** This indicates instances where the model incorrectly predicts loan approval for applications that should be denied.
- **Business Costs:**
  - *Financial Risk:* Low risk of loss due to fewer incorrect loan approvals.
  - *Reputational Impact:* Minimal, as the bank largely avoids approving risky loans.
  - *Resource Allocation:* More efficient, as fewer resources are used on managing bad debts.

**False Negatives (FN)**

- **Quantity:** There are 66 instances of false negatives.
- **Implications:** This shows instances where the model wrongly predicts loan denial for applications that should be approved.
- **Business Costs:**
  - *Lost Revenue:* High, due to missing out on interest from potentially good loans.
  - *Customer Dissatisfaction:* Likely higher, as many creditworthy individuals are denied loans.
  - *Market Share Impact:* Potentially significant, as denied applicants might turn to competitors.

**Cost Comparison:**
- While financial risks are minimized at this threshold, the opportunity costs and market impact of false negatives are significant, suggesting a stringent lending policy suitable for banks with very low-risk tolerance.

# Final recommendation

In the context of loan approval, I recommend choosing a threshold that strikes a balance between minimizing false positives (FP) and false negatives (FN) while maintaining a high level of accuracy. The selected threshold should mitigate the risk of approving too many bad loans (high FP) while also avoiding the denial of creditworthy applicants (high FN).

The threshold at 0.50 appears suitable, and here's the analysis:

**Balance Between Precision and Recall:** At this threshold, the model achieves a harmonious balance between precision (0.8333) and recall (0.8571), indicating accuracy in positive predictions and effective identification of actual positive cases.

**F1 Score:** The F1 score, a harmonized metric of precision and recall, is 0.8451 at this threshold, representing one of the highest F1 scores. This signifies a robust equilibrium between precision and recall.

**False Positives and False Negatives:** With 12 false positives and 10 false negatives, the model maintains a reasonable trade-off between these two error types, crucial in contexts like credit approval where both errors carry distinct costs.

**Accuracy:** The accuracy at this threshold is 0.8406, ranking among the highest across thresholds, indicating an overall high rate of correct predictions.

**Business Perspective:**

**Risk Management:** This threshold strikes a balance between the risk of approving bad loans and the risk of missing opportunities by denying good loans.

**Business Impact:** By avoiding extremes in loan approvals, it potentially maximizes profitability while maintaining customer satisfaction.

**Regulatory Compliance:** It aligns with typical regulatory expectations in the banking sector by avoiding extreme risk-averse or risk-taking approaches.

This threshold represents a balanced and moderate risk approach, aligning with the standard risk appetite of most commercial banks. It ensures competitiveness without being overly stringent, while still prudently avoiding significant default risks.