# <b>BMAN60422</b> - Data Analytics </br>
## Developing A Credit Scoring Model


This jupyter notebook investigates the data.csv file that is provided to make decisions about classifying applications for unsecured loans.

Each chapter in this notebook saves their relative results (graphs, figures, tables etc.) into their respective directory inside the analytics folder that the script creates.

The titles with light blue colour to them have code cells below them.

---
Author: Ozgur Aziz - 10860809

# <span style="color:lightblue">Loading Everything</span>


In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, roc_auc_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectKBest, f_classif
import xgboost as xgb
from xgboost import XGBClassifier

# Create analytics folder
if not os.path.exists('analytics'):
    os.makedirs('analytics')

output_dirs = ['chapter1', 'chapter2', 'chapter3', 'chapter4', 'chapter5']
for dir_name in output_dirs:
    if not os.path.exists(f'analytics/{dir_name}'):
        os.makedirs(f'analytics/{dir_name}')

# Set style for plots
plt.style.use('seaborn-v0_8-whitegrid')

# Dictionary with column descriptions
column_descriptions = {
    'BAD': 'Loan Default Status (0: paid, 1: defaulted)',
    'LOAN': 'Loan Amount Requested',
    'MORTDUE': 'Amount Due on Existing Mortgage',
    'VALUE': 'Value of Current Property',
    'REASON': 'Loan Purpose (DebtCon: debt consolidation, HomeImp: home improvement)',
    'JOB': 'Occupation Category',
    'YOJ': 'Years at Present Job',
    'DEROG': 'Number of Major Derogatory Reports',
    'DELINQ': 'Number of Delinquent Credit Lines',
    'CLAGE': 'Age of Oldest Credit Line in Months',
    'NINQ': 'Number of Recent Credit Inquiries',
    'CLNO': 'Number of Credit Lines',
    'DEBTINC': 'Debt-to-Income Ratio'
}

# Load the data
data = pd.read_csv('data.csv')

# Display data shape
print(f"Data shape: {data.shape}")
print(f"Number of rows: {data.shape[0]}")
print(f"Number of columns: {data.shape[1]}")

Data shape: (5960, 13)
Number of rows: 5960
Number of columns: 13


# Chapter 1

<span style="color:red">Future Works for Chapter 1:</span>
* Advanced outlier detection techniques like <b>Isolation Forest or DBSCAN</b></br>
* <b>Data quality scores</b> to quantify data completeness and reliability</br>

## <span style="color:lightblue">Initial Data Exploring</span>

In [4]:
# Display basic information about the dataset
print("Data Information:")
print(data.info())

print('\nDescriptive statistics:')
print(data.describe())

# Check for missing values
missing_values = data.isnull().sum()
missing_percentage = (missing_values / len(data)) * 100

print('\nMissing values per column:')
print(missing_values)

print('\nMissing values percentage:')
print(missing_percentage)

# Save missing values plot
plt.figure(figsize=(10, 6))
missing_percentage.plot(kind='bar')
plt.title('Percentage of Missing Values by Column')
plt.ylabel('Missing Values (%)')
plt.xlabel('Columns')
plt.tight_layout()
plt.savefig('analytics/chapter1/missing_values.png')
plt.close()

print("Missing values visualization saved to 'analytics/chapter1/missing_values.png'")

Data Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5960 entries, 0 to 5959
Data columns (total 13 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   BAD      5960 non-null   int64  
 1   LOAN     5960 non-null   int64  
 2   MORTDUE  5442 non-null   float64
 3   VALUE    5848 non-null   float64
 4   REASON   5708 non-null   object 
 5   JOB      5681 non-null   object 
 6   YOJ      5445 non-null   float64
 7   DEROG    5252 non-null   float64
 8   DELINQ   5380 non-null   float64
 9   CLAGE    5652 non-null   float64
 10  NINQ     5450 non-null   float64
 11  CLNO     5738 non-null   float64
 12  DEBTINC  4693 non-null   float64
dtypes: float64(9), int64(2), object(2)
memory usage: 605.4+ KB
None

Descriptive statistics:
               BAD          LOAN        MORTDUE          VALUE          YOJ  \
count  5960.000000   5960.000000    5442.000000    5848.000000  5445.000000   
mean      0.199497  18607.969799   73760.817200  1017

## <span style="color:lightblue">Looking at Target Variables</span>

In [5]:
# Visualize the distribution of the target variable
plt.figure(figsize=(8, 6))
ax = sns.countplot(x='BAD', data=data)
plt.title('Distribution of Target Variable: ' + column_descriptions['BAD'])
plt.xlabel(column_descriptions['BAD'])
plt.ylabel('Count')

# Add percentages on top of bars
total = len(data)
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x() + p.get_width()/2., height + 0.1,
            '{:1.2f}%'.format(height/total*100),
            ha="center")

plt.tight_layout()
plt.savefig('analytics/chapter1/target_distribution.png')
plt.close()

# Calculate and print class distribution
class_counts = data['BAD'].value_counts()
class_percentages = class_counts / len(data) * 100

print("Target Variable Distribution:")
print(f"Class 0 (Good loans): {class_counts[0]} ({class_percentages[0]:.2f}%)")
print(f"Class 1 (Bad loans): {class_counts[1]} ({class_percentages[1]:.2f}%)")
print(f"Class imbalance ratio: 1:{class_counts[0]/class_counts[1]:.2f}")
print("Target distribution visualization saved to 'analytics/chapter1/target_distribution.png'")

Target Variable Distribution:
Class 0 (Good loans): 4771 (80.05%)
Class 1 (Bad loans): 1189 (19.95%)
Class imbalance ratio: 1:4.01
Target distribution visualization saved to 'analytics/chapter1/target_distribution.png'


# Chapter 2

<span style="color:red">Future Works for Chapter 2:</span>
* Create interaction features between strongly correlated variables</br>
* Apply feature engineering to create more meaningful variables (loan-to-value ratio, etc.)</br>
* Use statistical tests to more rigorously identify significant predictors</br>
* Apply non-linear transformations to improve feature distributions</br>
* Implement feature selection techniques like <b>RFE or LASSO</b> to reduce dimensionality</br>

## <span style="color:lightblue">Numeric Features Analysis</span>

In [6]:
# Get numeric columns
numeric_cols = data.select_dtypes(include=['float64', 'int64']).columns
numeric_cols = [col for col in numeric_cols if col != 'BAD']

# Print basic statistics for numeric features
print("Numeric Features Summary Statistics:")
for col in numeric_cols:
    print(f"\n{col}: {column_descriptions.get(col, col)}")
    print(data[col].describe())
    
    # Generate histogram
    plt.figure(figsize=(10, 6))
    sns.histplot(data[col].dropna(), kde=True)
    plt.title(f'Distribution of {col}: {column_descriptions.get(col, col)}')
    plt.xlabel(column_descriptions.get(col, col))
    plt.ylabel('Frequency')
    plt.tight_layout()
    plt.savefig(f'analytics/chapter2/distribution_{col}.png')
    plt.close()
    
    # Generate boxplot by target
    plt.figure(figsize=(10, 6))
    sns.boxplot(x='BAD', y=col, data=data)
    plt.title(f'{col} by Target: {column_descriptions.get(col, col)}')
    plt.xlabel(column_descriptions['BAD'])
    plt.ylabel(column_descriptions.get(col, col))
    plt.tight_layout()
    plt.savefig(f'analytics/chapter2/boxplot_{col}_by_target.png')
    plt.close()

print(f"Visualizations for {len(numeric_cols)} numeric features saved to 'analytics/chapter2/' directory")

Numeric Features Summary Statistics:

LOAN: Loan Amount Requested
count     5960.000000
mean     18607.969799
std      11207.480417
min       1100.000000
25%      11100.000000
50%      16300.000000
75%      23300.000000
max      89900.000000
Name: LOAN, dtype: float64

MORTDUE: Amount Due on Existing Mortgage
count      5442.000000
mean      73760.817200
std       44457.609458
min        2063.000000
25%       46276.000000
50%       65019.000000
75%       91488.000000
max      399550.000000
Name: MORTDUE, dtype: float64

VALUE: Value of Current Property
count      5848.000000
mean     101776.048741
std       57385.775334
min        8000.000000
25%       66075.500000
50%       89235.500000
75%      119824.250000
max      855909.000000
Name: VALUE, dtype: float64

YOJ: Years at Present Job
count    5445.000000
mean        8.922268
std         7.573982
min         0.000000
25%         3.000000
50%         7.000000
75%        13.000000
max        41.000000
Name: YOJ, dtype: float64

DEROG: 

## <span style="color:lightblue">Categorical Features Analysis</span>

In [31]:
# Analyze categorical features
cat_cols = ['REASON', 'JOB']

print("Categorical Features Analysis:")
for col in cat_cols:
    print(f"\n{col}: {column_descriptions.get(col, col)}")
    
    # Print value counts and percentages
    val_counts = data[col].value_counts()
    val_percentages = (val_counts / val_counts.sum() * 100).round(2)
    
    print("Value counts:")
    for category, count in val_counts.items():
        print(f"  {category}: {count} ({val_percentages[category]}%)")
    
    # Print missing values
    missing = data[col].isnull().sum()
    missing_percent = (missing / len(data) * 100).round(2)
    print(f"Missing values: {missing} ({missing_percent}%)")
    
    # Print default rates by category
    default_rates = data.groupby(col)['BAD'].mean() * 100
    print("\nDefault rates by category:")
    for category, rate in default_rates.items():
        print(f"  {category}: {rate:.2f}%")
    
    # Count plot of categorical variable
    plt.figure(figsize=(12, 6))
    
    # Create plot
    ax = sns.countplot(x=col, data=data, order=val_counts.index)
    
    # Rotate x-labels
    plt.xticks(rotation=45, ha='right')
    
    plt.title(f'Distribution of {col}: {column_descriptions.get(col, col)}')
    plt.xlabel(column_descriptions.get(col, col))
    plt.ylabel('Count')
    
    # Add count labels
    for p in ax.patches:
        ax.annotate(format(p.get_height(), '.0f'), 
                    (p.get_x() + p.get_width() / 2., p.get_height()), 
                    ha = 'center', va = 'center', 
                    xytext = (0, 10), 
                    textcoords = 'offset points')
    
    plt.tight_layout()
    plt.savefig(f'analytics/chapter2/categorical_{col}.png')
    plt.close()
    
    # Distribution by target
    plt.figure(figsize=(12, 6))
    ax = sns.countplot(x=col, hue='BAD', data=data)
    plt.title(f'Distribution of {col} by Target: {column_descriptions.get(col, col)}')
    plt.xlabel(column_descriptions.get(col, col))
    plt.ylabel('Count')
    plt.xticks(rotation=45, ha='right')
    plt.legend(title=column_descriptions['BAD'])
    plt.tight_layout()
    plt.savefig(f'analytics/chapter2/categorical_{col}_by_target.png')
    plt.close()
    
    # Percentage of defaults by category
    plt.figure(figsize=(12, 6))
    default_rate = data.groupby(col)['BAD'].mean() * 100
    default_rate.sort_values(ascending=False).plot(kind='bar')
    plt.title(f'Default Rate by {col}: {column_descriptions.get(col, col)}')
    plt.xlabel(column_descriptions.get(col, col))
    plt.ylabel('Default Rate (%)')
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.savefig(f'analytics/chapter2/default_rate_by_{col}.png')
    plt.close()

print(f"Visualizations for categorical features saved to 'analytics/chapter2/' directory")

Categorical Features Analysis:

REASON: Loan Purpose (DebtCon: debt consolidation, HomeImp: home improvement)
Value counts:
  DebtCon: 3928 (68.82%)
  HomeImp: 1780 (31.18%)
Missing values: 252 (4.23%)

Default rates by category:
  DebtCon: 18.97%
  HomeImp: 22.25%

JOB: Occupation Category
Value counts:
  Other: 2388 (42.03%)
  ProfExe: 1276 (22.46%)
  Office: 948 (16.69%)
  Mgr: 767 (13.5%)
  Self: 193 (3.4%)
  Sales: 109 (1.92%)
Missing values: 279 (4.68%)

Default rates by category:
  Mgr: 23.34%
  Office: 13.19%
  Other: 23.20%
  ProfExe: 16.61%
  Sales: 34.86%
  Self: 30.05%
Visualizations for categorical features saved to 'analytics/chapter2/' directory


## <span style="color:lightblue">Correlation Analysis</span>

In [8]:
# Correlation analysis for numeric columns only
numeric_data = data.select_dtypes(include=['float64', 'int64'])

# Calculate correlation matrix
corr = numeric_data.corr()

# Print strongest correlations
print("Top 10 strongest feature correlations:")
corr_pairs = []
for i in range(len(corr.columns)):
    for j in range(i):
        if i != j:  # Avoid diagonal
            corr_pairs.append((corr.columns[i], corr.columns[j], corr.iloc[i, j]))

# Sort by absolute correlation and print top 10
corr_pairs.sort(key=lambda x: abs(x[2]), reverse=True)
for var1, var2, corr_val in corr_pairs[:10]:
    print(f"{var1} & {var2}: {corr_val:.3f}")

# Create correlation heatmap
plt.figure(figsize=(12, 10))
mask = np.triu(np.ones_like(corr, dtype=bool))
sns.heatmap(corr, mask=mask, annot=True, fmt='.2f', cmap='coolwarm', square=True)
plt.title('Correlation Matrix')
plt.tight_layout()
plt.savefig('analytics/chapter2/correlation_matrix.png')
plt.close()

print("Correlation matrix visualization saved to 'analytics/chapter2/correlation_matrix.png'")

Top 10 strongest feature correlations:
VALUE & MORTDUE: 0.876
DELINQ & BAD: 0.354
VALUE & LOAN: 0.335
CLNO & MORTDUE: 0.324
DEROG & BAD: 0.276
CLNO & VALUE: 0.269
CLNO & CLAGE: 0.238
MORTDUE & LOAN: 0.229
DELINQ & DEROG: 0.212
CLAGE & YOJ: 0.202
Correlation matrix visualization saved to 'analytics/chapter2/correlation_matrix.png'


# Chapter 3 (General Model Development)

The data is split 80/20 for training/testing with stratification to maintain class distribution. </br></br>
Missing values are handled in two ways:</br>
* Numeric features: Assigned median values to reduce outlier impact</br>
* Categorical features: Assigned the most frequent value </br>

Numeric features are standardised (mean = 0, standard deviation = 1) to improve model performance.</br>
Categorical features are one-hot encoded to convert them to numeric format.

<b>3 different algorithms</b> were chosen to represent different modeling approaches:</br>
* Logistic Regression: Linear model with L2 regularization, max_iter=1000</br>
* Random Forest: Ensemble of decision trees with default parameters</br>
* XGBoost: Gradient boosting implementation with default parameters</br>

Random state 42 is used consistently to ensure reproducibility</br>
No explicit hyperparameter tuning is implemented in the initial modeling phase</br>

<span style="color:red">Future Works for Chapter 3:</span>
* Implement cross-validation for more robust performance estimates</br>
* Conduct hyperparameter tuning using GridSearchCV or Bayesian optimization</br>
* Try additional models such as neural networks or SVM</br>
* Implement stacking or voting ensembles to combine model strengths</br>

## <span style="color:lightblue">Data Preparation</span>

In [9]:
print("Preparing data for modeling...")

# Fill missing values for analysis
data_filled = data.copy()

# Fill missing numeric values with median
numeric_cols = data.select_dtypes(include=['float64', 'int64']).columns
numeric_cols = [col for col in numeric_cols if col != 'BAD']
for col in numeric_cols:
    data_filled[col] = data_filled[col].fillna(data_filled[col].median())

# Fill missing categorical values with mode
cat_cols = ['REASON', 'JOB']
for col in cat_cols:
    data_filled[col] = data_filled[col].fillna(data_filled[col].mode()[0])

X = data_filled.drop('BAD', axis=1)
y = data_filled['BAD']

# Define features
categorical_features = ['REASON', 'JOB']
numerical_features = [col for col in X.columns if col not in categorical_features]

# Define preprocessing for numerical and categorical features
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print(f"Training set shape: {X_train.shape}, {y_train.shape}")
print(f"Testing set shape: {X_test.shape}, {y_test.shape}")
print(f"Class distribution in training set: {dict(y_train.value_counts())}")
print(f"Class distribution in testing set: {dict(y_test.value_counts())}")

Preparing data for modeling...
Training set shape: (4768, 12), (4768,)
Testing set shape: (1192, 12), (1192,)
Class distribution in training set: {0: 3817, 1: 951}
Class distribution in testing set: {0: 954, 1: 238}


## <span style="color:lightblue">Initial Feature Importance Analysis (using Random Forest)</span>

In [25]:
print("Analyzing feature importance...")

rf_clf = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state=42))
])

rf_clf.fit(X_train, y_train)

# Get feature names after one-hot encoding
feature_names = numerical_features.copy()
categorical_features_encoded = []
for name, transformer in rf_clf.named_steps['preprocessor'].named_transformers_.items():
    if name == 'cat':
        for feature, categories in zip(categorical_features, transformer.named_steps['onehot'].categories_):
            for category in categories:
                categorical_features_encoded.append(f"{feature}_{category}")
        
feature_names_encoded = feature_names + categorical_features_encoded

# Get feature importances
importances = rf_clf.named_steps['classifier'].feature_importances_
indices = np.argsort(importances)[::-1]

# Print feature importances
print("Feature importances from Random Forest:")
for i in range(min(20, len(indices))):
    if i < len(feature_names_encoded):
        print(f"{i+1}. {feature_names_encoded[indices[i]]}: {importances[indices[i]]:.4f}")
    else:
        print(f"{i+1}. Feature {indices[i]}: {importances[indices[i]]:.4f}")

# Plot feature importances
plt.figure(figsize=(12, 8))
plt.title('Feature Importances')
plt.barh(range(min(20, len(indices))), importances[indices][:20], align='center')
plt.yticks(range(min(20, len(indices))), [feature_names_encoded[i] if i < len(feature_names_encoded) else f"Feature {i}" for i in indices[:20]])
plt.xlabel('Relative Importance')
plt.tight_layout()
plt.savefig('analytics/chapter3/rf_feature_importances.png')
plt.close()

print("Feature importance visualization saved to 'analytics/chapter3/rf_feature_importances.png'")

Analyzing feature importance...
Feature importances from Random Forest:
1. DEBTINC: 0.2328
2. CLAGE: 0.1057
3. DELINQ: 0.0984
4. LOAN: 0.0926
5. VALUE: 0.0866
6. MORTDUE: 0.0815
7. CLNO: 0.0780
8. YOJ: 0.0604
9. DEROG: 0.0538
10. NINQ: 0.0441
11. JOB_Other: 0.0110
12. JOB_Office: 0.0105
13. REASON_HomeImp: 0.0096
14. REASON_DebtCon: 0.0092
15. JOB_ProfExe: 0.0085
16. JOB_Mgr: 0.0065
17. JOB_Sales: 0.0062
18. JOB_Self: 0.0046
Feature importance visualization saved to 'analytics/chapter3/rf_feature_importances.png'


## <span style="color:lightblue">Model Development (3 Models)</span>

At the end of this section there is a comparison of
- LR (Logistic Regression)
- RF (Random Forest)
- XGB (XGBoost)</br>

models using their ROC curves. 

### <span style="color:lightblue">Logistic Regression</span>

In [11]:
print("Training Logistic Regression model...")

log_reg = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(random_state=42, max_iter=1000))
])

log_reg.fit(X_train, y_train)
y_pred_lr = log_reg.predict(X_test)
y_prob_lr = log_reg.predict_proba(X_test)[:, 1]

print("\nLogistic Regression Results:")
lr_report = classification_report(y_test, y_pred_lr)
print(lr_report)

# Save classification report
with open('analytics/chapter3/lr_classification_report.txt', 'w') as f:
    f.write(lr_report)

# Save confusion matrix
cm_lr = confusion_matrix(y_test, y_pred_lr)
plt.figure(figsize=(8, 6))
sns.heatmap(cm_lr, annot=True, fmt='d', cmap='Blues')
plt.title('Logistic Regression Confusion Matrix')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.tight_layout()
plt.savefig('analytics/chapter3/lr_confusion_matrix.png')
plt.close()

print("Logistic Regression results saved to 'analytics/chapter3/' directory")

Training Logistic Regression model...

Logistic Regression Results:
              precision    recall  f1-score   support

           0       0.86      0.97      0.91       954
           1       0.71      0.34      0.46       238

    accuracy                           0.84      1192
   macro avg       0.78      0.65      0.69      1192
weighted avg       0.83      0.84      0.82      1192

Logistic Regression results saved to 'analytics/chapter3/' directory


### <span style="color:lightblue">Random Forest</span>

In [12]:
print("Training Random Forest model...")

random_forest = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state=42))
])

random_forest.fit(X_train, y_train)
y_pred_rf = random_forest.predict(X_test)
y_prob_rf = random_forest.predict_proba(X_test)[:, 1]

print("\nRandom Forest Results:")
rf_report = classification_report(y_test, y_pred_rf)
print(rf_report)

# Save classification report
with open('analytics/chapter3/rf_classification_report.txt', 'w') as f:
    f.write(rf_report)

# Save confusion matrix
cm_rf = confusion_matrix(y_test, y_pred_rf)
plt.figure(figsize=(8, 6))
sns.heatmap(cm_rf, annot=True, fmt='d', cmap='Blues')
plt.title('Random Forest Confusion Matrix')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.tight_layout()
plt.savefig('analytics/chapter3/rf_confusion_matrix.png')
plt.close()

print("Random Forest results saved to 'analytics/chapter3/' directory")

Training Random Forest model...

Random Forest Results:
              precision    recall  f1-score   support

           0       0.91      0.98      0.94       954
           1       0.87      0.63      0.73       238

    accuracy                           0.91      1192
   macro avg       0.89      0.81      0.84      1192
weighted avg       0.91      0.91      0.90      1192

Random Forest results saved to 'analytics/chapter3/' directory


### <span style="color:lightblue">XGBoost</span>

In [13]:
print("Training XGBoost model...")

xgb_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', XGBClassifier(random_state=42))
])

xgb_pipeline.fit(X_train, y_train)
y_pred_xgb = xgb_pipeline.predict(X_test)
y_prob_xgb = xgb_pipeline.predict_proba(X_test)[:, 1]

print("\nXGBoost Results:")
xgb_report = classification_report(y_test, y_pred_xgb)
print(xgb_report)

# Save classification report
with open('analytics/chapter3/xgb_classification_report.txt', 'w') as f:
    f.write(xgb_report)

# Save confusion matrix
cm_xgb = confusion_matrix(y_test, y_pred_xgb)
plt.figure(figsize=(8, 6))
sns.heatmap(cm_xgb, annot=True, fmt='d', cmap='Blues')
plt.title('XGBoost Confusion Matrix')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.tight_layout()
plt.savefig('analytics/chapter3/xgb_confusion_matrix.png')
plt.close()

print("XGBoost results saved to 'analytics/chapter3/' directory")

Training XGBoost model...

XGBoost Results:
              precision    recall  f1-score   support

           0       0.92      0.97      0.94       954
           1       0.84      0.66      0.74       238

    accuracy                           0.91      1192
   macro avg       0.88      0.81      0.84      1192
weighted avg       0.90      0.91      0.90      1192

XGBoost results saved to 'analytics/chapter3/' directory


### <span style="color:lightblue">Model Comparison - ROC Curves</span>

In [14]:
# ROC Curve Comparison
print("Comparing models using ROC curves...")

plt.figure(figsize=(10, 8))
# Logistic Regression
fpr_lr, tpr_lr, _ = roc_curve(y_test, y_prob_lr)
auc_lr = roc_auc_score(y_test, y_prob_lr)
plt.plot(fpr_lr, tpr_lr, label=f'Logistic Regression (AUC = {auc_lr:.3f})')

# Random Forest
fpr_rf, tpr_rf, _ = roc_curve(y_test, y_prob_rf)
auc_rf = roc_auc_score(y_test, y_prob_rf)
plt.plot(fpr_rf, tpr_rf, label=f'Random Forest (AUC = {auc_rf:.3f})')

# XGBoost
fpr_xgb, tpr_xgb, _ = roc_curve(y_test, y_prob_xgb)
auc_xgb = roc_auc_score(y_test, y_prob_xgb)
plt.plot(fpr_xgb, tpr_xgb, label=f'XGBoost (AUC = {auc_xgb:.3f})')

# Add diagonal line (random classifier)
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve Comparison')
plt.legend(loc='lower right')
plt.grid(True)
plt.tight_layout()
plt.savefig('analytics/chapter3/roc_comparison.png')
plt.close()

print("ROC Curve comparison saved to 'analytics/chapter3/roc_comparison.png'")
print(f"\nAUC Scores:")
print(f"Logistic Regression: {auc_lr:.4f}")
print(f"Random Forest: {auc_rf:.4f}")
print(f"XGBoost: {auc_xgb:.4f}")

Comparing models using ROC curves...
ROC Curve comparison saved to 'analytics/chapter3/roc_comparison.png'

AUC Scores:
Logistic Regression: 0.7643
Random Forest: 0.9644
XGBoost: 0.9507


### <span style="color:lightblue">Combined (3 Models) Feature Importance Comparison</span>

In [29]:
# Create a directory for feature importance plots
if not os.path.exists('analytics/chapter3/3models_feature_importances'):
    os.makedirs('analytics/chapter3/3models_feature_importances')

# Function to plot feature importance
def plot_feature_importance(importances, feature_names, model_name, top_n=15):
    # Sort features by importance
    indices = np.argsort(importances)[::-1]
    
    # Select top N features
    indices = indices[:top_n]
    
    plt.figure(figsize=(12, 8))
    plt.title(f'Top {top_n} Feature Importance - {model_name}')
    plt.barh(range(len(indices)), importances[indices], align='center')
    plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
    plt.xlabel('Relative Importance')
    plt.tight_layout()
    plt.savefig(f'analytics/chapter3/3models_feature_importances/{model_name.lower().replace(" ", "_")}_importance.png')
    plt.close()
    
    # Also save the data to CSV
    importance_df = pd.DataFrame({
        'Feature': feature_names,
        'Importance': importances
    })
    importance_df = importance_df.sort_values('Importance', ascending=False)
    importance_df.to_csv(f'analytics/chapter3/3models_feature_importances/{model_name.lower().replace(" ", "_")}_importance.csv', index=False)
    
    return indices

# Get feature names after preprocessing (for proper feature importance)
feature_names = numerical_features.copy()
categorical_features_encoded = []
for name, transformer in rf_clf.named_steps['preprocessor'].named_transformers_.items():
    if name == 'cat':
        for feature, categories in zip(categorical_features, transformer.named_steps['onehot'].categories_):
            for category in categories:
                categorical_features_encoded.append(f"{feature}_{category}")
        
feature_names_encoded = feature_names + categorical_features_encoded

# 1. Logistic Regression Feature Importance
# Access coefficients from the classifier inside the pipeline
lr_importances = np.abs(log_reg.named_steps['classifier'].coef_[0])
lr_indices = plot_feature_importance(lr_importances, feature_names_encoded, 'Logistic Regression')
print("Top 5 features for Logistic Regression:")
print("\n".join([f"{i+1}. {feature_names_encoded[indices[i]]} ({lr_importances[indices[i]]:.4f})" for i in range(5)]))

# 2. Random Forest Feature Importance
rf_importances = rf_clf.named_steps['classifier'].feature_importances_
rf_indices = plot_feature_importance(rf_importances, feature_names_encoded, 'Random Forest')
print("\nTop 5 features for Random Forest:")
print("\n".join([f"{i+1}. {feature_names_encoded[indices[i]]} ({rf_importances[indices[i]]:.4f})" for i in range(5)]))

# 3. XGBoost Feature Importance
xgb_importances = xgb_pipeline.named_steps['classifier'].feature_importances_
xgb_indices = plot_feature_importance(xgb_importances, feature_names_encoded, 'XGBoost')
print("\nTop 5 features for XGBoost:")
print("\n".join([f"{i+1}. {feature_names_encoded[indices[i]]} ({xgb_importances[indices[i]]:.4f})" for i in range(5)]))

# Create comparison visualization for top features across models
top_features = set()
for i in range(5):
    if i < len(lr_indices): top_features.add(feature_names_encoded[lr_indices[i]])
    if i < len(rf_indices): top_features.add(feature_names_encoded[rf_indices[i]])
    if i < len(xgb_indices): top_features.add(feature_names_encoded[xgb_indices[i]])

# Create a comparison table
comparison_data = []
for feature in top_features:
    idx = feature_names_encoded.index(feature)
    lr_rank = idx in lr_indices[:10]
    rf_rank = idx in rf_indices[:10]
    xgb_rank = idx in xgb_indices[:10]
    
    comparison_data.append({
        'Feature': feature,
        'Important in LR': '✓' if lr_rank else '✗',
        'Important in RF': '✓' if rf_rank else '✗',
        'Important in XGB': '✓' if xgb_rank else '✗'
    })

comparison_df = pd.DataFrame(comparison_data)
comparison_df.to_csv('analytics/chapter3/3models_feature_importances/model_comparison.csv', index=False)
print("\nFeature importance comparison across models:")
print(comparison_df)

# Create a combined bar chart visualization
plt.figure(figsize=(15, 10))
plt.title('Top Feature Importance Comparison Across Models')

# Get top features across all models (limited to 8 for readability)
top_features_list = list(top_features)[:8] 

x = np.arange(len(top_features_list))
width = 0.25

# Prepare data for plotting
lr_values = []
rf_values = []
xgb_values = []

for feature in top_features_list:
    idx = feature_names_encoded.index(feature)
    lr_values.append(lr_importances[idx] / np.sum(lr_importances))  # Normalize
    rf_values.append(rf_importances[idx] / np.sum(rf_importances))  # Normalize
    xgb_values.append(xgb_importances[idx] / np.sum(xgb_importances))  # Normalize

# Plot bars
plt.bar(x - width, lr_values, width, label='Logistic Regression')
plt.bar(x, rf_values, width, label='Random Forest')
plt.bar(x + width, xgb_values, width, label='XGBoost')

plt.xlabel('Features')
plt.ylabel('Normalized Importance')
plt.xticks(x, top_features_list, rotation=45, ha='right')
plt.legend()
plt.tight_layout()
plt.savefig('analytics/chapter3/3models_feature_importances/combined_importance.png')
plt.close()

print("Feature importance comparison visualizations saved to 'analytics/chapter3/3models_feature_importances/' directory")

Top 5 features for Logistic Regression:
1. DEBTINC (0.5093)
2. CLAGE (0.4936)
3. DELINQ (0.8593)
4. LOAN (0.2502)
5. VALUE (0.1521)

Top 5 features for Random Forest:
1. DEBTINC (0.2328)
2. CLAGE (0.1057)
3. DELINQ (0.0984)
4. LOAN (0.0926)
5. VALUE (0.0866)

Top 5 features for XGBoost:
1. DEBTINC (0.2366)
2. CLAGE (0.0441)
3. DELINQ (0.1703)
4. LOAN (0.0330)
5. VALUE (0.0299)

Feature importance comparison across models:
          Feature Important in LR Important in RF Important in XGB
0          DELINQ               ✓               ✓                ✓
1       JOB_Sales               ✓               ✗                ✓
2           DEROG               ✓               ✓                ✓
3      JOB_Office               ✓               ✗                ✓
4           CLAGE               ✓               ✓                ✓
5            LOAN               ✗               ✓                ✓
6           VALUE               ✗               ✓                ✗
7         DEBTINC               ✓     

# Chapter 4 (Model Development for Objectives 1 & 2)

Assumptions and Details:</br>
Threshold optimization is performed by finding the point on the ROC curve that satisfies each objective.</br>
Default threshold of 0.5 is not assumed to be optimal.</br>
Performance metrics focus on recall for the relevant class for each objective.</br>
Assumes false positives and false negatives have different business costs.</br>

<span style="color:red">Future Works for Chapter 4:</span>
* Implement cost-sensitive learning to directly optimize for business objectives
* Perform sensitivity analysis to understand how robust thresholds are to changes in data

## <span style="color:lightblue">Objective 1</span>

Objective 1 is defined as, accepting the maxumim number of good customers if at least 85% of bad customers are correctly identified. 

In [19]:
def find_threshold_for_recall(y_true, y_prob, target_recall=0.85):
    fpr, tpr, thresholds = roc_curve(y_true, y_prob)
    recall = tpr  # TPR is the same as recall
    idx = np.argmin(np.abs(recall - target_recall))
    return thresholds[idx], tpr[idx], 1-fpr[idx]  # threshold, actual recall, specificity

# Find thresholds for 85% recall on bad customers
threshold_lr_obj1, recall_lr_obj1, spec_lr_obj1 = find_threshold_for_recall(y_test, y_prob_lr, 0.85)
threshold_rf_obj1, recall_rf_obj1, spec_rf_obj1 = find_threshold_for_recall(y_test, y_prob_rf, 0.85)
threshold_xgb_obj1, recall_xgb_obj1, spec_xgb_obj1 = find_threshold_for_recall(y_test, y_prob_xgb, 0.85)

# Evaluate models with these thresholds
def evaluate_with_threshold(y_true, y_prob, threshold):
    y_pred = (y_prob >= threshold).astype(int)
    report = classification_report(y_true, y_pred, output_dict=True)
    
    # For BAD=1 class
    recall_bad = report['1']['recall']  # Percentage of bad customers correctly identified
    precision_bad = report['1']['precision']  # Precision for bad customers
    
    # For BAD=0 class
    recall_good = report['0']['recall']  # Percentage of good customers accepted
    
    # Overall accuracy
    accuracy = report['accuracy']
    
    # Confusion matrix
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    
    return {
        'threshold': threshold,
        'accuracy': accuracy,
        'recall_bad': recall_bad,  # sensitivity for bad class
        'precision_bad': precision_bad,
        'recall_good': recall_good,  # specificity for bad class
        'tn': tn,  # true negative (correctly identified good customers)
        'fp': fp,  # false positive (good customers misclassified as bad)
        'fn': fn,  # false negative (bad customers misclassified as good)
        'tp': tp   # true positive (correctly identified bad customers)
    }

print("\nLogistic Regression:")
lr_obj1_results = evaluate_with_threshold(y_test, y_prob_lr, threshold_lr_obj1)
print(f"Threshold: {lr_obj1_results['threshold']:.3f}")
print(f"Percentage of bad customers correctly identified: {lr_obj1_results['recall_bad']:.2%}")
print(f"Percentage of good customers accepted: {lr_obj1_results['recall_good']:.2%}")
print(f"Overall accuracy: {lr_obj1_results['accuracy']:.2%}")
print(f"Confusion Matrix: TN={lr_obj1_results['tn']}, FP={lr_obj1_results['fp']}, FN={lr_obj1_results['fn']}, TP={lr_obj1_results['tp']}")

print("\nRandom Forest:")
rf_obj1_results = evaluate_with_threshold(y_test, y_prob_rf, threshold_rf_obj1)
print(f"Threshold: {rf_obj1_results['threshold']:.3f}")
print(f"Percentage of bad customers correctly identified: {rf_obj1_results['recall_bad']:.2%}")
print(f"Percentage of good customers accepted: {rf_obj1_results['recall_good']:.2%}")
print(f"Overall accuracy: {rf_obj1_results['accuracy']:.2%}")
print(f"Confusion Matrix: TN={rf_obj1_results['tn']}, FP={rf_obj1_results['fp']}, FN={rf_obj1_results['fn']}, TP={rf_obj1_results['tp']}")

print("\nXGBoost:")
xgb_obj1_results = evaluate_with_threshold(y_test, y_prob_xgb, threshold_xgb_obj1)
print(f"Threshold: {xgb_obj1_results['threshold']:.3f}")
print(f"Percentage of bad customers correctly identified: {xgb_obj1_results['recall_bad']:.2%}")
print(f"Percentage of good customers accepted: {xgb_obj1_results['recall_good']:.2%}")
print(f"Overall accuracy: {xgb_obj1_results['accuracy']:.2%}")
print(f"Confusion Matrix: TN={xgb_obj1_results['tn']}, FP={xgb_obj1_results['fp']}, FN={xgb_obj1_results['fn']}, TP={xgb_obj1_results['tp']}")

# Create a better formatted table for Objective 1
obj1_data = {
    'Model': ['Logistic Regression', 'Random Forest', 'XGBoost'],
    'Threshold': [lr_obj1_results['threshold'], rf_obj1_results['threshold'], xgb_obj1_results['threshold']],
    'Bad Customers Identified (%)': [lr_obj1_results['recall_bad']*100, rf_obj1_results['recall_bad']*100, xgb_obj1_results['recall_bad']*100],
    'Good Customers Accepted (%)': [lr_obj1_results['recall_good']*100, rf_obj1_results['recall_good']*100, xgb_obj1_results['recall_good']*100],
    'Accuracy (%)': [lr_obj1_results['accuracy']*100, rf_obj1_results['accuracy']*100, xgb_obj1_results['accuracy']*100]
}

obj1_df = pd.DataFrame(obj1_data)

# Save results to CSV
obj1_df.to_csv('analytics/chapter4/objective1_results.csv', index=False)

# Create a more visually appealing table
fig, ax = plt.subplots(figsize=(12, 4))
ax.axis('off')
ax.axis('tight')

# Create table with better formatting
table = ax.table(
    cellText=obj1_df.round(2).values,
    colLabels=obj1_df.columns,
    cellLoc='center',
    loc='center',
    colColours=['#f2f2f2']*len(obj1_df.columns),
    cellColours=[['#f9f9f9']*len(obj1_df.columns)]*len(obj1_df)
)

# Style the table
table.auto_set_font_size(False)
table.set_fontsize(12)
table.scale(1, 1.5)  # Adjust row height

# Add a title with proper padding
plt.title('Objective 1: Accept max good customers if at least 85% of bad customers are identified', 
          fontsize=14, pad=20)

plt.tight_layout()
plt.savefig('analytics/chapter4/objective1_comparison.png', dpi=300, bbox_inches='tight')
plt.close()

print("Objective 1 results saved to 'analytics/chapter4/' directory")


Logistic Regression:
Threshold: 0.089
Percentage of bad customers correctly identified: 84.87%
Percentage of good customers accepted: 44.97%
Overall accuracy: 52.94%
Confusion Matrix: TN=429, FP=525, FN=36, TP=202

Random Forest:
Threshold: 0.330
Percentage of bad customers correctly identified: 84.87%
Percentage of good customers accepted: 92.87%
Overall accuracy: 91.28%
Confusion Matrix: TN=886, FP=68, FN=36, TP=202

XGBoost:
Threshold: 0.110
Percentage of bad customers correctly identified: 85.29%
Percentage of good customers accepted: 91.93%
Overall accuracy: 90.60%
Confusion Matrix: TN=877, FP=77, FN=35, TP=203
Objective 1 results saved to 'analytics/chapter4/' directory


## <span style="color:lightblue">Objective 2</span>

Objective 2 is defined as, accepting at least 70% of good customers while rejecting as many bad customers as possible.

In [18]:
def find_threshold_for_specificity(y_true, y_prob, target_specificity=0.70):
    fpr, tpr, thresholds = roc_curve(y_true, y_prob)
    specificity = 1 - fpr
    idx = np.argmin(np.abs(specificity - target_specificity))
    return thresholds[idx], tpr[idx], specificity[idx]  # threshold, recall, actual specificity

# Find thresholds for 70% specificity on good customers
threshold_lr_obj2, recall_lr_obj2, spec_lr_obj2 = find_threshold_for_specificity(y_test, y_prob_lr, 0.70)
threshold_rf_obj2, recall_rf_obj2, spec_rf_obj2 = find_threshold_for_specificity(y_test, y_prob_rf, 0.70)
threshold_xgb_obj2, recall_xgb_obj2, spec_xgb_obj2 = find_threshold_for_specificity(y_test, y_prob_xgb, 0.70)

print("\nLogistic Regression:")
lr_obj2_results = evaluate_with_threshold(y_test, y_prob_lr, threshold_lr_obj2)
print(f"Threshold: {lr_obj2_results['threshold']:.3f}")
print(f"Percentage of bad customers correctly identified: {lr_obj2_results['recall_bad']:.2%}")
print(f"Percentage of good customers accepted: {lr_obj2_results['recall_good']:.2%}")
print(f"Overall accuracy: {lr_obj2_results['accuracy']:.2%}")
print(f"Confusion Matrix: TN={lr_obj2_results['tn']}, FP={lr_obj2_results['fp']}, FN={lr_obj2_results['fn']}, TP={lr_obj2_results['tp']}")

print("\nRandom Forest:")
rf_obj2_results = evaluate_with_threshold(y_test, y_prob_rf, threshold_rf_obj2)
print(f"Threshold: {rf_obj2_results['threshold']:.3f}")
print(f"Percentage of bad customers correctly identified: {rf_obj2_results['recall_bad']:.2%}")
print(f"Percentage of good customers accepted: {rf_obj2_results['recall_good']:.2%}")
print(f"Overall accuracy: {rf_obj2_results['accuracy']:.2%}")
print(f"Confusion Matrix: TN={rf_obj2_results['tn']}, FP={rf_obj2_results['fp']}, FN={rf_obj2_results['fn']}, TP={rf_obj2_results['tp']}")

print("\nXGBoost:")
xgb_obj2_results = evaluate_with_threshold(y_test, y_prob_xgb, threshold_xgb_obj2)
print(f"Threshold: {xgb_obj2_results['threshold']:.3f}")
print(f"Percentage of bad customers correctly identified: {xgb_obj2_results['recall_bad']:.2%}")
print(f"Percentage of good customers accepted: {xgb_obj2_results['recall_good']:.2%}")
print(f"Overall accuracy: {xgb_obj2_results['accuracy']:.2%}")
print(f"Confusion Matrix: TN={xgb_obj2_results['tn']}, FP={xgb_obj2_results['fp']}, FN={xgb_obj2_results['fn']}, TP={xgb_obj2_results['tp']}")

# Create a better formatted table for Objective 2
obj2_data = {
    'Model': ['Logistic Regression', 'Random Forest', 'XGBoost'],
    'Threshold': [lr_obj2_results['threshold'], rf_obj2_results['threshold'], xgb_obj2_results['threshold']],
    'Bad Customers Identified (%)': [lr_obj2_results['recall_bad']*100, rf_obj2_results['recall_bad']*100, xgb_obj2_results['recall_bad']*100],
    'Good Customers Accepted (%)': [lr_obj2_results['recall_good']*100, rf_obj2_results['recall_good']*100, xgb_obj2_results['recall_good']*100],
    'Accuracy (%)': [lr_obj2_results['accuracy']*100, rf_obj2_results['accuracy']*100, xgb_obj2_results['accuracy']*100]
}

obj2_df = pd.DataFrame(obj2_data)

# Save results to CSV
obj2_df.to_csv('analytics/chapter4/objective2_results.csv', index=False)

# Create a more visually appealing table
fig, ax = plt.subplots(figsize=(12, 4))
ax.axis('off')
ax.axis('tight')

# Create table with better formatting
table = ax.table(
    cellText=obj2_df.round(2).values,
    colLabels=obj2_df.columns,
    cellLoc='center',
    loc='center',
    colColours=['#f2f2f2']*len(obj2_df.columns),
    cellColours=[['#f9f9f9']*len(obj2_df.columns)]*len(obj2_df)
)

# Style the table
table.auto_set_font_size(False)
table.set_fontsize(12)
table.scale(1, 1.5)  # Adjust row height

# Add a title with proper padding
plt.title('Objective 2: Accept at least 70% of good customers with max bad customer rejection', 
          fontsize=14, pad=20)

plt.tight_layout()
plt.savefig('analytics/chapter4/objective2_comparison.png', dpi=300, bbox_inches='tight')
plt.close()

print("Objective 2 results saved to 'analytics/chapter4/' directory")


Logistic Regression:
Threshold: 0.163
Percentage of bad customers correctly identified: 65.13%
Percentage of good customers accepted: 69.60%
Overall accuracy: 68.71%
Confusion Matrix: TN=664, FP=290, FN=83, TP=155

Random Forest:
Threshold: 0.070
Percentage of bad customers correctly identified: 98.32%
Percentage of good customers accepted: 70.13%
Overall accuracy: 75.76%
Confusion Matrix: TN=669, FP=285, FN=4, TP=234

XGBoost:
Threshold: 0.014
Percentage of bad customers correctly identified: 96.64%
Percentage of good customers accepted: 71.28%
Overall accuracy: 76.34%
Confusion Matrix: TN=680, FP=274, FN=8, TP=230
Objective 2 results saved to 'analytics/chapter4/' directory


# Chapter 5 (Model Comparison)

Assumptions and Details:</br>
Profit from a good customer paying back: <b>$1,500</b></br>
Loss from a bad customer defaulting: <b>$10,000</b></br>
These values represent average profit/loss per customer.</br>
Analysis scales results to show impact per <b>1,000 applications.</b></br>
Net impact is calculated by combining the financial effects of both correct and incorrect decisions.</br>

<span style="color:red">Future Works for Chapter 5:</span>
* Incorporate varying loan amounts to create more personalized profit/loss estimates
* Create what-if analysis tools to evaluate different scenarios

## <span style="color:lightblue">Model Performance Comparison (Objectives 1 & 2)</span>

In [22]:
# Bar chart of model performance by objective
print("Creating model performance comparison visualizations...")

# Objective 1 chart
plt.figure(figsize=(12, 6))
models = ['Logistic Regression', 'Random Forest', 'XGBoost']
bad_identified_obj1 = [lr_obj1_results['recall_bad']*100, rf_obj1_results['recall_bad']*100, xgb_obj1_results['recall_bad']*100]
good_accepted_obj1 = [lr_obj1_results['recall_good']*100, rf_obj1_results['recall_good']*100, xgb_obj1_results['recall_good']*100]

x = np.arange(len(models))
width = 0.35

plt.bar(x - width/2, bad_identified_obj1, width, label='Bad Customers Identified (%)')
plt.bar(x + width/2, good_accepted_obj1, width, label='Good Customers Accepted (%)')

plt.xlabel('Models')
plt.ylabel('Percentage (%)')
plt.title('Objective 1: Model Performance Comparison')
plt.xticks(x, models)
plt.ylim(0, 100)
plt.legend()
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.savefig('analytics/chapter5/objective1_performance.png')
plt.close()

# Objective 2 chart
plt.figure(figsize=(12, 6))
bad_identified_obj2 = [lr_obj2_results['recall_bad']*100, rf_obj2_results['recall_bad']*100, xgb_obj2_results['recall_bad']*100]
good_accepted_obj2 = [lr_obj2_results['recall_good']*100, rf_obj2_results['recall_good']*100, xgb_obj2_results['recall_good']*100]

plt.bar(x - width/2, bad_identified_obj2, width, label='Bad Customers Identified (%)')
plt.bar(x + width/2, good_accepted_obj2, width, label='Good Customers Accepted (%)')

plt.xlabel('Models')
plt.ylabel('Percentage (%)')
plt.title('Objective 2: Model Performance Comparison')
plt.xticks(x, models)
plt.ylim(0, 100)
plt.legend()
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.savefig('analytics/chapter5/objective2_performance.png')
plt.close()

# Combined chart showing both objectives
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 10))

# Objective 1
ax1.bar(x - width/2, bad_identified_obj1, width, label='Bad Customers Identified (%)')
ax1.bar(x + width/2, good_accepted_obj1, width, label='Good Customers Accepted (%)')
ax1.set_xlabel('Models')
ax1.set_ylabel('Percentage (%)')
ax1.set_title('Objective 1: Accept max good customers if 85% of bad customers are identified')
ax1.set_xticks(x)
ax1.set_xticklabels(models)
ax1.set_ylim(0, 100)
ax1.legend()
ax1.grid(axis='y', linestyle='--', alpha=0.7)

# Objective 2
ax2.bar(x - width/2, bad_identified_obj2, width, label='Bad Customers Identified (%)')
ax2.bar(x + width/2, good_accepted_obj2, width, label='Good Customers Accepted (%)')
ax2.set_xlabel('Models')
ax2.set_ylabel('Percentage (%)')
ax2.set_title('Objective 2: Accept at least 70% of good customers with max bad rejection')
ax2.set_xticks(x)
ax2.set_xticklabels(models)
ax2.set_ylim(0, 100)
ax2.legend()
ax2.grid(axis='y', linestyle='--', alpha=0.7)

plt.tight_layout()
plt.savefig('analytics/chapter5/model_performance_comparison.png')
plt.close()

print("Model performance visualizations saved to 'analytics/chapter5/' directory")

Creating model performance comparison visualizations...
Model performance visualizations saved to 'analytics/chapter5/' directory


## <span style="color:lightblue">Cost-Benefit Analysis</span>

In [23]:
# Define costs and benefits for different outcomes
# These are hypothetical values - adjust based on business knowledge
cost_false_positive = 500  # Cost of rejecting a good customer (lost business opportunity)
cost_false_negative = 5000  # Cost of accepting a bad customer (default loss)
benefit_true_positive = 0  # Benefit of correctly identifying a bad customer (cost avoidance)
benefit_true_negative = 1000  # Benefit of correctly accepting a good customer (profit)

def calculate_financial_impact(results):
    tn = results['tn']  # Good customers correctly accepted
    fp = results['fp']  # Good customers incorrectly rejected
    fn = results['fn']  # Bad customers incorrectly accepted
    tp = results['tp']  # Bad customers correctly rejected
    
    # Calculate financial impact
    profit_from_good = tn * benefit_true_negative
    cost_from_missed_good = fp * cost_false_positive
    cost_from_bad = fn * cost_false_negative
    benefit_from_avoided_bad = tp * benefit_true_positive
    
    net_impact = profit_from_good - cost_from_missed_good - cost_from_bad + benefit_from_avoided_bad
    
    return {
        'profit_from_good': profit_from_good,
        'cost_from_missed_good': cost_from_missed_good,
        'cost_from_bad': cost_from_bad,
        'benefit_from_avoided_bad': benefit_from_avoided_bad,
        'net_impact': net_impact
    }

# Objective 1 Financial Impact
print("\nObjective 1 Financial Impact (per 1000 applications):")
models = ['Logistic Regression', 'Random Forest', 'XGBoost']
results = [lr_obj1_results, rf_obj1_results, xgb_obj1_results]

financial_data_obj1 = []
for model, model_results in zip(models, results):
    impacts = calculate_financial_impact(model_results)
    
    # Scale to per 1000 applications
    total = model_results['tn'] + model_results['fp'] + model_results['fn'] + model_results['tp']
    scaling_factor = 1000 / total
    
    scaled_impacts = {k: v * scaling_factor for k, v in impacts.items()}
    
    financial_data_obj1.append({
        'Model': model,
        'Profit from Good Loans': scaled_impacts['profit_from_good'],
        'Cost from Missed Good Loans': scaled_impacts['cost_from_missed_good'],
        'Cost from Bad Loans': scaled_impacts['cost_from_bad'],
        'Net Impact': scaled_impacts['net_impact']
    })
    
    print(f"\n{model}:")
    print(f"Profit from Good Loans: ${scaled_impacts['profit_from_good']:,.2f}")
    print(f"Cost from Missed Good Loans: ${scaled_impacts['cost_from_missed_good']:,.2f}")
    print(f"Cost from Bad Loans: ${scaled_impacts['cost_from_bad']:,.2f}")
    print(f"Net Impact: ${scaled_impacts['net_impact']:,.2f}")

# Objective 2 Financial Impact
print("\nObjective 2 Financial Impact (per 1000 applications):")
results = [lr_obj2_results, rf_obj2_results, xgb_obj2_results]

financial_data_obj2 = []
for model, model_results in zip(models, results):
    impacts = calculate_financial_impact(model_results)
    
    # Scale to per 1000 applications
    total = model_results['tn'] + model_results['fp'] + model_results['fn'] + model_results['tp']
    scaling_factor = 1000 / total
    
    scaled_impacts = {k: v * scaling_factor for k, v in impacts.items()}
    
    financial_data_obj2.append({
        'Model': model,
        'Profit from Good Loans': scaled_impacts['profit_from_good'],
        'Cost from Missed Good Loans': scaled_impacts['cost_from_missed_good'],
        'Cost from Bad Loans': scaled_impacts['cost_from_bad'],
        'Net Impact': scaled_impacts['net_impact']
    })
    
    print(f"\n{model}:")
    print(f"Profit from Good Loans: ${scaled_impacts['profit_from_good']:,.2f}")
    print(f"Cost from Missed Good Loans: ${scaled_impacts['cost_from_missed_good']:,.2f}")
    print(f"Cost from Bad Loans: ${scaled_impacts['cost_from_bad']:,.2f}")
    print(f"Net Impact: ${scaled_impacts['net_impact']:,.2f}")

# Save financial analysis results to CSV
pd.DataFrame(financial_data_obj1).to_csv('analytics/chapter5/financial_impact_obj1.csv', index=False)
pd.DataFrame(financial_data_obj2).to_csv('analytics/chapter5/financial_impact_obj2.csv', index=False)

# Bar chart showing net impact by model for both objectives
plt.figure(figsize=(12, 6))
x = np.arange(len(models))
width = 0.35

net_impact_obj1 = [data['Net Impact'] for data in financial_data_obj1]
net_impact_obj2 = [data['Net Impact'] for data in financial_data_obj2]

plt.bar(x - width/2, net_impact_obj1, width, label='Objective 1')
plt.bar(x + width/2, net_impact_obj2, width, label='Objective 2')

plt.xlabel('Models')
plt.ylabel('Net Impact ($)')
plt.title('Financial Impact by Model and Objective (per 1000 applications)')
plt.xticks(x, models)
plt.legend()
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Add value labels on bars
for i, v in enumerate(net_impact_obj1):
    plt.text(i - width/2, v + (0.05 * max(net_impact_obj1 + net_impact_obj2)), 
             f"${v:,.0f}", ha='center')
    
for i, v in enumerate(net_impact_obj2):
    plt.text(i + width/2, v + (0.05 * max(net_impact_obj1 + net_impact_obj2)), 
             f"${v:,.0f}", ha='center')

plt.tight_layout()
plt.savefig('analytics/chapter5/financial_impact_comparison.png')
plt.close()

print("Financial impact analysis saved to 'analytics/chapter5/' directory")


Objective 1 Financial Impact (per 1000 applications):

Logistic Regression:
Profit from Good Loans: $359,899.33
Cost from Missed Good Loans: $220,218.12
Cost from Bad Loans: $151,006.71
Net Impact: $-11,325.50

Random Forest:
Profit from Good Loans: $743,288.59
Cost from Missed Good Loans: $28,523.49
Cost from Bad Loans: $151,006.71
Net Impact: $563,758.39

XGBoost:
Profit from Good Loans: $735,738.26
Cost from Missed Good Loans: $32,298.66
Cost from Bad Loans: $146,812.08
Net Impact: $556,627.52

Objective 2 Financial Impact (per 1000 applications):

Logistic Regression:
Profit from Good Loans: $557,046.98
Cost from Missed Good Loans: $121,644.30
Cost from Bad Loans: $348,154.36
Net Impact: $87,248.32

Random Forest:
Profit from Good Loans: $561,241.61
Cost from Missed Good Loans: $119,546.98
Cost from Bad Loans: $16,778.52
Net Impact: $424,916.11

XGBoost:
Profit from Good Loans: $570,469.80
Cost from Missed Good Loans: $114,932.89
Cost from Bad Loans: $33,557.05
Net Impact: $421,97

## <span style="color:lightblue">Final Model Selection & Recommendations</span>

In [24]:
# Combine all metrics for comparison
metrics_data = {
    'Model': ['Logistic Regression', 'Random Forest', 'XGBoost'],
    'AUC': [auc_lr, auc_rf, auc_xgb],
    
    # Objective 1 metrics
    'Obj1_Threshold': [lr_obj1_results['threshold'], rf_obj1_results['threshold'], xgb_obj1_results['threshold']],
    'Obj1_Bad_Identified (%)': [lr_obj1_results['recall_bad']*100, rf_obj1_results['recall_bad']*100, xgb_obj1_results['recall_bad']*100],
    'Obj1_Good_Accepted (%)': [lr_obj1_results['recall_good']*100, rf_obj1_results['recall_good']*100, xgb_obj1_results['recall_good']*100],
    'Obj1_Accuracy (%)': [lr_obj1_results['accuracy']*100, rf_obj1_results['accuracy']*100, xgb_obj1_results['accuracy']*100],
    'Obj1_Net_Impact ($)': [data['Net Impact'] for data in financial_data_obj1],
    
    # Objective 2 metrics
    'Obj2_Threshold': [lr_obj2_results['threshold'], rf_obj2_results['threshold'], xgb_obj2_results['threshold']],
    'Obj2_Bad_Identified (%)': [lr_obj2_results['recall_bad']*100, rf_obj2_results['recall_bad']*100, xgb_obj2_results['recall_bad']*100],
    'Obj2_Good_Accepted (%)': [lr_obj2_results['recall_good']*100, rf_obj2_results['recall_good']*100, xgb_obj2_results['recall_good']*100],
    'Obj2_Accuracy (%)': [lr_obj2_results['accuracy']*100, rf_obj2_results['accuracy']*100, xgb_obj2_results['accuracy']*100],
    'Obj2_Net_Impact ($)': [data['Net Impact'] for data in financial_data_obj2],
}

metrics_df = pd.DataFrame(metrics_data)
metrics_df.to_csv('analytics/chapter5/model_comparison_summary.csv', index=False)

# Identify best models based on different criteria
best_auc_model = metrics_df.loc[metrics_df['AUC'].idxmax(), 'Model']
best_obj1_financial_model = metrics_df.loc[metrics_df['Obj1_Net_Impact ($)'].idxmax(), 'Model']
best_obj2_financial_model = metrics_df.loc[metrics_df['Obj2_Net_Impact ($)'].idxmax(), 'Model']
best_obj1_good_accepted_model = metrics_df.loc[metrics_df['Obj1_Good_Accepted (%)'].idxmax(), 'Model']
best_obj2_bad_identified_model = metrics_df.loc[metrics_df['Obj2_Bad_Identified (%)'].idxmax(), 'Model']

print("\nBest model by AUC:", best_auc_model)
print("Best model for Objective 1 by financial impact:", best_obj1_financial_model)
print("Best model for Objective 2 by financial impact:", best_obj2_financial_model)
print("Best model for Objective 1 by good customers accepted:", best_obj1_good_accepted_model)
print("Best model for Objective 2 by bad customers identified:", best_obj2_bad_identified_model)

# Generate recommendations based on the results
print("\nRecommendations:")
print(f"1. For overall predictive performance: {best_auc_model}")
print(f"2. For Objective 1 (Identify 85% of bad customers): {best_obj1_financial_model}")
print(f"3. For Objective 2 (Accept 70% of good customers): {best_obj2_financial_model}")

# Create a text file with a summary of recommendations
with open('analytics/chapter5/recommendations.txt', 'w') as f:
    f.write("Credit Scoring Model Recommendations\n")
    f.write("===================================\n\n")
    
    f.write("Model Performance Summary:\n")
    f.write(f"- Best model by AUC: {best_auc_model}\n")
    f.write(f"- Best model for Objective 1 by financial impact: {best_obj1_financial_model}\n")
    f.write(f"- Best model for Objective 2 by financial impact: {best_obj2_financial_model}\n\n")
    
    f.write("Key Business Recommendations:\n")
    
    # Objective 1 recommendation
    obj1_model = best_obj1_financial_model
    obj1_idx = metrics_df[metrics_df['Model'] == obj1_model].index[0]
    obj1_threshold = metrics_df.loc[obj1_idx, 'Obj1_Threshold']
    obj1_bad_id = metrics_df.loc[obj1_idx, 'Obj1_Bad_Identified (%)']
    obj1_good_accept = metrics_df.loc[obj1_idx, 'Obj1_Good_Accepted (%)']
    obj1_impact = metrics_df.loc[obj1_idx, 'Obj1_Net_Impact ($)']
    
    f.write(f"1. For Risk-Averse Strategy (Objective 1):\n")
    f.write(f"   - Implement {obj1_model} with threshold {obj1_threshold:.3f}\n")
    f.write(f"   - This identifies {obj1_bad_id:.2f}% of bad loans while accepting {obj1_good_accept:.2f}% of good loans\n")
    f.write(f"   - Estimated financial impact: ${obj1_impact:,.2f} per 1000 applications\n\n")
    
    # Objective 2 recommendation
    obj2_model = best_obj2_financial_model
    obj2_idx = metrics_df[metrics_df['Model'] == obj2_model].index[0]
    obj2_threshold = metrics_df.loc[obj2_idx, 'Obj2_Threshold']
    obj2_bad_id = metrics_df.loc[obj2_idx, 'Obj2_Bad_Identified (%)']
    obj2_good_accept = metrics_df.loc[obj2_idx, 'Obj2_Good_Accepted (%)']
    obj2_impact = metrics_df.loc[obj2_idx, 'Obj2_Net_Impact ($)']
    
    f.write(f"2. For Growth-Oriented Strategy (Objective 2):\n")
    f.write(f"   - Implement {obj2_model} with threshold {obj2_threshold:.3f}\n")
    f.write(f"   - This accepts {obj2_good_accept:.2f}% of good loans while identifying {obj2_bad_id:.2f}% of bad loans\n")
    f.write(f"   - Estimated financial impact: ${obj2_impact:,.2f} per 1000 applications\n\n")
    
    # Most important variables
    f.write("Most Important Variables for Credit Scoring:\n")
    for i in range(min(5, len(indices))):
        if i < len(feature_names_encoded):
            f.write(f"   {i+1}. {feature_names_encoded[indices[i]]}\n")
    
    f.write("\nImplementation Considerations:\n")
    f.write("- The selected threshold balances risk and acceptance rates based on business objectives\n")
    f.write("- Monitor model performance over time and recalibrate as needed\n")
    f.write("- Consider implementing a phased rollout to validate model performance in production\n")
    f.write("- Regular model updates recommended as customer behavior patterns change\n")

print("\nAnalysis complete. All results, visualizations, and recommendations have been saved to the 'analytics/' directory.")


Best model by AUC: Random Forest
Best model for Objective 1 by financial impact: Random Forest
Best model for Objective 2 by financial impact: Random Forest
Best model for Objective 1 by good customers accepted: Random Forest
Best model for Objective 2 by bad customers identified: Random Forest

Recommendations:
1. For overall predictive performance: Random Forest
2. For Objective 1 (Identify 85% of bad customers): Random Forest
3. For Objective 2 (Accept 70% of good customers): Random Forest

Analysis complete. All results, visualizations, and recommendations have been saved to the 'analytics/' directory.
