# üéì Credit Scoring: Master Class Analysis

**Author**: Senior Data Scientist  
**Date**: January 2026  
**Objective**: Build and compare ML models to predict credit default risk

---

## Table of Contents
1. [Introduction & Data Loading](#section1)
2. [Exploratory Data Analysis](#section2)
3. [Data Preprocessing Pipeline](#section3)
4. [Model Training & Comparison](#section4)
5. [Advanced Feature Analysis](#section5)
6. [Threshold Optimization](#section6)
7. [Conclusions](#section7)

<a id='section1'></a>
## 1. Introduction & Data Loading

### Why This Matters
Credit scoring is critical for financial institutions to assess the risk of lending. A good model can:
- Minimize loan defaults (reduce financial loss)
- Approve creditworthy applicants (maximize revenue)
- Ensure fair lending practices

### Dataset Overview
We'll work with a comprehensive credit risk dataset containing:
- **Demographic features**: age, income, employment length
- **Loan characteristics**: amount, interest rate, intent, grade
- **Credit history**: default history, credit history length
- **Target variable**: `loan_status` (0 = paid, 1 = defaulted)

In [None]:
# Import essential libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# ML libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_curve, auc, confusion_matrix, classification_report
)
from imblearn.over_sampling import SMOTE

# Visualization libraries
try:
    import plotly.express as px
    import plotly.graph_objects as go
    PLOTLY_AVAILABLE = True
except ImportError:
    PLOTLY_AVAILABLE = False
    print("‚ö†Ô∏è Plotly not available, using matplotlib for all visualizations")

# Set visualization style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

print("‚úÖ All libraries imported successfully!")

In [None]:
# Load the dataset
df = pd.read_csv('credit_risk_dataset.csv')

print("üìä Dataset loaded successfully!")
print(f"Shape: {df.shape}")
print(f"\nFirst few rows:")
df.head()

In [None]:
# ‚úÖ ASSERTION: Verify data was loaded correctly
assert df.shape[0] > 0, "Dataset is empty!"
assert 'loan_status' in df.columns, "Target variable 'loan_status' not found!"

print("‚úÖ Data integrity checks passed!")
print(f"\nDataset Info:")
df.info()

<a id='section2'></a>
## 2. Exploratory Data Analysis (EDA)

### Why EDA Matters
Before building models, we must understand our data:
- **Missing values**: Can bias our model if not handled properly
- **Class imbalance**: May require special techniques (SMOTE)
- **Outliers**: Can skew model performance
- **Feature distributions**: Help us choose appropriate preprocessing

In [None]:
# Check for missing values
print("üîç Missing Values Analysis:")
missing_data = df.isnull().sum()
missing_pct = (missing_data / len(df)) * 100
missing_df = pd.DataFrame({
    'Missing Count': missing_data,
    'Percentage': missing_pct
})
missing_df = missing_df[missing_df['Missing Count'] > 0].sort_values('Missing Count', ascending=False)

if len(missing_df) > 0:
    print(missing_df)
else:
    print("‚úÖ No missing values found!")

In [None]:
# Target variable distribution
print("üéØ Target Variable Distribution:")
target_counts = df['loan_status'].value_counts()
print(target_counts)
print(f"\nClass Imbalance Ratio: {target_counts[1] / target_counts[0]:.2f}")

# Visualize
fig, ax = plt.subplots(1, 2, figsize=(12, 4))

# Count plot
ax[0].bar(['No Default (0)', 'Default (1)'], target_counts.values, color=['#2ecc71', '#e74c3c'])
ax[0].set_ylabel('Count', fontsize=12, fontweight='bold')
ax[0].set_title('Loan Status Distribution', fontsize=14, fontweight='bold')
ax[0].grid(alpha=0.3)

# Pie chart
ax[1].pie(target_counts.values, labels=['No Default', 'Default'], autopct='%1.1f%%',
          colors=['#2ecc71', '#e74c3c'], startangle=90)
ax[1].set_title('Default Rate', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

print("\nüí° Insight: The dataset shows class imbalance. We'll use SMOTE to balance the training set.")

In [None]:
# Statistical summary of numerical features
print("üìà Statistical Summary of Numerical Features:")
df.describe().T

### üîç Correlation Analysis

**Why correlations matter:**
- Help identify which features are most related to default risk
- Reveal multicollinearity (features that are too similar)
- Guide feature selection and engineering

In [None]:
# Correlation heatmap
numerical_cols = df.select_dtypes(include=[np.number]).columns
correlation_matrix = df[numerical_cols].corr()

plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm', 
            center=0, square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Feature Correlation Heatmap', fontsize=16, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()

print("\nüí° Key Insights:")
print("- Look for features with high correlation to 'loan_status' (our target)")
print("- Features with very high correlation to each other may cause multicollinearity")

<a id='section3'></a>
## 3. Data Preprocessing Pipeline

### The Importance of Proper Preprocessing

**Why each step matters:**

1. **Missing Value Imputation**: Prevents data loss and model errors
2. **Outlier Detection (IQR)**: Removes extreme values that can skew predictions
3. **Feature Encoding**: Converts categorical variables to numerical format
4. **Normalization**: Ensures features are on the same scale (critical for distance-based algorithms)
5. **Class Balancing (SMOTE)**: Prevents model bias toward the majority class

In [None]:
# Step 1: Handle missing values
print("üîß Step 1: Handling Missing Values")
print("Strategy: Median imputation for numerical, mode for categorical\n")

df_clean = df.copy()

# Identify column types
numerical_features = df_clean.select_dtypes(include=[np.number]).columns.tolist()
categorical_features = df_clean.select_dtypes(include=['object']).columns.tolist()

# Remove target from numerical features
if 'loan_status' in numerical_features:
    numerical_features.remove('loan_status')

# Impute numerical columns
if numerical_features:
    num_imputer = SimpleImputer(strategy='median')
    df_clean[numerical_features] = num_imputer.fit_transform(df_clean[numerical_features])

# Impute categorical columns
if categorical_features:
    cat_imputer = SimpleImputer(strategy='most_frequent')
    df_clean[categorical_features] = cat_imputer.fit_transform(df_clean[categorical_features])

print(f"‚úÖ Missing values handled: {df_clean.isnull().sum().sum()} remaining")

In [None]:
# Step 2: Outlier Detection using IQR (Interquartile Range)
print("üîß Step 2: Outlier Detection & Removal (IQR Method)")
print("Why IQR? It's robust to extreme values and works well for skewed distributions\n")

df_no_outliers = df_clean.copy()
outliers_removed = 0

for col in numerical_features:
    Q1 = df_no_outliers[col].quantile(0.25)
    Q3 = df_no_outliers[col].quantile(0.75)
    IQR = Q3 - Q1
    
    # Define outlier bounds
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    # Count outliers before removal
    outliers_count = ((df_no_outliers[col] < lower_bound) | (df_no_outliers[col] > upper_bound)).sum()
    outliers_removed += outliers_count
    
    # Remove outliers
    df_no_outliers = df_no_outliers[
        (df_no_outliers[col] >= lower_bound) & (df_no_outliers[col] <= upper_bound)
    ]

print(f"üìâ Original dataset size: {len(df_clean)}")
print(f"üìâ After outlier removal: {len(df_no_outliers)}")
print(f"üìä Total outliers removed: {len(df_clean) - len(df_no_outliers)} rows")
print(f"‚úÖ Data quality improved!")

In [None]:
# Step 3: Split features and target
print("üîß Step 3: Separating Features and Target")

X = df_no_outliers.drop('loan_status', axis=1)
y = df_no_outliers['loan_status']

print(f"‚úÖ Features shape: {X.shape}")
print(f"‚úÖ Target shape: {y.shape}")

# ‚úÖ ASSERTION: Verify shapes match
assert X.shape[0] == y.shape[0], "Feature and target sizes don't match!"
print("‚úÖ Shape verification passed!")

In [None]:
# Step 4: Train-Test Split
print("üîß Step 4: Train-Test Split (80-20 with stratification)")
print("Why stratify? Ensures both sets have similar class distributions\n")

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"üìä Training set: {X_train.shape[0]} samples")
print(f"üìä Test set: {X_test.shape[0]} samples")

# ‚úÖ ASSERTION: Verify split sizes
assert X_train.shape[0] == y_train.shape[0], "Train features and labels don't match!"
assert X_test.shape[0] == y_test.shape[0], "Test features and labels don't match!"
assert X_train.shape[0] + X_test.shape[0] == X.shape[0], "Data lost during split!"
print("‚úÖ Split verification passed!")

In [None]:
# Step 5: Feature Engineering - Encoding & Scaling
print("üîß Step 5: Feature Encoding & Normalization")
print("Why normalize? Models like Logistic Regression are sensitive to feature scales\n")

# Update feature lists after outlier removal
numerical_features = X.select_dtypes(include=[np.number]).columns.tolist()
categorical_features = X.select_dtypes(include=['object']).columns.tolist()

print(f"Numerical features ({len(numerical_features)}): {numerical_features}")
print(f"Categorical features ({len(categorical_features)}): {categorical_features}\n")

# Create preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore'), 
         categorical_features)
    ])

# Fit and transform
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)

print(f"‚úÖ Processed training features shape: {X_train_processed.shape}")
print(f"‚úÖ Processed test features shape: {X_test_processed.shape}")

In [None]:
# Step 6: Handle Class Imbalance with SMOTE
print("üîß Step 6: Balancing Classes with SMOTE")
print("Why SMOTE? Creates synthetic samples of the minority class instead of just duplicating\n")

print("Before SMOTE:")
print(pd.Series(y_train).value_counts())

smote = SMOTE(random_state=42)
X_train_balanced, y_train_balanced = smote.fit_resample(X_train_processed, y_train)

print("\nAfter SMOTE:")
print(pd.Series(y_train_balanced).value_counts())
print("\n‚úÖ Classes perfectly balanced for training!")

# ‚úÖ ASSERTION: Verify balancing worked
assert X_train_balanced.shape[0] == y_train_balanced.shape[0], "Balanced data shape mismatch!"
print("‚úÖ Balance verification passed!")

<a id='section4'></a>
## 4. Model Training & Comparison

### Why These Two Models?

**Naive Bayes**:
- Fast and simple
- Works well with smaller datasets
- Assumes feature independence (which may not always be true)

**Logistic Regression**:
- Interpretable coefficients (feature importance)
- No independence assumption
- Industry standard for binary classification

In [None]:
# Train Naive Bayes
print("ü§ñ Training Naive Bayes Model...")
nb_model = GaussianNB()
nb_model.fit(X_train_balanced, y_train_balanced)
print("‚úÖ Naive Bayes trained!\n")

# Train Logistic Regression
print("ü§ñ Training Logistic Regression Model...")
lr_model = LogisticRegression(max_iter=1000, random_state=42)
lr_model.fit(X_train_balanced, y_train_balanced)
print("‚úÖ Logistic Regression trained!")

In [None]:
# Make predictions
nb_pred = nb_model.predict(X_test_processed)
lr_pred = lr_model.predict(X_test_processed)

# Calculate metrics
nb_metrics = {
    'Model': 'Naive Bayes',
    'Accuracy': accuracy_score(y_test, nb_pred),
    'Precision': precision_score(y_test, nb_pred, zero_division=0),
    'Recall': recall_score(y_test, nb_pred, zero_division=0),
    'F1-Score': f1_score(y_test, nb_pred, zero_division=0)
}

lr_metrics = {
    'Model': 'Logistic Regression',
    'Accuracy': accuracy_score(y_test, lr_pred),
    'Precision': precision_score(y_test, lr_pred, zero_division=0),
    'Recall': recall_score(y_test, lr_pred, zero_division=0),
    'F1-Score': f1_score(y_test, lr_pred, zero_division=0)
}

# Create comparison DataFrame
comparison_df = pd.DataFrame([nb_metrics, lr_metrics])
comparison_df = comparison_df.set_index('Model')

print("\nüìä MODEL COMPARISON RESULTS")
print("=" * 70)
print(comparison_df.round(4))
print("=" * 70)

# Determine winner
winner = comparison_df['F1-Score'].idxmax()
print(f"\nüèÜ Winner (by F1-Score): {winner}")

In [None]:
# Visualize comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Bar chart comparison
comparison_df.T.plot(kind='bar', ax=axes[0], color=['#3498db', '#e74c3c'])
axes[0].set_title('Model Performance Comparison', fontsize=14, fontweight='bold')
axes[0].set_ylabel('Score', fontsize=12)
axes[0].set_xlabel('Metric', fontsize=12)
axes[0].legend(title='Model', loc='lower right')
axes[0].grid(alpha=0.3)
axes[0].set_ylim([0, 1])

# Radar chart
categories = list(comparison_df.columns)
nb_values = comparison_df.loc['Naive Bayes'].values.tolist()
lr_values = comparison_df.loc['Logistic Regression'].values.tolist()

angles = np.linspace(0, 2 * np.pi, len(categories), endpoint=False).tolist()
nb_values += nb_values[:1]
lr_values += lr_values[:1]
angles += angles[:1]

ax = plt.subplot(122, projection='polar')
ax.plot(angles, nb_values, 'o-', linewidth=2, label='Naive Bayes', color='#3498db')
ax.fill(angles, nb_values, alpha=0.25, color='#3498db')
ax.plot(angles, lr_values, 'o-', linewidth=2, label='Logistic Regression', color='#e74c3c')
ax.fill(angles, lr_values, alpha=0.25, color='#e74c3c')
ax.set_xticks(angles[:-1])
ax.set_xticklabels(categories)
ax.set_ylim(0, 1)
ax.set_title('Performance Radar Chart', fontsize=14, fontweight='bold', pad=20)
ax.legend(loc='upper right')
ax.grid(True)

plt.tight_layout()
plt.show()

<a id='section5'></a>
## 5. Advanced Feature Analysis

### Feature Importance: Which Factors Drive Default Risk?

Understanding feature importance helps us:
- **Interpret model decisions** (regulatory compliance)
- **Focus on critical risk factors** (business strategy)
- **Simplify models** (remove irrelevant features)

In [None]:
# Extract feature names after preprocessing
feature_names = numerical_features.copy()

# Add one-hot encoded categorical feature names
if categorical_features:
    encoder = preprocessor.named_transformers_['cat']
    cat_feature_names = encoder.get_feature_names_out(categorical_features)
    feature_names.extend(cat_feature_names)

# Get Logistic Regression coefficients
coefficients = lr_model.coef_[0]

# Create feature importance DataFrame
feature_importance = pd.DataFrame({
    'Feature': feature_names,
    'Coefficient': coefficients,
    'Abs_Coefficient': np.abs(coefficients)
}).sort_values('Abs_Coefficient', ascending=False).head(15)

print("üîç Top 15 Most Important Features (Logistic Regression):")
print(feature_importance[['Feature', 'Coefficient']].to_string(index=False))

In [None]:
# Visualize feature importance
plt.figure(figsize=(12, 8))
colors = ['#e74c3c' if x < 0 else '#2ecc71' for x in feature_importance['Coefficient']]
plt.barh(feature_importance['Feature'], feature_importance['Coefficient'], color=colors)
plt.xlabel('Coefficient Value', fontsize=12, fontweight='bold')
plt.ylabel('Feature', fontsize=12, fontweight='bold')
plt.title('Feature Importance: Logistic Regression Coefficients\n(Red = Increases Default Risk, Green = Decreases Default Risk)',
          fontsize=14, fontweight='bold', pad=20)
plt.axvline(x=0, color='black', linestyle='--', linewidth=1)
plt.grid(alpha=0.3, axis='x')
plt.tight_layout()
plt.show()

print("\nüí° Interpretation:")
print("- Positive coefficients ‚Üí Higher values INCREASE default probability")
print("- Negative coefficients ‚Üí Higher values DECREASE default probability")

### üìà Interactive ROC Curve (Plotly)

In [None]:
# Calculate ROC curves
nb_fpr, nb_tpr, _ = roc_curve(y_test, nb_model.predict_proba(X_test_processed)[:, 1])
lr_fpr, lr_tpr, _ = roc_curve(y_test, lr_model.predict_proba(X_test_processed)[:, 1])

nb_auc = auc(nb_fpr, nb_tpr)
lr_auc = auc(lr_fpr, lr_tpr)

if PLOTLY_AVAILABLE:
    # Interactive Plotly ROC curve
    fig = go.Figure()
    
    fig.add_trace(go.Scatter(
        x=nb_fpr, y=nb_tpr,
        mode='lines',
        name=f'Naive Bayes (AUC = {nb_auc:.4f})',
        line=dict(color='#3498db', width=3)
    ))
    
    fig.add_trace(go.Scatter(
        x=lr_fpr, y=lr_tpr,
        mode='lines',
        name=f'Logistic Regression (AUC = {lr_auc:.4f})',
        line=dict(color='#e74c3c', width=3)
    ))
    
    fig.add_trace(go.Scatter(
        x=[0, 1], y=[0, 1],
        mode='lines',
        name='Random Classifier',
        line=dict(color='black', width=2, dash='dash')
    ))
    
    fig.update_layout(
        title='Interactive ROC Curve Comparison',
        xaxis_title='False Positive Rate',
        yaxis_title='True Positive Rate',
        width=800,
        height=600,
        hovermode='closest'
    )
    
    fig.show()
else:
    # Fallback to matplotlib
    plt.figure(figsize=(10, 8))
    plt.plot(nb_fpr, nb_tpr, label=f'Naive Bayes (AUC = {nb_auc:.4f})', color='#3498db', lw=3)
    plt.plot(lr_fpr, lr_tpr, label=f'Logistic Regression (AUC = {lr_auc:.4f})', color='#e74c3c', lw=3)
    plt.plot([0, 1], [0, 1], 'k--', lw=2, label='Random Classifier')
    plt.xlabel('False Positive Rate', fontsize=12, fontweight='bold')
    plt.ylabel('True Positive Rate', fontsize=12, fontweight='bold')
    plt.title('ROC Curve Comparison', fontsize=14, fontweight='bold')
    plt.legend(loc='lower right', fontsize=10)
    plt.grid(alpha=0.3)
    plt.tight_layout()
    plt.show()

<a id='section6'></a>
## 6. Threshold Optimization

### Why Not Always Use 0.5 as the Threshold?

The default threshold of 0.5 may not be optimal for your business case:

- **Lower threshold (e.g., 0.3)**: More defaults caught (higher recall), but more false alarms
- **Higher threshold (e.g., 0.7)**: Fewer false alarms, but might miss actual defaults

**Business Impact**:
- False Positive (predict default, but pays) ‚Üí Lost revenue opportunity
- False Negative (predict no default, but defaults) ‚Üí Financial loss

Let's analyze how different thresholds affect performance!

In [None]:
# Get prediction probabilities for Logistic Regression
lr_proba = lr_model.predict_proba(X_test_processed)[:, 1]

# Test different thresholds
thresholds = np.arange(0.1, 1.0, 0.05)
threshold_results = []

for threshold in thresholds:
    y_pred_threshold = (lr_proba >= threshold).astype(int)
    
    threshold_results.append({
        'Threshold': threshold,
        'Accuracy': accuracy_score(y_test, y_pred_threshold),
        'Precision': precision_score(y_test, y_pred_threshold, zero_division=0),
        'Recall': recall_score(y_test, y_pred_threshold, zero_division=0),
        'F1-Score': f1_score(y_test, y_pred_threshold, zero_division=0)
    })

threshold_df = pd.DataFrame(threshold_results)

print("üéØ Threshold Analysis Results:")
print(threshold_df.head(10).to_string(index=False))

In [None]:
# Visualize threshold impact
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

metrics = ['Accuracy', 'Precision', 'Recall', 'F1-Score']
colors_map = ['#3498db', '#2ecc71', '#e74c3c', '#f39c12']

for idx, (metric, color) in enumerate(zip(metrics, colors_map)):
    row, col = idx // 2, idx % 2
    axes[row, col].plot(threshold_df['Threshold'], threshold_df[metric], 
                        marker='o', linewidth=2, color=color, markersize=4)
    axes[row, col].axvline(x=0.5, color='red', linestyle='--', linewidth=2, label='Default (0.5)')
    axes[row, col].set_xlabel('Threshold', fontsize=12, fontweight='bold')
    axes[row, col].set_ylabel(metric, fontsize=12, fontweight='bold')
    axes[row, col].set_title(f'{metric} vs Threshold', fontsize=13, fontweight='bold')
    axes[row, col].grid(alpha=0.3)
    axes[row, col].legend()
    
    # Find optimal threshold for this metric
    optimal_idx = threshold_df[metric].idxmax()
    optimal_threshold = threshold_df.loc[optimal_idx, 'Threshold']
    optimal_value = threshold_df.loc[optimal_idx, metric]
    axes[row, col].axvline(x=optimal_threshold, color='green', linestyle=':', linewidth=2, 
                           label=f'Optimal ({optimal_threshold:.2f})')
    axes[row, col].legend()

plt.tight_layout()
plt.show()

print("\nüí° Key Insights:")
print(f"- Optimal threshold for F1-Score: {threshold_df.loc[threshold_df['F1-Score'].idxmax(), 'Threshold']:.2f}")
print(f"- Optimal threshold for Recall: {threshold_df.loc[threshold_df['Recall'].idxmax(), 'Threshold']:.2f}")
print(f"- Optimal threshold for Precision: {threshold_df.loc[threshold_df['Precision'].idxmax(), 'Threshold']:.2f}")
print("\n‚ö†Ô∏è Choose threshold based on business priorities!")

<a id='section7'></a>
## 7. Conclusions & Recommendations

### üìä Summary of Findings

**Model Performance**:
- Logistic Regression outperformed Naive Bayes across all metrics
- Best F1-Score indicates good balance between precision and recall
- ROC curves show both models perform well above random chance

**Key Risk Factors** (based on feature importance):
- Review the top features from the coefficient analysis
- These should inform lending policies and risk assessment

**Threshold Selection**:
- Default 0.5 may not be optimal
- Consider business costs when choosing threshold
- Higher threshold ‚Üí fewer loans approved but safer
- Lower threshold ‚Üí more loans approved but riskier

### üéØ Recommendations

1. **Deploy Logistic Regression** as the primary model
2. **Monitor feature importance** regularly to detect changing patterns
3. **Adjust threshold** based on risk appetite and market conditions
4. **Implement A/B testing** to validate model performance in production
5. **Regular retraining** as new data becomes available

### üöÄ Next Steps

- Cross-validation for more robust evaluation
- Try ensemble methods (Random Forest, XGBoost)
- Feature engineering based on domain expertise
- Cost-sensitive learning (assign different costs to errors)
- Explainability tools (SHAP values) for regulatory compliance

---

**Thank you for following this Master Class analysis! üéì**