# Credit Card Default Prediction

## Introduction
This notebook explores credit card default prediction using machine learning. We'll analyze a dataset containing credit card client data to predict whether a client will default on their payment.

### Dataset Overview
- 30,000 credit card clients
- Features include credit limit, gender, education, marital status, age, and payment history
- Target variable: default payment next month (1 = default, 0 = no default)

## 1. Data Loading and Initial Exploration
First we'll load our data and take a look at its structure

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
import shap
import numpy as np

ModuleNotFoundError: No module named 'shap'

In [32]:
df = pd.read_csv("./data/defaultofcreditcardclients.csv")

In [None]:
print("Number of rows:", df.shape[0])
print("Number of columns:", df.shape[1])

In [None]:
print("Datas type")
print(df.dtypes)

In [None]:
print("Missing Values")
print(df.isnull().sum()[df.isnull().sum() > 0])

In [None]:
df.describe()

## 2. Exploratory Data Analysis 

## Labels distribution

In [None]:
df['default payment next month'].hist()

## Correlation matrix

In [None]:
f = plt.figure(figsize=(19, 15))
plt.matshow(df.corr(), fignum=f.number)
plt.xticks(range(df.select_dtypes(['number']).shape[1]), df.select_dtypes(['number']).columns, fontsize=14, rotation=45)
plt.yticks(range(df.select_dtypes(['number']).shape[1]), df.select_dtypes(['number']).columns, fontsize=14)
cb = plt.colorbar()
cb.ax.tick_params(labelsize=14)
plt.title('Correlation Matrix', fontsize=16)

## 3. Model Pipeline Development

We'll test three different models:
1. Support Vector Machine (SVM)
2. Random Forest
3. Logistic Regression

Each model will be tested using a standardized pipeline including:
- Feature scaling
- Cross-validation
- Hyperparameter tuning

In [39]:
y = df['default payment next month']
X = df.drop(['ID', 'default payment next month'], axis=1)

In [40]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

In [41]:
svm_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', SVC())
])

rf_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('rf', RandomForestClassifier())
])

lr_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('lr', LogisticRegression())
])

In [42]:
param_grid = {
    # SVM parameters
    'svm__C': [0.1, 1, 10],
    'svm__kernel': ['linear', 'poly', 'rbf'],
    
    # Random Forest parameters
    'rf__n_estimators': [100, 200, 300],
    'rf__max_depth': [10, 20, None],
    'rf__min_samples_split': [2, 5, 10],
    'rf__min_samples_leaf': [1, 2, 4],
    
    # Logistic Regression parameters
    'lr__C': [0.001, 0.01, 0.1, 1, 10],
    'lr__penalty': ['l1', 'l2'],
    'lr__solver': ['liblinear', 'saga'],
    'lr__max_iter': [1000]
}

In [43]:
models = {
    'Support Vector Machine': svm_pipeline,
    'Random Forest': rf_pipeline,
    'Logistic Regression': lr_pipeline
}

In [None]:
best_models = {}
all_results = {}


for name, model in models.items():
    print(f"\nPerforming grid search for {name}...")
    
    if name == 'Support Vector Machine':
        current_params = {key: value for key, value in param_grid.items() if key.startswith('svm')}
    elif name == 'Random Forest':
        current_params = {key: value for key, value in param_grid.items() if key.startswith('rf')}
    else:  # Logistic Regression
        current_params = {key: value for key, value in param_grid.items() if key.startswith('lr')}
    
    grid_search = GridSearchCV(
        estimator=model,
        param_grid=current_params,
        cv=5,
        n_jobs=-1,
        verbose=1,
        scoring=('accuracy', 'f1', 'f1_micro', 'f1_macro'),
        refit='f1'
    )
    
    grid_search.fit(X_train, y_train)
    
    best_models[name] = grid_search.best_estimator_
    y_pred = grid_search.predict(X_test)
    

    all_scores = {
        'accuracy': grid_search.cv_results_['mean_test_accuracy'][grid_search.best_index_],
        'f1': grid_search.cv_results_['mean_test_f1'][grid_search.best_index_],
        'f1_micro': grid_search.cv_results_['mean_test_f1_micro'][grid_search.best_index_],
        'f1_macro': grid_search.cv_results_['mean_test_f1_macro'][grid_search.best_index_]
    }

    all_results[name] = {
        'best_params': grid_search.best_params_,
        'best_score': grid_search.best_score_,
        'all_scores': all_scores,
        'test_predictions': y_pred,
        'classification_report': classification_report(y_test, y_pred),
        'confusion_matrix': confusion_matrix(y_test, y_pred)
    }


## 4. Model Evaluation

In [None]:
# Assuming all_results is your dictionary containing all model results
metrics = ['accuracy', 'f1', 'f1_micro', 'f1_macro']

# Extract scores for each model
scores = {}
for model_name, model_data in all_results.items():
    scores[model_name] = [
        model_data['all_scores']['accuracy'],
        model_data['all_scores']['f1'],
        model_data['all_scores']['f1_micro'],
        model_data['all_scores']['f1_macro']
    ]

# Create figure and subplots
num_models = len(all_results)
fig = plt.figure(figsize=(20, 8))
gs = fig.add_gridspec(2, num_models + 1)

# Metrics comparison plot
ax1 = fig.add_subplot(gs[0, :])

# Bar plot setup
x = np.arange(len(metrics))
width = 0.8 / num_models  # Adjust bar width based on number of models
colors = plt.cm.Set3(np.linspace(0, 1, num_models))

# Plot bars for each model
for i, (model_name, model_scores) in enumerate(scores.items()):
    positions = x + width * (i - num_models/2 + 0.5)
    bars = ax1.bar(positions, model_scores, width, label=model_name, color=colors[i])
    
    # Add value labels on bars
    for bar in bars:
        height = bar.get_height()
        ax1.annotate(f'{height:.2f}',
                    xy=(bar.get_x() + bar.get_width() / 2, height),
                    xytext=(0, 3),
                    textcoords="offset points",
                    ha='center', va='bottom',
                    rotation=90)

ax1.set_ylabel('Score')
ax1.set_title('Model Performance Metrics Comparison')
ax1.set_xticks(x)
ax1.set_xticklabels(metrics, rotation=45)
ax1.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
ax1.grid(True, alpha=0.3)

# Confusion Matrix plots
for i, (model_name, model_data) in enumerate(all_results.items()):
    ax = fig.add_subplot(gs[1, i])
    confusion_matrix = model_data['confusion_matrix']
    
    im = ax.imshow(confusion_matrix, interpolation='nearest', cmap=plt.cm.Blues)
    ax.set_title(f'Confusion Matrix\n{model_name}')
    
    # Add labels to each cell
    thresh = confusion_matrix.max() / 2.
    for i in range(confusion_matrix.shape[0]):
        for j in range(confusion_matrix.shape[1]):
            ax.text(j, i, format(confusion_matrix[i, j], 'd'),
                    ha="center", va="center",
                    color="white" if confusion_matrix[i, j] > thresh else "black")
    
    ax.set_ylabel('True Label')
    ax.set_xlabel('Predicted Label')
    ax.set_xticks([0, 1])
    ax.set_yticks([0, 1])

# Add colorbar
cbar_ax = fig.add_subplot(gs[1, -1])
plt.colorbar(im, cax=cbar_ax)

plt.tight_layout()
plt.show()

All models demonstrate good accuracy, though F1 scores are lower across all models, with Random Forest performing the best. The micro F1 scores mirror the accuracy, and the macro F1 scores range between 0.63 and 0.69.

The three models show good true negative detection, with acceptable true positive rates, though they all exhibit high false positive rates. This can be attributed to an imbalanced dataset (identified during exploration with a label mean of 0.22, indicating a majority of 0s). All models struggle with true predictions. It may be necessary to increase the representation of true cases in the dataset.

The Random Forest model achieves the best balance among the three tested models.

## 5. Model Interpretability 
Let's analyze which features have the strongest influence on our predictions using SHAP (SHapley Additive exPlanations) values. On the best models: Random Forest

## First, using no SHAP

In [None]:
best_model_name = 'Random Forest'

best_model = best_models[best_model_name].named_steps['rf']

importances = pd.Series(best_model.feature_importances_, index=X.columns)
importances_sorted = importances.sort_values(ascending=True)  # Ascending=True for horizontal bars to show highest at top

plt.figure(figsize=(10, 6))
plt.barh(importances_sorted.index, importances_sorted.values)

plt.title('Feature Importance')
plt.xlabel('Importance')

plt.tight_layout()

## Using SHAP

In [17]:
explainer = shap.TreeExplainer(best_model)

In [18]:
shap_values = explainer.shap_values(X_test[:1])

In [None]:
def plot_shap_values(shap_values, feature_names):
    values = shap_values[0][:, 0]
    
    plt.figure(figsize=(12, 8))
    
    sorted_idx = np.argsort(np.abs(values))
    
    y_pos = np.arange(len(values))
    plt.barh(y_pos, values[sorted_idx], 
            color=['red' if x < 0 else 'blue' for x in values[sorted_idx]])
    
    plt.yticks(y_pos, feature_names[sorted_idx], ha='right')
    plt.xlabel('SHAP Value (Impact on Model Output)')
    plt.title('Feature Importance (SHAP Values)')
    
    plt.axvline(x=0, color='black', linestyle='-', linewidth=0.5)
    
    plt.tight_layout()
    
    return plt

plot_shap_values(shap_values, X_test.columns)

# 6. Conclusion

Not balanced, need to be improved, SMOTE ? what manner. Grouping features correllated, ...

# 7. Improvement