# Credit Risk Modeling — Notebook

**Goal:** Assess creditworthiness of loan applicants (binary classification: good / bad loan).

This notebook includes: environment setup, EDA examples, preprocessing, SMOTE, model training (Logistic Regression / Random Forest / XGBoost), evaluation (ROC AUC, PR AUC, confusion matrix), and SHAP explainability. Update `DATA_PATH` to point to your CSV file and run cells in order.



## 1) Install requirements (run once)

Run the following in your environment if packages are missing:

In [None]:

# !pip install pandas numpy scikit-learn imbalanced-learn xgboost shap matplotlib seaborn joblib nbformat
print('Skip pip install in notebook if already installed.')


## 2) Imports and settings

In [None]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.metrics import roc_auc_score, precision_recall_curve, auc, classification_report, confusion_matrix
import shap
import joblib

RANDOM_STATE = 42
TEST_SIZE = 0.2
print('Libraries imported.')


## 3) Configuration: dataset path and quick target mapping

In [None]:

# === EDIT THIS: point to your dataset CSV ===
DATA_PATH = 'lending_club_loans.csv'  # change to your CSV file path

# Example load (no internet - ensure file exists)
try:
    df = pd.read_csv(DATA_PATH, low_memory=False)
    print('Loaded:', DATA_PATH, 'shape=', df.shape)
except FileNotFoundError:
    print('File not found at', DATA_PATH)
    df = pd.DataFrame()  # placeholder

# Quick target creation helper
if not df.empty:
    if 'target' not in df.columns:
        if 'loan_status' in df.columns:
            df['target'] = df['loan_status'].apply(lambda x: 1 if x in ['Charged Off','Default'] else 0)
            print('Derived target from loan_status. Value counts:\n', df['target'].value_counts(dropna=False))
        else:
            print('No `target` or `loan_status` column found. Please create a binary target column named `target`.')


## 4) Quick EDA examples

These are example exploratory analyses — run them when your dataset is loaded.

In [None]:

if not df.empty:
    display(df.head())
    print('\nNumeric summary:')
    display(df.describe().T)
    print('\nTarget distribution:')
    display(df['target'].value_counts(normalize=True))
    
    # Example: top features missingness
    missing = df.isnull().mean().sort_values(ascending=False).head(20)
    print('\nTop 20 columns by missing rate:')
    display(missing)
    
    # Example: plot target vs a numeric feature if exists
    numeric_cols = df.select_dtypes(include=['int64','float64']).columns.tolist()
    if numeric_cols:
        col = numeric_cols[0]
        plt.figure(figsize=(8,4))
        sns.histplot(data=df, x=col, hue='target', bins=50, stat='density', element='step', common_norm=False)
        plt.title(f'Distribution of {col} by target')
        plt.show()
else:
    print('Load a dataset to run EDA.')


## 5) Preprocessing, SMOTE, and model training (example)

This cell constructs a preprocessing pipeline, applies SMOTE during training, and runs GridSearch on RandomForest. Modify features list as needed.

In [None]:

if df.empty:
    print('Dataset not loaded. Please update DATA_PATH and run the earlier cell.')
else:
    # Simple feature selection: drop ids and text heavy columns
    drop_cols = [c for c in ['id','member_id','url','desc','title'] if c in df.columns]
    X = df.drop(columns=drop_cols + ['target'], errors='ignore')
    y = df['target']
    
    # Identify feature types
    num_features = X.select_dtypes(include=['int64','float64']).columns.tolist()
    cat_features = X.select_dtypes(include=['object','category']).columns.tolist()
    
    # Build preprocessors (adapt strategies as needed)
    numeric_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler())
    ])
    categorical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
        ('onehot', OneHotEncoder(handle_unknown='ignore', sparse=False))
    ])
    preprocessor = ColumnTransformer(transformers=[
        ('num', numeric_transformer, num_features),
        ('cat', categorical_transformer, cat_features)
    ], remainder='drop')
    
    smote = SMOTE(random_state=RANDOM_STATE)
    rf = RandomForestClassifier(random_state=RANDOM_STATE, n_jobs=-1)
    
    pipe = ImbPipeline(steps=[
        ('preprocessor', preprocessor),
        ('smote', smote),
        ('clf', rf)
    ])
    
    # Train-test split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=TEST_SIZE, stratify=y, random_state=RANDOM_STATE)
    print('Train/test shapes:', X_train.shape, X_test.shape)
    
    # Grid search (small example grid)
    param_grid = {
        'clf__n_estimators': [100, 200],
        'clf__max_depth': [6, 12]
    }
    cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=RANDOM_STATE)
    search = GridSearchCV(pipe, param_grid, scoring='roc_auc', n_jobs=-1, cv=cv, verbose=1)
    search.fit(X_train, y_train)
    print('Best params:', search.best_params_)
    
    # Evaluation
    best_model = search.best_estimator_
    probs = best_model.predict_proba(X_test)[:,1]
    preds = best_model.predict(X_test)
    roc = roc_auc_score(y_test, probs)
    precision, recall, _ = precision_recall_curve(y_test, probs)
    pr_auc = auc(recall, precision)
    print(f'Test ROC AUC: {roc:.4f} | PR AUC: {pr_auc:.4f}')
    print('\nClassification report:')
    print(classification_report(y_test, preds))
    cm = confusion_matrix(y_test, preds)
    print('Confusion matrix:\n', cm)
    
    # Save model
    joblib.dump(best_model, 'credit_model_rf.joblib')
    print('Saved model to credit_model_rf.joblib')


## 6) SHAP explainability (summary + single prediction)

Note: for large feature spaces KernelExplainer can be slow. For tree models TreeExplainer is used here on the preprocessed numeric matrix.

In [None]:

if df.empty:
    print('Dataset not loaded.')
else:
    # Refit a preprocessor-only to get transformed arrays and feature names
    preproc_only = preprocessor.fit(X_train)
    X_train_trans = preproc_only.transform(X_train)
    X_test_trans = preproc_only.transform(X_test)
    
    # Helper to extract feature names from ColumnTransformer
    def get_feature_names_from_ct(ct):
        feature_names = []
        for name, trans, cols in ct.transformers_:
            if name == 'remainder': 
                continue
            if hasattr(trans, 'named_steps') and 'onehot' in trans.named_steps:
                ohe = trans.named_steps['onehot']
                cats = ohe.categories_
                for i, col in enumerate(cols):
                    for cat in cats[i]:
                        feature_names.append(f"{col}__{cat}")
            else:
                feature_names.extend(cols)
        return feature_names
    
    feature_names = get_feature_names_from_ct(preproc_only)
    clf = best_model.named_steps['clf']
    try:
        explainer = shap.TreeExplainer(clf)
        shap_vals = explainer.shap_values(X_test_trans)
        # shap_vals may be list for classifiers
        if isinstance(shap_vals, list):
            shap_vals_pos = shap_vals[1]
        else:
            shap_vals_pos = shap_vals
        print('Displaying SHAP summary plot (this will render inline in Jupyter):')
        shap.summary_plot(shap_vals_pos, features=X_test_trans, feature_names=feature_names, show=True)
    except Exception as e:
        print('SHAP explanation failed:', e)


## 7) Next steps

- Tune preprocessing and feature engineering (binning, debt-to-income ratio, credit history features).
- Try XGBoost/LightGBM and compare with calibration.
- Add fairness checks for protected attributes.
- Export model + preprocessors for production use (pickle / joblib).