# Credit Default Risk Analysis & Probability-Based Decisioning

## Objective
This project analyzes credit default risk using Lending Club loan data and compares
rule-based credit policies with probability-based decisioning using logistic regression.

## Dataset
Lending Club loan data (2007–2015)

## Approach
- Define default and analyze key risk drivers
- Design and evaluate rule-based rejection policies
- Build an interpretable logistic regression model
- Use predicted default probabilities for flexible decisioning
- Discuss monitoring and population stability for production deployment

In [1]:
import pandas as pd
import numpy as np

## Data Loading & Setup
The raw CSV is converted to Parquet format to improve read performance for large-scale analysis.


In [2]:
df=pd.read_parquet("loan.parquet",engine="fastparquet")

In [3]:
df.shape

(2260668, 145)

## Defining Default

A binary default flag is created based on loan status.
Loans are considered defaulted if they are Charged Off, Default, or severely delinquent,
which aligns with standard credit risk practice.


In [4]:
df['loan_status'].value_counts()

loan_status
Fully Paid                                             1041952
Current                                                 919695
Charged Off                                             261655
Late (31-120 days)                                       21897
In Grace Period                                           8952
Late (16-30 days)                                         3737
Does not meet the credit policy. Status:Fully Paid        1988
Does not meet the credit policy. Status:Charged Off        761
Default                                                     31
Name: count, dtype: int64

In [5]:
df['default']= df['loan_status'].apply(lambda x: 1 if x in ['Charged Off', 'Default', 'Late (31-120 days)'] else 0)

In [6]:
default_rate= df['default'].mean()
print(f"Default rate: {default_rate:.2%}")

Default rate: 12.54%


### Interest Rate Bucketing

Interest rates are bucketed to analyze how default risk varies across pricing levels.
Since interest rate reflects both borrower risk and lender pricing decisions,
grouping loans into ordered buckets helps identify whether higher-priced loans
exhibit higher realized default rates.


In [7]:
int_rate_stats=df['int_rate'].agg({
    'min_rate':'min',
    'max_rate':'max',
    'mean':'mean',
    'q25':lambda x: x.quantile(0.25), 
    'q50':lambda x: x.quantile(0.5), 
    'q75':lambda x: x.quantile(0.75)})
print(int_rate_stats)

min_rate     5.310000
max_rate    30.990000
mean        13.092913
q25          9.490000
q50         12.620000
q75         15.990000
Name: int_rate, dtype: float64


In [8]:
q25 = int_rate_stats['q25']
q50 = int_rate_stats['q50']
q75 = int_rate_stats['q75']

def bucket_int_rate(x):
    if x <= q25:
        return 'Low'
    elif x <= q50:
        return 'Medium-Low'
    elif x <= q75:
        return 'Medium-High'
    else:
        return 'High'

df['int_rate_bucket'] = df['int_rate'].apply(lambda x: bucket_int_rate(x))




In [9]:
df['default'].groupby(df['int_rate_bucket']).mean().sort_values().round(2)*100


int_rate_bucket
Low             5.0
Medium-Low      9.0
Medium-High    15.0
High           22.0
Name: default, dtype: float64

### Interest Rate vs Default Risk

Default rates increase monotonically with interest rate buckets.
High-interest loans default at roughly five times the rate of low-interest loans.

This indicates that credit pricing is risk-aligned:
borrowers charged higher interest rates are, on average, riskier
and do in fact default more frequently.



### Debt-to-Income (DTI) Bucketing

Debt-to-income (DTI) measures a borrower’s repayment burden relative to income.
Loans are grouped into DTI buckets to assess how increasing leverage and affordability
stress impact default risk across the portfolio.


In [10]:
df['dti_clean'] = df['dti'].where(
    (df['dti'] != -1) & (df['dti'] < 100),
    np.nan
)

# If dti == -1 → missing
# If dti >= 100 → missing

In [11]:
dti_stats = df['dti_clean'].agg({'min_dti': 'min',
                           'max_dti':'max',
                           'mean_dti':'mean',
                           'q25':lambda x: x.quantile(0.25),
                           'q50':lambda x: x.quantile(0.5),
                           'q75':lambda x: x.quantile(0.75)})

print(dti_stats)



min_dti      0.000000
max_dti     99.920000
mean_dti    18.569813
q25         11.890000
q50         17.820000
q75         24.460000
Name: dti_clean, dtype: float64


In [12]:
q25 = dti_stats['q25']
q50 = dti_stats['q50']
q75 = dti_stats['q75']

def bucket_dti(x):
    if pd.isna(x):
        return np.nan
    elif x <= q25:
        return 'Low'
    elif x <= q50:
        return 'Medium-Low'
    elif x <= q75:
        return 'Medium-High'
    else:
        return 'High'

df['dti_bucket'] = df['dti_clean'].apply(lambda x: bucket_dti(x))


In [13]:
df['default'].groupby(df['dti_bucket']).mean().sort_values().round(2)*100

dti_bucket
Low            10.0
Medium-Low     11.0
Medium-High    13.0
High           16.0
Name: default, dtype: float64

### Debt-to-Income (DTI) vs Default Risk

Default rates increase steadily across DTI buckets.
Borrowers with higher debt-to-income ratios face greater repayment obligations
relative to their income, making them more vulnerable to financial stress and
more likely to default.

This confirms DTI as a key affordability-based risk driver in credit underwriting.



## Rule-Based Credit Policies

Rule-based policies are commonly used in credit risk for their simplicity and interpretability.
Here, rejection rules are designed using combinations of interest rate and DTI buckets.


In [14]:
risk_matrix = (
    df.groupby(['int_rate_bucket', 'dti_bucket'])['default']
      .mean()
      .unstack()  # Creates matrix format
      .round(2)*100 
)

risk_matrix = risk_matrix.loc[
    ['Low', 'Medium-Low', 'Medium-High', 'High'],
    ['Low', 'Medium-Low', 'Medium-High', 'High']
]

risk_matrix

dti_bucket,Low,Medium-Low,Medium-High,High
int_rate_bucket,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Low,4.0,4.0,5.0,6.0
Medium-Low,8.0,9.0,10.0,10.0
Medium-High,13.0,14.0,15.0,16.0
High,19.0,21.0,22.0,24.0


### Risk Matrix Observations

- For a fixed DTI bucket, default risk increases as interest rate rises.
- For a fixed interest rate bucket, default risk increases as DTI rises.
- Loans with both high interest rates and high DTI exhibit disproportionately high default risk.

This interaction effect motivates rule-based rejection policies that target
borrowers with simultaneously elevated pricing and affordability risk.


### Policy V1: High Interest Rate AND High DTI

This policy rejects borrowers who simultaneously fall into the highest
interest rate bucket and the highest DTI bucket, targeting borrowers with
both pricing-based and affordability-based risk.


In [15]:
df['reject_flag'] = (
    (df['int_rate_bucket'] == 'High') &
    (df['dti_bucket'] == 'High')
)

In [16]:
df[df['reject_flag']]['default'].mean() 

np.float64(0.23779024910956806)

In [17]:
df[~df['reject_flag']]['default'].mean()

np.float64(0.115178663669709)

In [18]:
df['default'].mean()

np.float64(0.125442125955691)

In [19]:
rejection_rate =df['reject_flag'].mean()*100
rejection_rate


np.float64(8.370711665755431)

#### Policy V1 Results

- The rejected segment exhibits a default rate of approximately 24%,
  indicating that the policy successfully isolates a high-risk group.
- The approved portfolio default rate drops to ~11.5% from a baseline of ~12.5%.
- This demonstrates meaningful risk reduction with a relatively simple rule.


### Policy V2: Expanded High Interest Rate Segment

This policy extends Policy V1 by additionally rejecting borrowers with
high interest rates and moderately high DTI.

The objective is to capture additional high-risk borrowers while
evaluating the trade-off between further risk reduction and increased
rejection volume.

In [20]:
df['reject_flag_v2'] = (
    ((df['int_rate_bucket'] == 'High') & (df['dti_bucket'] == 'High')) |
    ((df['int_rate_bucket'] == 'High') & (df['dti_bucket'] == 'Medium-High'))
)
rejection_rate =df['reject_flag_v2'].mean()*100
rejection_rate

np.float64(14.600640164765458)

In [21]:
df[df['reject_flag_v2']]['default'].mean()

np.float64(0.23213117138078965)

In [22]:
df[~df['reject_flag_v2']]['default'].mean()

np.float64(0.10720161027993427)

In [23]:
df['default'].mean()

np.float64(0.125442125955691)

#### Policy V2 Results

- Policy V2 rejects approximately 14–15% of loans.
- The rejected segment exhibits a default rate of ~23%, confirming it remains
  substantially riskier than the overall population.
- The approved portfolio default rate declines further to ~10.7% from ~11.5%
  under Policy V1.

Compared to Policy V1, this policy achieves additional risk reduction at the
cost of higher rejection volume, illustrating diminishing returns from
incrementally tightening rule-based policies.

## Policy Comparison and Trade-off Analysis

### Policy V1 Summary (Single Segment)
- Rejection rate: ~8.4%
- Approved-loan default rate: ~11.5%
- Portfolio risk reduction: ~1.02 percentage points (12.5% → 11.5%)

Policy V1 removes the single highest-risk segment while preserving most loan volume,
delivering strong risk reduction efficiency.

---

### Policy V2 Summary (Two Segments)
- Rejection rate: ~14.6%
- Approved-loan default rate: ~10.7%
- Portfolio risk reduction: ~1.82 percentage points (12.5% → 10.7%)

Policy V2 further tightens credit standards by rejecting an additional high-risk segment,
resulting in a cleaner portfolio but higher volume loss.

---

### Incremental Impact of Tightening the Policy

Comparing Policy V2 to Policy V1:

- Additional loans rejected: ~6.2 percentage points (14.6% − 8.4%)
- Additional portfolio risk reduction: ~0.80 percentage points (1.82 − 1.02)

This demonstrates diminishing returns:
the first ~8% rejection delivers over 1 percentage point of risk improvement,
while the next ~6% rejection delivers less than 1 additional point.

---

### Key Insight: Diminishing Marginal Returns

- Risk reduction per rejected loan is higher in Policy V1
- Marginal benefit decreases as rejection rules are expanded
- Aggressive rule-based tightening quickly leads to efficiency loss

---

### Business Implications

- **If the objective is risk reduction with minimal volume impact**:
  Policy V1 is preferable — it removes the worst risk pocket with strong efficiency.

- **If the objective is aggressive portfolio de-risking**:
  Policy V2 may be justified — resulting in a much cleaner book (~10.7% default rate),
  but only if the business can tolerate ~15% rejection.

These results motivate the transition from hard rule-based policies to
probability-based models that allow finer risk discrimination with better
volume–risk trade-offs.


## Preparing Data for Modeling

To move beyond hard rule-based policies, we train a probabilistic model
that estimates the likelihood of default for each borrower.

For modeling, we retain only observations with complete information
for interest rate, DTI, and default outcome.


In [24]:
model_df = df[['int_rate', 'dti_clean', 'default']].dropna()

X = model_df[['int_rate', 'dti_clean']]
y = model_df['default']



In [25]:
model_df.shape

(2256388, 3)

## Logistic Regression Model

A logistic regression model is used to estimate the probability of default
for each loan application.

This approach is chosen because:
- It produces well-calibrated default probabilities
- Model coefficients are interpretable and align with credit intuition
- It enables flexible risk-based decisioning beyond rigid rule cutoffs

The model uses interest rate and debt-to-income ratio as predictors,
capturing both pricing-based and affordability-based risk.



## Train–Test Split

To evaluate model performance fairly, the dataset is split into
training and test samples.

The training set is used to learn model parameters, while the test set
simulates unseen future data to assess generalization.


In [26]:
from sklearn.model_selection import train_test_split


## Probability-Based Decisioning

Predicted default probabilities allow credit decisions to be made using
flexible cutoffs, separating risk estimation from business risk appetite.

This enables scenario testing across different approval thresholds
without retraining the model.


In [27]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.3,
    random_state=42, 
    stratify=y  
)


The dataset is split using a 70/30 train–test ratio.

Stratification ensures that default rates remain consistent across both
samples, preventing class imbalance from biasing model evaluation.

A fixed random seed is used to make results reproducible.


In [28]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(
    max_iter=1000,
    solver='lbfgs'
)


In [29]:
log_reg.fit(X_train, y_train)


0,1,2
,"penalty  penalty: {'l1', 'l2', 'elasticnet', None}, default='l2' Specify the norm of the penalty: - `None`: no penalty is added; - `'l2'`: add a L2 penalty term and it is the default choice; - `'l1'`: add a L1 penalty term; - `'elasticnet'`: both L1 and L2 penalty terms are added. .. warning::  Some penalties may not work with some solvers. See the parameter  `solver` below, to know the compatibility between the penalty and  solver. .. versionadded:: 0.19  l1 penalty with SAGA solver (allowing 'multinomial' + L1) .. deprecated:: 1.8  `penalty` was deprecated in version 1.8 and will be removed in 1.10.  Use `l1_ratio` instead. `l1_ratio=0` for `penalty='l2'`, `l1_ratio=1` for  `penalty='l1'` and `l1_ratio` set to any float between 0 and 1 for  `'penalty='elasticnet'`.",'deprecated'
,"C  C: float, default=1.0 Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization. `C=np.inf` results in unpenalized logistic regression. For a visual example on the effect of tuning the `C` parameter with an L1 penalty, see: :ref:`sphx_glr_auto_examples_linear_model_plot_logistic_path.py`.",1.0
,"l1_ratio  l1_ratio: float, default=0.0 The Elastic-Net mixing parameter, with `0 <= l1_ratio <= 1`. Setting `l1_ratio=1` gives a pure L1-penalty, setting `l1_ratio=0` a pure L2-penalty. Any value between 0 and 1 gives an Elastic-Net penalty of the form `l1_ratio * L1 + (1 - l1_ratio) * L2`. .. warning::  Certain values of `l1_ratio`, i.e. some penalties, may not work with some  solvers. See the parameter `solver` below, to know the compatibility between  the penalty and solver. .. versionchanged:: 1.8  Default value changed from None to 0.0. .. deprecated:: 1.8  `None` is deprecated and will be removed in version 1.10. Always use  `l1_ratio` to specify the penalty type.",0.0
,"dual  dual: bool, default=False Dual (constrained) or primal (regularized, see also :ref:`this equation `) formulation. Dual formulation is only implemented for l2 penalty with liblinear solver. Prefer `dual=False` when n_samples > n_features.",False
,"tol  tol: float, default=1e-4 Tolerance for stopping criteria.",0.0001
,"fit_intercept  fit_intercept: bool, default=True Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.",True
,"intercept_scaling  intercept_scaling: float, default=1 Useful only when the solver `liblinear` is used and `self.fit_intercept` is set to `True`. In this case, `x` becomes `[x, self.intercept_scaling]`, i.e. a ""synthetic"" feature with constant value equal to `intercept_scaling` is appended to the instance vector. The intercept becomes ``intercept_scaling * synthetic_feature_weight``. .. note::  The synthetic feature weight is subject to L1 or L2  regularization as all other features.  To lessen the effect of regularization on synthetic feature weight  (and therefore on the intercept) `intercept_scaling` has to be increased.",1
,"class_weight  class_weight: dict or 'balanced', default=None Weights associated with classes in the form ``{class_label: weight}``. If not given, all classes are supposed to have weight one. The ""balanced"" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes * np.bincount(y))``. Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified. .. versionadded:: 0.17  *class_weight='balanced'*",
,"random_state  random_state: int, RandomState instance, default=None Used when ``solver`` == 'sag', 'saga' or 'liblinear' to shuffle the data. See :term:`Glossary ` for details.",
,"solver  solver: {'lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga'}, default='lbfgs' Algorithm to use in the optimization problem. Default is 'lbfgs'. To choose a solver, you might want to consider the following aspects: - 'lbfgs' is a good default solver because it works reasonably well for a wide  class of problems. - For :term:`multiclass` problems (`n_classes >= 3`), all solvers except  'liblinear' minimize the full multinomial loss, 'liblinear' will raise an  error. - 'newton-cholesky' is a good choice for  `n_samples` >> `n_features * n_classes`, especially with one-hot encoded  categorical features with rare categories. Be aware that the memory usage  of this solver has a quadratic dependency on `n_features * n_classes`  because it explicitly computes the full Hessian matrix. - For small datasets, 'liblinear' is a good choice, whereas 'sag'  and 'saga' are faster for large ones; - 'liblinear' can only handle binary classification by default. To apply a  one-versus-rest scheme for the multiclass setting one can wrap it with the  :class:`~sklearn.multiclass.OneVsRestClassifier`. .. warning::  The choice of the algorithm depends on the penalty chosen (`l1_ratio=0`  for L2-penalty, `l1_ratio=1` for L1-penalty and `0 < l1_ratio < 1` for  Elastic-Net) and on (multinomial) multiclass support:  ================= ======================== ======================  solver l1_ratio multinomial multiclass  ================= ======================== ======================  'lbfgs' l1_ratio=0 yes  'liblinear' l1_ratio=1 or l1_ratio=0 no  'newton-cg' l1_ratio=0 yes  'newton-cholesky' l1_ratio=0 yes  'sag' l1_ratio=0 yes  'saga' 0<=l1_ratio<=1 yes  ================= ======================== ====================== .. note::  'sag' and 'saga' fast convergence is only guaranteed on features  with approximately the same scale. You can preprocess the data with  a scaler from :mod:`sklearn.preprocessing`. .. seealso::  Refer to the :ref:`User Guide ` for more  information regarding :class:`LogisticRegression` and more specifically the  :ref:`Table `  summarizing solver/penalty supports. .. versionadded:: 0.17  Stochastic Average Gradient (SAG) descent solver. Multinomial support in  version 0.18. .. versionadded:: 0.19  SAGA solver. .. versionchanged:: 0.22  The default solver changed from 'liblinear' to 'lbfgs' in 0.22. .. versionadded:: 1.2  newton-cholesky solver. Multinomial support in version 1.6.",'lbfgs'


In [30]:
coefficients = pd.Series(
    log_reg.coef_[0],
    index=X.columns
)

intercept = log_reg.intercept_[0]

coefficients, intercept

(int_rate     0.113375
 dti_clean    0.007853
 dtype: float64,
 np.float64(-3.691839280949963))

### Model Interpretation

- Both interest rate and DTI are positively associated with default risk.
- Interest rate has a substantially stronger marginal impact than DTI.
- This suggests pricing embeds significant information about borrower risk.

The model confirms insights observed in the earlier rule-based analysis,
while providing a continuous risk score rather than coarse buckets.


## Probability-Based Credit Decisions
Predicted default probabilities allow credit decisions to be made using
flexible cutoffs, separating risk estimation from risk appetite.

This enables scenario testing across approval thresholds without
retraining the model.


In [31]:
y_test_proba = log_reg.predict_proba(X_test)[:, 1]
y_test_proba.mean()



np.float64(0.12550684202641013)

### Threshold Analysis: 20% PD Cutoff
Borrowers with predicted default probability above 20% are rejected.
This represents a relatively conservative credit policy aimed at
early risk containment.


In [32]:
reject_flag_model_1 = y_test_proba > 0.20

In [33]:
reject_flag_model_1.mean()

np.float64(0.11877379353746471)

In [34]:
y_test[~reject_flag_model_1].mean()

np.float64(0.10876303609117595)

In [35]:
y_test[reject_flag_model_1].mean()

np.float64(0.2501990049751244)

**Results:**
- Rejection rate: ~X%
- Approved-loan default rate: ~Y%
- Rejected-loan default rate: ~Z%

**Interpretation:**
- A larger share of borrowers is rejected compared to higher thresholds.
- The approved portfolio exhibits lower default risk.
- The rejected segment is materially riskier, indicating effective risk separation.


### Threshold Analysis: 25% PD Cutoff

Borrowers with predicted default probability above 25% are rejected.
This threshold balances risk reduction with approval volume.


In [36]:
reject_flag_model = y_test_proba > 0.25


In [37]:
reject_flag_model.mean()


np.float64(0.06111975323414835)

In [38]:
y_test[~reject_flag_model].mean()



np.float64(0.11607064184383772)

In [39]:
y_test[reject_flag_model].mean()


np.float64(0.2713605491504121)

**Results:**
- Rejection rate: ~6%
- Approved-loan default rate: ~11.6%
- Rejected-loan default rate: ~27%

**Interpretation:**
- Fewer borrowers are rejected compared to the 20% cutoff.
- The rejected segment concentrates high-risk borrowers.
- The approved portfolio risk remains well below the rejected segment.


### Threshold Comparison: 20% vs 25%

- The 20% cutoff achieves stronger risk reduction but at the cost of
  higher rejection volume.
- The 25% cutoff preserves more volume while still isolating a highly
  risky borrower segment.
- Marginal risk reduction diminishes as the threshold is lowered,
  reflecting a classic risk–volume trade-off.


### Model vs Rule-Based Policies

Compared to rule-based policies, probability thresholds allow smoother
control over the risk–volume trade-off without relying on rigid bucket
definitions.



## Monitoring & Population Stability

Model-based decisioning assumes population stability over time.

Key monitoring signals include:
- Base default rate tracking
- Score distribution drift
- Approval rate stability
- Performance by risk band

Significant deviations may indicate population drift and trigger
model recalibration or policy review.
