# Step-by-Step Guide to Building a Probability of Default (PD) Model Using Logistic Regression

## Step 1: Define the Dependent Variable

1.  **Default Definition**: We previously created a good/bad indicator variable (`good_bad`) based on loan status. In this variable:
    *   `1` represents good loans (non-default)
    *   `0` represents bad loans (default)

---


## Step 2: Understand Logistic Regression

1.  **Logistic Regression Basics**:

* Logistic regression models the probability of a binary outcome (default or non-default).
* The logistic function (S-curve) bounds the probabilities between 0 and 1.
    
2.  **Log-Odds Interpretation**:

* The logistic regression model can be expressed as:
    
$$
\log\left(\frac{P(Y=1)}{P(Y=0)}\right) = \beta_0 + \beta_1X_1 + \beta_2X_2 + \ldots + \beta_nX_n
$$
  where:
* $P(Y=1)$ is the probability of a good loan (non-default)
* $P(Y=0)$ is the probability of a bad loan (default).


---

## Step 3: Transform Independent Variables

**Dummy Variables**:

Transform all independent variables into dummy variables. For continuous variables, bin them into categories and then create dummy variables for each category.




---

## Step 4: Fit the Logistic Regression Model

**Model Fitting**:

Use logistic regression to model the probability of default.


---

## Step 5: Interpret Coefficients

**Coefficient Interpretation**:

The coefficients in logistic regression can be interpreted as the change in log-odds for a one-unit increase in the predictor variable. For dummy variables, this is the difference in log-odds between the categories.

## Example Implementation in Python

In [None]:
# @title Step 1: Prepare the Data

import pandas as pd
import numpy as np

# Sample data
data = {
    'loan_id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'loan_status': ['Fully Paid', 'Charged Off', 'Current', 'Default', 'Late (31-120 days)', 'In Grace Period', 'Late (16-30 days)', 'Does not meet the credit policy. Status:Fully Paid', 'Does not meet the credit policy. Status:Charged Off', 'Issued'],
    'annual_income': [25000, 45000, 50000, 80000, 120000, 30000, 70000, 60000, 40000, 90000],
    'home_ownership': ['Rent', 'Own', 'Rent', 'Mortgage', 'Rent', 'Own', 'Rent', 'Mortgage', 'Own', 'Rent']
}
df = pd.DataFrame(data)

# Define the good and bad statuses
good_statuses = ['Fully Paid', 'Current', 'In Grace Period', 'Does not meet the credit policy. Status:Fully Paid', 'Issued']
bad_statuses = ['Charged Off', 'Default', 'Late (31-120 days)', 'Does not meet the credit policy. Status:Charged Off']

# Create the good_bad indicator
df['good_bad'] = np.where(df['loan_status'].isin(bad_statuses), 0, 1)

# Display the DataFrame with the new good_bad column
print(df[['loan_id', 'loan_status', 'good_bad']])


   loan_id                                        loan_status  good_bad
0        1                                         Fully Paid         1
1        2                                        Charged Off         0
2        3                                            Current         1
3        4                                            Default         0
4        5                                 Late (31-120 days)         0
5        6                                    In Grace Period         1
6        7                                  Late (16-30 days)         1
7        8  Does not meet the credit policy. Status:Fully ...         1
8        9  Does not meet the credit policy. Status:Charge...         0
9       10                                             Issued         1


In [None]:
# @title Step 2: Transform Independent Variables
# Binning the continuous variable
df['income_bin'] = pd.cut(df['annual_income'], bins=[0, 30000, 60000, 90000, 120000], labels=['Low', 'Medium', 'High', 'Very High'])

# Creating dummy variables
dummy_df = pd.get_dummies(df[['income_bin', 'home_ownership']]
                    , columns=['income_bin', 'home_ownership']
                    , prefix = ['income_bin', 'home_ownership']
                    , prefix_sep = ':').astype(int)
df = pd.concat([df[['good_bad']], dummy_df], axis=1)

drop_col_l = ['loan_id', 'loan_status', 'income_bin', 'annual_income']
ref_col_l = ['income_bin:Low', 'home_ownership:Rent']

col_l = df.columns.tolist()
drop_col_l = [col for col in drop_col_l if col in col_l]
ref_col_l = [col for col in ref_col_l if ((col in col_l) and (col not in drop_col_l))]
df = df.drop(columns=drop_col_l)
df = df.drop(columns=ref_col_l)

df.head(10)


Unnamed: 0,good_bad,income_bin:Medium,income_bin:High,income_bin:Very High,home_ownership:Mortgage,home_ownership:Own
0,1,0,0,0,0,0
1,0,1,0,0,0,1
2,1,1,0,0,0,0
3,0,0,1,0,1,0
4,0,0,0,1,0,0
5,1,0,0,0,0,1
6,1,0,1,0,0,0
7,1,1,0,0,1,0
8,0,1,0,0,0,1
9,1,0,1,0,0,0


In [None]:
# @title Step 3: Fit the Logistic Regression Model

import statsmodels.api as sm

# Define independent variables (X) and dependent variable (y)
X = df.drop(columns=['good_bad'])
y = df['good_bad']

# # Add a constant to the model (intercept)
# X = sm.add_constant(X)

# Fit the logistic regression model
logit_model = sm.Logit(y, X).fit()
print(logit_model.summary())


Optimization terminated successfully.
         Current function value: 0.566357
         Iterations 21
                           Logit Regression Results                           
Dep. Variable:               good_bad   No. Observations:                   10
Model:                          Logit   Df Residuals:                        5
Method:                           MLE   Df Model:                            4
Date:                Fri, 21 Jun 2024   Pseudo R-squ.:                  0.1585
Time:                        08:14:29   Log-Likelihood:                -5.6636
converged:                       True   LL-Null:                       -6.7301
Covariance Type:            nonrobust   LLR p-value:                    0.7113
                              coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------------------
income_bin:Medium           0.9418      1.548      0.609      0.543      -2.092  

In [None]:
# @title Step 4: Interpret the Coefficients

# Exponentiate the coefficients to get the odds ratios
odds_ratios = np.exp(logit_model.params)
print(odds_ratios)


income_bin:Medium          2.564491e+00
income_bin:High            2.885479e+00
income_bin:Very High       6.168332e-20
home_ownership:Mortgage    3.676127e-01
home_ownership:Own         2.574155e-01
dtype: float64


## Example Interpretation

Assume you get the following coefficient for the `income_bin_Medium` variable:

*   Coefficient: 0.24

To interpret this:

*   The odds of a loan being good (non-default) for borrowers in the `Medium` income bin are $e^{0.24} \approx 1.27$ times the odds for borrowers in the reference income bin (`Low`).

This means borrowers in the `Medium` income bin are 27% more likely to be non-default compared to those in the `Low` income bin.



---

## Summary

*   **Dependent Variable**: Created a binary indicator of default based on loan status.
*   **Logistic Regression**: Used to model the probability of default, providing interpretable coefficients.
*   **Dummy Variables**: Independent variables are transformed into dummy variables.
*   **Model Fitting**: Fit the logistic regression model and interpreted the coefficients in terms of odds ratios.

---

---

---