# STA2546 Small Project: Exposure at Default (EAD) and Loss Given Default (LGD)


## Definitions and notation

We work only with **defaulted** accounts (those with a non-missing `default_dt`).  
Index defaulted accounts by $i$, and let $d_i$ denote the default date for account $i$ (the column `default_dt`).

### Exposure at Default (EAD)

EAD is the amount owed **at the moment of default**. In this project, it is exactly the balance recorded on the default date:

$$
\mathrm{EAD}_i = \mathtt{balance\_at\_default}_i.
$$

### Recovery cash from transactions (post-default payments)

The transaction table uses the sign convention:

- charges (purchases / fees / interest): `trx_amount > 0`
- payments: `trx_amount < 0`

For LGD, we need “cash collected after default” as a **positive** number.  
Let $\mathcal{P}_i$ be the set of payment transactions strictly after default for account $i$. We compute recovery cash as:

$$
\mathrm{RecoveryCash}_i
= \sum_{j \in \mathcal{P}_i} \max\bigl(0,\ -\mathtt{trx\_amount}_j\bigr).
$$

The $\max(0,\cdot)$ is a defensive guardrail: if a payment transaction ever appears with the wrong sign, it will not create a negative recovery.

### Loss Given Default (LGD)

LGD is the fraction of EAD that is not recovered. From the project description:

$$
\mathrm{LGD}_i
= 1 - \frac{\min\bigl(\mathrm{RecoveryCash}_i,\ \mathrm{EAD}_i\bigr)}{\mathrm{EAD}_i},
\qquad \mathrm{EAD}_i > 0.
$$


In [1]:
import numpy as np
import pandas as pd
from pathlib import Path
from IPython.display import display

pd.set_option("display.max_columns", None)
pd.set_option("display.width", 120)
pd.set_option("display.float_format", lambda x: f"{x:,.4f}")


## 1. Load data

The goal in this section is simple: load the two parquet files in a way that runs “out of the box”. Note that for the code to work, we need to install pyarrow in the kernal first.

The code below looks for `lgd_accounts.parquet` and `lgd_transactions.parquet` in a short list of common directories (including the notebook directory).


In [2]:
import pandas as pd

accounts = pd.read_parquet("lgd_accounts.parquet")
transactions = pd.read_parquet("lgd_transactions.parquet")

display(accounts.head())
display(transactions.head())

print("accounts shape:", accounts.shape)
print("transactions shape:", transactions.shape)

Unnamed: 0,customer_id,account_id,product_type,credit_limit,status_final,default_dt,balance_at_default
0,C01310,A00001,CREDIT_CARD,1000,CHARGED_OFF,2023-07-18,837.86
1,C00341,A00002,CREDIT_CARD,8000,ACTIVE,NaT,0.0
2,C00474,A00003,CREDIT_CARD,5000,ACTIVE,NaT,0.0
3,C01699,A00004,CREDIT_CARD,10000,CLOSED,NaT,0.0
4,C00895,A00005,CREDIT_CARD,2000,CLOSED,NaT,0.0


Unnamed: 0,customer_id,account_id,trx_id,trx_dt,trx_amount,trx_type,trx_message,post_balance
0,C01310,A00001,T0000001,2022-01-01,127.37,PURCHASE,PURCHASE_AT_MERCHANT_MISC,127.37
1,C01310,A00001,T0000002,2022-01-02,10.16,PURCHASE,PURCHASE_AT_MERCHANT_RESTAURANT,137.53
2,C01310,A00001,T0000003,2022-01-03,316.45,PURCHASE,PURCHASE_AT_MERCHANT_MISC,453.98
3,C01310,A00001,T0000004,2022-01-04,-144.85,PAYMENT,ONLINE_CARD_PAYMENT,309.13
4,C01310,A00001,T0000005,2022-01-04,-158.24,PAYMENT,CARD_PAYMENT,150.89


accounts shape: (11227, 7)
transactions shape: (6727311, 8)


## 2. Basic data preparation & sanity checks

Before computing EAD/LGD, we do a few quick checks as following:

- parse date columns as datetimes,
- verify the required columns are present,
- isolate defaulted accounts (`default_dt` not null),
- confirm `accounts` is truly one row per `account_id`,
- sanity-check obvious ranges (e.g., EAD should not be negative).

These checks are intentionally lightweight: they are meant to catch structural issues early, not to “clean” the dataset beyond what the project asks.


In [3]:
# Verify the required columns
required_accounts_cols = {"account_id", "default_dt", "balance_at_default"}
required_trx_cols = {"account_id", "trx_dt", "trx_amount", "trx_type"}

missing_accounts = required_accounts_cols - set(accounts.columns)
missing_trx = required_trx_cols - set(transactions.columns)

if missing_accounts:
    raise ValueError(f"accounts is missing required columns: {missing_accounts}")
if missing_trx:
    raise ValueError(f"transactions is missing required columns: {missing_trx}")

# Parse dates
accounts["default_dt"] = pd.to_datetime(accounts["default_dt"], errors="coerce")
transactions["trx_dt"] = pd.to_datetime(transactions["trx_dt"], errors="coerce")

# Identify defaulted accounts
defaulted_accounts = accounts.loc[accounts["default_dt"].notna()].copy()

print(f"Total accounts: {len(accounts):,}")
print(
    f"Defaulted accounts: {len(defaulted_accounts):,} "
    f"({len(defaulted_accounts)/len(accounts):.2%} of all accounts)"
)

# Enforce one-row-per-account rule
n_unique_accounts = accounts["account_id"].nunique()
if n_unique_accounts != len(accounts):
    dup_n = len(accounts) - n_unique_accounts
    raise ValueError(f"accounts has {dup_n} duplicate account_id rows; expected 1 row per account.")

# EAD sanity check
if (defaulted_accounts["balance_at_default"] < 0).any():
    raise ValueError("Found negative balance_at_default values for defaulted accounts (unexpected for EAD).")

# Inspect transaction types
print("Transaction types (counts):")
display(transactions["trx_type"].value_counts())

Total accounts: 11,227
Defaulted accounts: 3,417 (30.44% of all accounts)
Transaction types (counts):


trx_type
PURCHASE    3366502
PAYMENT     1929679
FEE          816539
INTEREST     614591
Name: count, dtype: int64

## 3. Compute Exposure at Default (EAD)

Per the project definition, for defaulted accounts:

$$
\mathrm{EAD}_i = \mathtt{balance\_at\_default}_i.
$$

We store EAD as a new column on `defaulted_accounts` and show:

- a small preview of the resulting columns, and
- a quick descriptive summary (including the required percentiles).


In [4]:
defaulted_accounts["EAD"] = defaulted_accounts["balance_at_default"].astype(float)

print("EAD preview (first 5 defaulted accounts):")
display(defaulted_accounts[["account_id", "default_dt", "EAD"]].head())

ead_quick = defaulted_accounts["EAD"].describe(percentiles=[0.25, 0.5, 0.75])
display(ead_quick.to_frame(name="EAD").T)


EAD preview (first 5 defaulted accounts):


Unnamed: 0,account_id,default_dt,EAD
0,A00001,2023-07-18,837.86
6,A00007,2025-05-27,1750.32
7,A00008,2023-07-28,1707.51
9,A00010,2025-09-23,9901.44
11,A00012,2023-09-02,9904.95


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
EAD,3417.0,4200.562,3253.6379,7.57,1444.34,2900.63,7781.96,10229.08


## 4. Compute recovery cash after default

Recoveries come **only** from payment transactions that occur **after** the account’s default date.

Operationally, we:

1. filter transactions down to `trx_type == "PAYMENT"` (this is both conceptually correct and efficient),
2. attach each payment transaction to the account’s default date via `account_id`,
3. keep only transactions with `trx_dt > default_dt`,
4. turn payment amounts into positive cash recovered via `-trx_amount`,
5. sum recovered cash by `account_id` to produce `recovery_amount`.

This produces a compact account-level table `recovery_by_account` with:

- `account_id`
- `recovery_amount`


In [5]:
default_dt_map = defaulted_accounts.set_index("account_id")["default_dt"]

payment_trx = transactions.loc[
    transactions["trx_type"].eq("PAYMENT"),
    ["account_id", "trx_dt", "trx_amount"],
].copy()

payment_trx["default_dt"] = payment_trx["account_id"].map(default_dt_map)

recovery_trx = payment_trx.loc[
    payment_trx["default_dt"].notna() & (payment_trx["trx_dt"] > payment_trx["default_dt"])
].copy()

recovery_trx["recovery_component"] = (-recovery_trx["trx_amount"]).clip(lower=0)

recovery_by_account = (
    recovery_trx
    .groupby("account_id", as_index=False)["recovery_component"]
    .sum()
    .rename(columns={"recovery_component": "recovery_amount"})
)

print("Recovery table preview (first 5 rows):")
display(recovery_by_account.head())

print(f"Defaulted accounts: {len(defaulted_accounts):,}")
print(f"Accounts with >=1 post-default payment transaction: {recovery_by_account['account_id'].nunique():,}")

assert (recovery_by_account["recovery_amount"] >= 0).all()


Recovery table preview (first 5 rows):


Unnamed: 0,account_id,recovery_amount
0,A00001,833.32
1,A00007,1476.1
2,A00008,1653.43
3,A00010,9186.73
4,A00012,9904.49


Defaulted accounts: 3,417
Accounts with >=1 post-default payment transaction: 3,386


## 5. Compute LGD and assemble the account-level deliverable

We merge `recovery_amount` onto the defaulted accounts (missing recoveries become 0), and compute LGD using the project formula:

$$
\mathrm{LGD}_i
= 1 - \frac{\min\bigl(\mathrm{RecoveryCash}_i,\ \mathrm{EAD}_i\bigr)}{\mathrm{EAD}_i}.
$$

Finally, we construct the deliverable account-level table with (at least):

- `account_id`
- `EAD`
- `recovery_amount`
- `LGD`

As a final check, we confirm LGD falls in $[0,1]$ (up to tiny numerical tolerances).


In [6]:
final_df = defaulted_accounts.merge(
    recovery_by_account,
    on="account_id",
    how="left"
)

final_df["recovery_amount"] = final_df["recovery_amount"].fillna(0.0)
final_df["recovery_capped"] = np.minimum(final_df["recovery_amount"], final_df["EAD"])

final_df["LGD"] = np.where(
    final_df["EAD"] > 0,
    1 - final_df["recovery_capped"] / final_df["EAD"],
    np.nan
)

final_output = final_df[["account_id", "EAD", "recovery_amount", "LGD"]].copy()

print("Deliverable table shape:", final_output.shape)
display(final_output.head(10))

assert final_output["account_id"].is_unique
assert final_output["EAD"].notna().all()
assert final_output["recovery_amount"].notna().all()

tol = 1e-12
if not ((final_output["LGD"] >= -tol) & (final_output["LGD"] <= 1 + tol)).all():
    bad = final_output.loc[~((final_output["LGD"] >= -tol) & (final_output["LGD"] <= 1 + tol))]
    raise ValueError(f"Found LGD values outside [0,1] for {len(bad)} accounts.")

print("LGD boundary counts:")
print("  LGD == 0 :", int((final_output["LGD"] == 0).sum()))
print("  LGD == 1 :", int((final_output["LGD"] == 1).sum()))


Deliverable table shape: (3417, 4)


Unnamed: 0,account_id,EAD,recovery_amount,LGD
0,A00001,837.86,833.32,0.0054
1,A00007,1750.32,1476.1,0.1567
2,A00008,1707.51,1653.43,0.0317
3,A00010,9901.44,9186.73,0.0722
4,A00012,9904.95,9904.49,0.0
5,A00015,969.02,962.92,0.0063
6,A00019,624.65,486.44,0.2213
7,A00026,7466.87,7458.85,0.0011
8,A00028,1929.42,1927.79,0.0008
9,A00030,3062.06,3057.55,0.0015


LGD boundary counts:
  LGD == 0 : 7
  LGD == 1 : 31


## 6. Worked example (one account)

To make the computation fully transparent, we pick one defaulted account that has at least one post-default payment.

For that account, we:

- list its post-default PAYMENT transactions,
- sum the recovered cash components, and
- recompute LGD directly from the definition.

This gives a concrete “audit trail” that the account-level numbers match the underlying transactions.


In [14]:
example_account_id = final_output.loc[final_output["recovery_amount"] > 0, "account_id"].iloc[0]

example_row = final_output.loc[final_output["account_id"] == example_account_id].iloc[0]
example_default_dt = defaulted_accounts.loc[
    defaulted_accounts["account_id"] == example_account_id, "default_dt"
].iloc[0]

print("Example account_id:", example_account_id)
print("Default date:", example_default_dt)
display(example_row.to_frame().T)

example_payments = recovery_trx.loc[recovery_trx["account_id"] == example_account_id].copy()
example_payments = example_payments.sort_values("trx_dt")

print("\nPost-default PAYMENT transactions (showing first 15 rows):")
display(example_payments.head(15))

recovery_check = example_payments["recovery_component"].sum()
ead = float(example_row["EAD"])
lgd_check = 1 - min(recovery_check, ead) / ead

print(f"Recovery sum check: {recovery_check:,.4f}")
print(f"EAD: {ead:,.4f}")
print(f"LGD (recomputed): {lgd_check:,.6f}")
print(f"LGD (table):      {float(example_row['LGD']):,.6f}")


  account_id      EAD  recovery_amount    LGD
0     A00001 837.8600         833.3200 0.0054
Example account_id: A00001
Default date: 2023-07-18 00:00:00


Unnamed: 0,account_id,EAD,recovery_amount,LGD
0,A00001,837.86,833.32,0.0054



Post-default PAYMENT transactions (showing first 15 rows):


Unnamed: 0,account_id,trx_dt,trx_amount,default_dt,recovery_component
420,A00001,2023-07-22,-123.96,2023-07-18,123.96
421,A00001,2023-07-29,-157.12,2023-07-18,157.12
423,A00001,2023-08-02,-205.4,2023-07-18,205.4
424,A00001,2023-08-06,-241.06,2023-07-18,241.06
425,A00001,2023-08-07,-36.64,2023-07-18,36.64
427,A00001,2023-08-10,-49.46,2023-07-18,49.46
428,A00001,2023-08-10,-19.68,2023-07-18,19.68


Recovery sum check: 833.3200
EAD: 837.8600
LGD (recomputed): 0.005419
LGD (table):      0.005419


## 7. Statistical summaries (deliverable)

We report the required summary statistics for **EAD** and **LGD** across defaulted accounts:

- number of observations ($n$),
- min / max,
- mean / standard deviation,
- 25th / 50th / 75th percentiles.

The output tables below are formatted to match the deliverable requirements directly.


In [8]:
def summarize_metric(s):
    s = s.dropna()
    return pd.DataFrame({
        "n": [int(s.count())],
        "min": [float(s.min())],
        "p25": [float(s.quantile(0.25))],
        "p50": [float(s.quantile(0.50))],
        "p75": [float(s.quantile(0.75))],
        "mean": [float(s.mean())],
        "std": [float(s.std(ddof=1))],
        "max": [float(s.max())],
    })

ead_summary_table = summarize_metric(final_output["EAD"])
lgd_summary_table = summarize_metric(final_output["LGD"])

print("EAD summary:")
display(ead_summary_table)

print("LGD summary:")
display(lgd_summary_table)


EAD summary:


Unnamed: 0,n,min,p25,p50,p75,mean,std,max
0,3417,7.57,1444.34,2900.63,7781.96,4200.562,3253.6379,10229.08


LGD summary:


Unnamed: 0,n,min,p25,p50,p75,mean,std,max
0,3417,0.0,0.0006,0.0027,0.0305,0.05,0.1342,1.0
