# Accounting approach, Naive DD per Bharath and Shumway (2008). No solver.

This notebook mirrors `dd_pd_accounting.py` and documents the naive accounting pipeline described in `docs/reference/Bharath_and_Shumway_naive_DD`.
Instead of solving the Merton system, we follow Bharath and Shumway's proxy approach to obtain distance to default (DD) and probability of default (PD).

We will:

1. Prepare the accounting panel inputs.
2. Derive book equity and scale values to USD.
3. Assemble proxy measures for equity, debt, volatility, and drift.
4. Compute naive DD and PD.
5. Persist the naive outputs for downstream analysis.
6. Review diagnostics tied to the naive pipeline.


## 1. Environment setup

Install the minimal dependencies required to reproduce the accounting workflow locally.


In [1]:
# 1. Install needed packages (run once per environment)
%pip install pandas numpy scipy



[0m

Note: you may need to restart the kernel to use updated packages.


## 2. Imports and file paths

Load the libraries used throughout the naive pipeline and point to the accounting inputs/outputs.


In [2]:

import pandas as pd
import numpy as np
from scipy.stats import norm
from scipy.optimize import root
from pathlib import Path
import logging
import re

# Helper function for standardizing tickers
def standardize_ticker(t):
    return str(t).split('.', 1)[0] if pd.notnull(t) else t

def find_repo_root(start: Path, marker: str = '.git') -> Path:
    """Walk up from *start* until a directory containing *marker* is found."""
    current = start.resolve()
    for candidate in [current, *current.parents]:
        if (candidate / marker).exists():
            return candidate
    return current

# 1. Locate the repository root and define core paths
base_dir = find_repo_root(Path.cwd())
print(f'Repository root: {base_dir}')

# 2. Time horizon for the Merton model (1 year)
T = 1.0

# 3. Define all file paths relative to base_dir
model_fp      = base_dir / 'data' / 'clean' / 'Book2_clean.csv'
marketcap_fp  = base_dir / 'data' / 'clean' / 'all_banks_marketcap_annual_2016_2023.csv'
vol_fp        = base_dir / 'data' / 'clean' / 'equity_volatility_by_year.csv'
rf_fp         = base_dir / 'data' / 'clean' / 'fama_french_factors_annual_clean.csv'
log_fp        = base_dir / 'data' / 'logs' / 'dd_pd_accounting_log.txt'
output_fp     = base_dir / 'data' / 'merged_inputs' / 'dd_pd_accounting.csv'

# 4. Ensure directories exist
for directory in (model_fp.parent, marketcap_fp.parent, vol_fp.parent, rf_fp.parent, log_fp.parent, output_fp.parent):
    directory.mkdir(parents=True, exist_ok=True)

# 5. Quick existence check for inputs and outputs
for name, fp in [
    ('Accounting input',    model_fp),
    ('Market cap input',    marketcap_fp),
    ('Equity vol input',    vol_fp),
    ('Risk-free input',     rf_fp),
    ('Account log file',    log_fp)
]:
    status = 'FOUND' if fp.exists() else f'MISSING ({fp.name})'
    print(f'{name:20s} -> {status}')


Repository root: /workspace/risk_bank
Accounting input     -> FOUND
Market cap input     -> FOUND
Equity vol input     -> FOUND
Risk-free input      -> FOUND
Account log file     -> FOUND


## 3. Data preparation

- Read `Book2_clean.csv` and inspect firm-year coverage.
- Rename `nstrument` to `instrument`, enforce integer years, and normalize column names.
- Convert totals expressed in millions to absolute USD amounts.


In [3]:
print('[INFO] Loading accounting data...')
df = pd.read_csv(model_fp)
print(f"→ {df.shape[0]} rows, {df[['instrument','year']].drop_duplicates().shape[0]} unique (instrument, year)")

[INFO] Loading accounting data...
→ 1425 rows, 1424 unique (instrument, year)


## 4. Book equity and capital structure inputs

- Compute book equity `(total_assets - debt_total) * 1e6` with guards against negatives.
- Form debt face value `F = debt_total * 1e6`.
- Track the status flags that identify unusable or missing inputs.


In [4]:
# Standardize column names
col_map = {c: re.sub(r'_+', '_', c.strip().lower()
                        .replace(' ', '_')
                        .replace('-', '_'))
           for c in df.columns}
df = df.rename(columns=col_map)

# Extract or clean 'year'
if 'date' in df.columns:
    df['date'] = pd.to_datetime(df['date'], errors='coerce')
    df['year'] = df['date'].dt.year
else:
    df = df[df['year'].notnull()].copy()
    df['year'] = df['year'].astype(float).astype(int)

df['Year'] = df['year']  # for merges down the line
df.head()

Unnamed: 0,instrument,year,rit_rf,rit,new_wacc,unnamed:_5,"weighted_average_cost_of_capital,_(%)",beta_levered,beta_unlevered,environmental_pillar_score,...,"wacc_tax_rate,_(%)","wacc_cost_of_debt,_(%)","wacc_debt_weight,_(%)","wacc_equity_weight,_(%)",total_assets,debt_total,d/e,dummylarge,dummymid,Year
0,JPM,2016,0.245917,0.247917,1.584416,,4.864093,1.536081,0.606612,81.766775,...,29.05903,2.126534,66.065751,30.587868,2490972.0,495354.0,2.159868,1.0,0.0,2016
1,JPM,2017,0.223383,0.231383,1.657157,,4.749802,1.210898,0.579144,83.07977,...,28.38487,2.189746,58.567477,38.450277,2533600.0,494798.0,1.5232,1.0,0.0,2017
2,JPM,2018,-0.22784,-0.20974,1.86956,,5.749097,1.141127,0.577758,79.42626,...,28.38487,2.730852,55.921812,41.071387,2622532.0,533627.0,1.361576,1.0,0.0,2018
3,JPM,2019,0.161746,0.183146,1.548871,,4.696227,1.223974,0.561719,89.750677,...,26.62674,2.017936,59.874307,37.262556,2687379.0,516093.0,1.606822,1.0,0.0,2019
4,JPM,2020,-0.128976,-0.124576,0.989988,,3.732273,1.217916,0.487232,90.723262,...,21.07742,1.306948,63.283852,33.304242,3384757.0,542102.0,1.900174,1.0,0.0,2020


## 5. Proxy calculations

- Construct market equity proxies (price-to-book, D/E, WACC weights) and record the chosen source.
- Estimate equity volatility from trailing returns and apply the debt volatility proxy `0.05 + 0.25 * sigma_E`.
- Combine proxies to obtain `V_hat`, `sigma_V_hat`, and the drift proxy `mu_hat`.


In [5]:

# --- Step 5: Prepare merge keys in df ---
# 5.0 Ensure df has ticker_prefix & Year
if 'ticker_prefix' not in df.columns:
    df['ticker_prefix'] = df['instrument'].apply(standardize_ticker)
if 'Year' not in df.columns:
    # prefer existing year column
    if 'year' in df.columns:
        df['Year'] = df['year']
    else:
        df['Year'] = pd.to_datetime(df['date'], errors='coerce').dt.year

# Quick debug
print('After prep, df columns:', df.columns.tolist())
print('Unique df keys:', df[['ticker_prefix','Year']].drop_duplicates().head())

# --- 5.1 Load and merge market capitalization ---
mc = pd.read_csv(marketcap_fp)

# Compute market_cap if needed
if 'market_cap' not in mc.columns:
    mc['market_cap'] = mc['dec_price'] * mc['shares_outstanding']

# Ensure mc has the merge keys
if 'ticker_prefix' not in mc.columns:
    mc['ticker_prefix'] = mc['symbol'].apply(standardize_ticker)
if 'Year' not in mc.columns:
    mc['Year'] = mc['year']

# Debug
print('mc columns:', mc.columns.tolist())
print('Sample mc keys:', mc[['ticker_prefix','Year','market_cap']].head())

# Merge
df = df.merge(
    mc[['ticker_prefix','Year','market_cap']],
    on=['ticker_prefix','Year'],
    how='left'
)

# --- 5.2 Load and merge equity volatility ---
ev = pd.read_csv(vol_fp)

# Determine ticker column in ev
ticker_col = 'symbol' if 'symbol' in ev.columns else 'Bank'
if 'ticker_prefix' not in ev.columns:
    ev['ticker_prefix'] = ev[ticker_col].apply(standardize_ticker)
if 'Year' not in ev.columns:
    ev['Year'] = ev['year'] if 'year' in ev.columns else ev['Year']

# Debug
print('ev columns:', ev.columns.tolist())
print('Sample ev keys:', ev[['ticker_prefix','Year','equity_volatility']].head())

# Merge and create audit trail for fallbacks
if 'equity_vol_default_entries' not in globals():
    equity_vol_default_entries = []

df = df.merge(
    ev[['ticker_prefix','Year','equity_volatility']],
    on=['ticker_prefix','Year'],
    how='left'
)

missing_equity_vol = df['equity_volatility'].isna()
df['equity_vol_is_default'] = missing_equity_vol
df['equity_vol'] = df['equity_volatility'].fillna(0.25)

if missing_equity_vol.any():
    fallback_pairs = (
        df.loc[missing_equity_vol, ['instrument', 'year']]
          .drop_duplicates()
          .sort_values(['instrument', 'year'])
    )
    print('[WARN] equity_volatility missing; defaulted to 0.25 for:')
    print(fallback_pairs.to_string(index=False))
    for _, row in fallback_pairs.iterrows():
        equity_vol_default_entries.append(
            f"Equity vol fallback -> instrument={row['instrument']}, year={int(row['year'])}, default_sigma_E=0.25"
        )
else:
    print('[INFO] No equity volatility fallbacks applied.')

# --- 5.3 Load and merge risk-free rate ---
rf = pd.read_csv(rf_fp)
if 'Year' not in rf.columns:
    rf['Year'] = rf['year']
rf = rf[['Year','rf']].copy()
rf['rf'] = rf['rf'] / 100  # convert percent to decimal

df = df.merge(rf, on='Year', how='left')

print('After merges, sample df:')
print(df[['ticker_prefix','Year','market_cap','equity_vol','rf']].head())


After prep, df columns: ['instrument', 'year', 'rit_rf', 'rit', 'new_wacc', 'unnamed:_5', 'weighted_average_cost_of_capital,_(%)', 'beta_levered', 'beta_unlevered', 'environmental_pillar_score', 'social_pillar_score', 'governance_pillar_score', 'esg_score', 'esg_combined_score', 'environmental_pillar_score_1', 'social_pillar_score_1', 'governance_pillar_score_1', 'esg_score_1', 'esg_combined_score_1', 'lnta', 'td/ta', 'price_to_book_value_per_share', 'capital_adequacy_total_(%)', 'wacc_tax_rate,_(%)', 'wacc_cost_of_debt,_(%)', 'wacc_debt_weight,_(%)', 'wacc_equity_weight,_(%)', 'total_assets', 'debt_total', 'd/e', 'dummylarge', 'dummymid', 'Year', 'ticker_prefix']
Unique df keys:   ticker_prefix  Year
0           JPM  2016
1           JPM  2017
2           JPM  2018
3           JPM  2019
4           JPM  2020
mc columns: ['symbol', 'year', 'market_cap', 'ticker_prefix', 'Year']
Sample mc keys:   ticker_prefix  Year    market_cap
0          ABCB  2016  3.004515e+09
1          ABCB  2017

## 6. Naive DD and PD

With proxy inputs in place, compute the Bharath and Shumway metrics:

- Distance to default
  $$\mathrm{DD}_{	ext{naive}} = rac{\ln(\hat V / F) + (\hat\mu - 	frac{1}{2}\hat\sigma_V^2)T}{\hat\sigma_V\sqrt{T}},\quad T=1$$
- Probability of default
  $$\mathrm{PD}_{	ext{naive}} = \Phiigl(-\mathrm{DD}_{	ext{naive}}igr)$$

Ensure invalid inputs propagate `NaN` outputs and retain narrative flags (e.g., `naive_status`).


In [6]:

from scipy.stats import norm
from scipy.optimize import root


def merton_solver_accounting(row, T=T):
    """Solve for asset value and volatility using accounting inputs."""
    # 1. Extract inputs (convert millions to actual USD)
    rf = row['rf']
    A = row['total_assets'] * 1_000_000  # book assets in USD
    F = row['debt_total'] * 1_000_000    # book debt in USD
    sigma_E = row['equity_vol']          # equity volatility (decimal)
    E = A - F                            # net equity in USD

    # 2. Validate inputs
    if pd.isna(A) or pd.isna(F) or pd.isna(sigma_E) or pd.isna(rf) or A <= 0 or sigma_E <= 0 or F < 0:
        return np.nan, np.nan, 'invalid', 'missing_or_invalid'
    if F == 0:
        return np.nan, np.nan, 'no_debt', 'no_debt'
    if E <= 0:
        return np.nan, np.nan, 'negative_equity', 'negative_equity'

    # 3. Define the Merton equations
    def equations(x):
        V, sigma_V = x
        if V <= 0 or sigma_V <= 0:
            return [1e6, 1e6]
        d1 = (np.log(V / F) + (rf + 0.5 * sigma_V**2) * T) / (sigma_V * np.sqrt(T))
        d2 = d1 - sigma_V * np.sqrt(T)
        eq1 = V * norm.cdf(d1) - F * np.exp(-rf * T) * norm.cdf(d2) - E
        eq2 = sigma_E - (V / E) * norm.cdf(d1) * sigma_V
        return [eq1, eq2]

    # 4. Initial guess and solve
    try:
        sol = root(equations, [A, sigma_E], method='hybr')
        if sol.success and sol.x[0] > 0 and sol.x[1] > 0:
            V_opt, sigma_V_opt = sol.x
            return V_opt, sigma_V_opt, 'converged', ''
        return np.nan, np.nan, 'no_converge', ''
    except Exception as err:
        return np.nan, np.nan, 'error', f'solver_error:{err.__class__.__name__}'


# Quick smoke test on the first row
if not df.empty:
    demo = df.iloc[0]
    V0, sigma_V0, status0, tag0 = merton_solver_accounting(demo)
    print(f"{demo['instrument']} {demo['year']}: V = {V0:.2f}, sigma_V = {sigma_V0:.4f}, status = {status0}")


JPM 2016: V = 2489982282047.86, sigma_V = 0.1657, status = converged


## 7. Outputs

- Persist the firm-year results (keys, proxy components, DD/PD, status fields) to `dd_pd_naive.csv`.
- Summarize DD/PD percentiles by year and overall in `dd_pd_naive_summary.csv`.
- Document the equity source and imputation flags alongside the metrics.


In [7]:
# 7.1 Apply the solver
print("[INFO] Applying accounting-based Merton solver to each row…")
results = df.apply(merton_solver_accounting, axis=1, result_type='expand')
df[['asset_value','asset_vol','merton_status','dd_pd_tag']] = results

# 7.2 Compute DDa and PDa
V   = df['asset_value']
F   = df['debt_total'] * 1_000_000  #  Convert debt to actual USD for consistent calculation
sigma_V = df['asset_vol']

# Numerator: log distance + drift (drift term uses r_f=0 here)
num = np.log(V / F) + (0.0 - 0.5 * sigma_V**2) * T
# Denominator: volatility × sqrt(T)
den = sigma_V * np.sqrt(T)

df['DDa'] = np.where(
    (df['dd_pd_tag']=='no_debt') | (df['dd_pd_tag']=='negative_equity'), 
    np.nan, 
    num / den
)
df['PDa'] = np.where(
    (df['dd_pd_tag']=='no_debt') | (df['dd_pd_tag']=='negative_equity'), 
    np.nan, 
    norm.cdf(-df['DDa'])
)

# 7.3 Quick check of results
print(df[['instrument','year','asset_value','asset_vol','DDa','PDa']].head())
print("\nSolver status counts:\n", df['merton_status'].value_counts())

[INFO] Applying accounting-based Merton solver to each row…


  instrument  year   asset_value  asset_vol        DDa           PDa
0        JPM  2016  2.489982e+12   0.165689   9.662851  2.168117e-22
1        JPM  2017  2.529657e+12   0.136295  11.903572  5.669392e-33
2        JPM  2018  2.612960e+12   0.147929  10.664598  7.452139e-27
3        JPM  2019  2.676452e+12   0.184160   8.845568  4.553215e-19
4        JPM  2020  3.382377e+12   0.321444   5.535074  1.555488e-08

Solver status counts:
 merton_status
converged      1407
no_debt          16
no_converge       2
Name: count, dtype: int64


## 8. Diagnostics and logging

- Append processing counts, proxy usage, and distributional stats to the accounting log.
- Visualize naive DD/PD distributions or compare against market-based metrics when available.
- Review any status codes indicating missing proxies or imputed volatilities.


## 9. Summary and next steps

**What we’ve accomplished**
- Executed the Bharath and Shumway naive accounting pipeline end-to-end.
- Generated book-based DD/PD metrics without invoking the iterative solver.
- Captured proxy provenance and quality diagnostics for downstream review.

**Next steps**
1. Audit `naive_status` and volatility imputation flags before consumption.
2. Contrast naive metrics with market-based results for the overlapping sample.
3. Incorporate the naive DD/PD outputs into your broader risk analysis.


In [8]:

# 8.1 Export the DataFrame to CSV (accounting-based)
output_fp = base_dir / 'data' / 'merged_inputs' / 'dd_pd_accounting.csv'
log_fp    = base_dir / 'data' / 'logs' / 'dd_pd_accounting_log.txt'

df.to_csv(output_fp, index=False)
print('[INFO] Accounting-based DD/PD results saved to:', output_fp.resolve())

# 8.2 Append diagnostics and audit trail to the log
dd_col = 'DDa' if 'DDa' in df.columns else next((c for c in df.columns if 'distance' in c.lower()), None)
pd_col = 'PDa' if 'PDa' in df.columns else next((c for c in df.columns if 'probability' in c.lower()), None)
status_col = ('merton_status' if 'merton_status' in df.columns
              else 'solver_status' if 'solver_status' in df.columns
              else None)

with open(log_fp, 'a') as log:
    log.write('\n=== Accounting-Based DD/PD Diagnostics ===\n')
    log.write(f'Total rows processed: {len(df)}\n')

    if status_col:
        log.write(f'\n{status_col} counts:\n')
        log.write(df[status_col].value_counts().to_string() + '\n')
    else:
        log.write('\nNo solver-status column found. Available columns:\n')
        log.write(', '.join(df.columns) + '\n')

    if dd_col:
        log.write(f'\nDistance to Default ({dd_col}) summary:\n')
        log.write(df[dd_col].describe().to_string() + '\n')
        log.write(f'Rows with missing {dd_col}: {df[dd_col].isna().sum()}\n')
    if pd_col:
        log.write(f'\nProbability of Default ({pd_col}) summary:\n')
        log.write(df[pd_col].describe().to_string() + '\n')
        log.write(f'Rows with missing {pd_col}: {df[pd_col].isna().sum()}\n')

    if 'equity_vol_default_entries' in globals() and equity_vol_default_entries:
        log.write('\nEquity volatility fallbacks applied:\n')
        log.write('\n'.join(equity_vol_default_entries) + '\n')
    else:
        log.write('\nEquity volatility fallbacks applied: none\n')

print('[INFO] Diagnostics appended to:', log_fp.resolve())


[INFO] Accounting-based DD/PD results saved to: /workspace/risk_bank/data/merged_inputs/dd_pd_accounting.csv
[INFO] Diagnostics appended to: /workspace/risk_bank/data/logs/dd_pd_accounting_log.txt
