### Imports and Loading JSON

In [1]:
import json
import pandas as pd
import numpy as np

import os

with open('../data/raw/Credit_bureau_sample_data.json', 'r') as f:
    bureau_json = json.load(f)

print(f"Loaded {len(bureau_json)} bureau records.")

Loaded 3 bureau records.


### Extraction Logic

In [2]:
def extract_features(record):
    app_id = record.get('application_id')
    inner = record.get('data', {}).get('consumerfullcredit', {})
    
    # 1. Summary Counts from accountrating
    rating = inner.get('accountrating', {})
    good_acc = sum([int(rating.get(k, 0)) for k in rating if 'good' in k])
    bad_acc = sum([int(rating.get(k, 0)) for k in rating if 'bad' in k])
    
    # 2. Financial Totals from accountdetails
    details = inner.get('accountdetails', [])
    if isinstance(details, dict): details = [details] # Handle single account cases
    
    total_bal = 0.0
    total_inst = 0.0
    for acc in details:
        total_bal += float(str(acc.get('current_balance', 0)).replace(',', ''))
        total_inst += float(str(acc.get('monthlyinstalmentamt', 0)).replace(',', ''))
        
    return {
        'application_id': app_id,
        'bureau_total_balance': total_bal,
        'bureau_monthly_instalment': total_inst,
        'bureau_count_good': good_acc,
        'bureau_count_bad': bad_acc,
        'bureau_bad_ratio': bad_acc / (good_acc + bad_acc) if (good_acc + bad_acc) > 0 else 0
    }

bureau_features = pd.DataFrame([extract_features(r) for r in bureau_json])
bureau_features.to_csv('../data/processed/bureau_features.csv', index=False)
bureau_features.head()

Unnamed: 0,application_id,bureau_total_balance,bureau_monthly_instalment,bureau_count_good,bureau_count_bad,bureau_bad_ratio
0,97,0.0,0.0,7,0,0.0
1,9714953,0.0,0.0,17,0,0.0
2,9714978,0.0,0.0,2,1,0.333333


### Save Bureau Features to processed data

In [3]:
os.makedirs('../data/processed', exist_ok=True)

bureau_features.to_csv('../data/processed/bureau_features.csv', index=False)
print("Bureau features saved to data/processed/bureau_features.csv")

Bureau features saved to data/processed/bureau_features.csv


# **Strategic Interpretation of External Credit Bureau Data**

## **1. Macro-Level Objective: Behavioral vs. Static Risk**

While the internal scoring model (Part 1) assesses **"Ability to Pay"** based on current liquidity, this Bureau Extraction layer assesses **"Willingness to Pay"** and **"Total Debt Capacity."** Senior-level risk management requires a multi-dimensional view. By integrating external behavioral signals, we mitigate **Adverse Selection**—a scenario where high-risk individuals appear low-risk on internal applications but are dangerously over-leveraged across other financial institutions.

## **2. Feature Logic & Risk Signaling**

The extraction logic focuses on three high-signal dimensions:

* **Character Proxy (Bureau Bad Ratio):** We prioritize the ratio of delinquent accounts to total history. A single default might be an outlier; a high *ratio* indicates a systemic behavioral pattern, providing a stronger predictive uplift than static demographic data.
* **Capacity Analysis (Total Monthly Instalments):** By flattening the nested JSON `accountdetails`, we capture the applicant’s existing monthly "burn rate." This allows for the calculation of the **Debt Service Coverage Ratio (DSCR)**. A high checking balance is secondary if 80% of income is already committed to existing bureau-reported debt.
* **Credit Hunger (Total Bureau Balance):** Aggregating total balances allows us to detect "credit stacking." Applicants seeking new credit while carrying peak balances elsewhere are statistically more likely to be in a debt spiral.

## **3. Technical Governance & Data Integrity**

To ensure the model receives high-fidelity inputs, the extraction pipeline implements robust **Defensive Programming**:

* **Handling Data Sparsity:** The code elegantly manages "Missing Nodes" (common for thin-file customers with no credit history) by defaulting to neutral, zero-risk values rather than allowing the pipeline to fail.
* **Normalization & Sanitization:** We addressed the "Dirty Data" reality of financial JSONs—standardizing inconsistent data types and removing non-numeric characters (commas/symbols) from currency strings to ensure downstream mathematical reliability.

## **4. Strategic Business Recommendation**

I recommend utilizing these extracted features to implement a **"Reject Overlay"** framework.

**The Policy:** Even if the primary ML Model (Part 1) generates an "Approve" score, certain bureau triggers—such as a **Bureau Bad Ratio > 30%** or a breach of the **Debt-to-Income (DTI)** ceiling—should trigger an **Automatic Hard Reject**.

This layered defense strategy optimizes the trade-off between growth and security, protecting the bank’s capital while ensuring a high-quality loan book.