# Taiwan Bankruptcy Prediction - Predictive Analytics Assignment

**Student:** Robert Banda  
**Institution:** MUBAS (Malawi University of Business and Applied Sciences)  
**Course:** Predictive Analytics  
**Date:** February 1, 2026  

---

## Table of Contents

1. [Section I: Dataset Description](#section-i)
2. [Section II: Business Problem Definition (Issue Trees)](#section-ii)
3. [Section III: Predictive Task Formulation](#section-iii)
4. [Section IV: Methodology & Planning](#section-iv)
5. [Section V: Analysis & Implementation](#section-v)
6. [Section VI: Documentation & Conclusions](#section-vi)

---

<a id='section-i'></a>
# I) Dataset Description

This section provides comprehensive descriptive statistics and visualizations of the Taiwan Bankruptcy dataset.

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# Set visualization style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette("husl")
%matplotlib inline

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.precision', 4)

print("✓ All libraries imported successfully!")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

## 1.1 Load and Inspect Dataset

In [None]:
# Load the dataset
df = pd.read_csv('data/data.csv')

print("="*80)
print("DATASET OVERVIEW")
print("="*80)
print(f"Dataset Shape: {df.shape}")
print(f"Total Records: {df.shape[0]:,}")
print(f"Total Features: {df.shape[1]}")
print(f"Memory Usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print("="*80)

In [None]:
# Display first 10 rows
print("First 10 rows of the dataset:")
df.head(10)

In [None]:
# Data types and information
print("Dataset Information:")
df.info()

In [None]:
# Display all column names
print("All Feature Names:")
print("="*80)
for i, col in enumerate(df.columns, 1):
    print(f"{i:2d}. {col}")

## 1.2 Descriptive Statistics

In [None]:
# Summary statistics for all features
print("Summary Statistics:")
df.describe().T

In [None]:
# Target variable distribution
print("="*80)
print("TARGET VARIABLE DISTRIBUTION")
print("="*80)
print("\nAbsolute Counts:")
print(df['Bankrupt?'].value_counts().sort_index())
print("\nPercentage Distribution:")
print(df['Bankrupt?'].value_counts(normalize=True).sort_index() * 100)

# Calculate imbalance ratio
n_bankrupt = df['Bankrupt?'].sum()
n_not_bankrupt = len(df) - n_bankrupt
imbalance_ratio = n_not_bankrupt / n_bankrupt

print(f"\nClass Imbalance Ratio: 1 : {imbalance_ratio:.1f}")
print(f"\n⚠️  WARNING: Severe class imbalance detected!")
print(f"   Only {n_bankrupt/len(df)*100:.2f}% of companies are bankrupt.")
print("="*80)

In [None]:
# Check for missing values
print("Missing Values Analysis:")
print("="*80)
missing = df.isnull().sum()
if missing.sum() == 0:
    print("✓ No missing values detected in the dataset!")
    print("  This is excellent data quality.")
else:
    print("Missing values found:")
    print(missing[missing > 0])
print("="*80)

## 1.3 Key Financial Ratios Analysis

Let's examine some of the most important financial indicators for bankruptcy prediction.

In [None]:
# Select key financial ratios for detailed analysis
key_features = [
    ' ROA(C) before interest and depreciation before interest',  # Profitability
    ' Operating Gross Margin',  # Profitability
    ' Current Ratio',  # Liquidity
    ' Quick Ratio',  # Liquidity
    ' Debt ratio %',  # Solvency
    ' Net worth/Assets',  # Solvency
    ' Total Asset Turnover',  # Efficiency
    ' Cash Flow to Sales'  # Cash Flow
]

print("Selected Key Financial Ratios for Analysis:")
print("="*80)
for i, feature in enumerate(key_features, 1):
    print(f"{i}. {feature.strip()}")
print("="*80)

In [None]:
# Statistics for key features
print("\nDescriptive Statistics for Key Financial Ratios:")
df[key_features].describe()

## 1.4 Data Visualizations

In [None]:
# Visualization 1: Class Distribution
import os
os.makedirs('figures', exist_ok=True)

fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Count plot
counts = df['Bankrupt?'].value_counts().sort_index()
axes[0].bar(['Not Bankrupt', 'Bankrupt'], counts.values, color=['#2ecc71', '#e74c3c'], edgecolor='black', linewidth=1.5)
axes[0].set_title('Class Distribution (Count)', fontsize=14, fontweight='bold')
axes[0].set_ylabel('Number of Companies', fontsize=12)
axes[0].set_xlabel('Bankruptcy Status', fontsize=12)
for i, v in enumerate(counts.values):
    axes[0].text(i, v + 100, str(v), ha='center', va='bottom', fontsize=11, fontweight='bold')

# Percentage plot
percentages = df['Bankrupt?'].value_counts(normalize=True).sort_index() * 100
axes[1].bar(['Not Bankrupt', 'Bankrupt'], percentages.values, color=['#2ecc71', '#e74c3c'], edgecolor='black', linewidth=1.5)
axes[1].set_title('Class Distribution (Percentage)', fontsize=14, fontweight='bold')
axes[1].set_ylabel('Percentage (%)', fontsize=12)
axes[1].set_xlabel('Bankruptcy Status', fontsize=12)
for i, v in enumerate(percentages.values):
    axes[1].text(i, v + 1, f'{v:.2f}%', ha='center', va='bottom', fontsize=11, fontweight='bold')

plt.tight_layout()
plt.savefig('figures/01_class_distribution.png', dpi=300, bbox_inches='tight')
plt.show()

print("✓ Figure saved: figures/01_class_distribution.png")

In [None]:
# Visualization 2: Distribution of Key Features
fig, axes = plt.subplots(4, 2, figsize=(16, 16))
axes = axes.flatten()

for idx, feature in enumerate(key_features):
    # Remove extreme outliers for better visualization
    data = df[feature]
    Q1 = data.quantile(0.25)
    Q3 = data.quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 3 * IQR
    upper_bound = Q3 + 3 * IQR
    data_filtered = data[(data >= lower_bound) & (data <= upper_bound)]
    
    axes[idx].hist(data_filtered, bins=50, edgecolor='black', alpha=0.7, color='steelblue')
    axes[idx].set_title(f'{feature.strip()}', fontsize=10, fontweight='bold')
    axes[idx].set_xlabel('Value', fontsize=9)
    axes[idx].set_ylabel('Frequency', fontsize=9)
    axes[idx].grid(alpha=0.3)

plt.suptitle('Distribution of Key Financial Ratios', fontsize=16, fontweight='bold', y=1.00)
plt.tight_layout()
plt.savefig('figures/02_feature_distributions.png', dpi=300, bbox_inches='tight')
plt.show()

print("✓ Figure saved: figures/02_feature_distributions.png")

In [None]:
# Visualization 3: Box Plots by Bankruptcy Status
fig, axes = plt.subplots(4, 2, figsize=(16, 16))
axes = axes.flatten()

for idx, feature in enumerate(key_features):
    # Prepare data
    data_to_plot = [df[df['Bankrupt?'] == 0][feature], df[df['Bankrupt?'] == 1][feature]]
    
    bp = axes[idx].boxplot(data_to_plot, labels=['Not Bankrupt', 'Bankrupt'], patch_artist=True)
    
    # Color the boxes
    colors = ['#2ecc71', '#e74c3c']
    for patch, color in zip(bp['boxes'], colors):
        patch.set_facecolor(color)
        patch.set_alpha(0.7)
    
    axes[idx].set_title(f'{feature.strip()}', fontsize=10, fontweight='bold')
    axes[idx].set_ylabel('Value', fontsize=9)
    axes[idx].grid(alpha=0.3, axis='y')

plt.suptitle('Box Plots: Features vs Bankruptcy Status', fontsize=16, fontweight='bold', y=1.00)
plt.tight_layout()
plt.savefig('figures/03_boxplots_by_status.png', dpi=300, bbox_inches='tight')
plt.show()

print("✓ Figure saved: figures/03_boxplots_by_status.png")

In [None]:
# Visualization 4: Correlation Heatmap (Top 20 features)
# Find top 20 features most correlated with bankruptcy
correlations = df.corr()['Bankrupt?'].abs().sort_values(ascending=False)
top_20_features = correlations[1:21].index.tolist()

# Create correlation matrix for top features
correlation_matrix = df[top_20_features + ['Bankrupt?']].corr()

plt.figure(figsize=(16, 14))
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm', 
            center=0, square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Correlation Heatmap: Top 20 Features vs Bankruptcy', fontsize=16, fontweight='bold', pad=20)
plt.tight_layout()
plt.savefig('figures/04_correlation_heatmap.png', dpi=300, bbox_inches='tight')
plt.show()

print("✓ Figure saved: figures/04_correlation_heatmap.png")
print("\nTop 10 Features Correlated with Bankruptcy:")
for i, (feature, corr) in enumerate(correlations[1:11].items(), 1):
    print(f"{i:2d}. {feature[:60]:<60} | r = {corr:.4f}")

In [None]:
# Visualization 5: Feature Comparison by Bankruptcy Status
fig, axes = plt.subplots(4, 2, figsize=(16, 16))
axes = axes.flatten()

for idx, feature in enumerate(key_features):
    bankrupt = df[df['Bankrupt?'] == 1][feature]
    not_bankrupt = df[df['Bankrupt?'] == 0][feature]
    
    # Remove extreme outliers for better visualization
    Q1, Q3 = df[feature].quantile([0.25, 0.75])
    IQR = Q3 - Q1
    lower_bound = Q1 - 3 * IQR
    upper_bound = Q3 + 3 * IQR
    
    not_bankrupt_filtered = not_bankrupt[(not_bankrupt >= lower_bound) & (not_bankrupt <= upper_bound)]
    bankrupt_filtered = bankrupt[(bankrupt >= lower_bound) & (bankrupt <= upper_bound)]
    
    axes[idx].hist([not_bankrupt_filtered, bankrupt_filtered], bins=30, 
                   label=['Not Bankrupt', 'Bankrupt'], 
                   color=['#2ecc71', '#e74c3c'], alpha=0.7, edgecolor='black')
    axes[idx].set_title(f'{feature.strip()}', fontsize=10, fontweight='bold')
    axes[idx].set_xlabel('Value', fontsize=9)
    axes[idx].set_ylabel('Frequency', fontsize=9)
    axes[idx].legend(fontsize=8)
    axes[idx].grid(alpha=0.3)

plt.suptitle('Feature Distributions by Bankruptcy Status', fontsize=16, fontweight='bold', y=1.00)
plt.tight_layout()
plt.savefig('figures/05_feature_comparison.png', dpi=300, bbox_inches='tight')
plt.show()

print("✓ Figure saved: figures/05_feature_comparison.png")

## 1.5 Key Findings from Exploratory Data Analysis

### Dataset Characteristics:
- **Size**: 6,819 companies with 96 features
- **Data Quality**: ✓ No missing values (excellent data quality)
- **Target Variable**: Binary (Bankrupt: 0 or 1)

### Class Imbalance Issue:
- **Severe imbalance detected**: Only ~3% of companies are bankrupt
- **Imbalance Ratio**: Approximately 1:30 (1 bankrupt for every 30 non-bankrupt)
- **Implications**: 
  - Cannot use accuracy as primary metric
  - Need to use SMOTE or class weights
  - Focus on Recall and F2-score for evaluation

### Feature Categories:
The 95 financial ratios can be grouped into:
1. **Profitability Ratios**: ROA, ROE, profit margins
2. **Liquidity Ratios**: Current ratio, quick ratio, cash ratios
3. **Solvency Ratios**: Debt ratio, equity ratios
4. **Efficiency Ratios**: Asset turnover, inventory turnover
5. **Cash Flow Indicators**: Operating cash flow ratios

### Discriminative Features:
- Several features show strong correlation with bankruptcy
- Clear separation visible in box plots between bankrupt and non-bankrupt companies
- These features will be valuable for predictive modeling

### Data Preparation Needs:
1. **Feature Scaling**: Required due to different value ranges
2. **Class Balancing**: SMOTE or class weights needed
3. **Outlier Handling**: Some features contain extreme outliers
4. **Feature Selection**: May benefit from dimensionality reduction (95 features)

---

<a id='section-ii'></a>
# II) Business Problem Definition (Issue Trees)

This section defines the business problem and uses Issue Tree analysis to decompose it into manageable components.

## 2.1 Business Problem Statement

### The Challenge:
Financial institutions, investors, regulatory bodies, and corporate stakeholders face a critical challenge: **identifying companies at high risk of bankruptcy early enough to take preventive action or minimize financial losses**.

### Why This Matters:

**For Banks and Lending Institutions:**
- Need to assess credit risk before approving loans
- Must monitor existing loan portfolios for deteriorating borrowers
- Late detection of bankruptcy = 100% loss of principal

**For Investors and Portfolio Managers:**
- Require early warning signals to divest from distressed companies
- Need to balance risk-return in portfolio allocation
- Bankruptcy of portfolio companies = significant capital loss

**For Credit Rating Agencies:**
- Responsible for accurate credit ratings
- Must update ratings promptly as financial health changes
- Rating accuracy impacts market efficiency

**For Corporate Management:**
- Need early warning system to identify financial distress
- Opportunity for corrective action before crisis
- Can implement turnaround strategies if warned early

**For Regulators:**
- Monitor systemic risk in financial system
- Identify sectors or regions under stress
- Intervene before widespread economic impact

### Current Limitations:

1. **Manual Analysis is Slow**: Traditional financial analysis takes time
2. **Information Overload**: 95+ financial indicators are difficult to interpret manually
3. **Subjective Judgment**: Human analysts may miss subtle warning signs
4. **Inconsistent Application**: Different analysts may reach different conclusions
5. **Late Detection**: Often identify distress too late for effective action

### Proposed Solution:

Develop a **machine learning-based early warning system** that:
- Analyzes 95 financial indicators automatically
- Predicts bankruptcy probability with high accuracy
- Provides consistent, objective risk assessment
- Enables proactive decision-making
- Updates predictions as new financial data becomes available

---

## 2.2 Issue Tree Analysis

### Main Question: What factors lead to corporate bankruptcy?

Issue Tree analysis helps us decompose this complex problem into manageable components:

```
CORPORATE BANKRUPTCY PREDICTION - ISSUE TREE
═══════════════════════════════════════════════════════════════════

Main Problem: Why do companies go bankrupt?
│
├── 1. FINANCIAL HEALTH DETERIORATION
│   │
│   ├── 1.1 Profitability Issues
│   │   ├── ROA declining → Company not generating adequate returns on assets
│   │   ├── Negative profit margins → Costs exceed revenues
│   │   ├── Operating losses → Core business operations unprofitable
│   │   └── Net income losses → Overall business unsustainable
│   │
│   ├── 1.2 Liquidity Problems
│   │   ├── Low current ratio → Cannot pay short-term debts
│   │   ├── Poor cash flow → Cash burn rate exceeds generation
│   │   ├── Working capital deficit → Day-to-day operations threatened
│   │   └── Quick ratio deterioration → Immediate payment difficulties
│   │
│   └── 1.3 Solvency Concerns
│       ├── High debt ratio → Over-leveraged balance sheet
│       ├── Poor equity position → Minimal financial buffer
│       ├── Debt service issues → Cannot meet debt obligations
│       └── Negative net worth → Liabilities exceed assets
│
├── 2. OPERATIONAL INEFFICIENCY
│   │
│   ├── 2.1 Asset Utilization
│   │   ├── Low asset turnover → Assets not generating revenue
│   │   ├── Inventory management issues → Cash tied up in inventory
│   │   ├── Poor receivables collection → Cash flow constraints
│   │   └── Fixed asset underutilization → Idle capacity
│   │
│   └── 2.2 Revenue Generation
│       ├── Declining sales → Market losing interest in products
│       ├── Market share loss → Competitors gaining ground
│       ├── Pricing pressure → Margin compression
│       └── Customer concentration risk → Dependence on few customers
│
└── 3. MARKET & INDUSTRY FACTORS
    │
    ├── 3.1 Economic Environment
    │   ├── Economic recession → Demand drops across the board
    │   ├── Industry downturn → Sector-wide difficulties
    │   ├── Regulatory changes → Increased compliance costs
    │   └── Interest rate changes → Debt servicing becomes expensive
    │
    └── 3.2 Competitive Position
        ├── New market entrants → Market disruption
        ├── Technology obsolescence → Products become outdated
        ├── Supplier power increase → Input costs rise
        └── Loss of competitive advantage → Differentiation eroded

═══════════════════════════════════════════════════════════════════
KEY INSIGHT: Bankruptcy is rarely due to a single factor - it's typically
a combination of financial weakness + operational issues + adverse market conditions
```

---

In [None]:
# Create a visual representation of the Issue Tree
print("═"*80)
print("CORPORATE BANKRUPTCY - ISSUE TREE BREAKDOWN")
print("═"*80)
print("\n🎯 MAIN PROBLEM: Corporate Bankruptcy Prediction")
print("\n" + "─"*80)
print("\n📊 LEVEL 1: Major Categories")
print("   1. Financial Health Deterioration")
print("   2. Operational Inefficiency")
print("   3. Market & Industry Factors")
print("\n" + "─"*80)
print("\n📈 LEVEL 2: Sub-Categories")
print("\n   1. FINANCIAL HEALTH:")
print("      • Profitability Issues")
print("      • Liquidity Problems")
print("      • Solvency Concerns")
print("\n   2. OPERATIONAL EFFICIENCY:")
print("      • Asset Utilization")
print("      • Revenue Generation")
print("\n   3. EXTERNAL FACTORS:")
print("      • Economic Environment")
print("      • Competitive Position")
print("\n" + "─"*80)
print("\n💡 LEVEL 3: Specific Indicators (Sample)")
print("\n   Profitability Issues:")
print("      ↳ ROA < 0 (Negative return on assets)")
print("      ↳ Operating Margin < 5% (Low profitability)")
print("      ↳ Persistent net losses")
print("\n   Liquidity Problems:")
print("      ↳ Current Ratio < 1.0 (Cannot pay short-term debts)")
print("      ↳ Negative operating cash flow")
print("      ↳ Working capital deficit")
print("\n   Solvency Concerns:")
print("      ↳ Debt Ratio > 70% (Over-leveraged)")
print("      ↳ Equity-to-Assets < 20% (Thin equity cushion)")
print("      ↳ Interest coverage < 1.0 (Cannot service debt)")
print("\n" + "═"*80)

## 2.3 Mapping Issue Tree to Available Features

Our dataset contains features that directly measure each component of the issue tree:

In [None]:
# Map issue tree categories to actual features in the dataset
feature_mapping = {
    'Profitability': [
        ' ROA(C) before interest and depreciation before interest',
        ' ROA(A) before interest and % after tax',
        ' Operating Gross Margin',
        ' Operating Profit Rate',
        ' Net Income to Total Assets'
    ],
    'Liquidity': [
        ' Current Ratio',
        ' Quick Ratio',
        ' Cash Flow to Sales',
        ' Cash/Total Assets',
        ' Working Capital to Total Assets'
    ],
    'Solvency': [
        ' Debt ratio %',
        ' Net worth/Assets',
        ' Total debt/Total net worth',
        ' Liability to Equity',
        ' Interest Coverage Ratio (Interest expense to EBIT)'
    ],
    'Efficiency': [
        ' Total Asset Turnover',
        ' Accounts Receivable Turnover',
        ' Inventory Turnover Rate (times)',
        ' Fixed Assets Turnover Frequency',
        ' Revenue per person'
    ]
}

print("ISSUE TREE → DATASET FEATURE MAPPING")
print("="*80)
for category, features in feature_mapping.items():
    print(f"\n📌 {category.upper()} Indicators:")
    for i, feature in enumerate(features, 1):
        print(f"   {i}. {feature.strip()}")
print("\n" + "="*80)
print("\n✓ Our dataset contains comprehensive features covering all issue tree categories!")

## 2.4 Stakeholder Impact Analysis

### How Bankruptcy Prediction Benefits Each Stakeholder:

| Stakeholder | Use Case | Benefit |
|-------------|----------|----------|
| **Banks** | Loan approval decisions | Reduce bad debt by 50-70% |
| **Investors** | Portfolio risk management | Exit distressed positions early |
| **Credit Agencies** | Rating adjustments | Improve rating accuracy |
| **Companies** | Self-assessment & early warning | Time for turnaround strategies |
| **Regulators** | Systemic risk monitoring | Prevent financial contagion |
| **Suppliers** | Credit terms decisions | Protect receivables |
| **Employees** | Job security assessment | Plan career moves proactively |

---

<a id='section-iii'></a>
# III) Predictive Task Formulation

This section formally defines the machine learning task.

## 3.1 Task Type: Binary Classification

### Formal Definition:

**Task Type:** Supervised Learning - Binary Classification

**Objective:** Predict whether a company will go bankrupt based on its financial ratios

**Input (Features - X):**
- 95 financial ratio features
- Categories: Profitability, Liquidity, Solvency, Efficiency, Cash Flow

**Output (Target - Y):**
- Binary class label: `Bankrupt?`
  - **0**: Company will NOT go bankrupt (Negative class)
  - **1**: Company WILL go bankrupt (Positive class)

**Prediction:**
- Model outputs: P(Bankrupt = 1 | Financial Ratios)
- Probability range: [0, 1]
- Decision threshold: Optimized based on business costs

---

## 3.2 Success Metrics

### Primary Metric: **Recall (Sensitivity)**

**Formula:** Recall = TP / (TP + FN)

**Why Recall is Critical:**
- **False Negative Cost is VERY HIGH**: Missing a bankruptcy means 100% loss
- **False Positive Cost is MODERATE**: False alarm means extra scrutiny
- **Business Priority**: Catch as many actual bankruptcies as possible

**Target:** Recall ≥ 80% (catch at least 80% of bankruptcies)

### Secondary Metrics:

1. **Precision:** TP / (TP + FP)
   - Target: ≥ 20% (acceptable given 30:1 class imbalance)
   - Interpretation: Of all predicted bankruptcies, how many are correct?

2. **F2-Score:** Weighted harmonic mean favoring recall
   - Formula: F2 = 5 × (Precision × Recall) / (4 × Precision + Recall)
   - Weights recall 2x more than precision
   - Target: Maximize F2-Score

3. **AUC-ROC:** Area Under ROC Curve
   - Measures overall discrimination ability
   - Target: ≥ 0.85
   - Interpretation: Probability model ranks random bankrupt company higher than random non-bankrupt

### Business Cost Metric:

**Cost Function:**
```
Total Cost = (FN × $100) + (FP × $5)
```

Where:
- **False Negative (FN)**: Missing a bankruptcy = $100 cost (total loss)
- **False Positive (FP)**: False alarm = $5 cost (extra due diligence)

**Objective:** Minimize total business cost

---

In [None]:
# Define business cost calculation function
def calculate_business_cost(y_true, y_pred, cost_fn=100, cost_fp=5):
    """
    Calculate business cost of predictions.
    
    Parameters:
    -----------
    y_true : array-like
        True labels
    y_pred : array-like
        Predicted labels
    cost_fn : float
        Cost of False Negative (missing a bankruptcy)
    cost_fp : float
        Cost of False Positive (false alarm)
    
    Returns:
    --------
    total_cost : float
        Total business cost
    """
    from sklearn.metrics import confusion_matrix
    
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    
    total_cost = (fn * cost_fn) + (fp * cost_fp)
    
    print("Business Cost Analysis:")
    print("="*60)
    print(f"True Negatives (TN):  {tn:4d} (Correctly identified non-bankrupt)")
    print(f"False Positives (FP): {fp:4d} (False alarms)")
    print(f"False Negatives (FN): {fn:4d} (Missed bankruptcies) ⚠️")
    print(f"True Positives (TP):  {tp:4d} (Correctly identified bankrupt)")
    print("\nCost Calculation:")
    print(f"  FN Cost: {fn} × ${cost_fn} = ${fn * cost_fn:,}")
    print(f"  FP Cost: {fp} × ${cost_fp} = ${fp * cost_fp:,}")
    print(f"  TOTAL COST: ${total_cost:,}")
    print("="*60)
    
    return total_cost

print("✓ Business cost calculation function defined")
print("  This function will be used to evaluate models based on business impact.")

## 3.3 Class Imbalance Challenge

### The Problem:

In [None]:
# Quantify and visualize the class imbalance problem
print("="*80)
print("CLASS IMBALANCE CHALLENGE")
print("="*80)

n_total = len(df)
n_bankrupt = df['Bankrupt?'].sum()
n_not_bankrupt = n_total - n_bankrupt
pct_bankrupt = (n_bankrupt / n_total) * 100
pct_not_bankrupt = (n_not_bankrupt / n_total) * 100

print(f"\n📊 Dataset Composition:")
print(f"   Total Companies:        {n_total:,}")
print(f"   Bankrupt Companies:     {n_bankrupt:,} ({pct_bankrupt:.2f}%)")
print(f"   Non-Bankrupt Companies: {n_not_bankrupt:,} ({pct_not_bankrupt:.2f}%)")
print(f"\n⚖️  Imbalance Ratio: 1 : {n_not_bankrupt/n_bankrupt:.1f}")
print("   (For every 1 bankrupt company, there are 30 non-bankrupt companies)")

print("\n" + "─"*80)
print("\n⚠️  IMPLICATIONS FOR MODELING:")
print("="*80)
print("\n1. Accuracy is MISLEADING:")
print("   → A model predicting all companies as 'not bankrupt' achieves 97% accuracy!")
print("   → But it's completely useless (0% recall)")
print("\n2. Need Specialized Evaluation Metrics:")
print("   ✓ Use Recall, Precision, F2-Score instead of Accuracy")
print("   ✓ Focus on minority class (bankrupt) performance")
print("\n3. Must Use Class Balancing Techniques:")
print("   ✓ SMOTE (Synthetic Minority Over-sampling)")
print("   ✓ Class weights in models")
print("   ✓ Stratified sampling for train-test split")
print("\n4. Risk of Model Bias:")
print("   → Without intervention, models will bias toward majority class")
print("   → Will have poor recall (miss many bankruptcies)")
print("\n" + "="*80)

### Our Strategy to Handle Imbalance:

1. **Data-Level:**
   - Apply SMOTE to training set only (not test set)
   - Creates synthetic minority class samples
   - Balances class distribution for training

2. **Algorithm-Level:**
   - Use `class_weight='balanced'` parameter in models
   - Penalizes misclassification of minority class more heavily

3. **Evaluation-Level:**
   - Stratified train-test split (maintains class ratio)
   - Stratified K-fold cross-validation
   - Focus on Recall and F2-Score

4. **Threshold Optimization:**
   - Don't use default 0.5 threshold
   - Optimize threshold based on business costs
   - Likely use lower threshold to catch more bankruptcies

---

<a id='section-iv'></a>
# IV) Methodology & Planning (CRISP-DM)

This section documents our systematic approach to the data mining task using the CRISP-DM framework.

## 4.1 CRISP-DM Framework

We follow the **Cross-Industry Standard Process for Data Mining (CRISP-DM)**, which consists of six phases:

### Phase 1: Business Understanding ✓

**Objective:**
- Predict corporate bankruptcy to minimize financial losses for stakeholders
- Enable early warning and proactive decision-making

**Success Criteria:**
- Recall ≥ 80% (catch most bankruptcies)
- Minimize business cost (FN × $100 + FP × $5)
- Interpretable model for stakeholder confidence

**Stakeholders Identified:**
- Banks, investors, credit agencies, regulators, corporate management

---

### Phase 2: Data Understanding ✓

**Dataset Characteristics:**
- Source: Taiwan Economic Journal (1999-2009)
- Size: 6,819 companies
- Features: 95 financial ratios + 1 target variable
- Data Quality: No missing values

**Key Findings:**
- Severe class imbalance (3% bankruptcy rate)
- Clear discriminative features identified
- Features cover all issue tree categories

---

### Phase 3: Data Preparation (Current Phase)

**Steps to be executed:**

1. **Data Cleaning:**
   - Verify no missing values ✓
   - Handle outliers if necessary
   - Check for data consistency

2. **Feature Engineering:**
   - Select most important features (optional dimensionality reduction)
   - Feature scaling using StandardScaler
   - No new features needed (comprehensive existing set)

3. **Train-Test Split:**
   - 80% training, 20% testing
   - Stratified sampling to maintain class ratio
   - Random state = 42 for reproducibility

4. **Handle Class Imbalance:**
   - Apply SMOTE to training set only
   - Alternative: Use class weights in models
   - Never apply SMOTE to test set (data leakage)

---

### Phase 4: Modeling

**Model Selection Strategy:**

Train multiple models and compare:

1. **Baseline Model: Logistic Regression**
   - Fast, interpretable
   - Good for understanding feature importance
   - Class weights for imbalance

2. **Tree-Based: Random Forest**
   - Handles non-linear relationships
   - Feature importance built-in
   - Robust to outliers
   - Class weights for imbalance

3. **Boosting: XGBoost**
   - State-of-the-art for tabular data
   - Excellent performance
   - scale_pos_weight for imbalance

**Cross-Validation:**
- 5-fold stratified CV
- Ensures robust performance estimates
- Prevents overfitting

---

### Phase 5: Evaluation

**Evaluation Metrics:**
- Primary: Recall (≥ 80% target)
- Secondary: Precision, F2-Score, AUC-ROC
- Business: Total cost minimization

**Model Comparison:**
- Compare all models on test set
- Select best based on recall and cost
- Analyze feature importance
- Examine misclassifications

---

### Phase 6: Deployment (Recommendations)

**Implementation Plan:**
1. Integrate selected model into decision workflow
2. Set probability thresholds based on risk appetite
3. Monitor model performance over time
4. Retrain quarterly with new data
5. Provide stakeholder training

**Monitoring:**
- Track recall on new data
- Monitor actual vs predicted bankruptcy rates
- Update model when performance degrades

---

## 4.2 Experimental Plan

Detailed step-by-step plan for implementation:

In [None]:
# Create experimental plan table
experimental_plan_data = {
    'Step': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'Task': [
        'Data Preparation',
        'Train-Test Split (Stratified)',
        'Feature Scaling',
        'Apply SMOTE (Training Only)',
        'Train Logistic Regression',
        'Train Random Forest',
        'Train XGBoost',
        'Evaluate All Models',
        'Feature Importance Analysis',
        'Model Selection & Recommendations'
    ],
    'Input': [
        'Raw dataset (df)',
        'Cleaned data (X, y)',
        'X_train, X_test',
        'X_train_scaled',
        'X_train_resampled',
        'X_train_resampled',
        'X_train_resampled',
        'Trained models',
        'Best model',
        'All results'
    ],
    'Output': [
        'X (features), y (target)',
        'Train and test sets',
        'Scaled features',
        'Balanced training data',
        'LR predictions & metrics',
        'RF predictions & metrics',
        'XGB predictions & metrics',
        'Performance comparison table',
        'Top features ranked',
        'Best model + deployment plan'
    ],
    'Tool/Library': [
        'pandas',
        'sklearn.model_selection',
        'sklearn.preprocessing.StandardScaler',
        'imblearn.SMOTE',
        'sklearn.linear_model.LogisticRegression',
        'sklearn.ensemble.RandomForestClassifier',
        'xgboost.XGBClassifier',
        'sklearn.metrics',
        'model.feature_importances_',
        'Custom analysis'
    ]
}

exp_plan_df = pd.DataFrame(experimental_plan_data)

print("\n" + "="*100)
print("EXPERIMENTAL PLAN - DETAILED WORKFLOW")
print("="*100)
display(exp_plan_df)
print("="*100)
print("\n✓ This plan will be executed in Section V")