# 14 - Ablation Studies (Phase 3)

**Author:** Tan Ming Kai (24PMR12003)  
**Date:** 2025-11-24  
**Purpose:** Test hypotheses H₂, H₃, H₄ through ablation experiments

---

## Research Hypotheses

**H₂:** Dual-branch processing improves accuracy by ≥5% vs single-scale
- **Test:** CrossViT (dual-branch) vs ViT (single-scale)
- **Prediction:** CrossViT should significantly outperform ViT

**H₃:** CLAHE enhancement improves performance by ≥2% vs no CLAHE
- **Test:** Train CrossViT on CLAHE-enhanced vs raw images
- **Prediction:** CLAHE should improve contrast → better features

**H₄:** Conservative augmentation improves generalization without degrading accuracy
- **Test:** No augmentation vs Conservative vs Aggressive augmentation
- **Prediction:** Conservative augmentation optimal

---

## Important Notes

**H₂ can be tested IMMEDIATELY** using existing Phase 2 results (CrossViT vs ViT)

**H₃ and H₄ require NEW training runs:**
- H₃: Train 1 model with/without CLAHE (2 runs × 1 seed = 2 GPU hours)
- H₄: Train 1 model with 3 augmentation levels (3 runs × 1 seed = 3 GPU hours)

**Time Budget:** ~5 GPU hours for complete ablation study

---

In [None]:
# Standard imports
import os, sys, warnings
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8-darkgrid')

print("[OK] Imports complete")

In [None]:
# Configuration
RESULTS_DIR = Path("../experiments/phase2_systematic/results/metrics")
OUTPUT_DIR = Path("../experiments/phase3_analysis/ablation_studies")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

print(f"[OK] Configuration set")

## 1. H₂: Dual-Branch vs Single-Scale

### Hypothesis
CrossViT's dual-branch architecture (12×12 and 16×16 patches) should capture multi-scale features better than ViT's single-scale (16×16 patches only).

### Prediction
CrossViT accuracy ≥ ViT accuracy + 5%

### Test Method
Compare existing Phase 2 results (no additional training needed)

In [None]:
# Load results
crossvit_df = pd.read_csv(RESULTS_DIR / "crossvit_results.csv")
vit_df = pd.read_csv(RESULTS_DIR / "vit_results.csv")

crossvit_acc = crossvit_df['test_acc'].values
vit_acc = vit_df['test_acc'].values

print("H₂: DUAL-BRANCH VS SINGLE-SCALE ANALYSIS")
print("="*80)
print(f"\nCrossViT-Tiny (Dual-Branch):")
print(f"  Mean: {np.mean(crossvit_acc):.2f}%")
print(f"  Std:  {np.std(crossvit_acc, ddof=1):.2f}%")
print(f"  Seeds: {crossvit_acc}")

print(f"\nViT-Tiny (Single-Scale):")
print(f"  Mean: {np.mean(vit_acc):.2f}%")
print(f"  Std:  {np.std(vit_acc, ddof=1):.2f}%")
print(f"  Seeds: {vit_acc}")

# Calculate difference
mean_diff = np.mean(crossvit_acc) - np.mean(vit_acc)

print(f"\n{'='*80}")
print(f"Mean Difference: {mean_diff:+.2f}%")
print(f"{'='*80}")

# Statistical test
t_stat, p_value = stats.ttest_rel(crossvit_acc, vit_acc)

print(f"\nPaired t-test:")
print(f"  t-statistic: {t_stat:+.3f}")
print(f"  p-value: {p_value:.4f}")
print(f"  Significant (α=0.05): {'Yes' if p_value < 0.05 else 'No'}")

# Hypothesis evaluation
print(f"\n{'='*80}")
print(f"H₂ EVALUATION: Dual-branch should improve accuracy by ≥5%")
print(f"{'='*80}")
print(f"Observed improvement: {mean_diff:+.2f}%")
print(f"Prediction: ≥5.00%")
print(f"Result: {'✓ SUPPORTED' if mean_diff >= 5.0 else '✗ NOT SUPPORTED'}")
print(f"Statistical significance: {'Yes (p<0.05)' if p_value < 0.05 else 'No (p≥0.05)'}")

# Effect size
cohens_d = mean_diff / np.sqrt((np.var(crossvit_acc, ddof=1) + np.var(vit_acc, ddof=1)) / 2)
print(f"Cohen's d: {cohens_d:.3f} ({'small' if abs(cohens_d) < 0.5 else 'medium' if abs(cohens_d) < 0.8 else 'large'} effect)")

## 2. H₃: CLAHE Enhancement Impact

### Hypothesis
CLAHE preprocessing enhances low-contrast chest X-rays → better feature extraction → higher accuracy

### Prediction
CLAHE accuracy ≥ No-CLAHE accuracy + 2%

### Test Method
**Option A (Recommended):** Quick pilot test with CrossViT seed=42 only
- Train 2 models: with/without CLAHE
- Compare accuracies
- **Time:** ~2 GPU hours

**Option B:** Full 5-seed replication (more rigorous but time-consuming)
- **Time:** ~10 GPU hours

### Status
⏭️ **NOT YET TESTED** - Requires new training runs

---

**To implement:**
1. Load raw images (no CLAHE)
2. Train CrossViT on raw data (seed=42)
3. Compare with Phase 2 CrossViT (seed=42, CLAHE-enhanced)
4. Calculate accuracy difference

**Code placeholder below:**

In [None]:
# H₃ Testing (requires implementation)
print("H₃: CLAHE ENHANCEMENT IMPACT")
print("="*80)
print("\n⚠️ STATUS: NOT YET TESTED")
print("\nThis test requires training new models:")
print("1. Train CrossViT on RAW images (no CLAHE)")
print("2. Compare with existing CLAHE-enhanced model")
print("3. Estimate time: 1-2 GPU hours per seed")
print("\nRecommendation: Run with seed=42 only for pilot test")
print("\n[PLACEHOLDER] Implement training code when ready")

# Expected code structure:
# 1. Load test data WITHOUT CLAHE
# 2. Train model on raw images
# 3. Compare accuracy: clahe_acc vs raw_acc
# 4. Test if difference ≥ 2%

## 3. H₄: Data Augmentation Strategy

### Hypothesis
Conservative augmentation (±10° rotation, 50% flip, slight color jitter) improves generalization without introducing anatomically impossible transformations.

### Prediction
Conservative > No augmentation  
Conservative ≥ Aggressive augmentation

### Test Method
Train CrossViT with 3 augmentation strategies (seed=42):

1. **No Augmentation:** Only resize + normalize
2. **Conservative (Current):** ±10° rotation, horizontal flip, brightness/contrast ±0.1
3. **Aggressive:** ±30° rotation, horizontal + vertical flip, brightness/contrast ±0.3

### Status
⏭️ **NOT YET TESTED** - Requires new training runs

**Time:** ~3 GPU hours (1 hour per configuration)

---

In [None]:
# H₄ Testing (requires implementation)
print("H₄: DATA AUGMENTATION STRATEGY")
print("="*80)
print("\n⚠️ STATUS: NOT YET TESTED")
print("\nThis test requires training 3 models:")
print("1. No augmentation (baseline)")
print("2. Conservative augmentation (current strategy)")
print("3. Aggressive augmentation")
print("\nEstimate time: 1 GPU hour per model × 3 = 3 hours")
print("\nRecommendation: Run with seed=42 only")
print("\n[PLACEHOLDER] Implement training code when ready")

# Expected code structure:
# 1. Define 3 augmentation transforms
# 2. Train CrossViT with each
# 3. Compare accuracies
# 4. Test: conservative > none AND conservative ≥ aggressive

## 4. Visualization: Ablation Results Summary

In [None]:
# Plot H₂ results (only completed test)
fig, ax = plt.subplots(figsize=(10, 6))

models = ['ViT-Tiny\n(Single-Scale)', 'CrossViT-Tiny\n(Dual-Branch)']
means = [np.mean(vit_acc), np.mean(crossvit_acc)]
stds = [np.std(vit_acc, ddof=1), np.std(crossvit_acc, ddof=1)]

x_pos = np.arange(len(models))
colors = ['#FF6B6B', '#4ECDC4']

bars = ax.bar(x_pos, means, yerr=stds, capsize=10, color=colors, alpha=0.7, edgecolor='black', linewidth=1.5)

# Add value labels
for i, (mean, std) in enumerate(zip(means, stds)):
    ax.text(i, mean + std + 1, f"{mean:.2f}%\n±{std:.2f}%", 
            ha='center', va='bottom', fontsize=11, fontweight='bold')

# Add significance indicator
if p_value < 0.05:
    y_max = max(means) + max(stds) + 3
    ax.plot([0, 1], [y_max, y_max], 'k-', linewidth=1.5)
    ax.text(0.5, y_max + 0.5, f"p = {p_value:.4f}*", ha='center', fontsize=10)

ax.set_ylabel('Test Accuracy (%)', fontsize=12, fontweight='bold')
ax.set_xticks(x_pos)
ax.set_xticklabels(models, fontsize=11)
ax.set_title('H₂: Dual-Branch vs Single-Scale Architecture', fontsize=14, fontweight='bold')
ax.set_ylim(80, max(means) + max(stds) + 8)
ax.grid(axis='y', alpha=0.3)

# Add hypothesis box
textstr = f"H₂: Dual-branch improves by ≥5%\nResult: {mean_diff:+.2f}% {'✓' if mean_diff >= 5.0 else '✗'}"
props = dict(boxstyle='round', facecolor='wheat', alpha=0.5)
ax.text(0.02, 0.98, textstr, transform=ax.transAxes, fontsize=10,
        verticalalignment='top', bbox=props)

plt.tight_layout()
plt.savefig(OUTPUT_DIR / 'h2_dual_branch_analysis.png', dpi=300, bbox_inches='tight')
plt.show()

print("[OK] H₂ visualization saved")

## 5. Summary Report

In [None]:
# Generate summary report
with open(OUTPUT_DIR / 'ablation_studies_summary.txt', 'w') as f:
    f.write("ABLATION STUDIES SUMMARY\n")
    f.write("="*80 + "\n\n")
    
    # H₂
    f.write("H₂: DUAL-BRANCH VS SINGLE-SCALE\n")
    f.write("-"*80 + "\n")
    f.write(f"Hypothesis: Dual-branch improves accuracy by ≥5%\n")
    f.write(f"CrossViT (Dual): {np.mean(crossvit_acc):.2f}% ± {np.std(crossvit_acc, ddof=1):.2f}%\n")
    f.write(f"ViT (Single):    {np.mean(vit_acc):.2f}% ± {np.std(vit_acc, ddof=1):.2f}%\n")
    f.write(f"Difference: {mean_diff:+.2f}%\n")
    f.write(f"Statistical test: t = {t_stat:+.3f}, p = {p_value:.4f}\n")
    f.write(f"Result: {'SUPPORTED' if mean_diff >= 5.0 else 'NOT SUPPORTED'}\n")
    f.write(f"Effect size: Cohen's d = {cohens_d:.3f}\n\n")
    
    # H₃
    f.write("H₃: CLAHE ENHANCEMENT IMPACT\n")
    f.write("-"*80 + "\n")
    f.write("Status: NOT YET TESTED\n")
    f.write("Requires: Training on raw (no CLAHE) images\n")
    f.write("Time estimate: 1-2 GPU hours\n\n")
    
    # H₄
    f.write("H₄: DATA AUGMENTATION STRATEGY\n")
    f.write("-"*80 + "\n")
    f.write("Status: NOT YET TESTED\n")
    f.write("Requires: Training with 3 augmentation levels\n")
    f.write("Time estimate: 3 GPU hours\n\n")
    
    f.write("RECOMMENDATIONS\n")
    f.write("-"*80 + "\n")
    f.write("1. H₂ complete - include in thesis\n")
    f.write("2. H₃ and H₄ optional for completion (5 GPU hours total)\n")
    f.write("3. If time-constrained, discuss H₂ and limitations of untested hypotheses\n")

print("\n[OK] Summary report saved to: ablation_studies_summary.txt")

print("\n" + "="*80)
print("ABLATION STUDIES STATUS")
print("="*80)
print("\n✓ H₂: Dual-Branch vs Single-Scale - COMPLETE")
print("⏭️ H₃: CLAHE Enhancement - NOT TESTED (requires 2 GPU hours)")
print("⏭️ H₄: Augmentation Strategy - NOT TESTED (requires 3 GPU hours)")
print("\nTotal remaining time: ~5 GPU hours")
print("\nRecommendation: H₂ sufficient for thesis. H₃ and H₄ optional if time permits.")