# 02 — Statistical Analysis

Testing five clinical hypotheses about treatment effectiveness and patient risk factors.

For each hypothesis:
- Check assumptions first (normality via Shapiro-Wilk, variance via Levene's)
- Pick the right test based on what the assumption checks tell us
- Report effect sizes and confidence intervals, not just p-values
- Apply Bonferroni correction since we're running 5 tests

In [1]:
import sys, os
sys.path.insert(0, os.path.abspath('..'))

import pandas as pd
import numpy as np
import json
from IPython.display import display, Markdown, Image
from src.config import TABLES_DIR, FIGURES_DIR

# Load pre-computed hypothesis results
with open(os.path.join(TABLES_DIR, 'hypothesis_results.json'), 'r') as f:
    hypotheses = json.load(f)

# Load multiple testing correction
correction_df = pd.read_csv(os.path.join(TABLES_DIR, 'multiple_testing_correction.csv'))
print(f'Loaded {len(hypotheses)} hypothesis tests.')

Loaded 5 hypothesis tests.


## Hypothesis 1: Treatment Arm Affects Outcome Rate

**H₀:** Recovery outcome rate is equal across all treatment arms (Standard, Enhanced, Control)  
**H₁:** At least one treatment arm has a significantly different outcome rate  
**Test:** Chi-Square Test of Independence  
**Why Chi-Square:** Both variables (treatment arm and outcome) are categorical. Chi-square is the standard test for association between two categorical variables.

In [2]:
h1 = hypotheses[0]
print(f"Test: {h1['test']}")
print(f"Chi-square statistic: {h1['chi2']:.4f}")
print(f"Degrees of freedom: {h1['dof']}")
print(f"p-value: {h1['p_value']:.2e}")
print(f"Effect size (Cramer's V): {h1['effect_size_cramers_v']:.4f}")
print(f"Statistical power: {h1['power']:.4f}")
print(f"Bonferroni-adjusted p: {h1['bonferroni_p']:.2e}")
print(f"\nInterpretation: {h1['interpretation']}")

# Pairwise comparisons
print('\nPairwise Comparisons:')
for p in h1['pairwise']:
    sig = '***' if p['p_value'] < 0.001 else ('**' if p['p_value'] < 0.01 else ('*' if p['p_value'] < 0.05 else 'ns'))
    print(f"  {p['comparison']:25s}  rates: {p['rate_1']:.3f} vs {p['rate_2']:.3f}  p={p['p_value']:.4e} {sig}")

Test: Chi-Square Test of Independence
Chi-square statistic: 59.2669
Degrees of freedom: 2
p-value: 1.35e-13
Effect size (Cramer's V): 0.0861
Statistical power: 1.0000
Bonferroni-adjusted p: 6.75e-13

Interpretation: Enhanced treatment shows significantly higher outcome rate. This aligns with the clinical hypothesis that more intensive intervention improves treatment success, though indication bias must be considered.

Pairwise Comparisons:
  Standard vs Enhanced       rates: 0.378 vs 0.357  p=1.0822e-01 ns
  Standard vs Control        rates: 0.378 vs 0.271  p=1.7058e-13 ***
  Enhanced vs Control        rates: 0.357 vs 0.271  p=1.5117e-10 ***


## Hypothesis 2: HbA1c Levels Differ by Outcome

**H₀:** Mean HbA1c is equal between good and poor outcome groups  
**H₁:** HbA1c levels differ between outcome groups  
**Assumption Tests:** Shapiro-Wilk for normality, Levene's for equal variance  
**Test:** Mann-Whitney U (non-parametric fallback, used because normality assumption violated)

**Why Mann-Whitney over t-test:** The Shapiro-Wilk test indicated non-normal distributions (p < 0.001), and Levene's test showed unequal variances. Mann-Whitney U is the appropriate non-parametric alternative.

In [3]:
h2 = hypotheses[1]
print(f"Normality (good outcome): {h2['normality_good']}")
print(f"Normality (poor outcome): {h2['normality_poor']}")
print(f"Equal variance: {h2['equal_variance']}")
print(f"\nTest: {h2['test']}")
print(f"Statistic: {h2['statistic']:.1f}")
print(f"p-value: {h2['p_value']}")
print(f"Effect size (Cohen's d): {h2['cohens_d']:.3f} — LARGE effect")
print(f"95% CI for mean difference: {h2['ci_95']}")
print(f"Statistical power: {h2['power']:.4f}")
print(f"\nInterpretation: {h2['interpretation']}")

Normality (good outcome): Shapiro-Wilk: W=0.9692, p=7.5935e-24, Non-normal
Normality (poor outcome): Shapiro-Wilk: W=0.9530, p=2.3597e-37, Non-normal
Equal variance: Levene: F=106.2711, p=9.1850e-25, Unequal variance

Test: Mann-Whitney U (non-parametric fallback)
Statistic: 1853934.5
p-value: 0.0
Effect size (Cohen's d): 1.447 — LARGE effect
95% CI for mean difference: [0.5258, 0.5527]
Statistical power: 1.0000

Interpretation: HbA1c is significantly higher in poor outcome group (effect size d=1.45, large). This supports the clinical hypothesis that glycemic control is associated with treatment success. Financially, each 1% reduction in HbA1c is associated with reduced complication costs.


## Hypothesis 3: Smoking Status Affects Outcome

**H₀:** Outcome rate is independent of smoking status (Never / Former / Current)  
**H₁:** Smoking status is associated with outcome rate  
**Test:** Chi-Square Test of Independence

**Clinical context:** Smoking is known to interfere with wound healing, drug metabolism, and cardiovascular function. We expect current smokers to have the lowest recovery rate.

In [4]:
h3 = hypotheses[2]
print(f"Test: {h3['test']}")
print(f"Chi-square: {h3['chi2']:.4f}, df={h3['dof']}")
print(f"p-value: {h3['p_value']:.2e}")
print(f"Effect size (Cramer's V): {h3['effect_size_cramers_v']:.4f}")
print(f"Power: {h3['power']:.4f}")
print(f"\nInterpretation: {h3['interpretation']}")

Test: Chi-Square Test of Independence
Chi-square: 104.6789, df=2
p-value: 1.86e-23
Effect size (Cramer's V): 0.1144
Power: 1.0000

Interpretation: Smoking status shows significant association with outcome (V=0.114). Current smokers have the lowest treatment success rate, consistent with tobacco's known interference with wound healing, drug metabolism, and cardiovascular function.


## Hypothesis 4: CVD Risk Predicts Outcome

**H₀:** CVD risk composite score is equal across outcome groups  
**H₁:** Higher CVD risk is associated with worse outcomes (one-sided)  
**Test:** Mann-Whitney U (one-sided)

In [5]:
h4 = hypotheses[3]
print(f"Test: {h4['test']}")
print(f"Statistic: {h4['statistic']:.1f}")
print(f"p-value: {h4['p_value']}")
print(f"Effect size (Cohen's d): {h4['cohens_d']:.3f} — VERY LARGE")
print(f"Power: {h4['power']:.4f}")
print(f"\nInterpretation: {h4['interpretation']}")

Test: Mann-Whitney U (one-sided: poor > good)
Statistic: 13790822.0
p-value: 0.0
Effect size (Cohen's d): 2.123 — VERY LARGE
Power: 1.0000

Interpretation: Composite CVD risk is significantly higher in the poor outcome group (d=2.12). This validates CVD risk as a potential stratification variable for targeting interventions to high-risk patients who would benefit most from enhanced treatment protocols.


## Hypothesis 5: Metabolic Syndrome Score Affects Outcome

**H₀:** Outcome rate is independent of metabolic syndrome score (0-5)  
**H₁:** Higher metabolic burden reduces treatment success  
**Test:** Chi-Square Test of Independence

In [6]:
h5 = hypotheses[4]
print(f"Test: {h5['test']}")
print(f"Chi-square: {h5['chi2']:.4f}, df={h5['dof']}")
print(f"p-value: {h5['p_value']}")
print(f"Effect size (Cramer's V): {h5['effect_size_cramers_v']:.4f} — LARGE")
print(f"Power: {h5['power']:.4f}")
print(f"\nInterpretation: {h5['interpretation']}")

Test: Chi-Square Test of Independence
Chi-square: 1452.2205, df=5
p-value: 0.0
Effect size (Cramer's V): 0.4261 — LARGE
Power: 1.0000

Interpretation: Metabolic Syndrome Score shows significant association with outcome (V=0.426, Power=1.0). Spearman correlation (rho=-0.417) indicates a negative monotonic trend, confirming that higher metabolic burden reduces treatment success.


## Multiple Testing Correction

With 5 simultaneous hypothesis tests, we apply **Bonferroni correction** (α/5 = 0.01) and **Benjamini-Hochberg** (FDR control) to minimize Type I error. All 5 hypotheses remain significant after correction.

In [7]:
display(correction_df.style.format({
    'raw_p': '{:.2e}',
    'bonferroni_p': '{:.2e}',
    'bh_p': '{:.2e}'
}).set_caption('Multiple Testing Correction Results'))

Unnamed: 0,hypothesis,raw_p,bonferroni_p,bonferroni_reject,bh_p,bh_reject
0,Treatment arm affects outcome rate,1.35e-13,6.75e-13,True,1.35e-13,True
1,HbA1c levels differ by outcome,0.0,0.0,True,0.0,True
2,Smoking status affects outcome,1.86e-23,9.289999999999999e-23,True,2.32e-23,True
3,Composite CVD risk predicts outcome,0.0,0.0,True,0.0,True
4,Metabolic Syndrome Score affects outcome,0.0,0.0,True,0.0,True


## Summary of Statistical Findings

| Hypothesis | Test | Effect Size | Power | Significant? |
|-----------|------|------------|-------|------|
| Treatment -> Outcome | Chi-Square | V=0.086 (small) | 1.0 | Yes (p<0.001) |
| HbA1c -> Outcome | Mann-Whitney | d=1.447 (large) | 1.0 | Yes (p<0.001) |
| Smoking -> Outcome | Chi-Square | V=0.114 (small-medium) | 1.0 | Yes (p<0.001) |
| CVD Risk -> Outcome | Mann-Whitney | d=2.123 (very large) | 1.0 | Yes (p<0.001) |
| Metabolic Score -> Outcome | Chi-Square | V=0.426 (large) | 1.0 | Yes (p<0.001) |

All five hypotheses are strongly supported after Bonferroni correction. The CVD risk composite and metabolic syndrome score show the largest effect sizes, which makes them the most promising targets for clinical intervention programs.