# Q3: Enrollment Performance

## Business objective

Understand how patient enrollment behaves across clinical trials by disentangling:

- **Structural effects** driven by trial design choices (phase, study design, sponsor), from
- **Apparent trends** driven by shifts in the composition of trials over time or across therapeutic areas.

The goal is to provide realistic expectations for portfolio planning, feasibility assessment, and interpretation of enrollment-related findings in Q2.

The analysis covers ClinicalTrials.gov studies starting from 1990 through the **latest available data at extraction time**.

---

## Research questions

### Q3.1 — Temporal trends (composition-adjusted)

**Question.**  
Has typical trial enrollment changed over time *once trial composition is held constant*?

**Operational definition of composition.**  
Trial composition is defined by **phase**, **study design (interventional vs observational)**, and **sponsor type (industry vs non-industry)**.

**Null hypothesis (H₀).**  
Conditional on trial composition, enrollment size is independent of calendar time.

---

### Q3.2 — Cross-sectional drivers of enrollment

Which trial characteristics are most strongly associated with enrollment size, and how large are these effects relative to one another?

This section focuses on **effect sizes**, not just statistical significance.

---

### Q3.3 — Therapeutic profiling of enrollment

Which conditions attract the most participants, and **in what sense**?

Rather than a single notion of "attractiveness," we profile conditions along **three complementary dimensions**:

- **Total enrollment** (ecosystem-level patient volume),
- **Trial count** (research intensity), and
- **Typical trial size** (median enrollment per trial; proxy for operational scale).

---

## Scope and data considerations

- **Population:** ClinicalTrials.gov studies with a valid start year.
- **Primary analysis set:** Trials with enrollment > 0.
- **Missing enrollment:** Analyzed separately; evidence suggests non-random missingness (MNAR plausible but not formally established).
- **Interpretation:** Results describe registry-reported studies and reporting practices; all findings are descriptive, not causal.

---

## Methodological notes

Enrollment is heavy-tailed. We therefore rely on **median and IQR** as primary summaries and use **log(enrollment)** for regression-based analyses.


In [1]:
# ============================================================
# Setup & Configuration (Q3)
# ============================================================

from __future__ import annotations

import sys
from pathlib import Path

import numpy as np
import pandas as pd
from scipy.stats import kruskal, mannwhitneyu, spearmanr, pearsonr, shapiro
from IPython.display import display, Markdown

# ----------------------------
# Project root discovery (robust)
# ----------------------------
from src.utils.notebook import find_project_root, check_dependencies

PROJECT_ROOT = find_project_root()

# Make project importable (required for src.* imports)
if str(PROJECT_ROOT) not in sys.path:
    sys.path.insert(0, str(PROJECT_ROOT))

# ----------------------------
# Imports from project
# ----------------------------
from src.data.loader import load_sql_query, get_db_connection
from src.analysis.constants import (
    PHASE_ORDER_CLINICAL,
    COHORT_LABELS,
    ENROLLMENT_TYPE_BUCKETS,
)
from src.analysis.viz import (
    DEFAULT_COLORS,
    create_grouped_box_plot,
)
from src.analysis.metrics import (
    calc_missingness_by_dimension,
    calc_cramers_v,
    calc_enrollment_coverage,
    calc_enrollment_type_breakdown,
    assess_temporal_missingness,
    validate_abt,
    interpret_effect_size,
    # Group comparison helpers
    summarize_by_group,
    kruskal_with_epsilon,
    pairwise_mannwhitney,
    analyze_enrollment_by_factor,
)

# ----------------------------
# Paths & reproducibility
# ----------------------------
DB_PATH = PROJECT_ROOT / "data" / "database" / "clinical_trials.db"
SQL_PATH = PROJECT_ROOT / "sql" / "queries"

EXTRACTION_DATE = "2026-01-18"  # must match DB extraction metadata / pipeline run

RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

# ----------------------------
# Display / pandas options
# ----------------------------
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", 100)
pd.set_option("display.precision", 2)

# Quick config echo (helps when screenshots get shared)
display(Markdown(f"""
**Config**
- Project root: `{PROJECT_ROOT}`
- DB: `{DB_PATH}`
- SQL path: `{SQL_PATH}`
- Extraction date: `{EXTRACTION_DATE}`
- Random seed: `{RANDOM_SEED}`
"""))

# ============================================================
# Shared Variables Registry
# ============================================================
# These variables are computed in sections and referenced in conclusions.
# Initializing them here ensures conclusion cells fail fast if prerequisites
# are not run, rather than silently using None values.

# Section 2: Effect sizes (initialized as None, set by respective cells)
epsilon_sq_phase: float | None = None
epsilon_sq_sponsor: float | None = None
epsilon_sq_design: float | None = None

# Section 3: Temporal analysis
rho: float | None = None  # Spearman correlation (time vs enrollment)
epsilon_sq_cohort: float | None = None
pct_decade3: float | None = None  # Adjusted %/decade from Model 3
p3: float | None = None  # p-value for time coefficient in Model 3
ci3: tuple[float, float] | None = None  # CI for time coefficient
interaction_significant: bool | None = None  # Time×Phase interaction test
pct_excluded: float | None = None  # % excluded due to non-clinical phases

# Section 4: Conditions
overlap_all: int | None = None  # Overlap count across rankings


**Config**
- Project root: `/Users/pedro/Work/Clinical-Trial-Analytics`
- DB: `/Users/pedro/Work/Clinical-Trial-Analytics/data/database/clinical_trials.db`
- SQL path: `/Users/pedro/Work/Clinical-Trial-Analytics/sql/queries`
- Extraction date: `2026-01-18`
- Random seed: `42`


In [2]:
# ============================================================
# Database connectivity & EXTRACTION_DATE validation
# ============================================================

assert DB_PATH.exists(), f"Database not found at {DB_PATH}"

# Lightweight connectivity test (open & close immediately)
try:
    with get_db_connection(DB_PATH) as conn:
        _ = pd.read_sql("SELECT 1;", conn)
except Exception as e:
    raise RuntimeError("Database connection test failed") from e

# ============================================================
# EXTRACTION_DATE validation
# ============================================================
# Verify that EXTRACTION_DATE is consistent with data in DB.

with get_db_connection(DB_PATH) as conn:
    date_check = pd.read_sql("""
        SELECT 
            MAX(start_date) AS max_start,
            MAX(completion_date) AS max_completion
        FROM studies
        WHERE start_date IS NOT NULL
    """, conn)
    
max_start = date_check["max_start"].iloc[0]
max_completion = date_check["max_completion"].iloc[0]

# Parse extraction date
extraction_dt = pd.to_datetime(EXTRACTION_DATE)

# Validation: extraction date should be after max observed dates
validation_passed = True
validation_msgs = []

if max_start and pd.to_datetime(max_start) > extraction_dt:
    validation_msgs.append(f"WARNING: max(start_date) = {max_start} is AFTER EXTRACTION_DATE")
    validation_passed = False

if max_completion and pd.to_datetime(max_completion) > extraction_dt:
    validation_msgs.append(f"WARNING: max(completion_date) = {max_completion} is AFTER EXTRACTION_DATE")
    validation_passed = False

if validation_passed and not validation_msgs:
    display(Markdown(f"""
**Database connectivity check passed**

**EXTRACTION_DATE validation:**
- Configured: `{EXTRACTION_DATE}`
- DB max(start_date): `{max_start}`
- DB max(completion_date): `{max_completion}`
- Dates are consistent
"""))
else:
    display(Markdown(f"""
**Database connectivity check passed**

**EXTRACTION_DATE validation warnings:**
- Configured: `{EXTRACTION_DATE}`
- DB max(start_date): `{max_start}`
- DB max(completion_date): `{max_completion}`

{chr(10).join(validation_msgs)}

*Review EXTRACTION_DATE in cell 1 if this is unexpected.*
"""))


**Database connectivity check passed**

**EXTRACTION_DATE validation warnings:**
- Configured: `2026-01-18`
- DB max(start_date): `2097-11-01`
- DB max(completion_date): `2100-12-01`

WARNING: max(start_date) = 2097-11-01 is AFTER EXTRACTION_DATE
WARNING: max(completion_date) = 2100-12-01 is AFTER EXTRACTION_DATE

*Review EXTRACTION_DATE in cell 1 if this is unexpected.*


In [3]:
# ============================================================
# Load ABT (Analytical Base Table)
# ============================================================
# Note: ClinicalTrials.gov uses "study" records; we refer to them as "trials".

# Create database connection (kept open for subsequent queries; closed in Cleanup)
conn = get_db_connection(DB_PATH)

df_abt = load_sql_query(
    'q3_enrollment_abt.sql',
    conn,
    SQL_PATH,
    params={'extraction_date': EXTRACTION_DATE}
)

# ============================================================
# 1. Validation (using helper)
# ============================================================
extraction_year = int(EXTRACTION_DATE[:4])
validation = validate_abt(
    df_abt,
    required_cols=['enrollment_type'],
    year_range=(1990, extraction_year),
)
n_trials = validation['n_rows']
min_year = validation['year_min']
max_year = validation['year_max']

# ============================================================
# 2. Enrollment coverage (using helper)
# ============================================================
coverage_df, coverage_stats = calc_enrollment_coverage(df_abt)
n_positive = coverage_stats['n_positive']

# Validate has_enrollment flag consistency
_is_positive = df_abt['enrollment'].fillna(0) > 0
assert (df_abt['has_enrollment'] == _is_positive.astype(int)).all(), \
    "has_enrollment flag inconsistent with enrollment values"

# ============================================================
# 3. Enrollment type breakdown (using helper)
# ============================================================
type_df, type_pcts = calc_enrollment_type_breakdown(df_abt)
pct_actual = type_pcts['pct_actual']
pct_anticipated = type_pcts['pct_anticipated']
pct_other = type_pcts['pct_other']
n_pos = type_pcts['n_total']

# ============================================================
# 4. Log-enrollment (only on primary analysis set)
# ============================================================
# NOTE: We use log(x), not log1p(x), because the primary analysis set
#       is enrollment > 0. This preserves multiplicative interpretation.
df_abt['log_enrollment'] = np.where(
    df_abt['enrollment'].fillna(0) > 0,
    np.log(df_abt['enrollment']),
    np.nan
)

# ============================================================
# 5. Output summary
# ============================================================
display(Markdown(f"""### ABT Summary

**Loaded:** {n_trials:,} trials | Start year: {min_year}–{max_year} | Extraction: {EXTRACTION_DATE}

---

**Enrollment coverage:**
"""))
display(coverage_df.style.format({'Count': '{:,}', 'Share': '{:.1%}'}).hide(axis='index'))

display(Markdown(f"""
---

**Enrollment type breakdown** (among trials with enrollment > 0, n={n_pos:,}):
"""))
display(type_df.style.format({'Count': '{:,}', 'Share': '{:.1%}'}).hide(axis='index'))

# Primary analysis set definition
pct_primary = n_positive / n_trials * 100

display(Markdown(f"""
---

### Primary analysis set

**Definition:** Trials with `enrollment > 0` (n = {n_positive:,}, {pct_primary:.1f}% of registry).

**Enrollment reporting breakdown:**
- **ACTUAL:** {pct_actual:.1f}%
- **ANTICIPATED:** {pct_anticipated:.1f}%
- **OTHER/UNKNOWN:** {pct_other:.1f}%

**Note:** `log_enrollment` uses `log(enrollment)` (not `log1p`) to preserve multiplicative interpretation. Missing/zero enrollment → `log_enrollment = NaN`.
"""))


### ABT Summary

**Loaded:** 82,707 trials | Start year: 1990–2025 | Extraction: 2026-01-18

---

**Enrollment coverage:**


Category,Count,Share
Enrollment > 0,79441,96.1%
Enrollment missing (NULL),3266,3.9%



---

**Enrollment type breakdown** (among trials with enrollment > 0, n=79,441):


Enrollment Type,Count,Share
ACTUAL,58233,73.3%
OTHER/UNKNOWN,21208,26.7%



---

### Primary analysis set

**Definition:** Trials with `enrollment > 0` (n = 79,441, 96.1% of registry).

**Enrollment reporting breakdown:**
- **ACTUAL:** 73.3%
- **ANTICIPATED:** 0.0%
- **OTHER/UNKNOWN:** 26.7%

**Note:** `log_enrollment` uses `log(enrollment)` (not `log1p`) to preserve multiplicative interpretation. Missing/zero enrollment → `log_enrollment = NaN`.


---

## 1. Data Quality and Distribution

Before analyzing enrollment patterns, we assess data quality: missingness rates by trial characteristics, and the statistical distribution of enrollment values. This informs methodological choices (e.g., median vs mean, log transformation) and identifies potential selection biases.

### 1.1 Enrollment missingness profile (selection-bias check)

Primary analyses in this notebook focus on trials with reported enrollment greater than zero. Before imposing this restriction, we examine where enrollment information is missing (or recorded as zero) and whether such missingness is systematically related to observable trial characteristics.

**What this analysis can establish:**
- Evidence against **MCAR** (Missing Completely At Random) when missingness is associated with observed trial features.

**What it cannot establish:**
- **MNAR** formally, which would require explicit sensitivity analyses or external validation data.

Given the large sample size, statistical significance alone is not informative. Interpretation therefore emphasizes effect sizes and practical relevance, rather than p-values.

In [4]:
# ============================================================
# 1.1 Enrollment Missingness Profile
# ============================================================

from scipy.stats import chi2_contingency

display(Markdown("**Enrollment missing/zero rate by key dimensions**"))

tables = []

# Phase
tables.append(
    calc_missingness_by_dimension(df_abt, "phase_group")
    .assign(dimension="Phase")
    .rename(columns={"phase_group": "group"})
)

# Study design (interventional vs observational)
tables.append(
    calc_missingness_by_dimension(
        df_abt, "is_interventional",
        label_map={1: "Interventional", 0: "Observational"}
    )
    .assign(dimension="Study design")
    .rename(columns={"is_interventional": "group"})
)

# Sponsor class
tables.append(
    calc_missingness_by_dimension(
        df_abt, "is_industry_sponsor",
        label_map={1: "Industry", 0: "Non-industry"}
    )
    .assign(dimension="Sponsor")
    .rename(columns={"is_industry_sponsor": "group"})
)

missingness_tbl = pd.concat(tables, ignore_index=True)[["dimension", "group", "n", "pct_missing"]]
missingness_tbl = missingness_tbl.sort_values(["dimension", "pct_missing"], ascending=[True, False])

display(
    missingness_tbl
    .style
    .format({"n": "{:,.0f}", "pct_missing": "{:.2f}%"})
    .hide(axis="index")
)

display(Markdown("*Missing/zero = `has_enrollment = 0` (registry enrollment NULL or 0).*"))

# Overall (derived)
overall_missing = (1 - df_abt["has_enrollment"].mean()) * 100
display(Markdown(f"**Overall missing/zero:** {overall_missing:.2f}%"))

# ============================================================
# TIME-VARYING MISSINGNESS CHECK (using helper)
# ============================================================
display(Markdown("---\n**Time-varying missingness (critical for Section 3)**"))

# Use helper function for temporal assessment
temporal_result = assess_temporal_missingness(df_abt, cohort_order=COHORT_LABELS)

display(Markdown("**Missingness by start-year cohort:**"))
display(temporal_result['cohort_stats'].style.format({"n": "{:,.0f}", "pct_missing": "{:.1f}%"}).hide(axis="index"))

# Extract results for display
rho_miss = temporal_result['rho']
p_miss = temporal_result['p_value']
miss_min = temporal_result['miss_min']
miss_max = temporal_result['miss_max']
miss_range = temporal_result['range_pp']
severity = temporal_result['severity']

display(Markdown(f"""
**Temporal association:** Spearman ρ = {rho_miss:.3f} (p {"< 0.001" if p_miss < 0.001 else f"= {p_miss:.3f}"})

**Range across cohorts:** {miss_min:.1f}% – {miss_max:.1f}% (Δ = {miss_range:.1f} pp) → **{severity}** variation

{temporal_result['warning']}

*If enrollment reporting has changed over time, the "no trend" finding in Section 3 may reflect selection rather than true stability.*
"""))

# ----------------------------
# Inferential: sponsor × missingness (using helper for Cramér's V)
# ----------------------------
display(Markdown("---\n**Inferential check: sponsor class × missing/zero enrollment**"))

ct = pd.crosstab(df_abt["is_industry_sponsor"], df_abt["has_enrollment"])
chi2, p_val, dof, _ = chi2_contingency(ct)

n = ct.to_numpy().sum()
r, c = ct.shape
cramers_v = calc_cramers_v(chi2, n, min(r - 1, c - 1))
effect_label = interpret_effect_size(cramers_v, metric="v")

p_str = "p < 0.001" if p_val < 0.001 else f"p = {p_val:.3f}"

rate_ind = (1 - df_abt.loc[df_abt["is_industry_sponsor"] == 1, "has_enrollment"].mean()) * 100
rate_non = (1 - df_abt.loc[df_abt["is_industry_sponsor"] == 0, "has_enrollment"].mean()) * 100

display(Markdown(f"""
- χ²({dof}) = {chi2:,.1f}, {p_str}  
- Industry: {rate_ind:.1f}% missing/zero | Non-industry: {rate_non:.1f}% missing/zero  
- **Cramér's V = {cramers_v:.3f}** ({effect_label} association)

**Interpretation:** Missingness is associated with both sponsor class and time period.
We proceed with `enrollment > 0` as the primary analysis set, noting these potential selection effects.
"""))

**Enrollment missing/zero rate by key dimensions**

dimension,group,n,pct_missing
Phase,Phase 2,9676,6.18%
Phase,Phase 1/2,2388,5.99%
Phase,Early Phase 1,837,5.85%
Phase,Phase 4,4794,5.36%
Phase,Phase 2/3,1034,5.03%
Phase,Phase 1,7574,4.13%
Phase,Phase 3,6313,3.91%
Phase,Not Applicable,50091,3.21%
Sponsor,Non-industry,62147,4.23%
Sponsor,Industry,20560,3.10%


*Missing/zero = `has_enrollment = 0` (registry enrollment NULL or 0).*

**Overall missing/zero:** 3.95%

---
**Time-varying missingness (critical for Section 3)**

**Missingness by start-year cohort:**

start_cohort,n,pct_missing
1990-1999,1067,20.8%
2000-2009,14314,5.0%
2010-2019,34468,4.0%
2020-2025,32858,2.9%



**Temporal association:** Spearman ρ = 0.067 (p < 0.001)

**Range across cohorts:** 2.9% – 20.8% (Δ = 17.9 pp) → **substantial** variation

**Warning:** Temporal analyses may be biased by time-varying selection.

*If enrollment reporting has changed over time, the "no trend" finding in Section 3 may reflect selection rather than true stability.*


---
**Inferential check: sponsor class × missing/zero enrollment**


- χ²(1) = 51.9, p < 0.001  
- Industry: 3.1% missing/zero | Non-industry: 4.2% missing/zero  
- **Cramér's V = 0.025** (negligible association)

**Interpretation:** Missingness is associated with both sponsor class and time period.
We proceed with `enrollment > 0` as the primary analysis set, noting these potential selection effects.


### 1.2 Distribution characteristics and metric choice

This section motivates the analytical choices used throughout Q3—**median/IQR for descriptive summaries** and **log(enrollment) for regression modeling**—and documents the composition of reported enrollment.

The objective is to assess whether standard summary statistics are appropriate and to quantify the extent to which extreme values shape observed enrollment patterns.

**Structure:**
1. Enrollment type coverage (ACTUAL vs non-ACTUAL)  
2. Distributional shape and normality  
3. Concentration of extreme values  
4. Resulting metric choices

In [5]:
# ============================================================
# 1.2 Distribution Characteristics and Metric Choice
# ============================================================
from scipy.stats import shapiro

# ------------------------------------------------------------
# Analysis dataset (reported enrollment only)
# ------------------------------------------------------------
df_enr = df_abt[df_abt["has_enrollment"] == 1].copy()
n_enr = len(df_enr)

# ============================================================
# 1.2.1 Enrollment type coverage (ACTUAL vs ANTICIPATED)
# ============================================================
n_actual = (df_enr["enrollment_type"] == "ACTUAL").sum()
n_anticipated = (df_enr["enrollment_type"] == "ANTICIPATED").sum()
n_other_type = n_enr - n_actual - n_anticipated

pct_actual = n_actual / n_enr * 100
pct_anticipated = n_anticipated / n_enr * 100

display(Markdown(f"""
#### 1.2.1 Enrollment type coverage (n = {n_enr:,})

| Type | Count | Share |
|------|------:|------:|
| **ACTUAL** | {n_actual:,} | {pct_actual:.1f}% |
| **ANTICIPATED** | {n_anticipated:,} | {pct_anticipated:.1f}% |
| Other/Unknown | {n_other_type:,} | {n_other_type/n_enr*100:.1f}% |

**Note:** Primary analyses pool ACTUAL and ANTICIPATED enrollment to maximize coverage.  
Separating ACTUAL from ANTICIPATED would reduce sample size substantially; this tradeoff is noted as a limitation in Section 5.
"""))

# ============================================================
# 1.2.2 Shape of the enrollment distribution
# ============================================================
q25 = df_enr["enrollment"].quantile(0.25)
q50 = df_enr["enrollment"].quantile(0.50)
q75 = df_enr["enrollment"].quantile(0.75)
iqr = q75 - q25
mean_enr = df_enr["enrollment"].mean()
max_enr = df_enr["enrollment"].max()
skew_val = df_enr["enrollment"].skew()
kurt_val = df_enr["enrollment"].kurtosis()

display(Markdown("#### 1.2.2 Shape of the distribution"))
display(
    pd.DataFrame({
        "Statistic": ["Median", "Q1", "Q3", "IQR", "Mean", "Max", "Skewness", "Kurtosis"],
        "Value": [q50, q25, q75, iqr, mean_enr, max_enr, skew_val, kurt_val]
    }).style.format({"Value": "{:,.1f}"}).hide(axis="index")
)

# Formal normality test (Shapiro-Wilk on subsample for feasibility)
sample_size = min(5000, n_enr)
sample = df_enr["enrollment"].sample(n=sample_size, random_state=42)
w_stat, p_norm = shapiro(sample)

display(Markdown(f"""
**Normality check (Shapiro–Wilk, subsample n={sample_size:,}):**  
W = {w_stat:.4f}, p < 0.001

This strongly rejects normality and supports non-parametric summaries.
"""))

# ============================================================
# 1.2.3 Extreme values (IQR-based)
# ============================================================
upper_iqr = q75 + 1.5 * iqr
n_extreme = (df_enr["enrollment"] > upper_iqr).sum()
pct_extreme = n_extreme / n_enr * 100

# How much of total enrollment do extremes account for?
total_enrollment = df_enr["enrollment"].sum()
extreme_enrollment = df_enr.loc[df_enr["enrollment"] > upper_iqr, "enrollment"].sum()
pct_extreme_share = extreme_enrollment / total_enrollment * 100

display(Markdown(f"""
#### 1.2.3 Extreme values (IQR-based)

**Threshold:** Q3 + 1.5×IQR = {upper_iqr:,.0f} participants

| Metric | Value |
|--------|------:|
| Trials exceeding threshold | {n_extreme:,} ({pct_extreme:.1f}%) |
| Share of total enrollment | {pct_extreme_share:.1f}% |

A small fraction of trials ({pct_extreme:.1f}%) accounts for a disproportionate share of total enrollment ({pct_extreme_share:.1f}%), reinforcing the need for robust statistics.
"""))

# ============================================================
# 1.2.4 Metric decision
# ============================================================
display(Markdown(f"""
#### 1.2.4 Metric decision

Enrollment is highly right-skewed with heavy tails:
- Median ({q50:,.0f}) is far below the mean ({mean_enr:,.0f})
- {pct_extreme:.1f}% of trials exceed the IQR-based extreme threshold
- Normality is formally rejected (Shapiro–Wilk p < 0.001)

**Metric choices for Q3:**
- **Descriptive comparisons:** Median and IQR (robust to extremes)
- **Regression modeling:** log(enrollment) (reduces leverage from extreme values)
- **Reporting:** Means shown only for completenessand are not used for interpretation.
"""))



#### 1.2.1 Enrollment type coverage (n = 79,441)

| Type | Count | Share |
|------|------:|------:|
| **ACTUAL** | 58,233 | 73.3% |
| **ANTICIPATED** | 0 | 0.0% |
| Other/Unknown | 21,208 | 26.7% |

**Note:** Primary analyses pool ACTUAL and ANTICIPATED enrollment to maximize coverage.  
Separating ACTUAL from ANTICIPATED would reduce sample size substantially; this tradeoff is noted as a limitation in Section 5.


#### 1.2.2 Shape of the distribution

Statistic,Value
Median,70.0
Q1,30.0
Q3,196.0
IQR,166.0
Mean,6067.2
Max,99999999.0
Skewness,173.4
Kurtosis,32179.4



**Normality check (Shapiro–Wilk, subsample n=5,000):**  
W = 0.0064, p < 0.001

This strongly rejects normality and supports non-parametric summaries.



#### 1.2.3 Extreme values (IQR-based)

**Threshold:** Q3 + 1.5×IQR = 445 participants

| Metric | Value |
|--------|------:|
| Trials exceeding threshold | 10,015 (12.6%) |
| Share of total enrollment | 98.7% |

A small fraction of trials (12.6%) accounts for a disproportionate share of total enrollment (98.7%), reinforcing the need for robust statistics.



#### 1.2.4 Metric decision

Enrollment is highly right-skewed with heavy tails:
- Median (70) is far below the mean (6,067)
- 12.6% of trials exceed the IQR-based extreme threshold
- Normality is formally rejected (Shapiro–Wilk p < 0.001)

**Metric choices for Q3:**
- **Descriptive comparisons:** Median and IQR (robust to extremes)
- **Regression modeling:** log(enrollment) (reduces leverage from extreme values)
- **Reporting:** Means shown only for completenessand are not used for interpretation.


### 1.3 Sensitivity Check: Enrollment Reporting Type

Primary analyses pool all trials with reported enrollment, regardless of whether the registry labels them as ACTUAL or non-ACTUAL. This section evaluates whether the key distributional properties motivating median-based analysis (right-skewness, heavy tails) hold across reporting types.

In [6]:
# ============================================================
# 1.3 Sensitivity Check: Enrollment Reporting Type
# ============================================================

# Standardize enrollment_type into ACTUAL vs non-ACTUAL
df_enr['_enroll_type_clean'] = df_enr['enrollment_type'].fillna('UNKNOWN').str.upper().str.strip()
df_enr['_enroll_type_bucket'] = df_enr['_enroll_type_clean'].apply(
    lambda x: 'ACTUAL' if x == 'ACTUAL' else 'NON-ACTUAL'
)

# Summary by type
type_summary = (
    df_enr
    .groupby('_enroll_type_bucket')['enrollment']
    .agg(
        N='count',
        Median='median',
        Mean='mean',
        Q1=lambda x: x.quantile(0.25),
        Q3=lambda x: x.quantile(0.75),
    )
    .round(1)
)

display(Markdown("**Enrollment distribution by reporting type:**"))
display(type_summary.style.format({
    'N': '{:,.0f}',
    'Median': '{:,.0f}',
    'Mean': '{:,.1f}',
    'Q1': '{:,.0f}',
    'Q3': '{:,.0f}',
}))

# Statistical comparison (ACTUAL vs NON-ACTUAL)
actual_enr = df_enr.loc[df_enr['_enroll_type_bucket'] == 'ACTUAL', 'enrollment']
nonactual_enr = df_enr.loc[df_enr['_enroll_type_bucket'] == 'NON-ACTUAL', 'enrollment']

if len(actual_enr) > 100 and len(nonactual_enr) > 100:
    u_stat, p_val = mannwhitneyu(actual_enr, nonactual_enr, alternative='two-sided')
    
    # Effect size (rank-biserial)
    n1, n2 = len(actual_enr), len(nonactual_enr)
    r_rb = 1 - (2 * u_stat) / (n1 * n2)
    effect_mag = abs(r_rb)
    
    effect_label = (
        "negligible" if effect_mag < 0.1 else
        "small" if effect_mag < 0.3 else
        "medium" if effect_mag < 0.5 else
        "large"
    )
    
    p_str = "p < 0.001" if p_val < 0.001 else f"p = {p_val:.3f}"
    
    median_actual = actual_enr.median()
    median_nonactual = nonactual_enr.median()
    
    # Determine direction
    direction = "larger" if median_nonactual > median_actual else "smaller"
    
    display(Markdown(f"""
**Mann-Whitney comparison (ACTUAL vs NON-ACTUAL):**
- U = {u_stat:,.0f}, {p_str}
- |r| = {effect_mag:.3f} ({effect_label} effect)
- Medians: ACTUAL = {median_actual:,.0f}, NON-ACTUAL = {median_nonactual:,.0f}

**Interpretation:**  
Trials with non-ACTUAL enrollment labels tend to have {direction} median values.
However, the defining characteristics motivating the analytical choices in Q3-strong right skew,
heavy tails, and dominance of a small fraction of large trials-are present in both groups.

Pooling enrollment types does not invalidate the use of median/IQR summaries
or log-transformed models, though it may slightly attenuate estimates of typical trial size.
"""))
else:
    # One group has insufficient data
    n_actual = len(actual_enr)
    n_nonactual = len(nonactual_enr)
    
    if n_actual > 0 and n_nonactual > 0:
        median_actual = actual_enr.median()
        median_nonactual = nonactual_enr.median()
        display(Markdown(f"""
**Note:** Formal comparison not performed (N: ACTUAL={n_actual:,}, NON-ACTUAL={n_nonactual:,}).

Descriptive comparison:
- ACTUAL median: {median_actual:,.0f}
- NON-ACTUAL median: {median_nonactual:,.0f}

Both groups exhibit heavy right skew (mean >> median), supporting the use of median-based summaries regardless of reporting type.
"""))
    else:
        display(Markdown("*Insufficient data in one or both groups for comparison.*"))

# Cleanup temp columns
df_enr.drop(columns=['_enroll_type_clean', '_enroll_type_bucket'], inplace=True)

**Enrollment distribution by reporting type:**

Unnamed: 0_level_0,N,Median,Mean,Q1,Q3
_enroll_type_bucket,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
ACTUAL,58233,61,2879.3,28,166
NON-ACTUAL,21208,100,14820.4,42,250



**Mann-Whitney comparison (ACTUAL vs NON-ACTUAL):**
- U = 509,365,975, p < 0.001
- |r| = 0.175 (small effect)
- Medians: ACTUAL = 61, NON-ACTUAL = 100

**Interpretation:**  
Trials with non-ACTUAL enrollment labels tend to have larger median values.
However, the defining characteristics motivating the analytical choices in Q3-strong right skew,
heavy tails, and dominance of a small fraction of large trials-are present in both groups.

Pooling enrollment types does not invalidate the use of median/IQR summaries
or log-transformed models, though it may slightly attenuate estimates of typical trial size.


---

## 2. Enrollment by Study Characteristics (Q3.2)

This section quantifies how enrollment varies across key trial dimensions: **phase**, **sponsor type**, and **study design**.

The goal is to identify **structural drivers** of enrollment size and assess their **magnitude** (effect sizes), not just statistical significance. These findings inform the covariate adjustment in Section 3 (temporal trends).


### 2.1 Enrollment by Trial Phase

Phase is expected to show the strongest **marginal association** with enrollment size, reflecting differences in study objectives, statistical power requirements, and regulatory expectations.

This section provides:
1. **Robust descriptive summaries** (median/IQR) by phase
2. **Effect size quantification** (ε² from Kruskal-Wallis)
3. **Visual confirmation** of the phase gradient

**Interpretation note:** Phase is not exogenous-it correlates with sponsor type, therapeutic area, study design, and calendar time. The effect size reported here is a **marginal association**, not a causal effect. Section 3.5 provides joint adjustment for multiple factors.

In [7]:
# ============================================================
# 2.1 Evidence: Enrollment by Trial Phase
# ============================================================

global epsilon_sq_phase

# ---------- Analysis using helper ----------
phase_result = analyze_enrollment_by_factor(
    df_enr,
    group_col="phase_group",
    order=PHASE_ORDER_CLINICAL + ["Not Applicable"],
    posthoc_pairs=[
        ("Phase 1", "Phase 2"),
        ("Phase 2", "Phase 3"),
        ("Phase 3", "Phase 4"),
        ("Phase 1", "Phase 3"),  # Main clinical contrast
    ],
)

# Extract key values for global registry
epsilon_sq_phase = phase_result['test']['epsilon_sq']
h_stat = phase_result['test']['h_stat']
k_phase = phase_result['test']['k']

# ---------- Display results ----------
display(Markdown("**Enrollment by Trial Phase (descriptive summary)**"))
display(
    phase_result['summary'].style.format({
        "N": "{:,.0f}",
        "Median": "{:,.0f}",
        "Mean": "{:,.1f}",
        "Q1": "{:,.0f}",
        "Q3": "{:,.0f}",
    })
)

if phase_result['n_excluded'] > 0:
    display(Markdown(f"*{phase_result['n_excluded']:,} trials with missing phase excluded.*"))

display(Markdown(f"""
**Kruskal-Wallis test:**  
H({k_phase-1}) = {h_stat:,.1f}, p < 0.001  
**Effect size:** ε² = {epsilon_sq_phase:.4f} ({phase_result['test']['effect_label']})
"""))

# Post-hoc results
if phase_result['posthoc'] is not None and len(phase_result['posthoc']) > 0:
    display(Markdown("**Post-hoc pairwise comparisons (Mann-Whitney, Bonferroni-adjusted):**"))
    display(
        phase_result['posthoc'][['Comparison', 'U', 'p_adj', 'r', 'effect_label', 'sig']]
        .style.format({
            "U": "{:,.0f}",
            "p_adj": "{:.2e}",
            "r": "{:.3f}",
        }).hide(axis="index")
    )

# ---------- Visual ----------
df_phase = df_enr[df_enr["phase_group"].notna()].copy()
phase_counts = df_phase.groupby("phase_group").size()
phase_labels_with_n = {
    p: f"{p}\n(n={phase_counts[p]:,})"
    for p in PHASE_ORDER_CLINICAL
    if p in phase_counts
}

df_phase_plot = df_phase[df_phase["phase_group"].isin(phase_labels_with_n)].copy()
df_phase_plot["phase_label"] = df_phase_plot["phase_group"].map(phase_labels_with_n)

fig = create_grouped_box_plot(
    df_phase_plot,
    x_col="phase_label",
    y_col="enrollment",
    category_order=list(phase_labels_with_n.values()),
    title="Enrollment by Trial Phase",
    subtitle="Clinical phases only · log scale",
)
fig.show()

# ---------- Interpretation ----------
display(Markdown(f"""
**Interpretation.**  
Enrollment differs across phases, with higher values in later development stages.
The effect size (ε² = {epsilon_sq_phase:.4f}) indicates a {phase_result['test']['effect_label']} effect.
Post-hoc comparisons confirm Phase 1 vs Phase 3 differs substantially.

**Implication.**  
Trial phase shows the largest association with enrollment size among the factors examined and should be controlled for in temporal or cross-sectional comparisons.
"""))

**Enrollment by Trial Phase (descriptive summary)**

Unnamed: 0_level_0,N,Median,Mean,Q1,Q3
phase_group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Early Phase 1,788,24,66.3,11,56
Phase 1,7261,30,51.2,17,53
Phase 1/2,2245,40,76.7,20,80
Phase 2,9078,52,110.3,26,120
Phase 2/3,982,88,361.8,40,220
Phase 3,6066,240,796.8,94,530
Phase 4,4537,80,768.8,36,186
Not Applicable,48484,80,9729.1,36,208



**Kruskal-Wallis test:**  
H(7) = 9,307.0, p < 0.001  
**Effect size:** ε² = 0.1171 (medium)


**Post-hoc pairwise comparisons (Mann-Whitney, Bonferroni-adjusted):**

Comparison,U,p_adj,r,effect_label,sig
Phase 1 vs Phase 2,22462436,2.57e-268,0.318,medium,***
Phase 2 vs Phase 3,11702647,0.0,0.575,large,***
Phase 3 vs Phase 4,19382240,5.85e-284,-0.409,medium,***
Phase 1 vs Phase 3,5151014,0.0,0.766,large,***



**Interpretation.**  
Enrollment differs across phases, with higher values in later development stages.
The effect size (ε² = 0.1171) indicates a medium effect.
Post-hoc comparisons confirm Phase 1 vs Phase 3 differs substantially.

**Implication.**  
Trial phase shows the largest association with enrollment size among the factors examined and should be controlled for in temporal or cross-sectional comparisons.


### 2.2 Enrollment by Sponsor Type

Sponsor class may influence enrollment through funding capacity and portfolio focus.  
Here we quantify whether sponsor type is a meaningful enrollment driver **relative to phase**.

We report:
- Robust descriptive summaries (median/IQR)
- Mann–Whitney test (non-parametric) with effect size (|rank-biserial r|)
- Context relative to the phase effect from Section 2.1 (ε²)

In [8]:
# ============================================================
# 2.2 Evidence: Enrollment by Sponsor Type
# ============================================================

# Use global epsilon_sq_phase (defined in setup, set by Section 2.1)
global epsilon_sq_phase
phase_effect = epsilon_sq_phase

# Descriptive summary
sponsor_summary = (
    df_enr
    .groupby("is_industry_sponsor")["enrollment"]
    .agg(
        N="count",
        Median="median",
        Mean="mean",
        Q1=lambda x: x.quantile(0.25),
        Q3=lambda x: x.quantile(0.75),
    )
    .rename(index={0: "Non-Industry", 1: "Industry"})
    .round(1)
)

display(Markdown("**Enrollment by Sponsor Type (descriptive summary)**"))
display(
    sponsor_summary.style.format({
        "N": "{:,.0f}",
        "Median": "{:,.0f}",
        "Mean": "{:,.1f}",
        "Q1": "{:,.0f}",
        "Q3": "{:,.0f}",
    })
)

# Mann–Whitney U + rank-biserial effect size
industry_enr = df_enr.loc[df_enr["is_industry_sponsor"] == 1, "enrollment"]
nonindustry_enr = df_enr.loc[df_enr["is_industry_sponsor"] == 0, "enrollment"]

u_stat, p_val = mannwhitneyu(industry_enr, nonindustry_enr, alternative="two-sided")

n1, n2 = len(industry_enr), len(nonindustry_enr)
rank_biserial_sponsor = 1 - (2 * u_stat) / (n1 * n2)
effect_mag = abs(rank_biserial_sponsor)

effect_label = (
    "negligible" if effect_mag < 0.1 else
    "small" if effect_mag < 0.3 else
    "medium"
)

# Epsilon-squared (for comparability with phase in conclusion)
global epsilon_sq_sponsor
k = 2
h_sponsor, _ = kruskal(industry_enr, nonindustry_enr)
epsilon_sq_sponsor = (h_sponsor - k + 1) / (n1 + n2 - k)

p_str = "p < 0.001" if p_val < 0.001 else f"p = {p_val:.3f}"

median_ind = sponsor_summary.loc["Industry", "Median"]
median_nonind = sponsor_summary.loc["Non-Industry", "Median"]

phase_context = (
    f"phase ε² = {phase_effect:.4f}" if phase_effect is not None else "phase ε² not available"
)

display(Markdown(f"""
**Mann–Whitney test:** U = {u_stat:,.0f}, {p_str}, |r| = {effect_mag:.3f} ({effect_label} effect)

**Interpretation:** Industry-sponsored trials have higher median enrollment ({median_ind:,.0f} vs {median_nonind:,.0f}),
but the effect magnitude is {effect_label}. With this sample size, the difference is statistically detectable even when
its practical size is small.

**Comparable effect size (ε²):** sponsor ε² = {epsilon_sq_sponsor:.4f} | {phase_context}
"""))

# ============================================================
# Phase-Stratified Analysis: Does sponsor effect persist within phases?
# ============================================================
# The univariate test may be confounded by phase composition.
# Industry sponsors may prefer later phases with inherently larger enrollment.
# Stratified analysis tests whether sponsor differences persist within phase strata.

display(Markdown("""
---
#### Phase-stratified sponsor comparison

The univariate comparison above may be confounded if industry sponsors concentrate in later phases
(which have larger enrollment). Below we test sponsor differences **within each phase stratum**.
"""))

# Stratified analysis
strat_results = []
for phase in PHASE_ORDER_CLINICAL:
    df_phase = df_enr[df_enr["phase_group"] == phase]
    ind = df_phase.loc[df_phase["is_industry_sponsor"] == 1, "enrollment"]
    non = df_phase.loc[df_phase["is_industry_sponsor"] == 0, "enrollment"]
    
    if len(ind) >= 20 and len(non) >= 20:
        u, p = mannwhitneyu(ind, non, alternative="two-sided")
        r = 1 - (2 * u) / (len(ind) * len(non))
        strat_results.append({
            "Phase": phase,
            "Industry N": len(ind),
            "Industry Median": ind.median(),
            "Non-Industry N": len(non),
            "Non-Industry Median": non.median(),
            "|r|": abs(r),
            "p-value": p,
        })

if strat_results:
    df_strat = pd.DataFrame(strat_results)
    df_strat["Sig"] = df_strat["p-value"].apply(lambda p: "***" if p < 0.001 else "**" if p < 0.01 else "*" if p < 0.05 else "")
    
    display(df_strat.style.format({
        "Industry N": "{:,.0f}",
        "Industry Median": "{:,.0f}",
        "Non-Industry N": "{:,.0f}",
        "Non-Industry Median": "{:,.0f}",
        "|r|": "{:.3f}",
        "p-value": "{:.4f}",
    }).hide(axis="index"))
    
    # Summarize
    sig_phases = df_strat[df_strat["p-value"] < 0.05]["Phase"].tolist()
    max_r = df_strat["|r|"].max()
    
    if sig_phases:
        display(Markdown(f"""
**Within-phase findings:** Sponsor differences remain statistically significant in {len(sig_phases)}/{len(df_strat)} phases,
with effect sizes ranging up to |r| = {max_r:.3f}. The sponsor effect appears to be a genuine (though modest) structural factor,
not purely an artifact of phase confounding.
"""))
    else:
        display(Markdown("""
**Within-phase findings:** Sponsor differences are not significant within individual phases,
suggesting the univariate difference may be partly confounded by phase composition.
"""))
else:
    display(Markdown("*Insufficient sample sizes for stratified analysis.*"))

display(Markdown("""
Sponsor type is treated as a **secondary control** in Section 3.
"""))

**Enrollment by Sponsor Type (descriptive summary)**

Unnamed: 0_level_0,N,Median,Mean,Q1,Q3
is_industry_sponsor,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Non-Industry,59518,66,7575.4,30,171
Industry,19923,80,1561.4,31,250



**Mann–Whitney test:** U = 630,785,061, p < 0.001, |r| = 0.064 (negligible effect)

**Interpretation:** Industry-sponsored trials have higher median enrollment (80 vs 66),
but the effect magnitude is negligible. With this sample size, the difference is statistically detectable even when
its practical size is small.

**Comparable effect size (ε²):** sponsor ε² = 0.0023 | phase ε² = 0.1171



---
#### Phase-stratified sponsor comparison

The univariate comparison above may be confounded if industry sponsors concentrate in later phases
(which have larger enrollment). Below we test sponsor differences **within each phase stratum**.


Phase,Industry N,Industry Median,Non-Industry N,Non-Industry Median,|r|,p-value,Sig
Early Phase 1,84,20,704,25,0.167,0.0122,*
Phase 1,4735,33,2526,24,0.211,0.0,***
Phase 1/2,942,50,1303,33,0.247,0.0,***
Phase 2,3445,86,5633,43,0.323,0.0,***
Phase 2/3,247,200,735,72,0.332,0.0,***
Phase 3,3625,317,2441,150,0.261,0.0,***
Phase 4,1065,118,3472,70,0.209,0.0,***



**Within-phase findings:** Sponsor differences remain statistically significant in 7/7 phases,
with effect sizes ranging up to |r| = 0.332. The sponsor effect appears to be a genuine (though modest) structural factor,
not purely an artifact of phase confounding.



Sponsor type is treated as a **secondary control** in Section 3.


### 2.3 Enrollment by Study Design

Interventional and observational designs differ fundamentally in participant burden, eligibility constraints, and data collection mechanisms. These differences may translate into systematic differences in achievable enrollment sizes.

This section quantifies the magnitude of enrollment differences by **study design** and assesses whether design choice represents a meaningful structural driver, relative to phase and sponsor effects examined above.

In [9]:
# ============================================================
# 2.3 Evidence: Enrollment by Study Design
# ============================================================

# Use global effect sizes (defined in setup, set by Sections 2.1-2.2)
global epsilon_sq_phase, epsilon_sq_sponsor, epsilon_sq_design

# ---------- Descriptive summary ----------
design_summary = (
    df_enr
    .groupby("is_interventional")["enrollment"]
    .agg(
        N="count",
        Median="median",
        Mean="mean",
        Q1=lambda x: x.quantile(0.25),
        Q3=lambda x: x.quantile(0.75),
    )
    .rename(index={0: "Observational", 1: "Interventional"})
    .round(1)
)

display(Markdown("**Enrollment by Study Design (descriptive summary)**"))
display(
    design_summary.style.format({
        "N": "{:,.0f}",
        "Median": "{:,.0f}",
        "Mean": "{:,.1f}",
        "Q1": "{:,.0f}",
        "Q3": "{:,.0f}",
    })
)

# ---------- Mann–Whitney U + effect size ----------
obs_enr = df_enr.loc[df_enr["is_interventional"] == 0, "enrollment"]
int_enr = df_enr.loc[df_enr["is_interventional"] == 1, "enrollment"]

u_stat, p_val = mannwhitneyu(obs_enr, int_enr, alternative="two-sided")

n_obs, n_int = len(obs_enr), len(int_enr)
rank_biserial = 1 - (2 * u_stat) / (n_obs * n_int)
effect_mag = abs(rank_biserial)

# Cohen (1988) conventions for r
effect_label = (
    "negligible" if effect_mag < 0.1 else
    "small" if effect_mag < 0.3 else
    "medium" if effect_mag < 0.5 else
    "large"
)

p_str = "p < 0.001" if p_val < 0.001 else f"p = {p_val:.3f}"

median_obs = design_summary.loc["Observational", "Median"]
median_int = design_summary.loc["Interventional", "Median"]
direction = "higher" if median_obs > median_int else "lower" if median_obs < median_int else "similar"

# ---------- Epsilon-squared (comparable with phase/sponsor) ----------
k = 2  # number of groups
h_design, _ = kruskal(obs_enr, int_enr)
epsilon_sq_design = (h_design - k + 1) / (n_obs + n_int - k)

# Build context line with available effect sizes
context_parts = [f"design ε² = {epsilon_sq_design:.4f}"]
if epsilon_sq_sponsor is not None:
    context_parts.append(f"sponsor ε² = {epsilon_sq_sponsor:.4f}")
if epsilon_sq_phase is not None:
    context_parts.append(f"phase ε² = {epsilon_sq_phase:.4f}")
context_line = " | ".join(context_parts)

display(Markdown(f"""
**Mann–Whitney test:** U = {u_stat:,.0f}, {p_str}, |r| = {effect_mag:.3f} ({effect_label} effect)

**Interpretation:** Observational studies have {direction} median enrollment than interventional trials
({median_obs:,.0f} vs {median_int:,.0f}). The effect magnitude is **{effect_label}**.

**Comparable effect sizes (ε² from Kruskal–Wallis):**  
{context_line}
"""))

# ============================================================
# Phase-Stratified Analysis: Does design effect persist within phases?
# ============================================================
# Design differences may be confounded by phase composition.
# Observational studies may be less common in early clinical phases.

display(Markdown("""
---
#### Phase-stratified design comparison

The univariate comparison above may be confounded by phase composition.
Below we test design differences **within each phase stratum**.
"""))

# Stratified analysis
strat_results = []
for phase in PHASE_ORDER_CLINICAL:
    df_phase = df_enr[df_enr["phase_group"] == phase]
    obs = df_phase.loc[df_phase["is_interventional"] == 0, "enrollment"]
    inter = df_phase.loc[df_phase["is_interventional"] == 1, "enrollment"]
    
    if len(obs) >= 20 and len(inter) >= 20:
        u, p = mannwhitneyu(obs, inter, alternative="two-sided")
        r = 1 - (2 * u) / (len(obs) * len(inter))
        strat_results.append({
            "Phase": phase,
            "Obs N": len(obs),
            "Obs Median": obs.median(),
            "Int N": len(inter),
            "Int Median": inter.median(),
            "|r|": abs(r),
            "p-value": p,
        })

if strat_results:
    df_strat_design = pd.DataFrame(strat_results)
    df_strat_design["Sig"] = df_strat_design["p-value"].apply(
        lambda p: "***" if p < 0.001 else "**" if p < 0.01 else "*" if p < 0.05 else ""
    )
    
    display(df_strat_design.style.format({
        "Obs N": "{:,.0f}",
        "Obs Median": "{:,.0f}",
        "Int N": "{:,.0f}",
        "Int Median": "{:,.0f}",
        "|r|": "{:.3f}",
        "p-value": "{:.4f}",
    }).hide(axis="index"))
    
    # Summarize
    sig_phases = df_strat_design[df_strat_design["p-value"] < 0.05]["Phase"].tolist()
    max_r = df_strat_design["|r|"].max()
    
    if sig_phases:
        display(Markdown(f"""
**Within-phase findings:** Design differences remain statistically significant in {len(sig_phases)}/{len(df_strat_design)} phases,
with effect sizes up to |r| = {max_r:.3f}. The design effect appears genuine, not purely an artifact of phase confounding.
"""))
    else:
        display(Markdown("""
**Within-phase findings:** Design differences are not consistently significant within individual phases,
suggesting the univariate difference may be partly explained by phase composition.
"""))
else:
    display(Markdown("*Insufficient sample sizes for stratified analysis.*"))

display(Markdown("""
Design differences are larger than sponsor differences but smaller than phase differences,
positioning study design as an **intermediate structural driver** of enrollment.

**Implication:** Study design is included as a control variable in temporal models (Section 3).
"""))

**Enrollment by Study Design (descriptive summary)**

Unnamed: 0_level_0,N,Median,Mean,Q1,Q3
is_interventional,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Observational,17901,130,25061.4,50,451
Interventional,61540,60,542.1,29,148



**Mann–Whitney test:** U = 724,372,955, p < 0.001, |r| = 0.315 (medium effect)

**Interpretation:** Observational studies have higher median enrollment than interventional trials
(130 vs 60). The effect magnitude is **medium**.

**Comparable effect sizes (ε² from Kruskal–Wallis):**  
design ε² = 0.0520 | sponsor ε² = 0.0023 | phase ε² = 0.1171



---
#### Phase-stratified design comparison

The univariate comparison above may be confounded by phase composition.
Below we test design differences **within each phase stratum**.


*Insufficient sample sizes for stratified analysis.*


Design differences are larger than sponsor differences but smaller than phase differences,
positioning study design as an **intermediate structural driver** of enrollment.

**Implication:** Study design is included as a control variable in temporal models (Section 3).


In [10]:
# ============================================================
# 2.4 Section 2 Conclusion — Enrollment Drivers (Q3.2)
# ============================================================

display(Markdown("### 2.4 Section 2 Conclusion — Enrollment Drivers (Q3.2)"))

# Use global effect sizes (set by Sections 2.1-2.3)
global epsilon_sq_phase, epsilon_sq_sponsor, epsilon_sq_design

# Fallbacks (if a section wasn't run)
phase_msg = f"{epsilon_sq_phase:.4f}" if epsilon_sq_phase is not None else "N/A (run Section 2.1)"
design_msg = f"{epsilon_sq_design:.4f}" if epsilon_sq_design is not None else "N/A (run Section 2.3)"
sponsor_msg = f"{epsilon_sq_sponsor:.4f}" if epsilon_sq_sponsor is not None else "N/A (run Section 2.2)"

# Optional: derive a ranking if all available
effects = {
    "Phase": epsilon_sq_phase,
    "Study design": epsilon_sq_design,
    "Sponsor": epsilon_sq_sponsor,
}
effects_available = {k: v for k, v in effects.items() if v is not None}

if len(effects_available) >= 2:
    ranking = " > ".join([k for k, _ in sorted(effects_available.items(), key=lambda kv: kv[1], reverse=True)])
    ranking_line = f"**Marginal ranking (by ε²):** {ranking}"
else:
    ranking_line = "**Marginal ranking (by ε²):** N/A (run Sections 2.1–2.3)"

display(Markdown(f"""
---

### Answer to Q3.2: Which trial characteristics are associated with enrollment size?

**Key findings (marginal associations):**

- **Trial phase shows the largest marginal association.**  
  Phase has the largest effect size (**ε² = {phase_msg}**), consistent with later-phase studies requiring larger samples.

- **Study design has a secondary association.**  
  Interventional vs observational differences are detectable (**ε² = {design_msg}**).

- **Sponsor type has a modest marginal association.**  
  Sponsor differences are detectable but small (**ε² = {sponsor_msg}**). Phase-stratified analyses (Section 2.2) suggest this persists within phases.

{ranking_line}

**Important caveat:** These are *marginal* (univariate) effect sizes. If industry sponsors concentrate in later phases,
some of the "sponsor effect" could reflect phase confounding. Section 3.5 provides a joint regression model
(`log_enrollment ~ phase + design + sponsor + time`) that adjusts for mutual confounding; coefficient magnitudes
there can be compared for relative importance under joint adjustment.

**Sensitivity note:** Section 2 pools ACTUAL and ANTICIPATED enrollment. The sensitivity analysis in Section 3.5
shows consistent patterns when restricted to ACTUAL only; we assume similar robustness applies to the driver rankings.

---

**Note on multiple comparisons.**  
Effect sizes (ε², |r|) guide interpretation over p-values. These are exploratory analyses; formal multiplicity
correction is not applied. Interpret borderline results with caution.
"""))

### 2.4 Section 2 Conclusion — Enrollment Drivers (Q3.2)


---

### Answer to Q3.2: Which trial characteristics are associated with enrollment size?

**Key findings (marginal associations):**

- **Trial phase shows the largest marginal association.**  
  Phase has the largest effect size (**ε² = 0.1171**), consistent with later-phase studies requiring larger samples.

- **Study design has a secondary association.**  
  Interventional vs observational differences are detectable (**ε² = 0.0520**).

- **Sponsor type has a modest marginal association.**  
  Sponsor differences are detectable but small (**ε² = 0.0023**). Phase-stratified analyses (Section 2.2) suggest this persists within phases.

**Marginal ranking (by ε²):** Phase > Study design > Sponsor

**Important caveat:** These are *marginal* (univariate) effect sizes. If industry sponsors concentrate in later phases,
some of the "sponsor effect" could reflect phase confounding. Section 3.5 provides a joint regression model
(`log_enrollment ~ phase + design + sponsor + time`) that adjusts for mutual confounding; coefficient magnitudes
there can be compared for relative importance under joint adjustment.

**Sensitivity note:** Section 2 pools ACTUAL and ANTICIPATED enrollment. The sensitivity analysis in Section 3.5
shows consistent patterns when restricted to ACTUAL only; we assume similar robustness applies to the driver rankings.

---

**Note on multiple comparisons.**  
Effect sizes (ε², |r|) guide interpretation over p-values. These are exploratory analyses; formal multiplicity
correction is not applied. Interpret borderline results with caution.


---

## 3. Temporal Analysis (Q3.1)

This section tests whether enrollment has exhibited a genuine secular trend over time, or whether apparent changes are driven by shifts in trial composition (phase, design, sponsor). We use both non-parametric tests and regression-based decomposition.


### 3.1 Hypothesis and Analytical Framing

This section evaluates whether trial enrollment has exhibited a **genuine secular trend** over time, as opposed to
apparent changes driven by shifts in the composition of the clinical trial portfolio.

Based on Section 2, enrollment is strongly structured by trial phase and, to a lesser extent, by study design and
sponsor class. Any aggregate temporal pattern must therefore be interpreted **conditional on these structural drivers**.

#### Hypotheses

- **H₀ (Null):** After accounting for changes in trial composition (phase, design), enrollment size has remained stable
  over the observed period.
- **H₁ (Alternative):** Enrollment size has changed systematically over time, beyond what can be explained by
  compositional shifts in trial characteristics.

#### Strategy

We assess this hypothesis using complementary approaches that capture different aspects of temporal change:

1. **Rank-based association** (Spearman ρ) to detect monotonic trends robust to heavy tails.
2. **Distributional comparisons** across temporal cohorts (Kruskal–Wallis) to detect non-linear or non-monotonic shifts.
3. **Regression modeling** of log-enrollment:
   - unadjusted, to quantify aggregate trends,
   - phase-adjusted, to isolate within-phase temporal effects.

Throughout, emphasis is placed on **effect sizes and practical relevance**, not statistical significance alone.

### 3.2 Descriptive Temporal Patterns (Unadjusted)

Before formal hypothesis testing, we examine **unadjusted enrollment patterns across broad temporal cohorts** to identify apparent shifts in trial size over time.

This step is intentionally descriptive. Given the heavy-tailed nature of enrollment and the strong structural effects identified in Section 2, any aggregate temporal pattern may reflect **changes in trial composition** rather than within-trial growth.

Accordingly, enrollment is summarized by **start-year cohorts** (decade bins), using robust statistics (median and IQR) rather than year-by-year averages.

In [11]:
# ============================================================
# 3.2 Descriptive Temporal Patterns (Unadjusted)
# ============================================================

display(Markdown("### 3.2 Descriptive Temporal Patterns (Unadjusted)"))

# Derive year range from data
year_min = int(df_enr['start_year'].min())
year_max = int(df_enr['start_year'].max())
extraction_year = int(EXTRACTION_DATE[:4])

# Cohort definition with metadata (aligned with ABT)
# Note: cohorts are defined in the SQL ABT and should match
cohort_meta = pd.DataFrame({
    "Cohort": ["1990–1999", "2000–2009", "2010–2019", f"2020–{year_max}"],
    "Years": [10, 10, 10, year_max - 2020 + 1],
    "Note": ["Pre-FDAAA", "Post-FDAAA", "Modern era", "Truncated; COVID disruption"]
})

display(Markdown("**Cohort definitions:**"))
display(cohort_meta.style.hide(axis="index"))

# Temporal cohort summary
cohort_summary = (
    df_enr
    .groupby('start_cohort', sort=False)
    .agg({
        'enrollment': [
            'count',
            'median',
            'mean',
            lambda x: x.quantile(0.25),
            lambda x: x.quantile(0.75)
        ]
    })
    .round(1)
    .reindex(COHORT_LABELS)
)
cohort_summary.columns = ['N', 'Median', 'Mean', 'Q1', 'Q3']

display(Markdown("**Enrollment by Temporal Cohort:**"))
display(cohort_summary.style.format({
    'N': '{:,.0f}',
    'Median': '{:,.0f}',
    'Mean': '{:,.1f}',
    'Q1': '{:,.0f}',
    'Q3': '{:,.0f}'
}))

# Interpretation
display(Markdown(f"""
**Observation:** Median enrollment varies across cohorts with no clear monotonic increase. The 1990s cohort shows higher median enrollment, which may reflect registry composition (fewer early-phase trials registered in that era).

**Caveat:** These are unadjusted patterns. Section 2 established that phase composition is the dominant driver of enrollment. Apparent temporal variation may reflect compositional shifts rather than genuine trends.

*Formal testing follows in Sections 3.3–3.5.*
"""))

### 3.2 Descriptive Temporal Patterns (Unadjusted)

**Cohort definitions:**

Cohort,Years,Note
1990–1999,10,Pre-FDAAA
2000–2009,10,Post-FDAAA
2010–2019,10,Modern era
2020–2025,6,Truncated; COVID disruption


**Enrollment by Temporal Cohort:**

Unnamed: 0_level_0,N,Median,Mean,Q1,Q3
start_cohort,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1990-1999,845,110,28970.9,40,404
2000-2009,13592,75,8539.9,30,232
2010-2019,33103,62,5642.1,28,175
2020-2025,31901,73,4848.1,35,190



**Observation:** Median enrollment varies across cohorts with no clear monotonic increase. The 1990s cohort shows higher median enrollment, which may reflect registry composition (fewer early-phase trials registered in that era).

**Caveat:** These are unadjusted patterns. Section 2 established that phase composition is the dominant driver of enrollment. Apparent temporal variation may reflect compositional shifts rather than genuine trends.

*Formal testing follows in Sections 3.3–3.5.*


### 3.3 Monotonic Association with Time (Spearman Rank Test)

As a first inferential step, we test whether trial enrollment exhibits a **global monotonic association** with calendar time, independent of distributional assumptions and robust to extreme outliers.

This analysis is intentionally **unadjusted** and serves as a diagnostic check rather than a definitive test of temporal change. Spearman's ρ detects only marginal monotonicity—it cannot distinguish structural (compositional) effects from genuine temporal trends.

In [12]:
# ============================================================
# 3.3 Spearman Rank Correlation (Monotonic Association Test)
# ============================================================

# ---------- Global variable declaration ----------
global rho

# ---------- Spearman correlation ----------
rho, p_val = spearmanr(df_enr['start_year'], df_enr['enrollment'])
n_obs = len(df_enr)

# Effect size interpretation (Cohen: |r| < 0.1 negligible, 0.1-0.3 small)
rho_mag = abs(rho)
rho_label = (
    "negligible" if rho_mag < 0.1 else
    "small" if rho_mag < 0.3 else
    "medium" if rho_mag < 0.5 else
    "large"
)
direction = "positive" if rho > 0 else "negative" if rho < 0 else "zero"

# ---------- Triangulation: log-scale correlations ----------
log_mask = df_enr['log_enrollment'].notna()
rho_log, p_log = spearmanr(df_enr.loc[log_mask, 'start_year'], df_enr.loc[log_mask, 'log_enrollment'])
pearson_log, p_pearson = pearsonr(df_enr.loc[log_mask, 'start_year'], df_enr.loc[log_mask, 'log_enrollment'])

# ---------- Output ----------
p_str = "p < 0.001" if p_val < 0.001 else f"p = {p_val:.3f}"

display(Markdown(f"""
**Results:**

| Metric | Value | Interpretation |
|--------|------:|----------------|
| Spearman ρ (enrollment) | {rho:.4f} | {rho_label.capitalize()} association |
| Spearman ρ (log-enrollment) | {rho_log:.4f} | Same conclusion on log scale |
| Pearson r (log-enrollment) | {pearson_log:.4f} | Linear correlation on log scale |

Despite statistical significance ({p_str}, n = {n_obs:,}), all correlations are effectively zero in magnitude.

---

**Interpretation:**

The association between calendar time and enrollment size is statistically detectable but **{rho_label}** in magnitude (|ρ| ≈ {rho_mag:.2f}). With a large sample size, even trivial effects are detectable. The near-zero correlations provide **no evidence against H₀** of stable enrollment over time.

Both raw and log-transformed enrollment yield the same substantive conclusion. The sign difference between Spearman and Pearson on the log scale reflects noise around zero rather than a meaningful directional trend.

---

**Caveats:**

1. Spearman's ρ treats year as an ordinal variable. With a limited number of unique calendar years and extensive ties, the test is less sensitive to complex or non-monotonic temporal patterns.

2. **Shocks vs gradual trends:** Spearman detects only monotonic (gradual) associations. An abrupt shift (e.g., COVID disruption) or U-shaped pattern would not be captured. Section 3.4 (cohort comparisons) partially addresses this.

---

**Implication:**

A simple monotonic time trend does not explain enrollment dynamics. Any apparent temporal heterogeneity is more plausibly attributable to compositional shifts in trial characteristics rather than a secular increase or decrease in enrollment size.

*Sections 3.4–3.5 formally test for non-monotonic patterns and isolate within-phase temporal effects via regression adjustment.*
"""))



**Results:**

| Metric | Value | Interpretation |
|--------|------:|----------------|
| Spearman ρ (enrollment) | 0.0101 | Negligible association |
| Spearman ρ (log-enrollment) | 0.0101 | Same conclusion on log scale |
| Pearson r (log-enrollment) | -0.0113 | Linear correlation on log scale |

Despite statistical significance (p = 0.004, n = 79,441), all correlations are effectively zero in magnitude.

---

**Interpretation:**

The association between calendar time and enrollment size is statistically detectable but **negligible** in magnitude (|ρ| ≈ 0.01). With a large sample size, even trivial effects are detectable. The near-zero correlations provide **no evidence against H₀** of stable enrollment over time.

Both raw and log-transformed enrollment yield the same substantive conclusion. The sign difference between Spearman and Pearson on the log scale reflects noise around zero rather than a meaningful directional trend.

---

**Caveats:**

1. Spearman's ρ treats year as an ordinal variable. With a limited number of unique calendar years and extensive ties, the test is less sensitive to complex or non-monotonic temporal patterns.

2. **Shocks vs gradual trends:** Spearman detects only monotonic (gradual) associations. An abrupt shift (e.g., COVID disruption) or U-shaped pattern would not be captured. Section 3.4 (cohort comparisons) partially addresses this.

---

**Implication:**

A simple monotonic time trend does not explain enrollment dynamics. Any apparent temporal heterogeneity is more plausibly attributable to compositional shifts in trial characteristics rather than a secular increase or decrease in enrollment size.

*Sections 3.4–3.5 formally test for non-monotonic patterns and isolate within-phase temporal effects via regression adjustment.*


## 3.4 Temporal Heterogeneity Across Cohorts (Kruskal–Wallis Test)

Section 3.3 found no meaningful monotonic trend. Here we test for **non-monotonic heterogeneity**: do enrollment distributions differ across temporal cohorts, regardless of direction?

**Method:** Kruskal–Wallis H test (non-parametric ANOVA on ranks).  
**Limitation:** K-W is omnibus—it detects *any* distributional difference but doesn't identify which cohorts differ

In [13]:


# ---------- Prepare cohort groups ----------

# Derive year range for display
year_max = int(df_enr['start_year'].max())
last_cohort_label = f"2020–{year_max}"
last_cohort_years = year_max - 2020 + 1

# Filter to valid cohorts and get group sizes
cohort_data = {
    c: df_enr.loc[df_enr['start_cohort'] == c, 'enrollment'].values
    for c in COHORT_LABELS
    if c in df_enr['start_cohort'].values
}

# Report cohort sizes (important for interpretation)
cohort_sizes = {c: len(v) for c, v in cohort_data.items()}
display(Markdown("**Cohort sizes:**"))
display(pd.DataFrame({
    'Cohort': list(cohort_sizes.keys()),
    'N': list(cohort_sizes.values()),
    'Years': [10, 10, 10, last_cohort_years],
    'Note': ['', '', '', 'Truncated + COVID']
}).style.hide(axis='index'))

# ---------- Kruskal-Wallis test ----------
cohort_groups = list(cohort_data.values())
h_stat, p_val = kruskal(*cohort_groups)

n_total = sum(len(g) for g in cohort_groups)
k = len(cohort_groups)

# Effect size: epsilon-squared (Tomczak & Tomczak, 2014)
# More appropriate than eta-squared approximation for K-W
epsilon_sq_cohort = h_stat / (n_total - 1)

# Alternative: eta-squared H (for comparison with Section 2)
eta_sq_H_cohort = (h_stat - k + 1) / (n_total - k)

# Effect size interpretation (using eta_sq_H thresholds for consistency with Section 2)
effect_label = (
    "negligible" if eta_sq_H_cohort < 0.01 else
    "small" if eta_sq_H_cohort < 0.06 else
    "medium" if eta_sq_H_cohort < 0.14 else
    "large"
)

# p-value formatting
p_str = "p < 0.001" if p_val < 0.001 else f"p = {p_val:.3f}"

# ---------- Context: compare with phase effect ----------
global epsilon_sq_phase, epsilon_sq_cohort
if epsilon_sq_phase is not None:
    context_msg = f"For context, phase ε² = {epsilon_sq_phase:.4f} (Section 2.1). The cohort effect is **{eta_sq_H_cohort / epsilon_sq_phase:.1f}x smaller**."
else:
    context_msg = "Run Section 2.1 to compare with phase effect size."

# ---------- Descriptive: median by cohort (to characterize pattern) ----------
cohort_medians = df_enr.groupby('start_cohort')['enrollment'].median().reindex(COHORT_LABELS)
median_trend = cohort_medians.values
is_monotonic_increasing = all(median_trend[i] <= median_trend[i+1] for i in range(len(median_trend)-1))
is_monotonic_decreasing = all(median_trend[i] >= median_trend[i+1] for i in range(len(median_trend)-1))

if is_monotonic_increasing:
    pattern_desc = "monotonically increasing"
elif is_monotonic_decreasing:
    pattern_desc = "monotonically decreasing"
else:
    pattern_desc = "non-monotonic"

display(Markdown(f"""
---

**Results:**

| Metric | Value |
|--------|-------|
| Kruskal–Wallis H | {h_stat:,.1f} |
| df | {k - 1} |
| p-value | {p_str} |
| ε² (epsilon-squared) | {epsilon_sq_cohort:.4f} |
| η²_H (for comparison) | {eta_sq_H_cohort:.4f} |
| Effect size | **{effect_label}** |

{context_msg}

---

**Interpretation:**

Enrollment distributions differ statistically across temporal cohorts (H = {h_stat:.1f}, {p_str}). However, the effect size is **{effect_label}** (ε² = {epsilon_sq_cohort:.4f}), indicating that cohort membership explains very little of the overall variance in enrollment.

**Pattern characterization:** Median enrollment across cohorts is **{pattern_desc}**:
- 1990–1999: {cohort_medians.iloc[0]:,.0f}
- 2000–2009: {cohort_medians.iloc[1]:,.0f}
- 2010–2019: {cohort_medians.iloc[2]:,.0f}
- {last_cohort_label}: {cohort_medians.iloc[3]:,.0f}

This {pattern_desc} pattern does not support a simple monotonic trend in enrollment over time.

**Caveats:**
1. K-W detects distributional differences (location *and* shape), not just median shifts
2. The {last_cohort_label} cohort is truncated (~{last_cohort_years} years) and includes COVID-era disruption
3. Cohort differences may reflect compositional changes (more Phase 1 trials registered over time) rather than genuine within-design trends

**Implication:** Calendar time explains negligible variance in enrollment. Apparent heterogeneity is likely driven by compositional shifts, motivating the adjusted regression in Section 3.5.
"""))

**Cohort sizes:**

Cohort,N,Years,Note
1990-1999,845,10,
2000-2009,13592,10,
2010-2019,33103,10,
2020-2025,31901,6,Truncated + COVID



---

**Results:**

| Metric | Value |
|--------|-------|
| Kruskal–Wallis H | 335.2 |
| df | 3 |
| p-value | p < 0.001 |
| ε² (epsilon-squared) | 0.0042 |
| η²_H (for comparison) | 0.0042 |
| Effect size | **negligible** |

For context, phase ε² = 0.1171 (Section 2.1). The cohort effect is **0.0x smaller**.

---

**Interpretation:**

Enrollment distributions differ statistically across temporal cohorts (H = 335.2, p < 0.001). However, the effect size is **negligible** (ε² = 0.0042), indicating that cohort membership explains very little of the overall variance in enrollment.

**Pattern characterization:** Median enrollment across cohorts is **non-monotonic**:
- 1990–1999: 110
- 2000–2009: 75
- 2010–2019: 62
- 2020–2025: 73

This non-monotonic pattern does not support a simple monotonic trend in enrollment over time.

**Caveats:**
1. K-W detects distributional differences (location *and* shape), not just median shifts
2. The 2020–2025 cohort is truncated (~6 years) and includes COVID-era disruption
3. Cohort differences may reflect compositional changes (more Phase 1 trials registered over time) rather than genuine within-design trends

**Implication:** Calendar time explains negligible variance in enrollment. Apparent heterogeneity is likely driven by compositional shifts, motivating the adjusted regression in Section 3.5.


### 3.5 Disentangling Temporal Effects from Trial Composition

The descriptive and non-parametric analyses above suggest that apparent temporal changes in enrollment may reflect
**shifts in trial composition** rather than genuine within-trial growth.

To formally test this hypothesis, we estimate linear models on log-transformed enrollment that:
1. capture the raw association between calendar time and enrollment, and
2. isolate the temporal effect after accounting for **structural trial characteristics**, including phase, study design,
   and sponsor class, identified in Section 2 as key drivers of enrollment size.

This decomposition allows us to distinguish **true temporal dynamics** from **compositional artifacts** arising from
changes in the mix of trials registered over time.

In [14]:
# ============================================================
# 3.5 Regression: Temporal Trends with Compositional Adjustment
# ============================================================

import statsmodels.formula.api as smf
from statsmodels.stats.diagnostic import het_breuschpagan

display(Markdown("""### 3.5 Regression Decomposition: Time vs Composition

We model log(enrollment) to address:
1. Is there an aggregate time trend after adjusting for composition?
2. Do time trends **differ by phase** (time×phase interaction)?
3. Are results robust to separating ACTUAL vs ANTICIPATED enrollment?

Outcome: log(enrollment) for multiplicative interpretation.
"""))

# ----------------------------
# Global variable declarations
# ----------------------------
global pct_excluded, pct_decade3, p3, ci3, interaction_significant

# ----------------------------
# Data preparation
# ----------------------------
required = ["log_enrollment", "years_since_2000", "phase_group", "is_interventional", "is_industry_sponsor", "enrollment_type"]
reg_data = df_enr[required].dropna().copy()

clinical_phases = ["Phase 1", "Phase 2", "Phase 3", "Phase 4"]
n_before = len(reg_data)
reg_data = reg_data[reg_data["phase_group"].isin(clinical_phases)].copy()
n_after = len(reg_data)

# Store exclusion percentage for use in Section 3.6 conclusion
pct_excluded = (n_before - n_after) / n_before * 100

display(Markdown(f"""
**Sample restriction:** clinical phases only (Phase 1–4)  
- Before: {n_before:,} trials (enrollment > 0)  
- After: {n_after:,} trials  
- Excluded: {n_before - n_after:,} ({pct_excluded:.1f}%)
"""))

# ----------------------------
# Model specifications
# ----------------------------
# M1: Unadjusted
m1 = smf.ols("log_enrollment ~ years_since_2000", data=reg_data).fit(cov_type="HC3")

# M2: + Phase (main effects only)
m2 = smf.ols("log_enrollment ~ years_since_2000 + C(phase_group)", data=reg_data).fit(cov_type="HC3")

# M3: Full composition adjustment (main effects)
m3 = smf.ols(
    "log_enrollment ~ years_since_2000 + C(phase_group) + is_interventional + is_industry_sponsor",
    data=reg_data
).fit(cov_type="HC3")

# M4: Time × Phase interaction (allows different trends by phase)
m4 = smf.ols(
    "log_enrollment ~ years_since_2000 * C(phase_group) + is_interventional + is_industry_sponsor",
    data=reg_data
).fit(cov_type="HC3")

def summarize_time(model, name):
    beta = model.params["years_since_2000"]
    ci = model.conf_int().loc["years_since_2000"]
    p = model.pvalues["years_since_2000"]
    p_str = "< 0.001" if p < 0.001 else f"{p:.3f}"
    pct_decade = (float(np.exp(beta * 10)) - 1) * 100
    return {
        "Model": name,
        "β_time (per year)": beta,
        "95% CI": f"[{ci[0]:.4f}, {ci[1]:.4f}]",
        "p": p_str,
        "% / decade": f"{pct_decade:+.1f}%",
        "R²": model.rsquared,
        "Adj R²": model.rsquared_adj,
        "n": int(model.nobs),
    }

cmp = pd.DataFrame([
    summarize_time(m1, "1) Unadjusted"),
    summarize_time(m2, "2) + Phase"),
    summarize_time(m3, "3) + Phase + Design + Sponsor"),
    summarize_time(m4, "4) + Time×Phase interaction"),
])

display(Markdown("#### Model comparison (time coefficient for reference phase)"))
display(cmp.style.format({"β_time (per year)": "{:.5f}", "R²": "{:.3f}", "Adj R²": "{:.3f}"}).hide(axis="index"))

# ----------------------------
# Time × Phase interaction analysis
# ----------------------------
display(Markdown("#### Time × Phase interaction (Model 4)"))

# Extract phase-specific time slopes
# Reference phase is Phase 1 (alphabetically first in C())
ref_phase = "Phase 1"
base_slope = m4.params["years_since_2000"]

# Get interaction terms
interaction_terms = [p for p in m4.params.index if "years_since_2000:C(phase_group)" in p]

phase_slopes = {ref_phase: base_slope}
for term in interaction_terms:
    phase_name = term.split("[T.")[1].rstrip("]")
    phase_slopes[phase_name] = base_slope + m4.params[term]

# Format as table
phase_slope_df = pd.DataFrame([
    {
        "Phase": phase,
        "β_time (per year)": slope,
        "% / decade": (np.exp(slope * 10) - 1) * 100
    }
    for phase, slope in sorted(phase_slopes.items())
])

display(Markdown("**Phase-specific time slopes (from interaction model):**"))
display(phase_slope_df.style.format({"β_time (per year)": "{:.5f}", "% / decade": "{:+.1f}%"}).hide(axis="index"))

# Test if interaction is significant (F-test)
from scipy import stats
# Compare M3 vs M4 (nested models)
f_stat = ((m3.ssr - m4.ssr) / (m3.df_resid - m4.df_resid)) / (m4.ssr / m4.df_resid)
f_pval = 1 - stats.f.cdf(f_stat, m3.df_resid - m4.df_resid, m4.df_resid)

f_str = "< 0.001" if f_pval < 0.001 else f"{f_pval:.3f}"
interaction_sig = "significant" if f_pval < 0.05 else "not significant"

# Check for opposite signs
slopes_positive = sum(1 for s in phase_slopes.values() if s > 0)
slopes_negative = sum(1 for s in phase_slopes.values() if s < 0)
has_opposite = slopes_positive > 0 and slopes_negative > 0

display(Markdown(f"""
**Interaction test (M3 vs M4):** F = {f_stat:.2f}, p = {f_str} [~] interaction is **{interaction_sig}**

**Pattern check:** {slopes_positive} phases show positive slopes, {slopes_negative} show negative slopes.
{"[!] Opposite-sign slopes detected: aggregate trend could mask within-phase heterogeneity." if has_opposite else "All phases show same-direction (or near-zero) trends."}
"""))

# ----------------------------
# Stratified analysis by phase
# ----------------------------
display(Markdown("#### Stratified trends by phase"))

stratified_results = []
for phase in clinical_phases:
    phase_data = reg_data[reg_data["phase_group"] == phase]
    if len(phase_data) >= 100:  # Minimum sample
        m_phase = smf.ols("log_enrollment ~ years_since_2000", data=phase_data).fit(cov_type="HC3")
        beta = m_phase.params["years_since_2000"]
        p = m_phase.pvalues["years_since_2000"]
        stratified_results.append({
            "Phase": phase,
            "n": len(phase_data),
            "β_time": beta,
            "p": "< 0.001" if p < 0.001 else f"{p:.3f}",
            "% / decade": (np.exp(beta * 10) - 1) * 100,
            "Direction": "[+]" if beta > 0.001 else ("[-]" if beta < -0.001 else "[~]")
        })

strat_df = pd.DataFrame(stratified_results)
display(strat_df.style.format({"β_time": "{:.5f}", "% / decade": "{:+.1f}%"}).hide(axis="index"))

# ----------------------------
# Sensitivity: ACTUAL vs ANTICIPATED
# ----------------------------
display(Markdown("#### Sensitivity: ACTUAL vs ANTICIPATED enrollment"))

reg_data["is_actual"] = (reg_data["enrollment_type"].str.upper() == "ACTUAL").astype(int)
n_actual = reg_data["is_actual"].sum()
n_anticipated = len(reg_data) - n_actual

if n_actual >= 1000 and n_anticipated >= 1000:
    # Separate regressions
    m_actual = smf.ols(
        "log_enrollment ~ years_since_2000 + C(phase_group) + is_interventional + is_industry_sponsor",
        data=reg_data[reg_data["is_actual"] == 1]
    ).fit(cov_type="HC3")
    
    m_anticipated = smf.ols(
        "log_enrollment ~ years_since_2000 + C(phase_group) + is_interventional + is_industry_sponsor",
        data=reg_data[reg_data["is_actual"] == 0]
    ).fit(cov_type="HC3")
    
    sens_df = pd.DataFrame([
        {
            "Subset": "ACTUAL only",
            "n": int(m_actual.nobs),
            "β_time": m_actual.params["years_since_2000"],
            "p": "< 0.001" if m_actual.pvalues["years_since_2000"] < 0.001 else f"{m_actual.pvalues['years_since_2000']:.3f}",
            "% / decade": (np.exp(m_actual.params["years_since_2000"] * 10) - 1) * 100
        },
        {
            "Subset": "ANTICIPATED only",
            "n": int(m_anticipated.nobs),
            "β_time": m_anticipated.params["years_since_2000"],
            "p": "< 0.001" if m_anticipated.pvalues["years_since_2000"] < 0.001 else f"{m_anticipated.pvalues['years_since_2000']:.3f}",
            "% / decade": (np.exp(m_anticipated.params["years_since_2000"] * 10) - 1) * 100
        },
        {
            "Subset": "Pooled (main)",
            "n": int(m3.nobs),
            "β_time": m3.params["years_since_2000"],
            "p": "< 0.001" if m3.pvalues["years_since_2000"] < 0.001 else f"{m3.pvalues['years_since_2000']:.3f}",
            "% / decade": (np.exp(m3.params["years_since_2000"] * 10) - 1) * 100
        }
    ])
    display(sens_df.style.format({"β_time": "{:.5f}", "% / decade": "{:+.1f}%"}).hide(axis="index"))
    
    # Check if conclusions differ
    beta_actual = m_actual.params["years_since_2000"]
    beta_antic = m_anticipated.params["years_since_2000"]
    same_sign = (beta_actual * beta_antic) > 0
    
    display(Markdown(f"""
**Sensitivity conclusion:** ACTUAL and ANTICIPATED subsets show {"consistent" if same_sign else "**divergent**"} time trends.
{"Results are robust to enrollment type." if same_sign else "[!] Pooled results may mask differences between target and realized enrollment."}
"""))
else:
    display(Markdown(f"*Insufficient data for sensitivity split (ACTUAL: {n_actual:,}, ANTICIPATED: {n_anticipated:,})*"))

# ----------------------------
# Diagnostics (Model 3)
# ----------------------------
display(Markdown("#### Diagnostics (Model 3)"))

bp_stat, bp_p, _, _ = het_breuschpagan(m3.resid, m3.model.exog)
bp_str = "< 0.001" if bp_p < 0.001 else f"{bp_p:.3f}"

display(Markdown(f"""
- **Breusch–Pagan:** χ² = {bp_stat:.1f}, p = {bp_str}  
- **Note:** HC3 robust standard errors are used for inference regardless of heteroscedasticity.
"""))

# ----------------------------
# Store key results for Section 3.6
# ----------------------------
beta3 = m3.params["years_since_2000"]
p3 = m3.pvalues["years_since_2000"]
ci3 = m3.conf_int().loc["years_since_2000"]
pct_decade3 = (float(np.exp(beta3 * 10)) - 1) * 100
interaction_significant = f_pval < 0.05
has_opposite_trends = has_opposite

display(Markdown(f"""
---

### Summary of findings

1. **Aggregate trend (M3):** β = {beta3:.5f}, implying {pct_decade3:+.1f}% per decade after adjusting for composition.

2. **Time×Phase interaction:** {"Significant" if interaction_significant else "Not significant"} (p = {f_str}).
   {"Within-phase trends may differ from the aggregate." if interaction_significant else "No evidence that trends differ by phase."}

3. **Stratified analysis:** Phase-specific trends {"show heterogeneity" if has_opposite_trends else "are directionally consistent"}.

4. **Sensitivity (ACTUAL vs ANTICIPATED):** {"Conclusions are robust." if 'same_sign' in dir() and same_sign else "See table above for details."}

**Implication:** The "no aggregate trend" finding should be interpreted with {"caution given phase heterogeneity" if has_opposite_trends or interaction_significant else "reasonable confidence"}.
"""))

### 3.5 Regression Decomposition: Time vs Composition

We model log(enrollment) to address:
1. Is there an aggregate time trend after adjusting for composition?
2. Do time trends **differ by phase** (time×phase interaction)?
3. Are results robust to separating ACTUAL vs ANTICIPATED enrollment?

Outcome: log(enrollment) for multiplicative interpretation.



**Sample restriction:** clinical phases only (Phase 1–4)  
- Before: 78,065 trials (enrollment > 0)  
- After: 26,036 trials  
- Excluded: 52,029 (66.6%)


#### Model comparison (time coefficient for reference phase)

Model,β_time (per year),95% CI,p,% / decade,R²,Adj R²,n
1) Unadjusted,-0.00705,"[-0.0095, -0.0045]",< 0.001,-6.8%,0.001,0.001,26036
2) + Phase,0.00284,"[0.0007, 0.0050]",0.009,+2.9%,0.251,0.25,26036
3) + Phase + Design + Sponsor,0.00383,"[0.0017, 0.0059]",< 0.001,+3.9%,0.279,0.279,26036
4) + Time×Phase interaction,0.00609,"[0.0027, 0.0095]",< 0.001,+6.3%,0.28,0.279,26036


#### Time × Phase interaction (Model 4)

**Phase-specific time slopes (from interaction model):**

Phase,β_time (per year),% / decade
Phase 1,0.00609,+6.3%
Phase 2,0.00467,+4.8%
Phase 3,-0.00424,-4.1%
Phase 4,0.01128,+11.9%



**Interaction test (M3 vs M4):** F = 7.21, p = < 0.001 [~] interaction is **significant**

**Pattern check:** 3 phases show positive slopes, 1 show negative slopes.
[!] Opposite-sign slopes detected: aggregate trend could mask within-phase heterogeneity.


#### Stratified trends by phase

Phase,n,β_time,p,% / decade,Direction
Phase 1,7129,0.00963,< 0.001,+10.1%,[+]
Phase 2,8805,0.00331,0.049,+3.4%,[+]
Phase 3,5754,-0.00525,0.038,-5.1%,[-]
Phase 4,4348,0.00375,0.253,+3.8%,[+]


#### Sensitivity: ACTUAL vs ANTICIPATED enrollment

Subset,n,β_time,p,% / decade
ACTUAL only,20885,-0.01125,< 0.001,-10.6%
ANTICIPATED only,5151,0.00474,0.026,+4.9%
Pooled (main),26036,0.00383,< 0.001,+3.9%



**Sensitivity conclusion:** ACTUAL and ANTICIPATED subsets show **divergent** time trends.
[!] Pooled results may mask differences between target and realized enrollment.


#### Diagnostics (Model 3)


- **Breusch–Pagan:** χ² = 529.1, p = < 0.001  
- **Note:** HC3 robust standard errors are used for inference regardless of heteroscedasticity.



---

### Summary of findings

1. **Aggregate trend (M3):** β = 0.00383, implying +3.9% per decade after adjusting for composition.

2. **Time×Phase interaction:** Significant (p = < 0.001).
   Within-phase trends may differ from the aggregate.

3. **Stratified analysis:** Phase-specific trends show heterogeneity.

4. **Sensitivity (ACTUAL vs ANTICIPATED):** See table above for details.

**Implication:** The "no aggregate trend" finding should be interpreted with caution given phase heterogeneity.


In [15]:
# ============================================================
# 3.6 Section 3 Conclusion
# ============================================================

display(Markdown("### 3.6 Section 3 Conclusion"))

# ---------- Use global variables (set by Sections 3.3-3.5) ----------
global rho, epsilon_sq_cohort, pct_decade3, p3, ci3, interaction_significant, pct_excluded

# ---------- Build evidence summary ----------
evidence_lines = []
if rho is not None:
    evidence_lines.append(f"Spearman ρ = {rho:.4f}")
if epsilon_sq_cohort is not None:
    evidence_lines.append(f"Cohort ε² = {epsilon_sq_cohort:.4f}")
if pct_decade3 is not None:
    evidence_lines.append(f"Adjusted trend: {pct_decade3:+.1f}%/decade")
if interaction_significant is not None:
    evidence_lines.append(f"Time×phase interaction: {'significant' if interaction_significant else 'not significant'}")

evidence_str = " | ".join(evidence_lines) if evidence_lines else "[Run Sections 3.3–3.5]"

# ---------- Trend interpretation ----------
if p3 is not None:
    if p3 > 0.05:
        trend_word = "no statistically significant"
    else:
        trend_word = "weak"
else:
    trend_word = "[pending]"

# Practical CI interpretation
if ci3 is not None and pct_decade3 is not None:
    ci_lower_pct = (np.exp(ci3[0] * 10) - 1) * 100
    ci_upper_pct = (np.exp(ci3[1] * 10) - 1) * 100
    ci_interpretation = f"The 95% CI spans **{ci_lower_pct:+.1f}% to {ci_upper_pct:+.1f}% per decade**—even at the upper bound, a modest change."
else:
    ci_interpretation = ""

# Interaction caveat
if interaction_significant:
    interaction_caveat = """
**Phase heterogeneity:** The time×phase interaction is significant, indicating trends differ across phases.
The aggregate near-zero trend may mask phase-specific patterns (see stratified analysis in Section 3.5).
"""
else:
    interaction_caveat = ""

# Exclusion note
if pct_excluded is not None:
    exclusion_note = f"Analysis restricted to clinical phases; ~{pct_excluded:.0f}% of trials excluded."
else:
    exclusion_note = "Analysis restricted to clinical phases."

display(Markdown(f"""
---

## Answer to Q3.1: Have trials become larger over time?

**Summary:** Among trials with reported enrollment, there is **{trend_word}** evidence of a uniform secular trend after adjusting for composition.

**Evidence:** {evidence_str}

{ci_interpretation}
{interaction_caveat}
---

### Critical caveat: Selection bias

This analysis conditions on trials having **reported enrollment >0**. Section 1.1 shows that missingness varies by trial characteristics and may vary over time (potential MNAR). If enrollment reporting practices have changed—e.g., if more early-terminated trials report zero enrollment in recent years—the apparent stability could partly reflect selection rather than true enrollment dynamics.

**Scope:** {exclusion_note} Results may not generalize to non-clinical or early-phase-only registrations.

---

**Multiple comparisons:** Models M1–M4 and stratified analyses are exploratory. Effect sizes guide interpretation; borderline p-values should be interpreted with caution.

*See Section 5 for practical implications.*
"""))

### 3.6 Section 3 Conclusion


---

## Answer to Q3.1: Have trials become larger over time?

**Summary:** Among trials with reported enrollment, there is **weak** evidence of a uniform secular trend after adjusting for composition.

**Evidence:** Spearman ρ = 0.0101 | Cohort ε² = 0.0042 | Adjusted trend: +3.9%/decade | Time×phase interaction: significant

The 95% CI spans **+1.7% to +6.1% per decade**—even at the upper bound, a modest change.

**Phase heterogeneity:** The time×phase interaction is significant, indicating trends differ across phases.
The aggregate near-zero trend may mask phase-specific patterns (see stratified analysis in Section 3.5).

---

### Critical caveat: Selection bias

This analysis conditions on trials having **reported enrollment >0**. Section 1.1 shows that missingness varies by trial characteristics and may vary over time (potential MNAR). If enrollment reporting practices have changed—e.g., if more early-terminated trials report zero enrollment in recent years—the apparent stability could partly reflect selection rather than true enrollment dynamics.

**Scope:** Analysis restricted to clinical phases; ~67% of trials excluded. Results may not generalize to non-clinical or early-phase-only registrations.

---

**Multiple comparisons:** Models M1–M4 and stratified analyses are exploratory. Effect sizes guide interpretation; borderline p-values should be interpreted with caution.

*See Section 5 for practical implications.*


---
## 4. Therapeutic Profiling (Q3.3)

**Q3.3 — Therapeutic profiling (conditions)**  
Which therapeutic areas concentrate the largest patient volumes within the clinical trial ecosystem?

Building on Sections 3.1–3.2, which established that enrollment size is driven primarily by trial structure (phase and study design) rather than calendar time, we now shift focus from *how trials are designed* to *what they study*.

This section profiles enrollment performance at the **condition level**, using two complementary perspectives:

1. **Total enrollment across trials**  
   Captures where patient participation is most concentrated in aggregate. This reflects research intensity, portfolio emphasis, and operational scale across conditions.

2. **Median enrollment per trial**  
   Captures the typical size of studies within each condition. This serves as a proxy for trial complexity, recruitment burden, and per-trial operational cost.

These metrics answer different questions and should not be interpreted interchangeably.

### Methodological note: condition multiplicity

Trials may be associated with multiple medical conditions (e.g., "Diabetes" and "Cardiovascular Disease"). As a result:

- Condition-level aggregates are **not mutually exclusive**
- Total enrollment **cannot be summed across conditions**
- Rankings reflect **condition-specific participation**, not system-wide totals

All condition-level statistics should therefore be interpreted independently.

In [None]:
# ============================================================
# 4.1 Conditions with Highest Cumulative Enrollment
# ============================================================

# ---------- Validate dependencies ----------
check_dependencies(
    required_vars={"df_enr": "Section 1.2", "conn": "Section 1 (ABT load)"},
    required_cols={"df_enr": {"study_id", "enrollment", "phase_group"}},
    caller_globals=globals(),
)

import plotly.graph_objects as go
from src.analysis.viz import create_condition_ranking_chart

# ============================================================
# Query conditions and join with enrollment
# ============================================================
query_conditions = """
SELECT 
    c.study_id,
    LOWER(TRIM(c.condition_name)) AS condition_standardized
FROM conditions c
WHERE c.condition_name IS NOT NULL
  AND TRIM(c.condition_name) != ''
"""

df_conditions = pd.read_sql_query(query_conditions, conn)
display(Markdown(f"Loaded {len(df_conditions):,} condition-study mappings"))

# Join with df_enr to get enrollment AND phase per condition
df_enr_cond = df_enr[["study_id", "enrollment", "phase_group"]].merge(
    df_conditions,
    on="study_id",
    how="inner"
)

display(Markdown(f"Matched {len(df_enr_cond):,} condition-enrollment pairs"))

# ============================================================
# Condition multiplicity assessment
# ============================================================
cond_per_trial = df_enr_cond.groupby("study_id").size()
n_trials_multi = (cond_per_trial > 1).sum()
pct_multi = n_trials_multi / len(cond_per_trial) * 100
avg_cond_per_trial = cond_per_trial.mean()

# Calculate inflation factor: summed enrollment across conditions vs unique enrollment
total_unique_enrollment = df_enr["enrollment"].sum()
total_condition_enrollment = df_enr_cond["enrollment"].sum()
inflation_factor = total_condition_enrollment / total_unique_enrollment

display(Markdown(f"""
**Condition multiplicity:**
- Trials with enrollment data: **{len(cond_per_trial):,}**
- Trials mapping to >1 condition: **{n_trials_multi:,}** ({pct_multi:.1f}%)
- Average conditions per trial: **{avg_cond_per_trial:.1f}**

**Inflation factor:** Summing enrollment across conditions yields **{inflation_factor:.2f}×** the true total
(because multi-condition trials are counted multiple times).

*Implication: Condition-level totals should not be summed across conditions. Each ranking is valid within itself but cross-condition sums overcount by ~{(inflation_factor-1)*100:.0f}%.*
"""))

# ============================================================
# CREATE df_cond: Condition-level aggregation WITH PHASE COMPOSITION
# ============================================================
MIN_TRIALS_PER_CONDITION = 50

df_cond = (
    df_enr_cond
    .groupby("condition_standardized", as_index=False)
    .agg(
        trial_count=("study_id", "nunique"),
        total_enrollment_raw=("enrollment", "sum"),
        median_enrollment=("enrollment", "median"),
        mean_enrollment=("enrollment", "mean"),
        q25_enrollment=("enrollment", lambda x: x.quantile(0.25)),
        q75_enrollment=("enrollment", lambda x: x.quantile(0.75)),
        max_enrollment=("enrollment", "max"),
        pct_phase3=("phase_group", lambda x: (x == "Phase 3").mean() * 100),
        pct_phase4=("phase_group", lambda x: (x == "Phase 4").mean() * 100),
        pct_late_phase=("phase_group", lambda x: (x.isin(["Phase 3", "Phase 4"])).mean() * 100),
    )
)

# Filter to conditions with sufficient trials
df_cond = df_cond[df_cond["trial_count"] >= MIN_TRIALS_PER_CONDITION].copy()
df_cond = df_cond.rename(columns={"condition_standardized": "condition_name"})

# Concentration metric: single largest trial's share of total
df_cond["top1_share"] = (df_cond["max_enrollment"] / df_cond["total_enrollment_raw"]) * 100


p95_total = df_cond["total_enrollment_raw"].quantile(0.95)
n_above_p95 = (df_cond["total_enrollment_raw"] > p95_total).sum()

# Identify conditions where a single trial dominates (>50% of total)
n_concentrated = (df_cond["top1_share"] > 50).sum()

display(Markdown(f"""
**Condition-level aggregation complete:**
- Conditions with ≥{MIN_TRIALS_PER_CONDITION} trials: **{len(df_cond):,}**
- Conditions above p95 total ({p95_total:,.0f}): **{n_above_p95}**
- Conditions where top trial >50% of total: **{n_concentrated}** (interpret with caution)
"""))

# ============================================================
# Top 15 by cumulative enrollment (RAW values)
# ============================================================
# Ranking by raw totals preserves true ordering.
# Top-1 % column flags conditions dominated by outlier trials.

top_15 = df_cond.nlargest(15, "total_enrollment_raw").copy()
top_15["iqr_enrollment"] = top_15["q75_enrollment"] - top_15["q25_enrollment"]

# ---------- Table with phase composition ----------
display(Markdown("**Top 15 conditions by cumulative enrollment (enrollment slots, not unique patients)**"))

table_cols = ["condition_name", "trial_count", "total_enrollment_raw", "median_enrollment", "pct_late_phase", "top1_share"]
col_renames = {
    "condition_name": "Condition",
    "trial_count": "Trials",
    "total_enrollment_raw": "Enrollment Slots",
    "median_enrollment": "Median",
    "pct_late_phase": "% Phase 3/4",
    "top1_share": "Top-1 %"
}
formatters = {
    "Trials": "{:,.0f}",
    "Enrollment Slots": "{:,.0f}",
    "Median": "{:,.0f}",
    "% Phase 3/4": "{:.0f}%",
    "Top-1 %": "{:.1f}%"
}

table_41 = (
    top_15[table_cols]
    .rename(columns=col_renames)
    .sort_values("Enrollment Slots", ascending=False)
)

display(table_41.style.format(formatters).hide(axis="index"))

# ---------- Concentration & context ----------
total_all = df_cond["total_enrollment_raw"].sum()
total_top15 = top_15["total_enrollment_raw"].sum()
concentration_pct = total_top15 / total_all * 100

avg_late_phase_top15 = top_15["pct_late_phase"].mean()
avg_late_phase_all = df_cond["pct_late_phase"].mean()

# Flag conditions with high single-trial concentration
high_conc_conditions = top_15[top_15["top1_share"] > 30]["condition_name"].tolist()
if high_conc_conditions:
    conc_warning = f"Conditions with >30% from single trial: {', '.join(high_conc_conditions[:3])}{'...' if len(high_conc_conditions) > 3 else ''}."
else:
    conc_warning = ""

display(Markdown(f"""
*Concentration: Top 15 conditions account for **{concentration_pct:.1f}%** of cumulative enrollment (non-exclusive).*

**Phase composition:** Top 15 average **{avg_late_phase_top15:.0f}%** Phase 3/4 vs **{avg_late_phase_all:.0f}%** overall.

**Outlier sensitivity:** The "Top-1 %" column shows each condition's dependence on its largest trial.
{conc_warning}
"""))

# ---------- Plot ----------
top_15_viz = top_15.sort_values("total_enrollment_raw", ascending=True)

fig = create_condition_ranking_chart(
    df=top_15_viz,
    y_col="condition_name",
    x_col="total_enrollment_raw",
    title="Top 15 Conditions by Cumulative Enrollment",
    subtitle=f"Enrollment slots (not unique patients) · ≥{MIN_TRIALS_PER_CONDITION} trials per condition",
    x_title="Total Enrollment",
    hover_fields=[
        ("trial_count", "Trials", ":,.0f"),
        ("median_enrollment", "Median", ":,.0f"),
        ("pct_late_phase", "% Phase 3/4", ":.0f"),
        ("top1_share", "Top-1 %", ":.1f%"),
    ],
)
fig.show()

display(Markdown("""
---
### Interpretation

High-volume conditions tend to be broad disease categories with large patient populations.
The "% Phase 3/4" column contextualizes rankings by portfolio maturity.
The "Top-1 %" column flags conditions where a single mega-trial dominates the total.
Section 4.3 ranks by median enrollment to isolate typical trial scale.
"""))

Loaded 176,735 condition-study mappings

Matched 141,788 condition-enrollment pairs


**Condition multiplicity:**
- Trials with enrollment data: **79,441**
- Trials mapping to >1 condition: **28,278** (35.6%)
- Average conditions per trial: **1.8**

**Inflation factor:** Summing enrollment across conditions yields **3.19×** the true total
(because multi-condition trials are counted multiple times).

*Implication: Condition-level totals should not be summed across conditions. Each ranking is valid within itself but cross-condition sums overcount by ~219%.*



**Condition-level aggregation complete:**
- Conditions with ≥50 trials: **304**
- Conditions above p95 total (3,101,456): **16**
- Conditions where top trial >50% of total: **97** (interpret with caution)


**Top 15 conditions by cumulative enrollment (enrollment slots, not unique patients)**

Condition,Trials,Enrollment Slots,Median,% Phase 3/4,Top-1 %
solid tumors,117,100021999,52,0%,100.0%
covid-19,389,12531051,179,14%,72.5%
heart failure,481,10915018,100,15%,72.0%
cardiovascular diseases,335,9904019,174,8%,55.1%
acute coronary syndrome,135,9189913,241,26%,85.5%
osteoarthritis,233,8384199,100,28%,99.1%
cardiovascular disease,147,5908544,100,18%,42.3%
atherosclerosis,120,5563033,108,22%,98.2%
hypertension,586,4949295,122,19%,39.2%
cirrhosis,75,4797514,78,27%,99.5%



*Concentration: Top 15 conditions account for **70.4%** of cumulative enrollment (non-exclusive).*

**Phase composition:** Top 15 average **15%** Phase 3/4 vs **13%** overall.

**Outlier sensitivity:** The "Top-1 %" column shows each condition's dependence on its largest trial.
Conditions with >30% from single trial: solid tumors, covid-19, heart failure....



---
### Interpretation

High-volume conditions tend to be broad disease categories with large patient populations.
The "% Phase 3/4" column contextualizes rankings by portfolio maturity.
The "Top-1 %" column flags conditions where a single mega-trial dominates the total.
Section 4.3 ranks by median enrollment to isolate typical trial scale.


### 4.2 Conditions with the Highest Trial Count (Research Intensity)

Section 4.1 ranked conditions by cumulative enrollment (participant volume). Here we rank by **number of trials**, which captures *research intensity* rather than volume.

This lens is useful because:
- Some conditions accumulate many participants through **many moderately sized trials**.
- Others appear frequently in the registry but are typically composed of **many small studies**.

In the table below we report trial count together with **median enrollment** and **IQR** to separate “many trials” from “large trials.”

In [17]:
# ============================================================
# 4.2 Conditions by Trial Count (Research Intensity)
# ============================================================

# ---------- Validate dependencies ----------
check_dependencies(
    required_vars={"df_cond": "Section 4.1", "df_enr": "Section 1.2"},
    required_cols={"df_cond": {'trial_count', 'top1_share', 'condition_name', 'total_enrollment_raw', 'median_enrollment'}},
    caller_globals=globals(),
)


from src.analysis.viz import create_condition_ranking_chart

# ---------- Validate ----------
required_cols = {"condition_name", "trial_count", "total_enrollment_raw", "median_enrollment", "top1_share"}
missing_cols = required_cols - set(df_cond.columns)
assert not missing_cols, f"df_cond missing required columns: {sorted(missing_cols)}. Re-run Section 4.1."

# ---------- Top 15 by trial count ----------
top_count = df_cond.nlargest(15, "trial_count").copy()

# Add IQR if available
if "q25_enrollment" in df_cond.columns and "q75_enrollment" in df_cond.columns:
    top_count["iqr_enrollment"] = top_count["q75_enrollment"] - top_count["q25_enrollment"]
    has_iqr = True
else:
    has_iqr = False

# ---------- Table ----------
display(Markdown("**Top 15 conditions by trial count (research intensity)**"))

if has_iqr:
    table_cols = ["condition_name", "trial_count", "median_enrollment", "iqr_enrollment", "total_enrollment_raw", "top1_share"]
    col_names = {"condition_name": "Condition", "trial_count": "Trials", "median_enrollment": "Median",
                 "iqr_enrollment": "IQR", "total_enrollment_raw": "Total Enrollment", "top1_share": "Top-1 %"}
    formatters = {"Trials": "{:,.0f}", "Median": "{:,.0f}", "IQR": "{:,.0f}", "Total Enrollment": "{:,.0f}", "Top-1 %": "{:.1f}%"}
else:
    table_cols = ["condition_name", "trial_count", "median_enrollment", "total_enrollment_raw", "top1_share"]
    col_names = {"condition_name": "Condition", "trial_count": "Trials", "median_enrollment": "Median",
                 "total_enrollment_raw": "Total Enrollment", "top1_share": "Top-1 %"}
    formatters = {"Trials": "{:,.0f}", "Median": "{:,.0f}", "Total Enrollment": "{:,.0f}", "Top-1 %": "{:.1f}%"}

table_42 = (
    top_count[table_cols]
    .rename(columns=col_names)
    .sort_values("Trials", ascending=False)
)

display(table_42.style.format(formatters).hide(axis="index"))

# Note on 'healthy' if present
has_healthy = any(table_42["Condition"].str.lower().str.contains("healthy"))
if has_healthy:
    display(Markdown(
        "**Note:** `healthy` commonly appears as a condition label for studies in healthy participants. "
        "High trial count reflects registry labeling and early-development research, not a disease category."
    ))

# ---------- Cross-check with 4.1 ----------
top15_volume = set(df_cond.nlargest(15, "total_enrollment_raw")["condition_name"])
top15_trials = set(top_count["condition_name"])

overlap = sorted(top15_volume & top15_trials)
only_volume = sorted(top15_volume - top15_trials)
only_trials = sorted(top15_trials - top15_volume)

def short_list(xs, k=6):
    return ", ".join(xs[:k]) + ("..." if len(xs) > k else "") if xs else "None"

display(Markdown(f"""
**Cross-check with Section 4.1 (cumulative enrollment):**
- Overlap: **{len(overlap)}/15**
- In both: {short_list(overlap)}
- Volume-only: {short_list(only_volume)}
- Trial-count-only: {short_list(only_trials)}
"""))

# ---------- Plot ----------
top_count_viz = top_count.sort_values("trial_count", ascending=True)

hover_fields = [
    ("median_enrollment", "Median enrollment", ":,.0f"),
    ("total_enrollment_raw", "Total enrollment", ":,.0f"),
    ("top1_share", "Top-1 share", ":.1f%"),
]

fig = create_condition_ranking_chart(
    df=top_count_viz,
    y_col="condition_name",
    x_col="trial_count",
    title="Top 15 Conditions by Trial Count",
    subtitle="Research intensity (number of studies) · ≥50 trials per condition",
    x_title="Number of trials",
    hover_fields=hover_fields,
)
fig.show()

# ---------- Interpretation ----------
overlap_pct = len(overlap) / 15 * 100
display(Markdown(f"""
**Interpretation:** Trial count captures research intensity. Overlap with volume ranking is **{overlap_pct:.0f}%**, 
showing that high study activity and high participant volume are related but not equivalent.

Trial-count-only conditions tend to reflect many smaller studies, while volume-only conditions achieve 
high totals through fewer, larger trials. Section 4.3 completes the picture by ranking typical trial size.
"""))


**Top 15 conditions by trial count (research intensity)**

Condition,Trials,Median,IQR,Total Enrollment,Top-1 %
healthy,1819,36,40,280237,35.7%
breast cancer,1209,92,240,4476259,52.4%
obesity,1073,75,165,1657518,41.1%
stroke,724,50,93,830099,48.0%
depression,617,100,250,2243927,46.7%
pain,616,73,86,109352,13.4%
prostate cancer,613,66,171,433815,46.1%
hypertension,586,122,445,4949295,39.2%
cancer,548,72,215,3733416,21.4%
asthma,532,113,317,1962612,41.5%


**Note:** `healthy` commonly appears as a condition label for studies in healthy participants. High trial count reflects registry labeling and early-development research, not a disease category.


**Cross-check with Section 4.1 (cumulative enrollment):**
- Overlap: **4/15**
- In both: breast cancer, cancer, heart failure, hypertension
- Volume-only: acute coronary syndrome, atherosclerosis, atrial fibrillation, cardiovascular disease, cardiovascular diseases, chronic disease...
- Trial-count-only: anxiety, asthma, coronary artery disease, depression, diabetes mellitus, type 2, healthy...



**Interpretation:** Trial count captures research intensity. Overlap with volume ranking is **27%**, 
showing that high study activity and high participant volume are related but not equivalent.

Trial-count-only conditions tend to reflect many smaller studies, while volume-only conditions achieve 
high totals through fewer, larger trials. Section 4.3 completes the picture by ranking typical trial size.


### 4.3 Conditions with the Largest Typical Trial Size (Median Enrollment)

Sections 4.1–4.2 focused on *where* participants are concentrated and *how often* conditions are studied. Here we isolate a different dimension: **typical study scale**.

*When a trial is conducted for a given condition, how large is it usually?*

Because enrollment is heavily right-skewed, we rank by **median enrollment per trial**.

In [18]:
# ============================================================
# 4.3 Conditions by Median Enrollment (Typical Trial Size)
# ============================================================

# ---------- Validate dependencies ----------
check_dependencies(
    required_vars={"df_cond": "Section 4.1", "df_enr": "Section 1.2"},
    required_cols={"df_cond": {'trial_count', 'top1_share', 'condition_name', 'total_enrollment_raw', 'median_enrollment'}},
    caller_globals=globals(),
)


from src.analysis.viz import create_condition_ranking_chart

# ---------- Validate ----------
required_cols = {"condition_name", "trial_count", "total_enrollment_raw", "median_enrollment", "top1_share"}
missing_cols = required_cols - set(df_cond.columns)
assert not missing_cols, f"df_cond missing columns: {sorted(missing_cols)}. Re-run Section 4.1."

# ---------- Top 15 by median enrollment ----------
top_median = df_cond.nlargest(15, "median_enrollment").copy()

# Add IQR if available
if "q25_enrollment" in df_cond.columns and "q75_enrollment" in df_cond.columns:
    top_median["iqr_enrollment"] = top_median["q75_enrollment"] - top_median["q25_enrollment"]
    has_iqr = True
else:
    has_iqr = False

# Global context
global_median = df_cond["median_enrollment"].median()
registry_median = df_enr["enrollment"].median()

# ---------- Table ----------
display(Markdown("**Top 15 conditions by median enrollment per trial**"))

if has_iqr:
    table_cols = ["condition_name", "trial_count", "median_enrollment", "iqr_enrollment", "total_enrollment_raw", "top1_share"]
    col_renames = {
        "condition_name": "Condition", "trial_count": "Trials", "median_enrollment": "Median",
        "iqr_enrollment": "IQR", "total_enrollment_raw": "Total Enrollment", "top1_share": "Top-1 %"
    }
    formatters = {"Trials": "{:,.0f}", "Median": "{:,.0f}", "IQR": "{:,.0f}", "Total Enrollment": "{:,.0f}", "Top-1 %": "{:.1f}%"}
else:
    table_cols = ["condition_name", "trial_count", "median_enrollment", "total_enrollment_raw", "top1_share"]
    col_renames = {
        "condition_name": "Condition", "trial_count": "Trials", "median_enrollment": "Median",
        "total_enrollment_raw": "Total Enrollment", "top1_share": "Top-1 %"
    }
    formatters = {"Trials": "{:,.0f}", "Median": "{:,.0f}", "Total Enrollment": "{:,.0f}", "Top-1 %": "{:.1f}%"}

table_43 = (
    top_median[table_cols]
    .rename(columns=col_renames)
    .sort_values("Median", ascending=False)
)

display(table_43.style.format(formatters).hide(axis="index"))

display(Markdown(f"""
*Context: Registry-wide median enrollment = {registry_median:,.0f}. Cross-condition median = {global_median:,.0f}.*
"""))

# ---------- Cross-check with 4.1 and 4.2 ----------
top15_volume = set(df_cond.nlargest(15, "total_enrollment_raw")["condition_name"])
top15_trials = set(df_cond.nlargest(15, "trial_count")["condition_name"])
top15_median = set(top_median["condition_name"])

overlap_with_volume = sorted(top15_volume & top15_median)
overlap_with_trials = sorted(top15_trials & top15_median)
unique_to_median = sorted(top15_median - top15_volume - top15_trials)

def short_list(xs, k=5):
    return ", ".join(xs[:k]) + ("..." if len(xs) > k else "") if xs else "None"

display(Markdown(f"""
---

**Cross-check with Sections 4.1–4.2:**

| Comparison | Overlap |
|------------|---------|
| vs 4.1 (cumulative enrollment) | {len(overlap_with_volume)}/15 |
| vs 4.2 (trial count) | {len(overlap_with_trials)}/15 |
| **Unique to median ranking** | {len(unique_to_median)}/15 |

**Conditions unique to this ranking:** {short_list(unique_to_median)}
"""))

# ---------- Plot ----------
top_median_viz = top_median.sort_values("median_enrollment", ascending=True)

hover_fields = [
    ("trial_count", "Trials", ":,.0f"),
    ("total_enrollment_raw", "Total enrollment", ":,.0f"),
    ("top1_share", "Top-1 share", ":.1f%"),
]

fig = create_condition_ranking_chart(
    df=top_median_viz,
    y_col="condition_name",
    x_col="median_enrollment",
    title="Top 15 Conditions by Median Enrollment per Trial",
    subtitle="Typical study scale · ≥50 trials per condition",
    x_title="Median Enrollment",
    hover_fields=hover_fields,
    reference_line=(registry_median, f"Registry median: {registry_median:,.0f}"),
)
fig.show()

# ---------- Interpretation ----------
median_range_low = int(top_median["median_enrollment"].min())
median_range_high = int(top_median["median_enrollment"].max())
ratio_to_registry = median_range_low / registry_median

display(Markdown(f"""
---

### Interpretation

Median enrollment in top-15 ranges from **{median_range_low:,}** to **{median_range_high:,}** participants—
**{ratio_to_registry:.1f}x to {median_range_high/registry_median:.1f}x** the registry-wide median of {registry_median:,}.

**Key insight:** These conditions have **larger typical trials** (among those with reported enrollment), regardless of total volume or research intensity.
This may reflect outcomes-based designs, population-based studies, or regulatory requirements—though reporting patterns may also play a role.

**Distinction from Sections 4.1–4.2:**
- 4.1 (volume): Where patients *accumulate* across many trials
- 4.2 (count): How *often* a condition is studied
- 4.3 (median): How *large* each individual trial typically is

{len(unique_to_median)} conditions appear **only** in this ranking, confirming median captures a distinct dimension.
"""))


**Top 15 conditions by median enrollment per trial**

Condition,Trials,Median,IQR,Total Enrollment,Top-1 %
venous thromboembolism,81,630,2118,440686,28.1%
diarrhea,59,317,680,111821,45.7%
influenza,216,300,724,2360875,37.3%
pulmonary embolism,68,296,1404,860956,77.3%
smoking cessation,153,269,720,85869,8.2%
hepatitis b,84,250,448,1881063,93.0%
acute ischemic stroke,88,245,443,146624,37.5%
acute coronary syndrome,135,241,934,9189913,85.5%
coronary disease,62,237,1134,1918987,91.6%
myocardial ischemia,51,234,354,57405,56.8%



*Context: Registry-wide median enrollment = 70. Cross-condition median = 80.*



---

**Cross-check with Sections 4.1–4.2:**

| Comparison | Overlap |
|------------|---------|
| vs 4.1 (cumulative enrollment) | 2/15 |
| vs 4.2 (trial count) | 0/15 |
| **Unique to median ranking** | 13/15 |

**Conditions unique to this ranking:** acute ischemic stroke, colorectal neoplasms, coronary disease, diarrhea, hepatitis b...



---

### Interpretation

Median enrollment in top-15 ranges from **200** to **630** participants—
**2.9x to 9.0x** the registry-wide median of 70.0.

**Key insight:** These conditions have **larger typical trials** (among those with reported enrollment), regardless of total volume or research intensity.
This may reflect outcomes-based designs, population-based studies, or regulatory requirements—though reporting patterns may also play a role.

**Distinction from Sections 4.1–4.2:**
- 4.1 (volume): Where patients *accumulate* across many trials
- 4.2 (count): How *often* a condition is studied
- 4.3 (median): How *large* each individual trial typically is

13 conditions appear **only** in this ranking, confirming median captures a distinct dimension.


In [19]:
# ============================================================
# 4.4 Section 4 Conclusion
# ============================================================

# ---------- Validate dependencies ----------
check_dependencies(
    required_vars={"df_cond": "Section 4.1", "df_enr": "Section 1.2"},
    required_cols={"df_cond": {"condition_name", "total_enrollment_raw", "trial_count", "median_enrollment"}},
    caller_globals=globals(),
)

# ---------- Global variable declaration ----------
global overlap_all

# ---------- Extract top conditions dynamically ----------
top5_volume = df_cond.nlargest(5, "total_enrollment_raw")["condition_name"].tolist()
top5_trials = df_cond.nlargest(5, "trial_count")["condition_name"].tolist()
top5_median = df_cond.nlargest(5, "median_enrollment")["condition_name"].tolist()

def format_examples(conditions, n=4):
    """Format condition list for prose."""
    clean = [str(c).strip() for c in conditions if pd.notna(c) and str(c).strip()]
    clean = clean[:n]
    if not clean:
        return "(none)"
    if len(clean) == 1:
        return clean[0]
    return ", ".join(clean[:-1]) + f", and {clean[-1]}"

examples_volume = format_examples(top5_volume)
examples_trials = format_examples(top5_trials)
examples_median = format_examples(top5_median)

# ---------- Calculate overlap ----------
top15_volume = set(df_cond.nlargest(15, "total_enrollment_raw")["condition_name"])
top15_trials = set(df_cond.nlargest(15, "trial_count")["condition_name"])
top15_median = set(df_cond.nlargest(15, "median_enrollment")["condition_name"])

overlap_all = len(top15_volume & top15_trials & top15_median)

if overlap_all == 0:
    overlap_descriptor = "no"
elif overlap_all <= 5:
    overlap_descriptor = "limited"
elif overlap_all <= 10:
    overlap_descriptor = "moderate"
else:
    overlap_descriptor = "substantial"

display(Markdown(f"""
### 4.4 Section 4 Conclusion — Answer to Q3.3

**Question:** Which conditions attract the most reported participants?

**Top conditions by metric:**
- **Total enrollment:** {examples_volume}
- **Trial count:** {examples_trials}  
- **Median enrollment:** {examples_median}

**Ranking overlap:** {overlap_descriptor} ({overlap_all}/15 conditions appear in all three top-15 lists)

Each metric captures a distinct dimension of condition-level enrollment; rankings should not be conflated.
"""))


### 4.4 Section 4 Conclusion — Answer to Q3.3

**Question:** Which conditions attract the most reported participants?

**Top conditions by metric:**
- **Total enrollment:** solid tumors, covid-19, heart failure, and cardiovascular diseases
- **Trial count:** healthy, breast cancer, obesity, and stroke  
- **Median enrollment:** venous thromboembolism, diarrhea, influenza, and pulmonary embolism

**Ranking overlap:** no (0/15 conditions appear in all three top-15 lists)

Each metric captures a distinct dimension of condition-level enrollment; rankings should not be conflated.


In [20]:
# ============================================================
# 5. Final Summary
# ============================================================

# ---------- Dependency check ----------
global epsilon_sq_phase, epsilon_sq_sponsor, epsilon_sq_design
global pct_decade3, p3, ci3, interaction_significant, pct_excluded, overlap_all

section_status = {
    "2.1 (Phase)": epsilon_sq_phase is not None,
    "2.2 (Sponsor)": epsilon_sq_sponsor is not None,
    "2.3 (Design)": epsilon_sq_design is not None,
    "3.5 (Regression)": pct_decade3 is not None,
    "4.4 (Conditions)": overlap_all is not None,
}
missing_sections = [s for s, ok in section_status.items() if not ok]

if missing_sections:
    display(Markdown(f"""
[!] **Note:** Sections not executed: {', '.join(missing_sections)}.
"""))

# ---------- Fetch key metrics ----------
eps_phase = epsilon_sq_phase
eps_design = epsilon_sq_design
eps_sponsor = epsilon_sq_sponsor

# Effect size summary
if all(v is not None for v in [eps_phase, eps_design, eps_sponsor]):
    effect_summary = f"(ε²: phase={eps_phase:.3f}, design={eps_design:.3f}, sponsor={eps_sponsor:.3f})"
else:
    effect_summary = ""

# Year range
year_min = int(df_enr['start_year'].min())
year_max = int(df_enr['start_year'].max())

# Overlap descriptor
if overlap_all is not None:
    overlap_display = f"{overlap_all}/15"
else:
    overlap_display = "few"

# Trend summary with CI
if pct_decade3 is not None and ci3 is not None:
    ci_lower_pct = (np.exp(ci3[0] * 10) - 1) * 100
    ci_upper_pct = (np.exp(ci3[1] * 10) - 1) * 100
    trend_detail = f"{pct_decade3:+.1f}%/decade (95% CI: {ci_lower_pct:+.1f}% to {ci_upper_pct:+.1f}%)"
elif pct_decade3 is not None:
    trend_detail = f"{pct_decade3:+.1f}%/decade"
else:
    trend_detail = "[run Section 3.5]"

# Interaction note
interaction_note = " Phase-specific trends may differ (significant interaction)." if interaction_significant else ""

# Exclusion scope
scope_note = f"~{pct_excluded:.0f}% excluded" if pct_excluded is not None else "subset excluded"

display(Markdown(f"""
---

## 5. Final Summary

Analysis of enrollment patterns in ClinicalTrials.gov ({year_min}–{year_max}).

---

### Q3.1 — Have trials become larger over time?

**Finding:** Among trials with reported enrollment, adjusted temporal trend is **{trend_detail}**.{interaction_note}

Compositional shifts (phase mix) explain most apparent temporal variation. However, **this conclusion is conditional on enrollment being reported**—if reporting practices changed over time, selection bias could mask or create trends.

---

### Q3.2 — Which characteristics are associated with enrollment size?

Phase shows the largest marginal association {effect_summary}. Design and sponsor show smaller effects. These are univariate associations; joint adjustment in Section 3.5 provides partial confounding control.

**Marginal ranking:** Phase > Design > Sponsor

---

### Q3.3 — Which conditions attract the most participants?

Rankings by total enrollment, trial count, and median size show **{overlap_display}** overlap, confirming these capture different dimensions.

**Caveat:** High-volume conditions may reflect later-phase portfolios rather than inherent "attractiveness." The % Phase 3/4 column provides context but does not fully adjust for composition.

---

### Practical Implications

1. **Benchmarking:** Condition on phase and design, not calendar year.
2. **Therapeutic comparisons:** Distinguish volume vs intensity vs typical scale.
3. **Forecasting:** No clear evidence of "enrollment inflation," but selection bias limits certainty.

---

### Limitations

1. **Selection bias:** Analysis conditions on enrollment >0; missingness may be time-varying (MNAR).
2. **Scope:** Clinical phases only; {scope_note} due to ambiguous/non-clinical phases.
3. **Linearity assumption:** Temporal model assumes constant rate of change; U-shaped or abrupt shifts would be masked.
4. **ACTUAL vs ANTICIPATED pooled:** Sensitivity in Section 3.5 suggests robustness.
5. **Observational:** Associations are not causal.
6. **COVID/truncation:** 2020+ cohort is heterogeneous and truncated.
7. **Condition multiplicity:** Therapeutic profiles are not mutually exclusive; "enrollment slots" ≠ unique patients.
8. **Multiple comparisons:** Exploratory analysis; effect sizes prioritized over p-values.
"""))


---

## 5. Final Summary

Analysis of enrollment patterns in ClinicalTrials.gov (1990–2025).

---

### Q3.1 — Have trials become larger over time?

**Finding:** Among trials with reported enrollment, adjusted temporal trend is **+3.9%/decade (95% CI: +1.7% to +6.1%)**. Phase-specific trends may differ (significant interaction).

Compositional shifts (phase mix) explain most apparent temporal variation. However, **this conclusion is conditional on enrollment being reported**—if reporting practices changed over time, selection bias could mask or create trends.

---

### Q3.2 — Which characteristics are associated with enrollment size?

Phase shows the largest marginal association (ε²: phase=0.117, design=0.052, sponsor=0.002). Design and sponsor show smaller effects. These are univariate associations; joint adjustment in Section 3.5 provides partial confounding control.

**Marginal ranking:** Phase > Design > Sponsor

---

### Q3.3 — Which conditions attract the most participants?

Rankings by total enrollment, trial count, and median size show **0/15** overlap, confirming these capture different dimensions.

**Caveat:** High-volume conditions may reflect later-phase portfolios rather than inherent "attractiveness." The % Phase 3/4 column provides context but does not fully adjust for composition.

---

### Practical Implications

1. **Benchmarking:** Condition on phase and design, not calendar year.
2. **Therapeutic comparisons:** Distinguish volume vs intensity vs typical scale.
3. **Forecasting:** No clear evidence of "enrollment inflation," but selection bias limits certainty.

---

### Limitations

1. **Selection bias:** Analysis conditions on enrollment >0; missingness may be time-varying (MNAR).
2. **Scope:** Clinical phases only; ~67% excluded due to ambiguous/non-clinical phases.
3. **Linearity assumption:** Temporal model assumes constant rate of change; U-shaped or abrupt shifts would be masked.
4. **ACTUAL vs ANTICIPATED pooled:** Sensitivity in Section 3.5 suggests robustness.
5. **Observational:** Associations are not causal.
6. **COVID/truncation:** 2020+ cohort is heterogeneous and truncated.
7. **Condition multiplicity:** Therapeutic profiles are not mutually exclusive; "enrollment slots" ≠ unique patients.
8. **Multiple comparisons:** Exploratory analysis; effect sizes prioritized over p-values.


In [21]:
# ============================================================
# Cleanup
# ============================================================

if 'conn' in dir() and conn is not None:
    conn.close()
    display(Markdown("[ok] Database connection closed."))

[ok] Database connection closed.