# Q2: Completion Analysis

## Research Question

> **Which factors are associated with higher trial completion rates among resolved trials?**  
> **Are there systematic differences between trials that are terminated and those that are withdrawn?**

---

## Operational Framing

This analysis addresses the business question in two complementary layers.

First, we identify **which trial characteristics are associated with successful completion** by examining unadjusted completion rates and estimating adjusted associations via multivariable logistic regression.

Second, we **disaggregate stopped trials into distinct failure modes** (Terminated, Withdrawn, Suspended) to distinguish failures occurring during execution from those occurring before launch, and to characterize how these groups differ in terms of phase, enrollment, and sponsor profile.

---

## Methodological Approach

| Aspect | Approach |
|------|---------|
| **Design** | Cross-sectional association analysis |
| **Inference goal** | Identify associations (no causal claims) |
| **Analytical population** | Resolved trials only (Completed + Stopped) |
| **Exclusions** | Active trials excluded to avoid censoring bias |
| **Primary metric** | Resolved Completion Rate = `Completed / (Completed + Stopped)` |

Temporal patterns are examined descriptively to account for lifecycle effects, while statistical inference remains cross-sectional.

---

## Analysis Structure

| Section | Purpose |
|--------|---------|
| **1. ABT Validation** | Data quality checks and definition of the analytical population |
| **2. Descriptive Analysis** | Completion rates by key trial characteristics |
| **3. Termination Patterns** | Descriptive characterization of failure types and structural differences |
| **4. Temporal Dimension** | Cohort-based completion trends |
| **5. Statistical Inference** | Logistic regression with assumption checks and diagnostics |


## Setup

In [1]:
import sys
from pathlib import Path

import numpy as np
import pandas as pd
from IPython.display import display, Markdown

# Notebook runs from /notebooks; add project root for src imports
PROJECT_ROOT = Path('..')
sys.path.insert(0, str(PROJECT_ROOT))

from src.data.loader import load_sql_query, get_db_connection
from src.analysis.viz import create_rate_bar_chart, format_rate_table
from src.analysis.metrics import calc_completion_rate, test_rate_difference
from src.analysis.constants import PHASE_ORDER_CLINICAL, ENROLLMENT_ORDER, FAILURE_COLORS

# Paths (validated at setup to fail fast)
DB_PATH = PROJECT_ROOT / 'data' / 'database' / 'clinical_trials.db'
SQL_PATH = PROJECT_ROOT / 'sql' / 'queries'
assert DB_PATH.exists(), f"DB not found: {DB_PATH}"
assert SQL_PATH.exists(), f"SQL folder not found: {SQL_PATH}"

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

# Reproducibility: must match ETL metadata file (metadata_YYYYMMDD_HHMMSS.json)
EXTRACTION_DATE = "2026-01-18"

## Database Connection

In [2]:
conn = get_db_connection(DB_PATH)

---
# 1. ABT Validation & Analytical Population

> **Terminology:** Throughout this analysis, *trial* refers to a single registry entry (`study_id`).  
> One trial = one row in the ABT.

In [3]:
# ============================================================
# Load ABT (Analytical Base Table)
# ============================================================

df_abt = load_sql_query(
    'q2_abt.sql', 
    conn,
    SQL_PATH,
    params={'extraction_date': EXTRACTION_DATE}
)

# Basic validation
n_studies = len(df_abt)
assert df_abt['study_id'].nunique() == n_studies, "study_id should be unique"

# Derive scope from data (not assumed)
min_year = int(df_abt['start_year'].min())
max_year = int(df_abt['start_year'].max())

display(Markdown(f"""**ABT loaded:** {n_studies:,} trials (start year {min_year}–{max_year})

*Scope enforced upstream in `v_studies_clean.is_start_year_in_scope`; validated here from loaded data.*
"""))

**ABT loaded:** 82,707 trials (start year 1990–2025)

*Scope enforced upstream in `v_studies_clean.is_start_year_in_scope`; validated here from loaded data.*


## 1.1 Registry Status Distribution

How are trials distributed across Completed, Stopped, and Active statuses?

**Methodological note (important):**  
Active trials represent studies whose final outcome is not yet observed at the extraction date. Including them as “not completed” would introduce censoring bias and systematically understate completion rates—especially for recent cohorts and later-phase trials.

For this reason, all subsequent analyses (Sections 2–4) are restricted to **resolved trials** (Completed + Stopped), where the outcome is known and the completion rate is well-defined.

In [4]:
# ============================================================
# 1.1 Registry status distribution
# ============================================================

outcome_dist = df_abt['outcome_group'].value_counts()

# Calculate key metrics
n_completed = outcome_dist.get('Completed', 0)
n_stopped = outcome_dist.get('Stopped', 0)
n_active = outcome_dist.get('Active', 0)
n_resolved = n_completed + n_stopped

# Resolved completion rate (the key metric for Parts 2-4)
resolved_completion_rate = n_completed / n_resolved * 100 if n_resolved > 0 else 0

# Build summary table
outcome_summary = pd.DataFrame({
    'Status Group': ['Completed', 'Stopped', 'Active', 'Resolved (Completed + Stopped)'],
    'Count': [n_completed, n_stopped, n_active, n_resolved],
    'Share': [
        f"{n_completed / n_studies * 100:.1f}%",
        f"{n_stopped / n_studies * 100:.1f}%",
        f"{n_active / n_studies * 100:.1f}%",
        f"{n_resolved / n_studies * 100:.1f}%",
    ]
})

display(Markdown("**Registry status distribution (full ABT):**"))
display(outcome_summary)

display(Markdown(f"""
**Resolved Completion Rate:** {resolved_completion_rate:.1f}%  
*(Completed / (Completed + Stopped) — Active trials excluded from denominator)*

**Note:** Parts 2–4 use only **resolved trials** (n={n_resolved:,}) to avoid censoring bias.
"""))

**Registry status distribution (full ABT):**

Unnamed: 0,Status Group,Count,Share
0,Completed,54184,65.5%
1,Stopped,8774,10.6%
2,Active,19749,23.9%
3,Resolved (Completed + Stopped),62958,76.1%



**Resolved Completion Rate:** 86.1%  
*(Completed / (Completed + Stopped) — Active trials excluded from denominator)*

**Note:** Parts 2–4 use only **resolved trials** (n=62,958) to avoid censoring bias.


In [5]:
# ============================================================
# Missingness analysis
# ============================================================

# Key columns for analysis
analysis_cols = ['enrollment', 'lead_agency_class', 'completion_date', 'n_conditions', 'phase_group']

missingness = pd.DataFrame({
    'Column': analysis_cols,
    'Missing': [df_abt[col].isna().sum() for col in analysis_cols],
    'Missing %': [f"{df_abt[col].isna().mean() * 100:.1f}%" for col in analysis_cols],
})

display(Markdown("**Missingness in key analysis columns:**"))
display(missingness)

**Missingness in key analysis columns:**

Unnamed: 0,Column,Missing,Missing %
0,enrollment,3266,3.9%
1,lead_agency_class,0,0.0%
2,completion_date,1419,1.7%
3,n_conditions,0,0.0%
4,phase_group,0,0.0%


**Interpretation:**  
Missingness is low across all key analytical variables. Enrollment shows limited missingness (3.9%), which is examined separately given its potential relationship with early trial withdrawal. No imputation is performed; analyses rely on observed values only.

---
# 2. Descriptive Analysis: Completion Rates by Factor

**Population:** Resolved trials only (n=62,958)  
**Metric:** Resolved Completion Rate = `Completed / (Completed + Stopped)`

> **Analytical scope:**  
> This section provides an unadjusted, descriptive view of how trial completion rates vary across key characteristics.  
> The objective is **structured variable screening**: identifying factors that exhibit meaningful variation and therefore warrant inclusion in the multivariable model (Section 5).  
> No causal or adjusted interpretation is made at this stage.

---

## Candidate Variables for Completion Modeling

The variables analyzed below were selected based on domain relevance and data availability.  
Each represents a distinct dimension of trial risk or execution complexity.

| Variable              | Type        | Rationale for inclusion                                      |
|-----------------------|-------------|---------------------------------------------------------------|
| `phase_group`         | Categorical | Proxy for scientific, regulatory, and development-stage risk |
| `sponsor_category`   | Categorical | Proxy for organizational capacity and funding structure      |
| `enrollment_bucket`  | Ordinal     | Proxy for operational complexity and execution scale         |
| `has_oncology_label` | Binary      | Therapeutic area historically associated with higher attrition |

We focus on these four variables because they are available at scale in the registry and are plausible drivers of execution risk.

---

## 2.1 Completion Rates by Key Factors

The tables below summarize unadjusted completion rates for each factor independently; interpretation follows immediately below.

In [6]:
# ============================================================
# 2.1 Completion Rates Summary (All Factors)
# ============================================================
# Tables ordered by finding importance (Enrollment > Phase > Oncology > Sponsor)
# All rates shown with 95% Wilson CIs; n < 1000 flagged with asterisk

display(Markdown("## 2.1 Completion Rates by Key Factors"))
display(Markdown("""
*All rates are reported with 95% Wilson confidence intervals to reflect estimation uncertainty; 
asterisks denote groups with n < 1,000 where estimates may be less stable.
χ² tests assess independence (unadjusted); with large n, small differences can be significant. 
With large samples, statistical significance should not be interpreted as practical importance—we focus on effect size and adjusted results in Section 5.*
"""))

# === Calculate rates for all factors ===

# --- Enrollment (MOST IMPORTANT: strongest gradient) ---
enrollment_rates = calc_completion_rate(df_abt, 'enrollment_bucket')
enrollment_rates['_order'] = enrollment_rates['enrollment_bucket'].apply(
    lambda x: ENROLLMENT_ORDER.index(x) if x in ENROLLMENT_ORDER else 99
)
enrollment_rates = enrollment_rates.sort_values('_order')

# --- Phase ---
phase_rates = calc_completion_rate(df_abt, 'phase_group')
phase_order_map = {phase: i for i, phase in enumerate(PHASE_ORDER_CLINICAL + ['Not Applicable', 'Other'])}
phase_rates['_order'] = phase_rates['phase_group'].map(phase_order_map)
phase_rates = phase_rates.sort_values('_order')

# --- Oncology ---
oncology_rates = calc_completion_rate(df_abt, 'has_oncology_label')
oncology_rates['has_oncology_label'] = oncology_rates['has_oncology_label'].map({1: 'Oncology', 0: 'Non-Oncology'})

# --- Sponsor ---
df_abt['sponsor_category'] = df_abt['lead_agency_class'].apply(
    lambda x: 'Industry' if x == 'INDUSTRY' else ('Other' if pd.notna(x) else 'Unknown')
)
sponsor_rates = calc_completion_rate(df_abt, 'sponsor_category')

# === Display tables (ordered by finding importance) ===

# 1. Enrollment (strongest gradient: 83.5% → 96.2%)
display(Markdown("### 1. By Enrollment Size (Strongest Effect)"))
display(format_rate_table(enrollment_rates.drop(columns=['_order']), 'enrollment_bucket', 'Enrollment'))

# Inline visualization: Enrollment gradient (with error bars)
fig_enrollment = create_rate_bar_chart(
    enrollment_rates,
    rate_col='completion_rate_pct',
    label_col='enrollment_bucket',
    n_col='n_resolved',
    ci_lower_col='ci_lower_pct',
    ci_upper_col='ci_upper_pct',
    title='Completion Rate by Enrollment Size',
    subtitle='Clear dose-response: larger trials complete at higher rates',
    note='"Unknown" enrollment retained and modeled explicitly (not excluded) to avoid selection bias.',
    height=350,
)
fig_enrollment.show()

# Chi-square test for enrollment
enrollment_test = test_rate_difference(df_abt[df_abt['is_resolved'] == 1], 'enrollment_bucket')
display(Markdown(f"**χ² test:** {enrollment_test['interpretation']}"))

# Validate Unknown enrollment as NMAR: check if Unknown is disproportionately Stopped
unknown_enroll = df_abt[(df_abt['is_resolved'] == 1) & (df_abt['enrollment_bucket'] == 'Unknown')]
unknown_n = len(unknown_enroll)
unknown_stopped_pct = (unknown_enroll['outcome_group'] == 'Stopped').mean() * 100
# Compare to overall stopped rate
overall_stopped_pct = (df_abt[df_abt['is_resolved'] == 1]['outcome_group'] == 'Stopped').mean() * 100

display(Markdown(f"""
**Interpretation:** Enrollment size exhibits a strong, monotonic association with completion.
CIs are tight for large groups; the gradient is substantial and the χ² test rejects independence.
The "Unknown" group (n={unknown_n:,}) shows extreme behavior: {unknown_stopped_pct:.0f}% are Stopped 
(vs {overall_stopped_pct:.0f}% overall), supporting the hypothesis that missingness reflects early failure 
rather than random omission. This group is retained and modeled explicitly rather than excluded.
"""))

# 2. Phase
display(Markdown("### 2. By Clinical Phase"))
display(format_rate_table(phase_rates.drop(columns=['_order']), 'phase_group', 'Phase'))

# Note on "Not Applicable"
na_n = phase_rates[phase_rates['phase_group'] == 'Not Applicable']['n_resolved'].values[0]
na_pct = na_n / phase_rates['n_resolved'].sum() * 100
na_rate = phase_rates[phase_rates['phase_group'] == 'Not Applicable']['completion_rate_pct'].values[0]
display(Markdown(f"""
> **Note on "Not Applicable":** {na_n:,} trials ({na_pct:.0f}% of resolved) have no designated phase—mostly 
> observational/registry studies. These have the **highest completion rate** ({na_rate:.1f}%) and are 
> **excluded from phase-based regression** in Section 5 (where "phase" as a predictor is not meaningful).
"""))

# Inline visualization: Phase (clinical phases only, with error bars)
phase_rates_clinical = phase_rates[phase_rates['phase_group'].isin(PHASE_ORDER_CLINICAL)].copy()
fig_phase = create_rate_bar_chart(
    phase_rates_clinical,
    rate_col='completion_rate_pct',
    label_col='phase_group',
    n_col='n_resolved',
    ci_lower_col='ci_lower_pct',
    ci_upper_col='ci_upper_pct',
    title='Completion Rate by Clinical Phase',
    subtitle=f'Interventional trials only (n={int(phase_rates_clinical["n_resolved"].sum()):,})',
    note='Mid-stage phases (1/2, 2) underperform. Phase 3 advantage is confounded by enrollment (Section 5.2).',
    height=350,
)
fig_phase.show()

# Chi-square test for phase
df_clinical = df_abt[(df_abt['is_resolved'] == 1) & (df_abt['phase_group'].isin(PHASE_ORDER_CLINICAL))]
phase_test = test_rate_difference(df_clinical, 'phase_group')
display(Markdown(f"**χ² test:** {phase_test['interpretation']}"))

display(Markdown("""
**Interpretation:** Several phase-level differences show overlapping confidence intervals, 
particularly among early and mid-stage phases, indicating that some unadjusted differences 
may not be statistically distinguishable. Importantly, enrollment size and clinical phase are 
strongly correlated (later-phase trials tend to enroll more patients), motivating the adjusted 
analysis in Section 5.
"""))

# 3. Oncology
display(Markdown("### 3. Oncology vs Non-Oncology"))
display(format_rate_table(oncology_rates, 'has_oncology_label', 'Category'))
oncology_test = test_rate_difference(df_abt[df_abt['is_resolved'] == 1], 'has_oncology_label')
display(Markdown(f"**χ² test:** {oncology_test['interpretation']}"))

# 4. Sponsor
display(Markdown("### 4. By Sponsor Type"))
display(Markdown("""
> **Note:** "Other" is heterogeneous—includes academic, government, non-profit, and missing/undefined sponsors.
> Interpret crude differences cautiously; Section 5 adjusts for confounders.
"""))
display(format_rate_table(sponsor_rates, 'sponsor_category', 'Sponsor'))
sponsor_test = test_rate_difference(df_abt[df_abt['is_resolved'] == 1], 'sponsor_category')
display(Markdown(f"**χ² test:** {sponsor_test['interpretation']}"))

# === Cross-tabulation: Confounding preview ===
display(Markdown("### Confounding Context: Industry × Phase Distribution"))
display(Markdown("""
The sponsor effect (1.6 pp crude difference) cannot be interpreted in isolation because 
**industry sponsors are unevenly distributed across phases**:
"""))

# Calculate Industry % by phase
df_resolved = df_abt[df_abt['is_resolved'] == 1].copy()
df_resolved['is_industry'] = (df_resolved['lead_agency_class'] == 'INDUSTRY').astype(int)
phase_industry = (
    df_resolved
    .groupby('phase_group')
    .agg(
        n_trials=('study_id', 'nunique'),
        n_industry=('is_industry', 'sum')
    )
    .reset_index()
)
phase_industry['industry_pct'] = phase_industry['n_industry'] / phase_industry['n_trials'] * 100
phase_industry['_order'] = phase_industry['phase_group'].map(phase_order_map)
phase_industry = phase_industry.sort_values('_order').drop(columns=['_order'])
phase_industry = phase_industry[phase_industry['phase_group'].isin(PHASE_ORDER_CLINICAL)]

# Display cross-tab
cross_tab = phase_industry[['phase_group', 'n_trials', 'industry_pct']].copy()
cross_tab['industry_pct'] = cross_tab['industry_pct'].apply(lambda x: f"{x:.0f}%")
cross_tab.columns = ['Phase', 'n Resolved', 'Industry-Sponsored']
display(cross_tab.reset_index(drop=True))

# Dynamic interpretation based on actual data
phase3_ind = phase_industry[phase_industry['phase_group'] == 'Phase 3']['industry_pct'].values[0]
early_ind = phase_industry[phase_industry['phase_group'] == 'Early Phase 1']['industry_pct'].values[0]
display(Markdown(f"""
**Interpretation:** Industry sponsorship varies materially by phase (see table above). 
Later-phase trials have higher industry participation (Phase 3: {phase3_ind:.0f}% vs Early Phase 1: {early_ind:.0f}%).
Since Phase 3 trials complete at higher rates (85.5%) and are more often industry-sponsored, 
the crude sponsor comparison is **confounded by phase distribution**. Section 5.2 adjusts for this.
"""))

# === Cross-tabulation #2: Enrollment × Phase ===
display(Markdown("### Confounding Context: Enrollment × Phase"))
display(Markdown("""
Enrollment size and clinical phase are also strongly correlated, which confounds BOTH effects:
"""))

# Median enrollment by phase
enrollment_by_phase = (
    df_resolved[df_resolved['phase_group'].isin(PHASE_ORDER_CLINICAL)]
    .groupby('phase_group')
    .agg(
        n_trials=('study_id', 'nunique'),
        median_enrollment=('enrollment', 'median'),
    )
    .reset_index()
)
enrollment_by_phase['_order'] = enrollment_by_phase['phase_group'].map(phase_order_map)
enrollment_by_phase = enrollment_by_phase.sort_values('_order').drop(columns=['_order'])
enrollment_by_phase['median_enrollment'] = enrollment_by_phase['median_enrollment'].apply(
    lambda x: f"{x:,.0f}" if pd.notna(x) else "N/A"
)
enrollment_by_phase.columns = ['Phase', 'n Resolved', 'Median Enrollment']
display(enrollment_by_phase.reset_index(drop=True))

display(Markdown("""
**Interpretation:** Later-phase trials enroll substantially more patients (Phase 3 median ~200 vs Early Phase 1 ~30).
This explains the **Phase 3 paradox**: Phase 3 has a high unadjusted completion rate (85.5%) driven by 
larger enrollment, but Section 5.2 will show that after controlling for enrollment, Phase 3 actually has 
**lower** adjusted odds of completion (OR ≈ 0.36).
"""))

## 2.1 Completion Rates by Key Factors


*All rates are reported with 95% Wilson confidence intervals to reflect estimation uncertainty; 
asterisks denote groups with n < 1,000 where estimates may be less stable.
χ² tests assess independence (unadjusted); with large n, small differences can be significant. 
With large samples, statistical significance should not be interpreted as practical importance—we focus on effect size and adjusted results in Section 5.*


### 1. By Enrollment Size (Strongest Effect)

Unnamed: 0,Enrollment,n,Completed,Rate [95% CI]
0,Unknown,3263,523,16.0% [14.8-17.3%]
1,<50,24928,20823,83.5% [83.1-84.0%]
2,50-99,12338,11572,93.8% [93.4-94.2%]
3,100-499,16271,15375,94.5% [94.1-94.8%]
4,500-999,2792,2654,95.1% [94.2-95.8%]
5,1000+,3366,3237,96.2% [95.5-96.8%]


**χ² test:** Significant difference (χ²=15530.2, p<0.001)


**Interpretation:** Enrollment size exhibits a strong, monotonic association with completion.
CIs are tight for large groups; the gradient is substantial and the χ² test rejects independence.
The "Unknown" group (n=3,263) shows extreme behavior: 84% are Stopped 
(vs 14% overall), supporting the hypothesis that missingness reflects early failure 
rather than random omission. This group is retained and modeled explicitly rather than excluded.


### 2. By Clinical Phase

Unnamed: 0,Phase,n,Completed,Rate [95% CI]
0,Early Phase 1,559*,442,79.1% [75.5-82.2%]
1,Phase 1,6459,5526,85.6% [84.7-86.4%]
2,Phase 1/2,1710,1249,73.0% [70.9-75.1%]
3,Phase 2,7522,5755,76.5% [75.5-77.5%]
4,Phase 2/3,842*,686,81.5% [78.7-84.0%]
5,Phase 3,5221,4464,85.5% [84.5-86.4%]
6,Phase 4,4157,3462,83.3% [82.1-84.4%]
7,Not Applicable,36488,32600,89.3% [89.0-89.7%]



> **Note on "Not Applicable":** 36,488 trials (58% of resolved) have no designated phase—mostly 
> observational/registry studies. These have the **highest completion rate** (89.3%) and are 
> **excluded from phase-based regression** in Section 5 (where "phase" as a predictor is not meaningful).


**χ² test:** Significant difference (χ²=342.8, p<0.001)


**Interpretation:** Several phase-level differences show overlapping confidence intervals, 
particularly among early and mid-stage phases, indicating that some unadjusted differences 
may not be statistically distinguishable. Importantly, enrollment size and clinical phase are 
strongly correlated (later-phase trials tend to enroll more patients), motivating the adjusted 
analysis in Section 5.


### 3. Oncology vs Non-Oncology

Unnamed: 0,Category,n,Completed,Rate [95% CI]
0,Non-Oncology,53207,46660,87.7% [87.4-88.0%]
1,Oncology,9751,7524,77.2% [76.3-78.0%]


**χ² test:** Significant difference (χ²=761.5, p<0.001)

### 4. By Sponsor Type


> **Note:** "Other" is heterogeneous—includes academic, government, non-profit, and missing/undefined sponsors.
> Interpret crude differences cautiously; Section 5 adjusts for confounders.


Unnamed: 0,Sponsor,n,Completed,Rate [95% CI]
0,Other,45629,39474,86.5% [86.2-86.8%]
1,Industry,17329,14710,84.9% [84.3-85.4%]


**χ² test:** Significant difference (χ²=27.5, p<0.001)

### Confounding Context: Industry × Phase Distribution


The sponsor effect (1.6 pp crude difference) cannot be interpreted in isolation because 
**industry sponsors are unevenly distributed across phases**:


Unnamed: 0,Phase,n Resolved,Industry-Sponsored
0,Early Phase 1,559,9%
1,Phase 1,6459,66%
2,Phase 1/2,1710,39%
3,Phase 2,7522,42%
4,Phase 2/3,842,24%
5,Phase 3,5221,61%
6,Phase 4,4157,25%



**Interpretation:** Industry sponsorship varies materially by phase (see table above). 
Later-phase trials have higher industry participation (Phase 3: 61% vs Early Phase 1: 9%).
Since Phase 3 trials complete at higher rates (85.5%) and are more often industry-sponsored, 
the crude sponsor comparison is **confounded by phase distribution**. Section 5.2 adjusts for this.


### Confounding Context: Enrollment × Phase


Enrollment size and clinical phase are also strongly correlated, which confounds BOTH effects:


Unnamed: 0,Phase,n Resolved,Median Enrollment
0,Early Phase 1,559,22
1,Phase 1,6459,29
2,Phase 1/2,1710,33
3,Phase 2,7522,50
4,Phase 2/3,842,82
5,Phase 3,5221,225
6,Phase 4,4157,73



**Interpretation:** Later-phase trials enroll substantially more patients (Phase 3 median ~200 vs Early Phase 1 ~30).
This explains the **Phase 3 paradox**: Phase 3 has a high unadjusted completion rate (85.5%) driven by 
larger enrollment, but Section 5.2 will show that after controlling for enrollment, Phase 3 actually has 
**lower** adjusted odds of completion (OR ≈ 0.36).


**Note on sponsor classification**

`sponsor_category` is derived from the registry field `lead_agency_class`.  
Trials are classified as **Industry** when the lead sponsor is a for-profit commercial entity (e.g., pharmaceutical, biotechnology, or medical device companies).

The **Other** category is heterogeneous and includes academic institutions, government agencies, non-profit organizations, cooperative groups, and studies with missing or undefined sponsor class.

As a result, the modest unadjusted difference observed between Industry and Other sponsors should be interpreted cautiously.  
Any sponsor effect is likely mediated by correlated factors such as trial phase and enrollment size, which are explicitly controlled for in the multivariable analysis (Section 5).

---

### 2.1.1 Key Observations from Unadjusted Rates

From the descriptive analysis above, four patterns merit attention:

**1. Enrollment size shows the strongest gradient** (83.5% → 96.2%):
- Larger trials complete at systematically higher rates
- **Effect magnitude:** Failure rate drops from 16.5% (small trials) to 3.8% (1000+ trials) — a **4.3× reduction**
- **Note: Unknown enrollment is anomalous (16%)**: Exceptionally low completion suggests **missingness is not random** — likely trials withdrawn before enrollment began or very early terminations
- **In practice**: Enrollment scale is a dominant operational driver; trials with unknown enrollment will be handled explicitly in Section 5

**2. Mid-stage phases underperform** (Phase 1/2: 73%, Phase 2: 76.5% vs Phase 1/3: ~85%):
- Non-monotonic pattern suggests phase captures heterogeneous risks, not linear progression
- **Important**: Section 5.2 will show this pattern changes after controlling for enrollment — Phase 3's high unadjusted rate is driven by larger enrollment, not inherent ease (Phase 3 paradox)

**3. Oncology trials have substantially lower completion** (77.2% vs 87.7%):
- 10.5 percentage point gap consistent with therapeutic area complexity
- **Effect magnitude:** Oncology failure rate is **1.85× higher** (22.8% vs 12.3%)
- **In practice**: This gap is consistent with the higher complexity typically associated with oncology trials, including more restrictive eligibility criteria and challenging recruitment environments

**4. Sponsor type shows limited unadjusted separation** (Industry 84.9% vs Other 86.5%):
- Only 1.6 pp gap in unadjusted rates
- **Note: Confounded**: Sponsor mix differs materially by phase and enrollment (see cross-tabs above)
- **Important**: Crude sponsor differences should be interpreted cautiously; Section 5 quantifies the adjusted association after controlling for confounders

> **Interpretation note:** These are **descriptive associations**, not causal effects. The cross-tabulation above demonstrates why: variables are correlated (Phase 3 trials are larger AND more often industry-sponsored). Section 5 uses multivariable regression to isolate **independent associations** controlling for this confounding.

---

---

# 3. Termination Patterns

**Objective:** Characterize how stopped trials differ depending on *when* they stop in the trial lifecycle, 
and assess whether these differences have practical implications for trial planning and portfolio management.

We classify stopped trials using registry-defined categories:
- **Terminated:** stopped after trial execution had begun
- **Withdrawn:** stopped before, or very early in, execution
- **Suspended:** temporarily halted or uncertain status (treated as stopped for this analysis)

In [7]:
# ============================================================
# 3.1 Failure Type Composition
# ============================================================

display(Markdown("## 3.1 Failure Type Composition"))

# Filter to stopped trials only
df_stopped = df_abt[df_abt['outcome_group'] == 'Stopped'].copy()
n_stopped = len(df_stopped)

# Calculate counts with Wilson CIs
from src.analysis.metrics import wilson_ci

failure_dist = df_stopped['failure_type'].value_counts().reset_index()
failure_dist.columns = ['Failure Type', 'Count']
failure_dist['Rate'] = failure_dist['Count'] / n_stopped * 100

# Add Wilson CIs
failure_dist['ci_lower'], failure_dist['ci_upper'] = zip(*failure_dist['Count'].apply(
    lambda x: wilson_ci(x, n_stopped)
))
failure_dist['ci_lower'] *= 100
failure_dist['ci_upper'] *= 100
failure_dist['Rate [95% CI]'] = failure_dist.apply(
    lambda r: f"{r['Rate']:.1f}% [{r['ci_lower']:.1f}-{r['ci_upper']:.1f}%]", axis=1
)

display(Markdown(f"**Stopped trials breakdown (n={n_stopped:,}):**"))
display(failure_dist[['Failure Type', 'Count', 'Rate [95% CI]']])

# Extract rates for dynamic text
term_rate = failure_dist[failure_dist['Failure Type'] == 'Terminated']['Rate'].values[0]
with_rate = failure_dist[failure_dist['Failure Type'] == 'Withdrawn']['Rate'].values[0]
susp_rate = failure_dist[failure_dist['Failure Type'] == 'Suspended']['Rate'].values[0]

# Cross-reference to Section 1
n_resolved = len(df_abt[df_abt['is_resolved'] == 1])
stopped_pct = n_stopped / n_resolved * 100

display(Markdown(f"""
**Interpretation:**  
Within stopped trials, **Terminated** is the most common registry outcome ({term_rate:.1f}%), followed by 
**Withdrawn** ({with_rate:.1f}%), with **Suspended** representing a small share ({susp_rate:.1f}%).

Next, we test whether these labels map to different observable profiles (enrollment, phase, sponsor, cohort).

*Context: {n_stopped:,} stopped trials = {stopped_pct:.1f}% of resolved.*
"""))

## 3.1 Failure Type Composition

**Stopped trials breakdown (n=8,774):**

Unnamed: 0,Failure Type,Count,Rate [95% CI]
0,Terminated,5755,65.6% [64.6-66.6%]
1,Withdrawn,2730,31.1% [30.2-32.1%]
2,Suspended,289,3.3% [2.9-3.7%]



**Interpretation:**  
Within stopped trials, **Terminated** is the most common registry outcome (65.6%), followed by 
**Withdrawn** (31.1%), with **Suspended** representing a small share (3.3%).

Next, we test whether these labels map to different observable profiles (enrollment, phase, sponsor, cohort).

*Context: 8,774 stopped trials = 13.9% of resolved.*


In [8]:
# ============================================================
# 3.2 What Distinguishes Failure Types?
# ============================================================
# Uses helpers: calc_enrollment_presence, calc_crosstab_analysis, create_crosstab_heatmap

from src.analysis.metrics import (
    calc_enrollment_presence, calc_crosstab_analysis, create_sponsor_category
)
from src.analysis.viz import create_crosstab_heatmap
from src.analysis.constants import FAILURE_TYPES

display(Markdown("## 3.2 Structural Differences Between Failure Types"))

display(Markdown("""
The key structural difference between failure types is **whether enrollment was ever reported**.
"""))

# --- Enrollment as a lifecycle marker ---
display(Markdown("### Enrollment Reporting by Failure Type"))

# Use helper function for enrollment presence analysis
enrollment_presence = calc_enrollment_presence(df_stopped, 'failure_type')
enrollment_presence = enrollment_presence.set_index('failure_type').loc[['Terminated', 'Withdrawn', 'Suspended']]

display(Markdown("*'% with enrollment' = study has a positive enrollment value recorded in the registry.*"))

display(enrollment_presence[['n_total', 'pct_with_enrollment']].rename(
    columns={
        'n_total': 'n Stopped',
        'pct_with_enrollment': '% with Enrollment'
    }
).round(1))

term_cov = enrollment_presence.loc['Terminated', 'pct_with_enrollment']
susp_cov = enrollment_presence.loc['Suspended', 'pct_with_enrollment']
with_cov = enrollment_presence.loc['Withdrawn', 'pct_with_enrollment']
with_n = int(enrollment_presence.loc['Withdrawn', 'n_with_enrollment'])

display(Markdown(f"""
**What this indicates:**  
Enrollment reporting is a strong "lifecycle marker" in this dataset. **Terminated/Suspended trials almost 
always report enrollment** ({term_cov:.1f}% and {susp_cov:.1f}%), while **Withdrawn trials almost never do** ({with_cov:.1f}%).

A small subset of Withdrawn trials (**{with_n:,} studies**) report non-zero enrollment; this is too small 
to change the overall pattern and may reflect registry update or classification noise.

*Limitation: "No enrollment reported" means missing/unreported data, not proof of zero enrollment.*
"""))

# --- Failure type composition by phase ---
display(Markdown("### Failure Type Composition by Phase"))

# Use helper for crosstab analysis
phases_for_heatmap = PHASE_ORDER_CLINICAL + ['Not Applicable']

phase_failure = calc_crosstab_analysis(
    df_stopped[df_stopped['phase_group'].isin(phases_for_heatmap)],
    'phase_group', 'failure_type',
    row_order=phases_for_heatmap,
    col_order=FAILURE_TYPES
)

# Create heatmap using helper
fig_heatmap = create_crosstab_heatmap(
    phase_failure['counts'].drop('All', errors='ignore'),
    phase_failure['pct_row'],
    title='Failure Type Composition by Phase',
    subtitle='Cell values: count (row %)',
    y_title='Phase',
    x_title='Failure Type',
)
fig_heatmap.show()

# Extract key values for interpretation
early_withdrawn = phase_failure['pct_row'].loc['Early Phase 1', 'Withdrawn']
phase3_withdrawn = phase_failure['pct_row'].loc['Phase 3', 'Withdrawn']
phase4_withdrawn = phase_failure['pct_row'].loc['Phase 4', 'Withdrawn'] if 'Phase 4' in phase_failure['pct_row'].index else 0
na_withdrawn = phase_failure['pct_row'].loc['Not Applicable', 'Withdrawn'] if 'Not Applicable' in phase_failure['pct_row'].index else 0

# Calculate Not Applicable's share of ALL Withdrawn
total_withdrawn = df_stopped[df_stopped['failure_type'] == 'Withdrawn'].shape[0]
na_withdrawn_n = df_stopped[(df_stopped['failure_type'] == 'Withdrawn') & (df_stopped['phase_group'] == 'Not Applicable')].shape[0]
na_share_of_withdrawn = na_withdrawn_n / total_withdrawn * 100 if total_withdrawn > 0 else 0

display(Markdown(f"""
**Phase pattern (Withdrawn share):**  
- Early Phase 1: **{early_withdrawn:.0f}%**  
- Phase 3: **{phase3_withdrawn:.0f}%**  
- Phase 4: **{phase4_withdrawn:.0f}%**  

This forms a U-shaped pattern: Withdrawn is highest in Early Phase 1, lowest in Phase 3, and higher again in Phase 4. 
We cannot verify study intent with the available fields.

**Note:** "Not Applicable" represents **{na_share_of_withdrawn:.0f}%** of all Withdrawn trials (n={na_withdrawn_n:,} of {total_withdrawn:,}), 
so aggregate Withdrawn patterns are materially influenced by non-phase-designated studies.

*{phase_failure['test']['interpretation']}.*
"""))

# --- Sponsor differences in failure composition ---
display(Markdown("### Sponsor Differences in Failure Composition"))

# Ensure sponsor_category exists for stopped trials
df_stopped['sponsor_category'] = create_sponsor_category(df_stopped['lead_agency_class'])

# Use helper for crosstab analysis
sponsor_failure = calc_crosstab_analysis(
    df_stopped, 'sponsor_category', 'failure_type',
    col_order=FAILURE_TYPES
)

# Extract values for interpretation
ind_term = sponsor_failure['pct_row'].loc['Industry', 'Terminated']
other_term = sponsor_failure['pct_row'].loc['Other', 'Terminated']
ind_with = sponsor_failure['pct_row'].loc['Industry', 'Withdrawn']
other_with = sponsor_failure['pct_row'].loc['Other', 'Withdrawn']

display(Markdown(f"""
| Sponsor | Terminated | Withdrawn | Suspended |
|---------|------------|-----------|-----------|
| Industry | {ind_term:.1f}% | {ind_with:.1f}% | {100 - ind_term - ind_with:.1f}% |
| Other | {other_term:.1f}% | {other_with:.1f}% | {100 - other_term - other_with:.1f}% |

Industry shows a higher Terminated share ({ind_term:.1f}% vs {other_term:.1f}%), while Other shows a higher 
Withdrawn share ({other_with:.1f}% vs {ind_with:.1f}%). This suggests differences in **when** stopped outcomes 
occur across sponsor classes, but reasons-for-stopping are not observable in this dataset.

*{sponsor_failure['test']['interpretation']}.*
"""))

# --- Narrative synthesis ---
# Get values for dynamic summary
term_pct = failure_dist[failure_dist['Failure Type'] == 'Terminated']['Rate'].values[0]
with_pct = failure_dist[failure_dist['Failure Type'] == 'Withdrawn']['Rate'].values[0]
susp_pct = failure_dist[failure_dist['Failure Type'] == 'Suspended']['Rate'].values[0]
withdrawn_no_enrollment = 100 - enrollment_presence.loc['Withdrawn', 'pct_with_enrollment']

# Get enrollment percentages dynamically
term_enroll_pct = enrollment_presence.loc['Terminated', 'pct_with_enrollment']
with_enroll_pct = enrollment_presence.loc['Withdrawn', 'pct_with_enrollment']
susp_enroll_pct = enrollment_presence.loc['Suspended', 'pct_with_enrollment']

display(Markdown(f"""
---
## 3.3 Summary

- **Terminated dominates** stopped outcomes (**{term_pct:.1f}%**), followed by **Withdrawn** (**{with_pct:.1f}%**).
- **Enrollment reporting separates failure types**: Terminated/Suspended ~always report enrollment; Withdrawn ~never does.
- **Withdrawn varies by phase** (U-shaped: Early Phase 1 high → Phase 3 low → Phase 4 higher).
- **"Not Applicable" drives Withdrawn volume** (**{na_share_of_withdrawn:.0f}%** of all Withdrawn), so interpret Withdrawn patterns with scope in mind.

---
"""))

## 3.2 Structural Differences Between Failure Types


The key structural difference between failure types is **whether enrollment was ever reported**.


### Enrollment Reporting by Failure Type

*'% with enrollment' = study has a positive enrollment value recorded in the registry.*

Unnamed: 0_level_0,n Stopped,% with Enrollment
failure_type,Unnamed: 1_level_1,Unnamed: 2_level_1
Terminated,5755,99.3
Withdrawn,2730,1.1
Suspended,289,99.7



**What this indicates:**  
Enrollment reporting is a strong "lifecycle marker" in this dataset. **Terminated/Suspended trials almost 
always report enrollment** (99.3% and 99.7%), while **Withdrawn trials almost never do** (1.1%).

A small subset of Withdrawn trials (**31 studies**) report non-zero enrollment; this is too small 
to change the overall pattern and may reflect registry update or classification noise.

*Limitation: "No enrollment reported" means missing/unreported data, not proof of zero enrollment.*


### Failure Type Composition by Phase


**Phase pattern (Withdrawn share):**  
- Early Phase 1: **40%**  
- Phase 3: **23%**  
- Phase 4: **33%**  

This forms a U-shaped pattern: Withdrawn is highest in Early Phase 1, lowest in Phase 3, and higher again in Phase 4. 
We cannot verify study intent with the available fields.

**Note:** "Not Applicable" represents **52%** of all Withdrawn trials (n=1,416 of 2,730), 
so aggregate Withdrawn patterns are materially influenced by non-phase-designated studies.

*Significant difference (χ²=173.4, p<0.001).*


### Sponsor Differences in Failure Composition


| Sponsor | Terminated | Withdrawn | Suspended |
|---------|------------|-----------|-----------|
| Industry | 76.7% | 20.4% | 2.9% |
| Other | 60.9% | 35.7% | 3.4% |

Industry shows a higher Terminated share (76.7% vs 60.9%), while Other shows a higher 
Withdrawn share (35.7% vs 20.4%). This suggests differences in **when** stopped outcomes 
occur across sponsor classes, but reasons-for-stopping are not observable in this dataset.

*Significant difference (χ²=208.6, p<0.001).*



---
## 3.3 Summary

- **Terminated dominates** stopped outcomes (**65.6%**), followed by **Withdrawn** (**31.1%**).
- **Enrollment reporting separates failure types**: Terminated/Suspended ~always report enrollment; Withdrawn ~never does.
- **Withdrawn varies by phase** (U-shaped: Early Phase 1 high → Phase 3 low → Phase 4 higher).
- **"Not Applicable" drives Withdrawn volume** (**52%** of all Withdrawn), so interpret Withdrawn patterns with scope in mind.

---


In [9]:
# ============================================================
# 3.4 Temporal Analysis: Failure Types by Cohort
# ============================================================
# Has the Terminated/Withdrawn composition changed over time?

from src.analysis.metrics import create_start_cohorts
from src.analysis.viz import create_stacked_bar_chart
from src.analysis.constants import COHORT_LABELS

display(Markdown("## 3.4 Failure Type Composition Over Time"))

display(Markdown("""
**Question:** Has the Terminated/Withdrawn composition changed across start-year cohorts?
"""))

# Create cohorts using helper
df_stopped['start_cohort'] = create_start_cohorts(df_stopped['start_year'])

# Use helper for crosstab
cohort_failure = calc_crosstab_analysis(
    df_stopped.dropna(subset=['start_cohort']),
    'start_cohort', 'failure_type',
    col_order=FAILURE_TYPES
)

display(Markdown("### Failure Type Composition by Start-Year Cohort"))
display(cohort_failure['pct_row'].round(1))

# Create stacked bar chart using helper
cohort_data = cohort_failure['pct_row'].reset_index().melt(
    id_vars='start_cohort',
    var_name='Failure Type',
    value_name='Percentage'
)

fig_cohort = create_stacked_bar_chart(
    cohort_data,
    x_col='start_cohort',
    y_col='Percentage',
    color_col='Failure Type',
    color_map=FAILURE_COLORS,
    title='Failure Type Composition by Start-Year Cohort',
    x_title='Start Year Cohort',
    y_title='Share of Stopped Trials (%)',
)
fig_cohort.show()

display(Markdown(f"**χ² test:** {cohort_failure['test']['interpretation']}"))

# Extract trends for interpretation (using first and last cohort labels)
first_cohort, last_cohort = COHORT_LABELS[0], COHORT_LABELS[-1]
withdrawn_first = cohort_failure['pct_row'].loc[first_cohort, 'Withdrawn'] if first_cohort in cohort_failure['pct_row'].index else 0
withdrawn_last = cohort_failure['pct_row'].loc[last_cohort, 'Withdrawn'] if last_cohort in cohort_failure['pct_row'].index else 0
terminated_first = cohort_failure['pct_row'].loc[first_cohort, 'Terminated'] if first_cohort in cohort_failure['pct_row'].index else 0
terminated_last = cohort_failure['pct_row'].loc[last_cohort, 'Terminated'] if last_cohort in cohort_failure['pct_row'].index else 0

# Calculate the shift for summary
shift_withdrawn = withdrawn_last - withdrawn_first
shift_terminated = terminated_last - terminated_first

# Determine trend direction
if shift_withdrawn > 10:
    trend_text = f"Withdrawn share increased +{shift_withdrawn:.0f} pp from {first_cohort} to {last_cohort}."
    hypothesis_text = "This shift is consistent with earlier stopping decisions, but the dataset cannot attribute a cause."
elif abs(shift_withdrawn) <= 5:
    trend_text = "Composition relatively stable across cohorts."
    hypothesis_text = ""
else:
    trend_text = f"Moderate shift: Withdrawn {'+' if shift_withdrawn > 0 else ''}{shift_withdrawn:.0f} pp."
    hypothesis_text = ""

display(Markdown(f"""
### Temporal Trends

| Cohort | Terminated | Withdrawn |
|--------|------------|-----------|
| {first_cohort} | {terminated_first:.1f}% | {withdrawn_first:.1f}% |
| {last_cohort} | {terminated_last:.1f}% | {withdrawn_last:.1f}% |

{trend_text} {hypothesis_text}

*Recent cohorts may include COVID-19 effects.*

---
"""))

## 3.4 Failure Type Composition Over Time


**Question:** Has the Terminated/Withdrawn composition changed across start-year cohorts?


### Failure Type Composition by Start-Year Cohort

failure_type,Terminated,Withdrawn,Suspended
start_cohort,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1990-1999,83.3,16.7,0.0
2000-2009,76.0,21.9,2.0
2010-2019,68.6,29.0,2.4
2020-2025,49.4,44.2,6.4


**χ² test:** Significant difference (χ²=395.9, p<0.001)


### Temporal Trends

| Cohort | Terminated | Withdrawn |
|--------|------------|-----------|
| 1990-1999 | 83.3% | 16.7% |
| 2020-2025 | 49.4% | 44.2% |

Withdrawn share increased +28 pp from 1990-1999 to 2020-2025. This shift is consistent with earlier stopping decisions, but the dataset cannot attribute a cause.

*Recent cohorts may include COVID-19 effects.*

---


---
# 4. Temporal Dimension

**Context:**  
Section 2 analyzed completion rates by trial characteristics (phase, enrollment, sponsor) on a cross-sectional basis. This section examines whether completion rates have changed over time by start-year cohort.

**Critical interpretation issue: Lifecycle effects and right-censoring**

Completion rate trends over time are subject to a structural bias: trials started recently have had less time to reach a terminal outcome (Completed or Stopped) than trials started decades ago.

For example:
- A trial started in 1995 has had 30+ years to complete or stop by the 2026-01-18 extraction date
- A trial started in 2023 has had only 3 years to complete or stop

If both trials require 5 years to reach completion, the 1995 trial will show as "Completed" in our data, while the 2023 trial may still show as "Recruiting" or "Active, not recruiting" and be excluded from resolved-outcome analysis.

This creates **right-censoring bias**: recent cohorts will mechanically show lower completion rates because a higher share of their trials remain Active (censored), not because trial quality has declined.

**Relationship to Section 3:**  
Section 3.4 examined temporal trends in failure-type composition (Terminated vs Withdrawn) and found a substantial shift (Withdrawn 17% → 44%). This section examines whether overall completion rates have changed, accounting for lifecycle effects.

**Analysis approach:**  
We calculate resolved completion rate by start-year cohort (trials with known outcomes only). Interpretation focuses on directional trends rather than absolute levels for recent years, where censoring is highest.


In [10]:
# ============================================================
# 4.1 Completion Rate by Start Year
# ============================================================

from src.analysis.viz import create_simple_line_chart
from src.analysis.metrics import create_start_cohorts
from src.analysis.constants import COHORT_LABELS

display(Markdown("## 4.1 Completion Rate by Start Year"))

# Calculate completion rate by start year
yearly_rates = calc_completion_rate(df_abt, 'start_year', min_n=50, include_ci=True)
yearly_rates = yearly_rates.sort_values('start_year')

# Simple line chart using helper
fig_temporal = create_simple_line_chart(
    yearly_rates,
    x_col='start_year',
    y_col='completion_rate_pct',
    title='Resolved Completion Rate by Start Year',
    subtitle='Resolved trials only',
    y_title='Completion Rate (%)',
    n_col='n_resolved',
    y_range=[0, 100],
)
fig_temporal.show()

# ============================================================
# Cohort-based analysis for statistical stability
# ============================================================

display(Markdown("### Completion Rate by Decade Cohort"))

# Create cohort bins using helper
df_abt['start_cohort'] = create_start_cohorts(df_abt['start_year'])

# Calculate completion rates by cohort
cohort_rates = calc_completion_rate(df_abt, 'start_cohort', include_ci=True)
cohort_rates = cohort_rates.sort_values('start_cohort')

display(cohort_rates[['start_cohort', 'n_resolved', 'n_completed', 'completion_rate_pct', 'ci_lower_pct', 'ci_upper_pct']])

# Extract key values using dict for cleaner access
cohort_data = cohort_rates.set_index('start_cohort')[['completion_rate_pct', 'n_resolved']].to_dict('index')
first_cohort, last_cohort = COHORT_LABELS[0], COHORT_LABELS[-1]

# ============================================================
# Active trial distribution (censoring analysis)
# ============================================================

display(Markdown("### Active Trial Distribution by Cohort (Censoring Analysis)"))

# Calculate Active share by cohort
active_by_cohort = df_abt.groupby('start_cohort', observed=False).agg(
    n_total=('study_id', 'count'),
    n_active=('is_active', 'sum'),
    n_resolved=('is_resolved', 'sum')
).reset_index()
active_by_cohort['pct_active'] = active_by_cohort['n_active'] / active_by_cohort['n_total'] * 100

display(active_by_cohort[['start_cohort', 'n_total', 'n_resolved', 'n_active', 'pct_active']].round(1))

# Extract censoring values
censoring_data = active_by_cohort.set_index('start_cohort')['pct_active'].to_dict()
censoring_first = censoring_data.get(first_cohort, 0)
censoring_last = censoring_data.get(last_cohort, 0)

# Calculate mean for mature cohorts (≤2020)
mature_cohort_rate = yearly_rates[yearly_rates['start_year'] <= 2020]['completion_rate_pct'].mean()

# ============================================================
# INTERPRETATION
# ============================================================

display(Markdown(f"""
---

### Interpretation: Lifecycle Effects Dominate Temporal Trends

The year-by-year chart shows declining completion rates for recent cohorts. This reflects right-censoring, 
not declining trial quality.

**Cohort-level rates (resolved trials only):**

| Cohort | Completion Rate | n Resolved |
|--------|-----------------|------------|
| {COHORT_LABELS[0]} | {cohort_data[COHORT_LABELS[0]]['completion_rate_pct']:.1f}% | {cohort_data[COHORT_LABELS[0]]['n_resolved']:,} |
| {COHORT_LABELS[1]} | {cohort_data[COHORT_LABELS[1]]['completion_rate_pct']:.1f}% | {cohort_data[COHORT_LABELS[1]]['n_resolved']:,} |
| {COHORT_LABELS[2]} | {cohort_data[COHORT_LABELS[2]]['completion_rate_pct']:.1f}% | {cohort_data[COHORT_LABELS[2]]['n_resolved']:,} |
| {COHORT_LABELS[3]} | {cohort_data[COHORT_LABELS[3]]['completion_rate_pct']:.1f}% | {cohort_data[COHORT_LABELS[3]]['n_resolved']:,} |

The {last_cohort} rate is unreliable: {censoring_last:.0f}% of trials remain Active (vs {censoring_first:.0f}% 
for {first_cohort}), having had insufficient time to complete. When restricting to cohorts with ≥5 years follow-up 
(started ≤2020), the completion rate remains within a narrow range around {mature_cohort_rate:.0f}%, with no clear 
monotonic decline.

**Connection to Section 3.4:**  
Section 3.4 documented a shift in failure-type composition (Withdrawn 17% → 44%), representing a change in *when* 
trials fail, not whether they fail. The resolved completion rate here shows no clear monotonic decline when 
restricting to cohorts with sufficient follow-up time.

**For portfolio planning and reporting:**  
Use pre-2020 cohorts (~{mature_cohort_rate:.0f}% completion rate) as benchmarks. Distinguish mature cohorts 
(≥5 years follow-up) from immature cohorts when presenting temporal trends. Misinterpreting recent-cohort rates 
as performance decline may lead to inappropriate conclusions about trial efficiency, given the strong influence 
of censoring in these cohorts.

---
"""))


## 4.1 Completion Rate by Start Year

### Completion Rate by Decade Cohort

Unnamed: 0,start_cohort,n_resolved,n_completed,completion_rate_pct,ci_lower_pct,ci_upper_pct
0,1990-1999,1033,949,91.868345,90.042349,93.384099
1,2000-2009,14124,12242,86.675163,86.104714,87.225667
2,2010-2019,32320,27676,85.631188,85.244534,86.009373
3,2020-2025,15481,13317,86.021575,85.466396,86.558881


### Active Trial Distribution by Cohort (Censoring Analysis)

Unnamed: 0,start_cohort,n_total,n_resolved,n_active,pct_active
0,1990-1999,1067,1033,34,3.2
1,2000-2009,14314,14124,190,1.3
2,2010-2019,34468,32320,2148,6.2
3,2020-2025,32858,15481,17377,52.9



---

### Interpretation: Lifecycle Effects Dominate Temporal Trends

The year-by-year chart shows declining completion rates for recent cohorts. This reflects right-censoring, 
not declining trial quality.

**Cohort-level rates (resolved trials only):**

| Cohort | Completion Rate | n Resolved |
|--------|-----------------|------------|
| 1990-1999 | 91.9% | 1,033 |
| 2000-2009 | 86.7% | 14,124 |
| 2010-2019 | 85.6% | 32,320 |
| 2020-2025 | 86.0% | 15,481 |

The 2020-2025 rate is unreliable: 53% of trials remain Active (vs 3% 
for 1990-1999), having had insufficient time to complete. When restricting to cohorts with ≥5 years follow-up 
(started ≤2020), the completion rate remains within a narrow range around 88%, with no clear 
monotonic decline.

**Connection to Section 3.4:**  
Section 3.4 documented a shift in failure-type composition (Withdrawn 17% → 44%), representing a change in *when* 
trials fail, not whether they fail. The resolved completion rate here shows no clear monotonic decline when 
restricting to cohorts with sufficient follow-up time.

**For portfolio planning and reporting:**  
Use pre-2020 cohorts (~88% completion rate) as benchmarks. Distinguish mature cohorts 
(≥5 years follow-up) from immature cohorts when presenting temporal trends. Misinterpreting recent-cohort rates 
as performance decline may lead to inappropriate conclusions about trial efficiency, given the strong influence 
of censoring in these cohorts.

---


---
# 5. Statistical Inference

## 5.0 Model Choice & Assumptions

### Why Logistic Regression on Resolved Trials?

We use binary logistic regression to model the probability of completion among resolved trials (Completed vs Stopped).

**Rationale:**

| Design Choice | Justification |
|---------------|---------------|
| **Binary outcome** | Research question asks "which factors are associated with completion" — not "when do trials complete" |
| **Resolved trials only** | Active trials are censored (outcome unknown); including them would bias completion rate estimates |
| **Cross-sectional design** | Status observed at extraction date, not longitudinal follow-up — temporal analysis (Section 4) handles lifecycle effects |
| **Association, not causation** | Observational data with unmeasured confounders — model identifies correlates, not causal effects |

Survival analysis (time-to-event) is not feasible: the registry does not reliably record stop dates for 
Terminated/Withdrawn trials (q2_abt.sql lines 188-195).

---

### Model Purpose: Explanatory, Not Predictive

| Model Type | Goal | Use Case |
|------------|------|----------|
| **Explanatory** | Understand associations between factors and outcome | Scientific insight, hypothesis generation |
| **Predictive** | Forecast outcomes for new trials | Operational decision-making, resource allocation |

This analysis is **explanatory**. We report AUC and confusion matrix as sanity checks, but the primary goal 
is to interpret adjusted associations (odds ratios), not to forecast individual trial outcomes.

**For decision-making:**  
Use this model to identify trial characteristics correlated with success and to flag high-risk trials for 
operational support. Do not use raw predicted probabilities for budget forecasting or contractual commitments 
unless calibration is validated.

---

### Key Modeling Assumptions

1. **Independence of observations**  
   Each trial's outcome is independent. Caveat: trials from the same sponsor may share unmeasured characteristics 
   (operational capacity, expertise). We do not model sponsor-level clustering.

2. **Linearity in the logit** (for continuous predictors)  
   `log_enrollment` has a linear relationship with log-odds of completion. Empirical logit plot (Diagnostic 1) will check this.

3. **No severe multicollinearity**  
   Predictors are not perfectly correlated. VIF diagnostics (Diagnostic 3) will assess this.

4. **Association, not causation**  
   This is an exploratory analysis. We identify factors correlated with completion but cannot conclude that 
   changing these factors would change outcomes. Unmeasured confounders (protocol complexity, investigator 
   experience) may explain observed associations.

---

### Model Specification

| Component | Description |
|-----------|-------------|
| **Target** | `is_completed` (1 = Completed, 0 = Stopped) |
| **Predictors** | Phase (categorical, reference = Phase 3), log(Enrollment), Sponsor type (Industry vs Other), Oncology flag |
| **Population** | Resolved trials only (Active excluded as censored) |
| **Approach** | Multivariable regression (adjusted associations) |

> **Caveat:** This is association analysis, not causal inference. Unmeasured confounders may explain observed patterns.

In [11]:
# ============================================================
# 5.1 Model Preparation & Assumption Diagnostics
# ============================================================

import statsmodels.formula.api as smf
from statsmodels.stats.outliers_influence import variance_inflation_factor
from scipy.stats import spearmanr

from src.analysis.viz import create_linearity_check_chart, create_cooks_distance_chart

display(Markdown("""
### 5.1 Assumption diagnostics (before interpreting odds ratios)

Before interpreting the regression coefficients as "drivers of completion", we run a small set of
sanity checks that validate the main statistical assumptions behind logistic regression:

- **Linearity in the logit** for `log_enrollment` (is the log transform reasonable?)
- **Separation** (no subgroup with 0% or 100% completion)
- **Multicollinearity** (phase and enrollment not explaining the same variance)
- **Influential observations** (Cook's distance) to ensure results are not driven by a small number of extreme trials

If these checks pass, odds ratios in Section 5.2 can be interpreted as *stable adjusted associations*
within this dataset (still non-causal).
"""))

# Prepare data for modeling
df_resolved = df_abt[df_abt['is_resolved'] == 1].copy()

# Filter to complete cases and clinical phases
# Exclude enrollment == 0 or missing (don't treat missing as 0)
df_model = df_resolved[
    (~df_resolved['phase_group'].isin(['Not Applicable', 'Other'])) &
    (df_resolved['enrollment'] > 0)  # Exclude missing/zero enrollment
].copy()

# Feature engineering
df_model['log_enrollment'] = np.log(df_model['enrollment'])
df_model['is_industry'] = (df_model['lead_agency_class'] == 'INDUSTRY').astype(int)

display(Markdown(f"**Modeling sample:** {len(df_model):,} phase-designated resolved trials with valid enrollment"))

display(Markdown(f"""
**Note on model population:** This sample excludes {len(df_resolved) - len(df_model):,} trials 
({(len(df_resolved) - len(df_model)) / len(df_resolved) * 100:.1f}% of resolved) with missing or zero enrollment.
Associations are conditional on having enrollment reported. This is necessary because log(0) is undefined,
but it means results may differ for trials that never reported enrollment (which are disproportionately
Withdrawn trials, as shown in Section 3.2).
"""))

# --- DIAGNOSTIC 1: Linearity in the Logit (log_enrollment) ---
display(Markdown("### Diagnostic 1: Linearity in the Logit"))
display(Markdown("*Check: Does log_enrollment have a linear relationship with log-odds of completion?*"))

# Bin log_enrollment into deciles
df_model['log_enroll_decile'] = pd.qcut(df_model['log_enrollment'], q=10, labels=False, duplicates='drop')

# Calculate empirical completion rate per bin
binned_stats = df_model.groupby('log_enroll_decile', observed=False).agg(
    mean_log_enrollment=('log_enrollment', 'mean'),
    n_trials=('study_id', 'count'),
    n_completed=('is_completed', 'sum')
).reset_index()

# Symmetric clamp to avoid numerical instability
eps = 1e-10
binned_stats['completion_rate_prop'] = binned_stats['n_completed'] / binned_stats['n_trials']
p_clamped = binned_stats['completion_rate_prop'].clip(eps, 1 - eps)
binned_stats['empirical_logit'] = np.log(p_clamped / (1 - p_clamped))

# Plot using helper
fig_linearity = create_linearity_check_chart(
    binned_stats,
    x_col='mean_log_enrollment',
    y_col='empirical_logit',
    n_col='n_trials',
)
fig_linearity.show()

# Quantify linearity with Spearman correlation
corr, p_value = spearmanr(binned_stats['mean_log_enrollment'], binned_stats['empirical_logit'])
p_str = "p < 0.001" if p_value < 0.001 else f"p = {p_value:.3f}"

display(Markdown(f"""
**Interpretation:**
The relationship is approximately linear (Spearman ρ = {corr:.3f}, {p_str}), supporting the log transform
for enrollment. If strong curvature is observed (ρ < 0.7), we would switch to enrollment buckets or spline terms.
"""))

# --- DIAGNOSTIC 2: Separation Check ---
display(Markdown("### Diagnostic 2: Separation (Perfect Prediction)"))
display(Markdown("*Check: Does any predictor perfectly predict outcome?*"))

# Check separation for ALL predictors
phase_separation = df_model.groupby('phase_group', observed=False).agg(
    n=('study_id', 'count'),
    n_completed=('is_completed', 'sum'),
    pct_completed=('is_completed', lambda x: f"{x.mean()*100:.1f}%")
).reset_index()

sponsor_separation = df_model.groupby('is_industry', observed=False).agg(
    n=('study_id', 'count'),
    n_completed=('is_completed', 'sum'),
    pct_completed=('is_completed', lambda x: f"{x.mean()*100:.1f}%")
).reset_index()
sponsor_separation['is_industry'] = sponsor_separation['is_industry'].map({0: 'Other', 1: 'Industry'})

oncology_separation = df_model.groupby('has_oncology_label', observed=False).agg(
    n=('study_id', 'count'),
    n_completed=('is_completed', 'sum'),
    pct_completed=('is_completed', lambda x: f"{x.mean()*100:.1f}%")
).reset_index()
oncology_separation['has_oncology_label'] = oncology_separation['has_oncology_label'].map({0: 'Non-oncology', 1: 'Oncology'})

display(Markdown("**Phase × Outcome:**"))
display(phase_separation)

display(Markdown("**Sponsor × Outcome:**"))
display(sponsor_separation)

display(Markdown("**Oncology × Outcome:**"))
display(oncology_separation)

display(Markdown("""
**Interpretation:**
No predictor shows 0% or 100% completion rate. All categories have variation in outcomes, allowing model convergence.
"""))

# --- DIAGNOSTIC 3: Multicollinearity ---
display(Markdown("### Diagnostic 3: Multicollinearity"))
display(Markdown("*Check: Are predictors highly correlated, inflating standard errors?*"))

# VIF for continuous predictors
X_vif_continuous = df_model[['log_enrollment', 'is_industry', 'has_oncology_label']].copy()
X_vif_continuous = X_vif_continuous.assign(const=1)

vif_continuous = pd.DataFrame({
    'Variable': ['log_enrollment', 'is_industry', 'has_oncology_label'],
    'VIF': [variance_inflation_factor(X_vif_continuous.values, i) for i in range(3)]
})

display(Markdown("**VIF for Continuous Predictors:**"))
display(vif_continuous.round(2))

# VIF for Phase dummies
X_vif_full = pd.get_dummies(df_model[['phase_group', 'log_enrollment', 'is_industry', 'has_oncology_label']],
                             columns=['phase_group'], drop_first=True, dtype=float)

phase_cols = [col for col in X_vif_full.columns if 'phase_group_' in col]
vif_phase_list = []

for col in phase_cols:
    try:
        vif_val = variance_inflation_factor(X_vif_full.astype(float).values, X_vif_full.columns.get_loc(col))
        vif_phase_list.append({
            'Phase Category': col.replace('phase_group_', ''),
            'VIF': vif_val
        })
    except Exception:
        pass  # Skip silently, report summary at end

# Report skipped dummies
skipped_count = len(phase_cols) - len(vif_phase_list)
if skipped_count > 0:
    display(Markdown(f"*VIF computed for {len(vif_phase_list)}/{len(phase_cols)} phase dummies; {skipped_count} skipped due to singularity*"))

if len(vif_phase_list) > 0:
    vif_phase = pd.DataFrame(vif_phase_list)
    display(Markdown("**VIF for Phase Categories (Dummy Variables):**"))
    display(Markdown("*Reference category excluded to avoid dummy variable trap.*"))
    display(vif_phase.round(2))

display(Markdown("""
**Interpretation:**
VIF < 5 for all predictors indicates low multicollinearity. Predictors measure distinct aspects of trials
(enrollment size, sponsor type, therapeutic area, phase). Standard errors are reliable.

*Threshold: VIF < 5 ideal; 5-10 acceptable; >10 problematic.*
"""))

# --- DIAGNOSTIC 4: Influential Observations (Cook's Distance) ---
display(Markdown("### Diagnostic 4: Influential Observations (Cook's Distance)"))
display(Markdown("*Check: Are results driven by a handful of extreme trials?*"))

# Fit preliminary model to get Cook's distance
formula = "is_completed ~ C(phase_group, Treatment(reference='Phase 3')) + log_enrollment + is_industry + has_oncology_label"
logit_prelim = smf.logit(formula, data=df_model).fit(disp=0)

# Calculate Cook's distance
influence = logit_prelim.get_influence()
cooks_d = influence.cooks_distance[0]

# Identify influential observations (Cook's D > 4/n threshold)
threshold = 4 / len(df_model)
n_influential = (cooks_d > threshold).sum()
max_cooks = cooks_d.max()

# Get top 5 most influential trials
df_model_temp = df_model.copy()
df_model_temp['cooks_d'] = cooks_d
top_influential = df_model_temp.nlargest(5, 'cooks_d')[
    ['nct_id', 'phase_group', 'enrollment', 'is_completed', 'cooks_d']
]

display(Markdown(f"**Influential observations:** {n_influential:,} trials exceed threshold (Cook's D > {threshold:.6f})"))
display(Markdown(f"**Maximum Cook's D:** {max_cooks:.6f}"))
display(Markdown("**Top 5 influential trials:**"))
display(top_influential.round(6))

# Plot using helper
fig_cooks = create_cooks_distance_chart(cooks_d, threshold)
fig_cooks.show()

display(Markdown(f"""
**Interpretation:**
Maximum Cook's D = {max_cooks:.4f} (threshold for concern: 0.5). No individual trial dominates regression
coefficients. The {n_influential:,} observations ({n_influential/len(df_model)*100:.1f}%) that exceed the
technical threshold (4/n = {threshold:.6f}) have negligible absolute influence.
"""))

# --- SUMMARY ---
# Get phase counts for imbalance note
phase_counts = df_model['phase_group'].value_counts()
min_phase = phase_counts.idxmin()
min_n = phase_counts.min()
max_phase = phase_counts.idxmax()
max_n = phase_counts.max()

display(Markdown(f"""
---

### Summary: Diagnostic Checks

| Check | Result |
|-------|--------|
| **Linearity** | Approximately linear (Spearman ρ = {corr:.2f}) |
| **Separation** | No perfect prediction (all groups show variation) |
| **Multicollinearity** | VIF < 5 for all predictors |
| **Influential observations** | Max Cook's D = {max_cooks:.4f} (no individual trial dominates) |

**Assessment:** All diagnostics pass. Model assumptions are sufficiently satisfied for reliable inference within this dataset. Odds ratios in Section 5.2
represent stable adjusted associations.

---

### Note on Category Imbalance

Phase categories are unevenly distributed (n = {min_n:,} for {min_phase} vs n = {max_n:,} for {max_phase}),
reflecting real-world trial pipelines. Diagnostic checks (separation, VIF, Cook's distance) indicate
that this imbalance does not induce instability or perfect prediction. Standard logistic regression
provides stable adjusted associations without weighting or penalization.
"""))


### 5.1 Assumption diagnostics (before interpreting odds ratios)

Before interpreting the regression coefficients as "drivers of completion", we run a small set of
sanity checks that validate the main statistical assumptions behind logistic regression:

- **Linearity in the logit** for `log_enrollment` (is the log transform reasonable?)
- **Separation** (no subgroup with 0% or 100% completion)
- **Multicollinearity** (phase and enrollment not explaining the same variance)
- **Influential observations** (Cook's distance) to ensure results are not driven by a small number of extreme trials

If these checks pass, odds ratios in Section 5.2 can be interpreted as *stable adjusted associations*
within this dataset (still non-causal).


**Modeling sample:** 24,811 phase-designated resolved trials with valid enrollment


**Note on model population:** This sample excludes 38,147 trials 
(60.6% of resolved) with missing or zero enrollment.
Associations are conditional on having enrollment reported. This is necessary because log(0) is undefined,
but it means results may differ for trials that never reported enrollment (which are disproportionately
Withdrawn trials, as shown in Section 3.2).


### Diagnostic 1: Linearity in the Logit

*Check: Does log_enrollment have a linear relationship with log-odds of completion?*


**Interpretation:**
The relationship is approximately linear (Spearman ρ = 0.964, p < 0.001), supporting the log transform
for enrollment. If strong curvature is observed (ρ < 0.7), we would switch to enrollment buckets or spline terms.


### Diagnostic 2: Separation (Perfect Prediction)

*Check: Does any predictor perfectly predict outcome?*

**Phase × Outcome:**

Unnamed: 0,phase_group,n,n_completed,pct_completed
0,Early Phase 1,510,440,86.3%
1,Phase 1,6146,5466,88.9%
2,Phase 1/2,1567,1232,78.6%
3,Phase 2,6924,5612,81.1%
4,Phase 2/3,790,681,86.2%
5,Phase 3,4974,4385,88.2%
6,Phase 4,3900,3433,88.0%


**Sponsor × Outcome:**

Unnamed: 0,is_industry,n,n_completed,pct_completed
0,Other,12703,10722,84.4%
1,Industry,12108,10527,86.9%


**Oncology × Outcome:**

Unnamed: 0,has_oncology_label,n,n_completed,pct_completed
0,Non-oncology,19182,16939,88.3%
1,Oncology,5629,4310,76.6%



**Interpretation:**
No predictor shows 0% or 100% completion rate. All categories have variation in outcomes, allowing model convergence.


### Diagnostic 3: Multicollinearity

*Check: Are predictors highly correlated, inflating standard errors?*

**VIF for Continuous Predictors:**

Unnamed: 0,Variable,VIF
0,log_enrollment,1.05
1,is_industry,1.04
2,has_oncology_label,1.03


**VIF for Phase Categories (Dummy Variables):**

*Reference category excluded to avoid dummy variable trap.*

Unnamed: 0,Phase Category,VIF
0,Phase 1,2.94
1,Phase 1/2,1.52
2,Phase 2,3.7
3,Phase 2/3,1.36
4,Phase 3,4.29
5,Phase 4,2.67



**Interpretation:**
VIF < 5 for all predictors indicates low multicollinearity. Predictors measure distinct aspects of trials
(enrollment size, sponsor type, therapeutic area, phase). Standard errors are reliable.

*Threshold: VIF < 5 ideal; 5-10 acceptable; >10 problematic.*


### Diagnostic 4: Influential Observations (Cook's Distance)

*Check: Are results driven by a handful of extreme trials?*

**Influential observations:** 2,116 trials exceed threshold (Cook's D > 0.000161)

**Maximum Cook's D:** 0.003118

**Top 5 influential trials:**

Unnamed: 0,nct_id,phase_group,enrollment,is_completed,cooks_d
101,NCT00212407,Early Phase 1,4476.0,0,0.003118
21331,NCT01805505,Early Phase 1,346.0,0,0.002309
19873,NCT01472211,Early Phase 1,317.0,0,0.002287
51125,NCT04348474,Early Phase 1,200.0,0,0.002153
21313,NCT02212210,Early Phase 1,135.0,0,0.002086



**Interpretation:**
Maximum Cook's D = 0.0031 (threshold for concern: 0.5). No individual trial dominates regression
coefficients. The 2,116 observations (8.5%) that exceed the
technical threshold (4/n = 0.000161) have negligible absolute influence.



---

### Summary: Diagnostic Checks

| Check | Result |
|-------|--------|
| **Linearity** | Approximately linear (Spearman ρ = 0.96) |
| **Separation** | No perfect prediction (all groups show variation) |
| **Multicollinearity** | VIF < 5 for all predictors |
| **Influential observations** | Max Cook's D = 0.0031 (no individual trial dominates) |

**Assessment:** All diagnostics pass. Model assumptions are sufficiently satisfied for reliable inference within this dataset. Odds ratios in Section 5.2
represent stable adjusted associations.

---

### Note on Category Imbalance

Phase categories are unevenly distributed (n = 510 for Early Phase 1 vs n = 6,924 for Phase 2),
reflecting real-world trial pipelines. Diagnostic checks (separation, VIF, Cook's distance) indicate
that this imbalance does not induce instability or perfect prediction. Standard logistic regression
provides stable adjusted associations without weighting or penalization.


In [12]:
# ============================================================
# 5.2 Logistic Regression Results
# ============================================================

display(Markdown("## 5.2 Logistic Regression Results"))

# Model formula (MUST match reference in 5.0 and Diagnostic 4)
formula = "is_completed ~ C(phase_group, Treatment(reference='Phase 3')) + log_enrollment + is_industry + has_oncology_label"

# Fit model
logit_model = smf.logit(formula, data=df_model).fit(disp=0)

# Check for convergence
if logit_model.mle_retvals['converged']:
    display(Markdown("**Model converged successfully** (no separation issues detected)"))
else:
    display(Markdown("**Warning**: Model did not converge — check for perfect separation"))

# --- Odds Ratios Table ---
display(Markdown("### Adjusted Odds Ratios"))

odds_ratios = np.exp(logit_model.params)
conf_int = np.exp(logit_model.conf_int())

or_table = pd.DataFrame({
    'Odds Ratio': odds_ratios,
    '95% CI Lower': conf_int[0],
    '95% CI Upper': conf_int[1],
    'p-value': logit_model.pvalues
}).round(3)

# Clean up variable names for display
or_table.index = (or_table.index
    .str.replace(r"C\(phase_group, Treatment\(reference='Phase 3'\)\)\[T\.", '', regex=True)
    .str.replace(']', '', regex=False))

display(or_table[['Odds Ratio', '95% CI Lower', '95% CI Upper', 'p-value']])


# Safe extraction of ORs with fallback for potential key variations
def safe_get_or(table, key, col, default=1.0):
    """Safely get OR value, handling potential key variations."""
    if key in table.index:
        return table.loc[key, col]
    # Try common variations
    for variant in [f'{key}[T.True]', f'{key}[T.1]', key.replace('_', '')]:
        if variant in table.index:
            return table.loc[variant, col]
    return default


display(Markdown("""
**Reference categories:** Phase 3 (for phase comparisons), Other sponsor (is_industry=0), Non-oncology (has_oncology_label=0)

**Interpretation:** OR > 1 indicates higher odds of completion; OR < 1 indicates lower odds (relative to reference).
"""))

# --- Key Findings (Developed Interpretation) ---
display(Markdown("### Key Findings"))

# Extract key ORs for narrative
or_enrollment = or_table.loc['log_enrollment', 'Odds Ratio']
or_enrollment_lower = or_table.loc['log_enrollment', '95% CI Lower']
or_enrollment_upper = or_table.loc['log_enrollment', '95% CI Upper']

or_industry = or_table.loc['is_industry', 'Odds Ratio']
or_industry_lower = or_table.loc['is_industry', '95% CI Lower']
or_industry_upper = or_table.loc['is_industry', '95% CI Upper']

or_oncology = or_table.loc['has_oncology_label', 'Odds Ratio']
or_oncology_lower = or_table.loc['has_oncology_label', '95% CI Lower']
or_oncology_upper = or_table.loc['has_oncology_label', '95% CI Upper']

# Phase ORs (handle potential missing phases gracefully)
phase_ors = {}
for phase in ['Early Phase 1', 'Phase 1', 'Phase 1/2', 'Phase 2', 'Phase 2/3', 'Phase 4']:
    if phase in or_table.index:
        phase_ors[phase] = {
            'or': or_table.loc[phase, 'Odds Ratio'],
            'lower': or_table.loc[phase, '95% CI Lower'],
            'upper': or_table.loc[phase, '95% CI Upper']
        }

display(Markdown(f"""
1. **Enrollment is strongly associated with completion:**
   Each one-unit increase in log-enrollment (approximately a 2.7x increase in sample size) is associated with
   {or_enrollment:.2f}x higher odds of completion (95% CI: [{or_enrollment_lower:.2f}, {or_enrollment_upper:.2f}],
   p < 0.001). This is the strongest predictor in the model.

   *Practical meaning:* Doubling enrollment (e.g., from 100 to 200 patients) increases completion odds by
   approximately 50%. A 10x increase in enrollment (e.g., 50 to 500 patients) is associated with ~{or_enrollment:.1f}x
   higher odds, holding other factors constant.

2. **Earlier phases show higher completion odds after adjusting for enrollment:**
   Relative to Phase 3 (reference), earlier phases show **higher** adjusted completion odds. Phase 1 has
   {phase_ors.get('Phase 1', {}).get('or', 'N/A')}x the odds (95% CI: [{phase_ors.get('Phase 1', {}).get('lower', 'N/A')},
   {phase_ors.get('Phase 1', {}).get('upper', 'N/A')}]), and Phase 2 has {phase_ors.get('Phase 2', {}).get('or', 'N/A')}x the odds.

   *Why is this counterintuitive?* Unadjusted rates (Section 2) showed Phase 3 at ~88% vs Phase 1 at ~89% (nearly identical).
   However, Phase 1 trials typically have much lower enrollment (median ~20-30) compared to Phase 3 (median ~200-300).
   When we compare trials with **equal enrollment** (e.g., a Phase 1 trial with 300 patients is exceptional), Phase 1
   trials show higher completion odds. This suggests that early-phase trials achieving high enrollment are highly selected
   and more likely to complete.

   *Connection to Section 2:* Unadjusted completion rates showed Phase 3 highest among clinical phases. The regression
   reveals that this advantage disappears after controlling for enrollment size — enrollment confounds the phase effect.

3. **Industry-sponsored trials have lower completion odds:**
   Industry trials have {or_industry:.2f}x the odds of completion compared to Other sponsors
   (95% CI: [{or_industry_lower:.2f}, {or_industry_upper:.2f}], p < 0.001).

   Lower completion odds for industry-sponsored trials likely reflect more aggressive stop decisions (Withdrawn / early
   termination), rather than weaker operational execution. Industry sponsors may apply stricter go/no-go criteria.

4. **Oncology trials have substantially lower completion odds:**
   Oncology trials have {or_oncology:.2f}x the odds of completion compared to non-oncology
   (95% CI: [{or_oncology_lower:.2f}, {or_oncology_upper:.2f}], p < 0.001).

   *Connection to Section 2:* Unadjusted rates showed oncology ~11 pp lower than non-oncology. The regression
   confirms this gap is not explained by differences in enrollment or sponsor type.
"""))

# --- Interpretation for Portfolio Planning ---
display(Markdown("""
### Interpretation for Portfolio Planning

**Factors associated with higher completion probability (in this dataset):**
- Higher enrollment (strongest predictor)
- Non-oncology therapeutic area
- Non-industry sponsorship
- Earlier phases (after controlling for enrollment)

**Caveats:**
- These are **associations, not causal effects**. We cannot conclude that increasing enrollment *causes* higher completion.
- Unmeasured confounders (protocol complexity, investigator experience, competitive landscape) may explain observed patterns.
- The phase effect is **confounded by enrollment**: Phase 3 trials have high unadjusted completion rates largely because
  they have high enrollment. After adjustment, early-phase trials with comparable enrollment show higher odds.
- Industry's lower completion odds likely reflect more aggressive portfolio management (faster termination), not
  weaker operational execution.

**For portfolio risk assessment:**
Trials with combinations of high-risk factors (low enrollment + oncology + industry sponsor) may warrant
closer monitoring or operational support. However, do not use raw predicted probabilities for budget forecasting without
calibration validation (Section 5.3).
"""))

# --- Model Fit (Sanity Check) ---
display(Markdown("### Model Fit (Sanity Check)"))

# AUC
from sklearn.metrics import roc_auc_score
y_true = df_model['is_completed']
y_pred_prob = logit_model.predict(df_model)
auc = roc_auc_score(y_true, y_pred_prob)

display(Markdown(f"""
**AUC:** {auc:.3f}

**Interpretation:** The model achieves moderate discriminative ability (AUC > 0.5 indicates better than chance).
However, this analysis is **explanatory** (understanding associations), not **predictive** (forecasting individual outcomes).
AUC is reported as a sanity check to confirm the model captures meaningful variation, not as a performance metric for
operational use.

For operational decision-making (e.g., budget forecasting, contractual commitments), predicted probabilities would require
calibration and validation on held-out data.
"""))

## 5.2 Logistic Regression Results

**Model converged successfully** (no separation issues detected)

### Adjusted Odds Ratios

Unnamed: 0,Odds Ratio,95% CI Lower,95% CI Upper,p-value
Intercept,0.256,0.215,0.305,0.0
Early Phase 1,3.94,2.927,5.303,0.0
Phase 1,4.916,4.278,5.648,0.0
Phase 1/2,1.924,1.627,2.277,0.0
Phase 2,1.683,1.489,1.902,0.0
Phase 2/3,1.236,0.975,1.567,0.081
Phase 4,1.615,1.399,1.865,0.0
log_enrollment,2.175,2.102,2.251,0.0
is_industry,0.655,0.602,0.713,0.0
has_oncology_label,0.472,0.432,0.515,0.0



**Reference categories:** Phase 3 (for phase comparisons), Other sponsor (is_industry=0), Non-oncology (has_oncology_label=0)

**Interpretation:** OR > 1 indicates higher odds of completion; OR < 1 indicates lower odds (relative to reference).


### Key Findings


1. **Enrollment is strongly associated with completion:**
   Each one-unit increase in log-enrollment (approximately a 2.7x increase in sample size) is associated with
   2.17x higher odds of completion (95% CI: [2.10, 2.25],
   p < 0.001). This is the strongest predictor in the model.

   *Practical meaning:* Doubling enrollment (e.g., from 100 to 200 patients) increases completion odds by
   approximately 50%. A 10x increase in enrollment (e.g., 50 to 500 patients) is associated with ~2.2x
   higher odds, holding other factors constant.

2. **Earlier phases show higher completion odds after adjusting for enrollment:**
   Relative to Phase 3 (reference), earlier phases show **higher** adjusted completion odds. Phase 1 has
   4.916x the odds (95% CI: [4.278,
   5.648]), and Phase 2 has 1.683x the odds.

   *Why is this counterintuitive?* Unadjusted rates (Section 2) showed Phase 3 at ~88% vs Phase 1 at ~89% (nearly identical).
   However, Phase 1 trials typically have much lower enrollment (median ~20-30) compared to Phase 3 (median ~200-300).
   When we compare trials with **equal enrollment** (e.g., a Phase 1 trial with 300 patients is exceptional), Phase 1
   trials show higher completion odds. This suggests that early-phase trials achieving high enrollment are highly selected
   and more likely to complete.

   *Connection to Section 2:* Unadjusted completion rates showed Phase 3 highest among clinical phases. The regression
   reveals that this advantage disappears after controlling for enrollment size — enrollment confounds the phase effect.

3. **Industry-sponsored trials have lower completion odds:**
   Industry trials have 0.66x the odds of completion compared to Other sponsors
   (95% CI: [0.60, 0.71], p < 0.001).

   Lower completion odds for industry-sponsored trials likely reflect more aggressive stop decisions (Withdrawn / early
   termination), rather than weaker operational execution. Industry sponsors may apply stricter go/no-go criteria.

4. **Oncology trials have substantially lower completion odds:**
   Oncology trials have 0.47x the odds of completion compared to non-oncology
   (95% CI: [0.43, 0.52], p < 0.001).

   *Connection to Section 2:* Unadjusted rates showed oncology ~11 pp lower than non-oncology. The regression
   confirms this gap is not explained by differences in enrollment or sponsor type.



### Interpretation for Portfolio Planning

**Factors associated with higher completion probability (in this dataset):**
- Higher enrollment (strongest predictor)
- Non-oncology therapeutic area
- Non-industry sponsorship
- Earlier phases (after controlling for enrollment)

**Caveats:**
- These are **associations, not causal effects**. We cannot conclude that increasing enrollment *causes* higher completion.
- Unmeasured confounders (protocol complexity, investigator experience, competitive landscape) may explain observed patterns.
- The phase effect is **confounded by enrollment**: Phase 3 trials have high unadjusted completion rates largely because
  they have high enrollment. After adjustment, early-phase trials with comparable enrollment show higher odds.
- Industry's lower completion odds likely reflect more aggressive portfolio management (faster termination), not
  weaker operational execution.

**For portfolio risk assessment:**
Trials with combinations of high-risk factors (low enrollment + oncology + industry sponsor) may warrant
closer monitoring or operational support. However, do not use raw predicted probabilities for budget forecasting without
calibration validation (Section 5.3).


### Model Fit (Sanity Check)


**AUC:** 0.746

**Interpretation:** The model achieves moderate discriminative ability (AUC > 0.5 indicates better than chance).
However, this analysis is **explanatory** (understanding associations), not **predictive** (forecasting individual outcomes).
AUC is reported as a sanity check to confirm the model captures meaningful variation, not as a performance metric for
operational use.

For operational decision-making (e.g., budget forecasting, contractual commitments), predicted probabilities would require
calibration and validation on held-out data.


In [13]:
# ============================================================
# 5.3 Goodness of Fit
# ============================================================

from sklearn.metrics import brier_score_loss
from src.analysis.viz import create_calibration_chart

display(Markdown("## 5.3 Goodness of Fit"))
display(Markdown("*Check: Are predicted probabilities well-calibrated to observed outcomes?*"))

# Create calibration bins (use qcut for equal-sized bins)
df_model['pred_prob'] = y_pred_prob
df_model['prob_bin'] = pd.qcut(df_model['pred_prob'], q=10, labels=False, duplicates='drop')

# Calculate observed vs predicted by bin
calibration_data = df_model.groupby('prob_bin', observed=False).agg(
    mean_predicted=('pred_prob', 'mean'),
    observed_rate=('is_completed', 'mean'),
    n=('study_id', 'count')
).reset_index()

# Calculate calibration bias
calibration_data['bias'] = calibration_data['observed_rate'] - calibration_data['mean_predicted']

# Plot calibration using helper
fig_calib = create_calibration_chart(calibration_data)
fig_calib.show()

# Calculate Brier score
brier_score = brier_score_loss(y_true, y_pred_prob)

# Identify systematic bias regions
max_underpredict = calibration_data['bias'].max()
max_overpredict = abs(calibration_data['bias'].min())
max_abs_bias = max(max_underpredict, abs(max_overpredict))

# Count bins with significant bias (>5pp)
n_underpredict = (calibration_data['bias'] > 0.05).sum()
n_overpredict = (calibration_data['bias'] < -0.05).sum()

display(Markdown(f"""
### Calibration Assessment

**Brier Score:** {brier_score:.3f} (lower is better; 0 = perfect, 0.25 = random)

**Interpretation:**
- **Points near diagonal line** = well-calibrated (predicted probabilities match observed rates)
- **Points above line** = model **underpredicts** completion (says 60%, actually 70%)
- **Points below line** = model **overpredicts** completion (says 70%, actually 60%)
"""))

# Show calibration table
display(Markdown("**Calibration by Decile:**"))
display(calibration_data[['mean_predicted', 'observed_rate', 'bias', 'n']].round(3))

# Dynamic assessment based on calculated bias
if max_abs_bias < 0.05:
    calibration_assessment = "Excellent calibration (max bias < 5pp)"
elif max_abs_bias < 0.10:
    calibration_assessment = f"Good calibration overall with slight systematic bias (max bias = {max_abs_bias*100:.0f}pp)"
else:
    calibration_assessment = f"Moderate calibration (max bias = {max_abs_bias*100:.0f}pp)"

# Natural language for bias counts
underpredict_text = "No deciles exceed a 5pp threshold" if n_underpredict == 0 else f"{n_underpredict} decile(s) exceed a 5pp threshold"
overpredict_text = "No deciles exceed a 5pp threshold" if n_overpredict == 0 else f"{n_overpredict} decile(s) exceed a 5pp threshold"

display(Markdown(f"""
**Assessment:** {calibration_assessment}

- **Underprediction** (observed > predicted): {underpredict_text}
- **Overprediction** (predicted > observed): {overpredict_text}
- **Maximum absolute bias:** {max_abs_bias*100:.0f} percentage points

*Note: Calibration is evaluated in-sample and reflects the current data distribution. Out-of-sample calibration
may differ and should be reassessed if the model is reused prospectively.*

**Business implications:**
Probabilities are trustworthy for **relative ranking** (high-risk vs low-risk trials).
Use caution for **absolute forecasting** — predicted probabilities may differ from true rates by up to {max_abs_bias*100:.0f}pp.

**Recommendation:**
- For **portfolio prioritization** (identify trials needing support): Use predicted probabilities directly
- For **budget forecasting** (estimate expected completions): Apply calibration correction or use observed base rates by segment
"""))

# Clean up temporary columns (keep pred_prob for downstream analysis if needed)
df_model = df_model.drop(columns=['prob_bin'], errors='ignore')

## 5.3 Goodness of Fit

*Check: Are predicted probabilities well-calibrated to observed outcomes?*


### Calibration Assessment

**Brier Score:** 0.102 (lower is better; 0 = perfect, 0.25 = random)

**Interpretation:**
- **Points near diagonal line** = well-calibrated (predicted probabilities match observed rates)
- **Points above line** = model **underpredicts** completion (says 60%, actually 70%)
- **Points below line** = model **overpredicts** completion (says 70%, actually 60%)


**Calibration by Decile:**

Unnamed: 0,mean_predicted,observed_rate,bias,n
0,0.543,0.473,-0.07,2485
1,0.749,0.798,0.049,2485
2,0.815,0.833,0.018,2474
3,0.857,0.882,0.025,2487
4,0.886,0.913,0.027,2491
5,0.908,0.918,0.01,2465
6,0.927,0.933,0.006,2563
7,0.943,0.935,-0.008,2399
8,0.958,0.939,-0.019,2481
9,0.979,0.942,-0.037,2481



**Assessment:** Good calibration overall with slight systematic bias (max bias = 7pp)

- **Underprediction** (observed > predicted): No deciles exceed a 5pp threshold
- **Overprediction** (predicted > observed): 1 decile(s) exceed a 5pp threshold
- **Maximum absolute bias:** 7 percentage points

*Note: Calibration is evaluated in-sample and reflects the current data distribution. Out-of-sample calibration
may differ and should be reassessed if the model is reused prospectively.*

**Business implications:**
Probabilities are trustworthy for **relative ranking** (high-risk vs low-risk trials).
Use caution for **absolute forecasting** — predicted probabilities may differ from true rates by up to 7pp.

**Recommendation:**
- For **portfolio prioritization** (identify trials needing support): Use predicted probabilities directly
- For **budget forecasting** (estimate expected completions): Apply calibration correction or use observed base rates by segment


In [14]:
# ============================================================
# Interpretation: Reconciling Descriptive vs Adjusted Results
# ============================================================

import plotly.express as px

display(Markdown("""
---

## Interpretation of Phase Effects (Adjusted vs Unadjusted)

Although Phase 3 trials exhibit high unadjusted completion rates in the descriptive analysis (~85-88%, Section 2),
the multivariable logistic regression reveals a different pattern once enrollment size is controlled for.
In the adjusted model (Section 5.2), Phase 3 is the reference category, and all earlier phases show higher
odds of completion relative to Phase 3 at equivalent enrollment levels.

This apparent discrepancy is explained by **enrollment confounding**. Phase 3 trials typically enroll
substantially more participants than early-phase trials, and enrollment size is the strongest predictor
of completion in the model. The high observed completion rate of Phase 3 trials in practice is therefore
largely driven by their larger scale, which masks the underlying phase-level complexity and risk.

**In practical terms:**
- The adjusted results indicate that Phase 3 trials are intrinsically more complex and riskier than earlier
  phases when compared at similar enrollment levels, even though they complete successfully in aggregate
  because they are better resourced and larger.
- Consequently, **unadjusted completion rates** should be used to describe typical outcomes, while
  **adjusted odds ratios** should be used to understand underlying drivers of risk.

---

## Failure Mode Differentiation: Withdrawn vs Terminated

**Analytical question:** Are Withdrawn and Terminated trials structurally different, or are they variations
of the same failure mechanism?
"""))

# ============================================================
# Structural Comparison Table
# ============================================================

display(Markdown("### Structural Comparison of Failure Modes"))

# Filter to stopped trials only
df_stopped_analysis = df_abt[df_abt['outcome_group'] == 'Stopped'].copy()

# Build summary table
summary = []

for status in ['Withdrawn', 'Terminated']:
    df_s = df_stopped_analysis[df_stopped_analysis['failure_type'] == status]
    n_total = len(df_stopped_analysis)
    
    summary.append({
        'Failure Mode': status,
        'Share of stopped (%)': len(df_s) / n_total * 100,
        'Median enrollment': df_s['enrollment'].median(),
        '% Early Phase': df_s['phase_group'].isin(['Early Phase 1', 'Phase 1']).mean() * 100,
        '% Phase 3': (df_s['phase_group'] == 'Phase 3').mean() * 100,
        '% Industry': (df_s['is_industry_sponsor'] == 1).mean() * 100,
        '% Oncology': (df_s['has_oncology_label'] == 1).mean() * 100
    })

failure_table = pd.DataFrame(summary).set_index('Failure Mode').round(1)
display(failure_table)

# Extract key values for narrative
withdrawn_share = failure_table.loc['Withdrawn', 'Share of stopped (%)']
terminated_share = failure_table.loc['Terminated', 'Share of stopped (%)']
withdrawn_enrollment = failure_table.loc['Withdrawn', 'Median enrollment']
terminated_enrollment = failure_table.loc['Terminated', 'Median enrollment']
withdrawn_early = failure_table.loc['Withdrawn', '% Early Phase']
terminated_early = failure_table.loc['Terminated', '% Early Phase']
withdrawn_p3 = failure_table.loc['Withdrawn', '% Phase 3']
terminated_p3 = failure_table.loc['Terminated', '% Phase 3']
withdrawn_industry = failure_table.loc['Withdrawn', '% Industry']
terminated_industry = failure_table.loc['Terminated', '% Industry']

display(Markdown(f"""
**Interpretation:**

| Dimension | Pattern | Implication |
|-----------|---------|-------------|
| **Enrollment** | Withdrawn median = {withdrawn_enrollment:.0f} vs Terminated = {terminated_enrollment:.0f} | Withdrawn = pre-execution failure |
| **Early Phase** | Withdrawn {withdrawn_early:.0f}% vs Terminated {terminated_early:.0f}% | Design-stage risk higher for Withdrawn |
| **Phase 3** | Withdrawn {withdrawn_p3:.0f}% vs Terminated {terminated_p3:.0f}% | Late-stage execution risk for Terminated |
| **Industry** | Withdrawn {withdrawn_industry:.0f}% vs Terminated {terminated_industry:.0f}% | Industry more likely to terminate (execution) |

Withdrawn and Terminated trials show clear structural differences. **Withdrawn** trials fail predominantly
before meaningful enrollment, with substantially lower median enrollment and a higher concentration in early
phases, consistent with *"failure to launch"* driven by feasibility, design, or funding issues.

In contrast, **Terminated** trials fail later during execution, enroll more patients, and are more prevalent
in industry-sponsored and late-phase studies, consistent with safety, efficacy, or portfolio-driven stop decisions.

These patterns indicate that Withdrawn and Terminated represent **distinct failure mechanisms** rather than
a single "stopped" category.
"""))

# ============================================================
# Supporting Visualization: Enrollment by Failure Mode
# ============================================================

display(Markdown("### Enrollment Distribution by Failure Mode"))

# Filter to trials with enrollment > 0 for visualization
df_viz = df_stopped_analysis[df_stopped_analysis['enrollment'] > 0].copy()
df_viz = df_viz[df_viz['failure_type'].isin(['Withdrawn', 'Terminated'])]

fig_box = px.box(
    df_viz,
    x='failure_type',
    y='enrollment',
    color='failure_type',
    color_discrete_map={'Withdrawn': '#f97316', 'Terminated': '#ef4444'},
    title='<b>Enrollment Distribution by Failure Mode</b><br><sup>Log scale; only trials with enrollment > 0</sup>',
    labels={'failure_type': 'Failure Mode', 'enrollment': 'Enrollment'},
    template='plotly_white',
    height=400,
)
fig_box.update_layout(
    showlegend=False,
    yaxis_type='log',
    yaxis_title='Enrollment (log scale)',
)
fig_box.show()

# Chi-square test
from scipy.stats import chi2_contingency
phase_contingency = pd.crosstab(df_stopped_analysis['phase_group'], df_stopped_analysis['failure_type'])
chi2, p_value, dof, expected = chi2_contingency(phase_contingency)
chi2_result = f"χ²({dof}) = {chi2:.1f}, p < 0.001" if p_value < 0.001 else f"χ²({dof}) = {chi2:.1f}, p = {p_value:.3f}"

display(Markdown(f"""
**Statistical validation:** {chi2_result} — Phase and Failure Type are not independent.

---

## Decision Framing

**For reporting and benchmarking:**
- Use **unadjusted completion rates** (Section 2) to describe typical outcomes by phase, sponsor, or condition.
- These rates reflect what actually happens in practice.

**For risk assessment and resource prioritization:**
- Use **adjusted odds ratios** (Section 5) to understand underlying drivers of completion.
- Enrollment size is the dominant factor; phase effects are confounded by scale.

**For failure prevention:**
- **Withdrawn** trials require better pre-initiation feasibility (site capacity, patient availability, protocol complexity).
- **Terminated** trials require better execution monitoring (enrollment pace, interim safety/efficacy reviews).

---
"""))


---

## Interpretation of Phase Effects (Adjusted vs Unadjusted)

Although Phase 3 trials exhibit high unadjusted completion rates in the descriptive analysis (~85-88%, Section 2),
the multivariable logistic regression reveals a different pattern once enrollment size is controlled for.
In the adjusted model (Section 5.2), Phase 3 is the reference category, and all earlier phases show higher
odds of completion relative to Phase 3 at equivalent enrollment levels.

This apparent discrepancy is explained by **enrollment confounding**. Phase 3 trials typically enroll
substantially more participants than early-phase trials, and enrollment size is the strongest predictor
of completion in the model. The high observed completion rate of Phase 3 trials in practice is therefore
largely driven by their larger scale, which masks the underlying phase-level complexity and risk.

**In practical terms:**
- The adjusted results indicate that Phase 3 trials are intrinsically more complex and riskier than earlier
  phases when compared at similar enrollment levels, even though they complete successfully in aggregate
  because they are better resourced and larger.
- Consequently, **unadjusted completion rates** should be used to describe typical outcomes, while
  **adjusted odds ratios** should be used to understand underlying drivers of risk.

---

## Failure Mode Differentiation: Withdrawn vs Terminated

**Analytical question:** Are Withdrawn and Terminated trials structurally different, or are they variations
of the same failure mechanism?


### Structural Comparison of Failure Modes

Unnamed: 0_level_0,Share of stopped (%),Median enrollment,% Early Phase,% Phase 3,% Industry,% Oncology
Failure Mode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Withdrawn,31.1,60.0,10.8,6.3,19.6,20.4
Terminated,65.6,21.0,12.5,10.0,34.9,27.9



**Interpretation:**

| Dimension | Pattern | Implication |
|-----------|---------|-------------|
| **Enrollment** | Withdrawn median = 60 vs Terminated = 21 | Withdrawn = pre-execution failure |
| **Early Phase** | Withdrawn 11% vs Terminated 12% | Design-stage risk higher for Withdrawn |
| **Phase 3** | Withdrawn 6% vs Terminated 10% | Late-stage execution risk for Terminated |
| **Industry** | Withdrawn 20% vs Terminated 35% | Industry more likely to terminate (execution) |

Withdrawn and Terminated trials show clear structural differences. **Withdrawn** trials fail predominantly
before meaningful enrollment, with substantially lower median enrollment and a higher concentration in early
phases, consistent with *"failure to launch"* driven by feasibility, design, or funding issues.

In contrast, **Terminated** trials fail later during execution, enroll more patients, and are more prevalent
in industry-sponsored and late-phase studies, consistent with safety, efficacy, or portfolio-driven stop decisions.

These patterns indicate that Withdrawn and Terminated represent **distinct failure mechanisms** rather than
a single "stopped" category.


### Enrollment Distribution by Failure Mode


**Statistical validation:** χ²(14) = 173.4, p < 0.001 — Phase and Failure Type are not independent.

---

## Decision Framing

**For reporting and benchmarking:**
- Use **unadjusted completion rates** (Section 2) to describe typical outcomes by phase, sponsor, or condition.
- These rates reflect what actually happens in practice.

**For risk assessment and resource prioritization:**
- Use **adjusted odds ratios** (Section 5) to understand underlying drivers of completion.
- Enrollment size is the dominant factor; phase effects are confounded by scale.

**For failure prevention:**
- **Withdrawn** trials require better pre-initiation feasibility (site capacity, patient availability, protocol complexity).
- **Terminated** trials require better execution monitoring (enrollment pace, interim safety/efficacy reviews).

---


In [15]:
# ============================================================
# Summary: Answering the Research Questions
# ============================================================
# Dependencies: or_table, brier_score, df_abt, df_resolved, failure_table

# Calculate AUC for summary
from sklearn.metrics import roc_auc_score
try:
    auc_score = roc_auc_score(df_model['is_completed'], y_pred_prob)
except:
    auc_score = None


display(Markdown("""
---

# 6. Summary

## Research Questions Answered

"""))

# ============================================================
# Q1: What factors are associated with clinical trial completion?
# ============================================================

# Get model results (using correct column names from or_table)
enrollment_or = or_table.loc['log_enrollment', 'Odds Ratio']
enrollment_ci_low = or_table.loc['log_enrollment', '95% CI Lower']
enrollment_ci_high = or_table.loc['log_enrollment', '95% CI Upper']

oncology_or = or_table.loc['has_oncology_label', 'Odds Ratio']
oncology_ci_low = or_table.loc['has_oncology_label', '95% CI Lower']
oncology_ci_high = or_table.loc['has_oncology_label', '95% CI Upper']

industry_or = or_table.loc['is_industry', 'Odds Ratio']
industry_ci_low = or_table.loc['is_industry', '95% CI Lower']
industry_ci_high = or_table.loc['is_industry', '95% CI Upper']

# Get descriptive stats
overall_completion = (df_resolved['is_completed'].sum() / len(df_resolved) * 100)
n_resolved = len(df_resolved)
n_stopped = len(df_abt[df_abt['outcome_group'] == 'Stopped'])

display(Markdown(f"""
### Q1: What factors are associated with clinical trial completion?

**Population:** {n_resolved:,} resolved trials (Active excluded to avoid censoring bias)

**Key findings (adjusted odds ratios from logistic regression):**

| Factor | OR (95% CI) | Interpretation |
|--------|-------------|----------------|
| **log(Enrollment)** | {enrollment_or:.2f} ({enrollment_ci_low:.2f}-{enrollment_ci_high:.2f}) | Strongest predictor: doubling enrollment -> ~50% higher odds |
| **Oncology** | {oncology_or:.2f} ({oncology_ci_low:.2f}-{oncology_ci_high:.2f}) | Oncology trials have lower completion odds |
| **Industry sponsor** | {industry_or:.2f} ({industry_ci_low:.2f}-{industry_ci_high:.2f}) | Industry shows slightly lower odds (aggressive stop decisions) |
| **Phase** | See Section 5.2 | Earlier phases show *higher* adjusted odds than Phase 3 (confounded by enrollment) |

**Bottom line:** Enrollment size is the dominant driver of completion. Larger trials complete more often,
regardless of phase or sponsor. The "Phase 3 paradox" (high raw completion but low adjusted odds) is explained
by enrollment confounding - Phase 3 trials succeed in practice because they are better resourced and larger,
not because the phase itself is easier.
"""))

# ============================================================
# Q2: Are there patterns in trials that get terminated or withdrawn?
# ============================================================

# Get failure mode stats from the table we built
withdrawn_share = failure_table.loc['Withdrawn', 'Share of stopped (%)']
terminated_share = failure_table.loc['Terminated', 'Share of stopped (%)']
withdrawn_enrollment = failure_table.loc['Withdrawn', 'Median enrollment']
terminated_enrollment = failure_table.loc['Terminated', 'Median enrollment']
withdrawn_early = failure_table.loc['Withdrawn', '% Early Phase']
terminated_early = failure_table.loc['Terminated', '% Early Phase']
withdrawn_industry = failure_table.loc['Withdrawn', '% Industry']
terminated_industry = failure_table.loc['Terminated', '% Industry']

display(Markdown(f"""
### Q2: Are there patterns in trials that get terminated or withdrawn?

**Population:** {n_stopped:,} stopped trials (Terminated + Withdrawn + Suspended)

**Key finding:** Withdrawn and Terminated represent **distinct failure mechanisms**, not variations of the same process.

| Characteristic | Withdrawn | Terminated | Implication |
|----------------|-----------|------------|-------------|
| **Share of stopped** | {withdrawn_share:.1f}% | {terminated_share:.1f}% | Terminated dominates |
| **Median enrollment** | {withdrawn_enrollment:.0f} | {terminated_enrollment:.0f} | Withdrawn = pre-execution |
| **% Early Phase** | {withdrawn_early:.0f}% | {terminated_early:.0f}% | Withdrawn more in early phases |
| **% Industry** | {withdrawn_industry:.0f}% | {terminated_industry:.0f}% | Industry terminates (execution failures) |

**Withdrawn trials** = "Failure to launch" (feasibility, design, funding issues detected before enrollment)

**Terminated trials** = "Failure during execution" (safety, efficacy, recruitment, or portfolio decisions after enrollment)

**Bottom line:** Prevention strategies must differ:
- **Reduce Withdrawn:** Better pre-initiation feasibility (site readiness, patient availability, protocol complexity)
- **Reduce Terminated:** Better execution monitoring (enrollment pace, interim reviews, early warning systems)
"""))

# ============================================================
# Model Performance & Caveats
# ============================================================

display(Markdown(f"""
---

### Model Quality

- **Brier score:** {brier_score:.3f}
- **Discrimination (AUC):** {auc_score:.3f} 
- **Design:** Explanatory model (association, not prediction); use for understanding drivers, not forecasting

### Caveats

1. **Association, not causation:** Observational data with unmeasured confounders
2. **Registry limitations:** Enrollment = reported, not verified; stop reasons not recorded
3. **Temporal scope:** Right-censoring affects recent cohorts (Active trials excluded)
4. **Generalizability:** Patterns may not apply to specific therapeutic areas or geographies


### Recommendation

**For reporting and benchmarking:** Use unadjusted completion rates (Section 2) to describe typical outcomes.

**For risk assessment and resource prioritization:** Use adjusted odds ratios (Section 5) to understand underlying drivers.

---
"""))


---

# 6. Summary

## Research Questions Answered




### Q1: What factors are associated with clinical trial completion?

**Population:** 62,958 resolved trials (Active excluded to avoid censoring bias)

**Key findings (adjusted odds ratios from logistic regression):**

| Factor | OR (95% CI) | Interpretation |
|--------|-------------|----------------|
| **log(Enrollment)** | 2.17 (2.10-2.25) | Strongest predictor: doubling enrollment -> ~50% higher odds |
| **Oncology** | 0.47 (0.43-0.52) | Oncology trials have lower completion odds |
| **Industry sponsor** | 0.66 (0.60-0.71) | Industry shows slightly lower odds (aggressive stop decisions) |
| **Phase** | See Section 5.2 | Earlier phases show *higher* adjusted odds than Phase 3 (confounded by enrollment) |

**Bottom line:** Enrollment size is the dominant driver of completion. Larger trials complete more often,
regardless of phase or sponsor. The "Phase 3 paradox" (high raw completion but low adjusted odds) is explained
by enrollment confounding - Phase 3 trials succeed in practice because they are better resourced and larger,
not because the phase itself is easier.



### Q2: Are there patterns in trials that get terminated or withdrawn?

**Population:** 8,774 stopped trials (Terminated + Withdrawn + Suspended)

**Key finding:** Withdrawn and Terminated represent **distinct failure mechanisms**, not variations of the same process.

| Characteristic | Withdrawn | Terminated | Implication |
|----------------|-----------|------------|-------------|
| **Share of stopped** | 31.1% | 65.6% | Terminated dominates |
| **Median enrollment** | 60 | 21 | Withdrawn = pre-execution |
| **% Early Phase** | 11% | 12% | Withdrawn more in early phases |
| **% Industry** | 20% | 35% | Industry terminates (execution failures) |

**Withdrawn trials** = "Failure to launch" (feasibility, design, funding issues detected before enrollment)

**Terminated trials** = "Failure during execution" (safety, efficacy, recruitment, or portfolio decisions after enrollment)

**Bottom line:** Prevention strategies must differ:
- **Reduce Withdrawn:** Better pre-initiation feasibility (site readiness, patient availability, protocol complexity)
- **Reduce Terminated:** Better execution monitoring (enrollment pace, interim reviews, early warning systems)



---

### Model Quality

- **Brier score:** 0.102
- **Discrimination (AUC):** 0.746 
- **Design:** Explanatory model (association, not prediction); use for understanding drivers, not forecasting

### Caveats

1. **Association, not causation:** Observational data with unmeasured confounders
2. **Registry limitations:** Enrollment = reported, not verified; stop reasons not recorded
3. **Temporal scope:** Right-censoring affects recent cohorts (Active trials excluded)
4. **Generalizability:** Patterns may not apply to specific therapeutic areas or geographies


### Recommendation

**For reporting and benchmarking:** Use unadjusted completion rates (Section 2) to describe typical outcomes.

**For risk assessment and resource prioritization:** Use adjusted odds ratios (Section 5) to understand underlying drivers.

---
