# Q1: Clinical Trial Landscape

## Research Question

> **What is the distribution of clinical trials by phase, status, and therapeutic area?  
> How has this evolved over time?**

## Analysis Structure

| Part | Focus | Sections |
|------|-------|----------|
| **Part 1** | Distribution Snapshot (cross-sectional) | 1.1 Phase, 1.2 Status, 1.3 Therapeutic Area, 1.4 Phase×Status |
| **Part 2** | Temporal Trends (longitudinal) | 2.1 Initiations by Phase, 2.2 Initiations by Status, 2.3 Top Conditions Over Time |

## Scope & Data Notes

- **Analysis scope:** Studies with start year 1990–2025
- **Data source:** ClinicalTrials.gov registry extract
- **Condition labels:** Free-text registry entries (not a standardized taxonomy)

In [16]:
# ============================================================
# Setup
# ============================================================

import sys
from pathlib import Path

import pandas as pd
from IPython.display import display, Markdown

# Add project root to path for imports
PROJECT_ROOT = Path('..')
sys.path.insert(0, str(PROJECT_ROOT))

# Shared utilities
from src.data.loader import load_sql_query, get_db_connection

# Visualization and constants from src/analysis
from src.analysis.viz import (
    create_horizontal_bar_chart,
    create_multi_line_chart,
    create_stacked_area_chart,
    create_temporal_heatmap,
    create_annotated_heatmap,
)
from src.analysis.constants import (
    PHASE_ORDER,
    PHASE_AGG_MAP,
    PHASE_AGG_ORDER,
    PHASE_AGG_COLORS,
    STATUS_ORDER,
    STATUS_AGG_MAP,
    STATUS_AGG_ORDER,
    STATUS_AGG_COLORS,
)

# Paths (validated at setup to fail fast)
DB_PATH = PROJECT_ROOT / 'data' / 'database' / 'clinical_trials.db'
SQL_PATH = PROJECT_ROOT / 'sql' / 'queries'
assert DB_PATH.exists(), f"DB not found: {DB_PATH}"
assert SQL_PATH.exists(), f"SQL folder not found: {SQL_PATH}"

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

In [17]:
# ============================================================
# Database connection
# ============================================================

conn = get_db_connection(DB_PATH)

In [18]:
# ============================================================
# Load base data (single source of truth)
# ============================================================

# Study-level base (phase, status) - one row per study
df_study = load_sql_query('q1_study_base.sql', conn, SQL_PATH)

# Condition-level base (study × condition) - one row per assignment
df_condition = load_sql_query('q1_condition_base.sql', conn, SQL_PATH)

# ============================================================
# Data validation (mandatory checks)
# ============================================================

# 1. Assert study_id uniqueness (q1_study_base.sql must return one row per study_id)
assert df_study["study_id"].is_unique, "q1_study_base.sql must return one row per study_id"

# 2. Cross-check count against v_studies_clean scope
scope_count = pd.read_sql_query(
    "SELECT COUNT(*) AS n FROM v_studies_clean WHERE is_start_year_in_scope = 1;",
    conn
)["n"].iloc[0]
n_studies = df_study["study_id"].nunique()
assert n_studies == int(scope_count), f"df_study count ({n_studies}) != scope count ({scope_count})"

# ============================================================
# Derive scope metadata
# ============================================================

# Condition metrics (with empty-data fallback)
if len(df_condition) > 0:
    n_studies_with_conditions = df_condition['study_id'].nunique()
    n_condition_assignments = len(df_condition)
    n_unique_conditions = df_condition['condition_name'].nunique()
    avg_conditions_per_study = round(n_condition_assignments / n_studies_with_conditions, 2)
    pct_with_conditions = round(n_studies_with_conditions / n_studies * 100, 1)
else:
    n_studies_with_conditions = 0
    n_condition_assignments = 0
    n_unique_conditions = 0
    avg_conditions_per_study = 0.0
    pct_with_conditions = 0.0

# Get year range from data
scope_years = pd.read_sql_query(
    """SELECT MIN(start_year) AS min_year, MAX(start_year) AS max_year
       FROM v_studies_clean WHERE is_start_year_in_scope = 1;""",
    conn
)
min_year = int(scope_years['min_year'].iloc[0])
max_year = int(scope_years['max_year'].iloc[0])

# Dataset summary (as table, not prints)
dataset_summary = pd.DataFrame({
    'Metric': [
        'Analysis scope (start_year)',
        'Studies in scope',
        'Studies with condition labels',
        'Condition label coverage',
        'Condition assignments (total)',
        'Unique condition labels',
        'Avg conditions per study',
    ],
    'Value': [
        f"{min_year}–{max_year}",
        f"{n_studies:,}",
        f"{n_studies_with_conditions:,}",
        f"{pct_with_conditions}%",
        f"{n_condition_assignments:,}",
        f"{n_unique_conditions:,}",
        f"{avg_conditions_per_study:.2f}",
    ]
})
dataset_summary

Unnamed: 0,Metric,Value
0,Analysis scope (start_year),1990–2025
1,Studies in scope,97943
2,Studies with condition labels,97942
3,Condition label coverage,100.0%
4,Condition assignments (total),173129
5,Unique condition labels,42348
6,Avg conditions per study,1.77


---

# Part 1: Distribution Snapshot (Cross-sectional)

This section describes the current distribution of trials across three dimensions:
- **Phase:** Clinical development stage
- **Status:** Registration/operational status
- **Therapeutic Area:** Condition labels (proxy)

## 1.1 Distribution by Phase

How are trials distributed across clinical development phases?

In [19]:
# ============================================================
# 1.1 Distribution by Phase
# ============================================================

# Aggregate by phase (one row per phase group)
phase_dist = (
    df_study
    .groupby(["phase_group", "phase_order"], as_index=False)
    .size()
    .rename(columns={"size": "trial_count"})
    .sort_values("phase_order")
)

# Split into buckets for interpretation
na_count = int(phase_dist.loc[phase_dist["phase_group"] == "Not Applicable", "trial_count"].sum())
other_count = int(phase_dist.loc[phase_dist["phase_group"] == "Other", "trial_count"].sum())
phase_designated = phase_dist[~phase_dist["phase_group"].isin(["Not Applicable", "Other"])].copy()

phase_designated_total = int(phase_designated["trial_count"].sum())

# Sanity check: totals should reconcile
assert na_count + other_count + phase_designated_total == n_studies, (
    f"Phase totals do not reconcile: designated={phase_designated_total:,}, "
    f"NA={na_count:,}, Other={other_count:,}, n_studies={n_studies:,}"
)

na_pct = round(na_count / n_studies * 100, 1) if n_studies else 0.0
designated_pct = round(phase_designated_total / n_studies * 100, 1) if n_studies else 0.0

# Create chart (phase-designated only, in clinical phase order)
fig_phase = create_horizontal_bar_chart(
    data=phase_designated,
    value_col="trial_count",
    label_col="phase_group",
    title="Trial Volume by Clinical Phase",
    subtitle=f"Phase-designated studies only · {min_year}–{max_year} (n={phase_designated_total:,})",
    note=(
        f"<b>Note</b>: 'Not Applicable' studies (observational/registry) are excluded here.<br>"
        f"They represent <b>{na_pct}%</b> of the {min_year}–{max_year} analysis scope (n={na_count:,}). "
        f"Phase-designated studies represent <b>{designated_pct}%</b>."
    ),
    order_by_value=False,  # Keep clinical phase order
    total_for_pct=phase_designated_total,  # % within phase-designated only
)

fig_phase.show()

In [20]:
# ============================================================
# Summary table (FULL SCOPE: all phases including Not Applicable)
# Chart above shows phase-designated only; this table shows the full scope.
# ============================================================

display(Markdown("**Summary table (full scope):** *Chart above shows phase-designated studies; table includes all phases (including Not Applicable).*"))

# Build a clean, ordered table (independent of any previous pct calculations)
phase_table = (
    phase_dist[["phase_group", "phase_order", "trial_count"]]
    .copy()
    .sort_values("phase_order")
)

# Exact % and cumulative % (avoid rounding drift by rounding only at display time)
phase_table["pct_exact"] = phase_table["trial_count"] / n_studies * 100
phase_table["cum_pct_exact"] = phase_table["pct_exact"].cumsum()

# Optional notes (keep short)
phase_table["Notes"] = ""
phase_table.loc[phase_table["phase_group"] == "Not Applicable", "Notes"] = "Observational/registry (phase not meaningful)"
phase_table.loc[phase_table["phase_group"] == "Other", "Notes"] = "Non-standard/mixed phase label"

# Display formatting (round only here)
phase_table_display = phase_table.copy()
phase_table_display["Trials"] = phase_table_display["trial_count"].apply(lambda x: f"{int(x):,}")
phase_table_display["% of Scope"] = phase_table_display["pct_exact"].apply(lambda x: f"{x:.1f}%")
phase_table_display["Cum %"] = phase_table_display["cum_pct_exact"].apply(lambda x: f"{x:.1f}%")

phase_table_display = phase_table_display[["phase_group", "Trials", "% of Scope", "Cum %", "Notes"]]
phase_table_display = phase_table_display.rename(columns={"phase_group": "Phase"})

# Total row (always consistent with n_studies)
total_row = pd.DataFrame([{
    "Phase": "Total",
    "Trials": f"{n_studies:,}",
    "% of Scope": "100.0%",
    "Cum %": "100.0%",
    "Notes": ""
}])

phase_table_display = pd.concat([phase_table_display, total_row], ignore_index=True)
phase_table_display

**Summary table (full scope):** *Chart above shows phase-designated studies; table includes all phases (including Not Applicable).*

Unnamed: 0,Phase,Trials,% of Scope,Cum %,Notes
0,Early Phase 1,1020,1.0%,1.0%,
1,Phase 1,8128,8.3%,9.3%,
2,Phase 1/2,2794,2.9%,12.2%,
3,Phase 2,10991,11.2%,23.4%,
4,Phase 2/3,1264,1.3%,24.7%,
5,Phase 3,7115,7.3%,32.0%,
6,Phase 4,5853,6.0%,37.9%,
7,Not Applicable,60778,62.1%,100.0%,Observational/registry (phase not meaningful)
8,Total,97943,100.0%,100.0%,


## 1.2 Distribution by Status

What is the operational status distribution of registered trials?

In [21]:
# ============================================================
# 1.2 Distribution by Status
# ============================================================

# Aggregate by status (one row per status)
status_dist = (
    df_study
    .groupby(["status_group", "status_order"])
    .size()
    .reset_index(name="trial_count")
)

# Exact % (round only at display time)
status_dist["pct_exact"] = status_dist["trial_count"] / n_studies * 100

# Chart (ordered by volume)
fig_status = create_horizontal_bar_chart(
    data=status_dist,
    value_col="trial_count",
    label_col="status_group",
    title="Trial Volume by Registration Status",
    subtitle=f"{min_year}–{max_year} (n={n_studies:,})",
    order_by_value=True,
    total_for_pct=n_studies,
)

fig_status.show()

In [22]:
# ============================================================
# Summary table (status) — ordered by volume, cum % matches display order
# ============================================================
display(Markdown("**Summary table (status):** *Ordered by volume; cumulative % follows display order.*"))

status_table = (
    status_dist[["status_group", "trial_count", "pct_exact"]]
    .copy()
    .sort_values("trial_count", ascending=False)   # same order as the bar chart
)

# Cum % must be computed AFTER ordering
status_table["cum_pct_exact"] = status_table["pct_exact"].cumsum()

# Display formatting (round only here)
status_table_display = status_table.copy()
status_table_display["Trials"] = status_table_display["trial_count"].apply(lambda x: f"{int(x):,}")
status_table_display["% of Scope"] = status_table_display["pct_exact"].apply(lambda x: f"{x:.1f}%")
status_table_display["Cum %"] = status_table_display["cum_pct_exact"].apply(lambda x: f"{x:.1f}%")

status_table_display = status_table_display.rename(columns={"status_group": "Status"})[
    ["Status", "Trials", "% of Scope", "Cum %"]
]

# Total row
total_row = pd.DataFrame([{
    "Status": "Total",
    "Trials": f"{n_studies:,}",
    "% of Scope": "100.0%",
    "Cum %": "100.0%",
}])

status_table_display = pd.concat([status_table_display, total_row], ignore_index=True)
status_table_display

**Summary table (status):** *Ordered by volume; cumulative % follows display order.*

Unnamed: 0,Status,Trials,% of Scope,Cum %
0,Completed,54184,55.3%,55.3%
1,Unknown,15225,15.5%,70.9%
2,Recruiting,11461,11.7%,82.6%
3,Terminated,5755,5.9%,88.4%
4,"Active, not recruiting",3777,3.9%,92.3%
5,Not yet recruiting,3625,3.7%,96.0%
6,Withdrawn,2730,2.8%,98.8%
7,Enrolling by invitation,886,0.9%,99.7%
8,Suspended,289,0.3%,100.0%
9,Other,11,0.0%,100.0%


## 1.3 Distribution by Condition Labels
*(proxy for therapeutic areas)*

Which condition labels appear most frequently in the registry?

In [23]:
# ============================================================
# 1.3 Distribution by Condition
# ============================================================

# Aggregate by condition (unique studies per condition label)
condition_dist = (
    df_condition
    .groupby('condition_name')['study_id']
    .nunique()
    .reset_index(name='trial_count')
    .sort_values('trial_count', ascending=False)
    .head(20)
)

# Create chart using reusable function (neutral subtitle)
fig_conditions = create_horizontal_bar_chart(
    data=condition_dist,
    value_col='trial_count',
    label_col='condition_name',
    title="Top 20 Condition Labels by Trial Count",
    subtitle=f"{min_year}–{max_year} analysis scope · unique studies per label",
    note=(
        "<b>Note</b>: Condition labels are free-text registry entries, not a standardized taxonomy.<br>"
        "A study may have multiple labels; counts reflect unique studies per label.<br>"
        "'Healthy' typically reflects healthy volunteer studies (e.g., early-phase, PK/PD) rather than a disease area."
    ),
    order_by_value=True,
    show_pct=False,  # Just show counts for conditions
    height=650,
)

fig_conditions.show()

In [24]:
# ============================================================
# Summary table (conditions) — Top 20 with coverage context
# ============================================================

# Full condition distribution (all labels, not just top 20)
condition_full = (
    df_condition
    .groupby("condition_name")["study_id"]
    .nunique()
    .reset_index(name="trial_count")
    .sort_values("trial_count", ascending=False)
)

n_unique_labels = len(condition_full)
n_studies_with_conditions = df_condition["study_id"].nunique()

# Top 20 for table
top20 = condition_full.head(20).copy()
top20["pct"] = (top20["trial_count"] / n_studies_with_conditions * 100).round(1)

# Coverage: unique studies that have ANY of the top 20 labels
top20_labels = set(top20["condition_name"])
top20_studies = df_condition[df_condition["condition_name"].isin(top20_labels)]["study_id"].nunique()
top20_coverage_pct = round(top20_studies / n_studies_with_conditions * 100, 1)

display(Markdown(
    f"**Summary table (conditions):** *Top 20 of {n_unique_labels:,} unique labels. "
    f"These labels appear in {top20_coverage_pct}% of studies with ≥1 condition (n={n_studies_with_conditions:,}).*"
))

# Format for display
condition_table = top20.copy()
condition_table["Trials"] = condition_table["trial_count"].apply(lambda x: f"{int(x):,}")
condition_table["% of Studies*"] = condition_table["pct"].apply(lambda x: f"{x:.1f}%")
condition_table = condition_table.rename(columns={"condition_name": "Condition Label"})[
    ["Condition Label", "Trials", "% of Studies*"]
]

# Add context row
context_row = pd.DataFrame([{
    "Condition Label": f"... {n_unique_labels - 20:,} other labels",
    "Trials": "—",
    "% of Studies*": "—",
}])

condition_table = pd.concat([condition_table, context_row], ignore_index=True)
condition_table

**Summary table (conditions):** *Top 20 of 42,348 unique labels. These labels appear in 14.6% of studies with ≥1 condition (n=97,942).*

Unnamed: 0,Condition Label,Trials,% of Studies*
0,Healthy,1958,2.0%
1,Breast Cancer,1503,1.5%
2,Obesity,1269,1.3%
3,Stroke,909,0.9%
4,Depression,745,0.8%
5,Pain,734,0.7%
6,Prostate Cancer,734,0.7%
7,Hypertension,722,0.7%
8,Cancer,646,0.7%
9,Coronary Artery Disease,643,0.7%


## 1.4 Cross-dimensional View: Phase × Status

How does trial volume distribute across the intersection of phase and status?

*For readability, rare status labels (< 1% of total) are excluded from the heatmap and reported in the footnote.*

In [25]:
# ============================================================
# 1.4 Phase × Status Heatmap
# ============================================================

# Threshold for "rare" status (excluded from heatmap for readability)
RARE_STATUS_THRESHOLD = 0.01  # 1% of n_studies

# Aggregate and filter to major statuses
phase_status = df_study.groupby(['phase_group', 'status_group']).size().reset_index(name='trial_count')
df_major = phase_status[phase_status['status_group'].isin(STATUS_ORDER)]
df_other = phase_status[~phase_status['status_group'].isin(STATUS_ORDER)]
other_count = int(df_other['trial_count'].sum())
other_pct = round(other_count / n_studies * 100, 2)

# Pivot and reorder
pivot = (
    df_major
    .pivot_table(index='phase_group', columns='status_group', values='trial_count', fill_value=0)
    .reindex(
        index=[p for p in PHASE_ORDER if p in df_major['phase_group'].unique()],
        columns=[s for s in STATUS_ORDER if s in df_major['status_group'].unique()]
    )
    .rename(index={'Not Applicable': 'Not Applicable*'})
)

# Stats for note
na_label = 'Not Applicable*'
na_count_heatmap = int(pivot.loc[na_label].sum()) if na_label in pivot.index else 0
na_pct_heatmap = round(na_count_heatmap / n_studies * 100, 1)

# Create heatmap
fig_heatmap = create_annotated_heatmap(
    pivot,
    title="Trial Volume by Phase and Status",
    subtitle=f"({min_year}–{max_year}, n={n_studies:,} within analysis scope)",
    note=(
        f"<b>* Not Applicable</b>: Phase concept does not apply (mostly observational/registry).<br>"
        f"{na_pct_heatmap}% of analysis scope (n={na_count_heatmap:,}). "
        f"Excluded rare statuses (<{RARE_STATUS_THRESHOLD*100:.0f}% each): {other_pct}% (n={other_count:,})."
    ),
)
fig_heatmap.show()

## 1.5 Key Observations (Distribution Snapshot)

- **Phase 2 is the most common phase among phase-designated studies**, followed by Phase 1 and Phase 3. In other words, trials that report a clinical phase are most often concentrated in mid-stage development within this sample.

- **Late-stage trials (Phases 3–4) account for a smaller share of the landscape.** This reflects a cross-sectional snapshot of registered studies rather than pipeline dynamics; assessing progression or attrition would require longitudinal analysis.

- **Most studies (~60%) fall under "Not Applicable" for phase**, consistent with the presence of observational and registry-based studies where the clinical phase framework does not apply.

- **"Completed" is the most frequent registration status**, followed by recruiting and other active states. This largely reflects the cumulative nature of the registry, where completed studies accumulate over time.

- **"Healthy" is the most frequent condition label**, primarily capturing healthy-volunteer studies (e.g. early-phase PK/PD trials). Cancer-related labels also appear repeatedly among the most common entries.

**Note:** Condition labels are free-text registry entries and should be interpreted as an indicative proxy rather than a standardized therapeutic taxonomy.

# Part 2: Temporal Trends (Longitudinal)

How has trial initiation activity changed over time?

This section examines:
- **2.1 Initiations by Phase:** Aggregated line chart (Early/Mid/Late development stages)
- **2.2 Initiations by Status:** Aggregated line chart (Completed/Active/Stopped)
- **2.3 Top Conditions Over Time:** Heatmap illustrating changes in research focus

**Note:** Each study is counted once, at its reported initiation date.  
This captures research entry points rather than how trials move through phases over time.

In [26]:
# ============================================================
# Load temporal base data
# ============================================================

# Study-level temporal base (phase, status, start_year)
df_temporal = load_sql_query('q1_temporal_base.sql', conn, SQL_PATH)

# Condition-level temporal base (study × condition)
df_temporal_cond = load_sql_query('q1_temporal_condition_base.sql', conn, SQL_PATH)

# Validation
assert df_temporal["study_id"].is_unique, \
    "q1_temporal_base.sql must return one row per study_id"

# Summary
n_temporal = len(df_temporal)
temporal_year_range = f"{df_temporal['start_year'].min()}–{df_temporal['start_year'].max()}"
display(Markdown(
    f"**Temporal dataset:** {n_temporal:,} studies with valid start dates "
    f"({temporal_year_range}). Counts reflect the registry state at extraction time."
))

**Temporal dataset:** 97,943 studies with valid start dates (1990–2025). Counts reflect the registry state at extraction time.

## 2.1 Trial Initiations by Phase

How has trial initiation activity evolved over time, when broken down by clinical phase?

In [27]:
# ============================================================
# 2.1 Trial Initiations by Phase (Aggregated Line Chart)
# ============================================================
# Uses PHASE_AGG_MAP from setup cell
# Shows phase-designated studies only (excludes "Not Applicable")

# Aggregate by year and aggregated phase
df_phase_agg = df_temporal.copy()
df_phase_agg['phase_agg'] = df_phase_agg['phase_group'].map(PHASE_AGG_MAP)

yearly_phase_agg = (
    df_phase_agg
    .groupby(['start_year', 'phase_agg'])
    .size()
    .reset_index(name='trial_count')
)

# Pivot for line chart
pivot_phase_full = yearly_phase_agg.pivot_table(
    index='start_year',
    columns='phase_agg',
    values='trial_count',
    fill_value=0
)

# Calculate "Not Applicable" stats for reporting
na_total = int(pivot_phase_full['Not Applicable'].sum()) if 'Not Applicable' in pivot_phase_full.columns else 0
na_pct = round(na_total / len(df_temporal) * 100, 1)

# Exclude "Not Applicable" from chart (phase-designated only)
# Uses PHASE_AGG_ORDER from setup cell
phase_agg_cols = [p for p in PHASE_AGG_ORDER if p in pivot_phase_full.columns]
pivot_phase = pivot_phase_full[phase_agg_cols]

# Get peak year (phase-designated only)
peak_year = int(pivot_phase.sum(axis=1).idxmax())
peak_count = int(pivot_phase.sum(axis=1).max())
total_phase_designated = int(pivot_phase.sum().sum())

# Create chart using reusable function
fig_phase_trends = create_multi_line_chart(
    pivot_data=pivot_phase,
    title="Trial Initiations by Phase Over Time",
    subtitle=f"Phase-designated studies only · {temporal_year_range} (n={total_phase_designated:,})",
    note=(
        f"<b>Note</b>: Peak year: {peak_year} (n={peak_count:,}). "
        "Recent years may undercount due to reporting lag.<br>"
        f"'Not Applicable' (observational/registry) excluded: {na_pct}% of scope (n={na_total:,})."
    ),
    colors=PHASE_AGG_COLORS,
    show_total=True,
)

fig_phase_trends.show()

## 2.2 Current Status of Trials by Initiation Year

For each start-year cohort, what is the distribution of current registry statuses today?

- **What this shows:** Current status (as of data extraction) of trials initiated in each year.
- **How to interpret:** Not a measure of "status at initiation"—differences across years largely reflect time since start.

In [28]:
# ============================================================
# 2.2 Current Status of Trials by Initiation Year
# ============================================================
# Uses STATUS_AGG_MAP from setup cell

# Aggregate by year and aggregated status
df_status_agg = df_temporal.copy()
df_status_agg['status_agg'] = (
    df_status_agg['status_group']
    .astype(str).str.strip()
    .map(STATUS_AGG_MAP)
    .fillna('Unknown/Other')  # Catch unmapped statuses
)

yearly_status_agg = (
    df_status_agg
    .groupby(['start_year', 'status_agg'])
    .size()
    .reset_index(name='trial_count')
)

# Pivot for line chart
pivot_status = yearly_status_agg.pivot_table(
    index='start_year',
    columns='status_agg',
    values='trial_count',
    fill_value=0
)

# Reindex to ensure all years are present (even if zero)
all_years = range(min_year, max_year + 1)
pivot_status = pivot_status.reindex(all_years, fill_value=0)

# Use order from constants (imported in setup)
status_agg_cols = [s for s in STATUS_AGG_ORDER if s in pivot_status.columns]
pivot_status = pivot_status[status_agg_cols]

# Create 100% stacked area chart (shows composition per cohort)
fig_status_trends = create_stacked_area_chart(
    pivot_data=pivot_status,
    title="Current Registry Status by Start-Year Cohort",
    subtitle=f"Share of trials initiated each year, by current status (as of extract) · {temporal_year_range}",
    note="Older cohorts have had more time to reach terminal statuses; recent cohorts are still ongoing.",
    colors=STATUS_AGG_COLORS,
    normalize=True,  # 100% stacked
)

fig_status_trends.show()

## 2.3 Top Conditions Over Time

How has trial activity evolved for the most frequent condition labels?

In [29]:
# ============================================================
# 2.3 Top Conditions Over Time (Heatmap)
# ============================================================
# Using heatmap instead of line chart for better readability:
# - Avoids line crossings
# - Easy to spot peaks and patterns
# - Works well with 10+ categories

# Parameters
TOP_N_CONDITIONS = 10

# Get top N conditions within analysis scope (1990–2025)
top_conditions = (
    df_temporal_cond
    .groupby('condition_name')['study_id']
    .nunique()
    .nlargest(TOP_N_CONDITIONS)
    .index.tolist()
)

# Filter to top conditions
df_top_cond = df_temporal_cond[
    df_temporal_cond['condition_name'].isin(top_conditions)
]

# Aggregate by year and condition
yearly_cond = (
    df_top_cond
    .groupby(['start_year', 'condition_name'])['study_id']
    .nunique()
    .reset_index(name='study_count')
)

# Pivot for heatmap (conditions as rows, years as columns)
pivot_cond = yearly_cond.pivot_table(
    index='condition_name',
    columns='start_year',
    values='study_count',
    fill_value=0
)

# Reindex columns to full year range (avoid missing years)
pivot_cond = pivot_cond.reindex(columns=range(min_year, max_year + 1), fill_value=0)

# Order rows by total (descending) - highest at top
row_order = pivot_cond.sum(axis=1).sort_values(ascending=False).index.tolist()
pivot_cond = pivot_cond.loc[row_order]

# Create heatmap using reusable function
fig_cond_heatmap = create_temporal_heatmap(
    pivot_data=pivot_cond,
    title=f"Top {TOP_N_CONDITIONS} Condition Labels Over Time",
    subtitle=f"Unique studies per label per year ({temporal_year_range})",
    note=(
        "<b>Note</b>: Condition labels are free-text (not standardized). A study may have multiple labels.<br>"
        "Color shows absolute counts (not normalized); high-volume labels dominate the scale."
    ),
    height=450,
)

fig_cond_heatmap.show()

## 2.4 Key Observations (Temporal Trends)

- **Trial initiations show an overall upward trend across the analysis window (1990–2025)**, with growth becoming more pronounced from the early 2000s onward (within the registry extract and analysis scope).

- **At an aggregated level, the phase mix is relatively stable over time.** Over much of the observed period, the **Mid bucket (Phase 1/2 + Phase 2 + Phase 2/3)** contributes the largest share of phase-designated initiations.

- **Current status by initiation year shows a clear lifecycle effect.** Earlier cohorts skew toward terminal states (**Completed / Stopped**), while more recent cohorts skew toward **Active/Recruiting** statuses (status as of data extraction).

- **Top condition labels show heterogeneous patterns over time.** Some high-frequency labels (e.g., **“Healthy”**, **“Breast Cancer”**) appear across many years, while others show more episodic spikes depending on the period and the top-N selection.

- **Methodological note:** Each study is counted once at its reported start date and associated labels. These trends describe **research entry points**, not phase progression through the development pipeline. Counts for the latest years should be interpreted cautiously because registry records are updated over time.
---

# Summary and Limitations

## What this analysis establishes

### Part 1: Distribution Snapshot
1. **Phase (snapshot):** Among phase-designated studies in this extract, **Phase 2 is the largest single phase**. In the full analysis scope, **“Not Applicable”** represents a substantial share of studies (observational/registry designs where phase is not meaningful).
2. **Status (snapshot):** **Completed** is the most common current status, consistent with the cumulative nature of the registry.
3. **Therapeutic focus (snapshot):** The most frequent **condition labels** include “Healthy”, “Breast Cancer”, “Obesity”, and “Stroke”, noting that labels are free-text and not mutually exclusive.

### Part 2: Temporal Trends
4. **Growth trajectory:** Trial initiation volume increases over time across the analysis window, with stronger growth from the early 2000s onward (within scope).
5. **Phase mix:** Aggregated phase groups are broadly stable over time; the **Mid bucket (Phase 1/2 + Phase 2 + Phase 2/3)** contributes the largest share of phase-designated initiations across much of the period.
6. **Status lifecycle:** Earlier initiation cohorts show higher **Completed/Stopped** shares; recent cohorts show higher **Active/Recruiting** shares (status as of extraction).
7. **Condition trends:** High-frequency condition labels vary over time; interpretation remains label-level due to free-text taxonomy.

## Data limitations

- **Registry snapshot:** The dataset reflects a point-in-time extract from ClinicalTrials.gov, not a finalized historical record.
- **Classification:** “Not Applicable” is a broad bucket; condition labels are free-text and non-standardized. Studies can have multiple labels.
- **Temporal bias:** Recent years can be undercounted as registrations and updates arrive with delay.
- **No pipeline flow:** Phase and status are not tracked longitudinally per trial within this analysis; we describe distributions and entry points, not progression or attrition.

## Next steps

- **Q2 (Completion):** Compare completion / stopped patterns by phase and key attributes.
- **Q3 (Enrollment):** Evaluate enrollment distributions (median/IQR) by phase and condition labels.
- **Q4 (Geography):** Quantify geographic concentration and site footprints by phase/condition.

In [30]:
# ============================================================
# Cleanup
# ============================================================

conn.close()