# Q1: Clinical Trial Landscape

**Purpose:** Establish baseline distribution to inform deeper analysis

**Three questions:**
1. Where is research volume concentrated across development phases?
2. How has trial initiation volume changed over time?
3. Which therapeutic areas show the highest trial counts?

**What this analysis does NOT cover:**
- Completion rates (Q2)
- Enrollment performance (Q3)
- Geographic patterns (Q4)
- Trial duration (Q5)

In [1]:
import sqlite3
import pandas as pd
import plotly.graph_objects as go
from pathlib import Path

# Database connection
DB_PATH = Path('../data/database/clinical_trials.db')
conn = sqlite3.connect(str(DB_PATH))

In [2]:
snapshot = pd.read_sql_query(
    """
    SELECT
        COUNT(*) AS n_registry,
        SUM(CASE WHEN has_valid_start_date = 1 THEN 1 ELSE 0 END) AS n_parseable_start_date,
        SUM(CASE WHEN is_start_year_in_scope = 1 THEN 1 ELSE 0 END) AS n_analysis
    FROM v_studies_clean;
    """,
    conn,
)

n_registry = int(snapshot["n_registry"].iloc[0])
n_parseable = int(snapshot["n_parseable_start_date"].iloc[0])
n_analysis = int(snapshot["n_analysis"].iloc[0])

n_missing_or_invalid = n_registry - n_parseable
n_out_of_range = n_parseable - n_analysis

scope_years = pd.read_sql_query(
    """
    SELECT MIN(start_year) AS min_year, MAX(start_year) AS max_year
    FROM v_studies_clean
    WHERE is_start_year_in_scope = 1;
    """,
    conn,
)

min_year = int(scope_years["min_year"].iloc[0])
max_year = int(scope_years["max_year"].iloc[0])

dataset_summary = pd.DataFrame(
    {
        "Metric": [
            "Registry size (all studies loaded)",
            "Analysis dataset (1990–2025 start years)",
            "Start date missing/invalid",
            "Start year out of analysis range",
            "Temporal coverage (analysis dataset)",
        ],
        "Value": [
            f"{n_registry:,}",
            f"{n_analysis:,}",
            f"{n_missing_or_invalid:,}",
            f"{n_out_of_range:,}",
            f"{min_year}–{max_year}",
        ],
    }
)

dataset_summary

Unnamed: 0,Metric,Value
0,Registry size (all studies loaded),100000
1,Analysis dataset (1990–2025 start years),97943
2,Start date missing/invalid,957
3,Start year out of analysis range,1100
4,Temporal coverage (analysis dataset),1990–2025


In [3]:
phase_raw = pd.read_sql_query("""
    SELECT 
        COALESCE(phase, '[NULL]') AS raw_phase,
        COUNT(*) AS n_studies
    FROM v_studies_clean
    WHERE is_start_year_in_scope = 1
    GROUP BY COALESCE(phase, '[NULL]')
    ORDER BY n_studies DESC;
""", conn)

phase_raw

Unnamed: 0,raw_phase,n_studies
0,,37599
1,[NULL],23179
2,PHASE2,10991
3,PHASE1,8128
4,PHASE3,7115
5,PHASE4,5853
6,"PHASE1, PHASE2",2794
7,"PHASE2, PHASE3",1264
8,EARLY_PHASE1,1020


---

## 1. Phase Distribution

**Question:** Where is trial volume concentrated across development phases and statuses?

In [4]:
# Load phase × status distribution
with open('../sql/queries/q1_phase_status_distribution.sql', 'r', encoding='utf-8') as f:
    query_phase_status = f.read()

df_phase_status = pd.read_sql_query(query_phase_status, conn)

# Defensive checks
assert not df_phase_status.empty, "Phase × status query returned no rows"
expected_cols = {"phase_group", "status_label", "trial_count"}
assert expected_cols.issubset(df_phase_status.columns), (
    f"Unexpected columns returned: {df_phase_status.columns.tolist()}"
)

df_phase_status.head(10)

Unnamed: 0,phase_group,status_label,trial_count
0,Early Phase 1,Completed,442
1,Early Phase 1,Unknown,183
2,Early Phase 1,Recruiting,162
3,Early Phase 1,Not yet recruiting,70
4,Early Phase 1,Terminated,64
5,Early Phase 1,Withdrawn,47
6,Early Phase 1,"Active, not recruiting",36
7,Early Phase 1,Enrolling by invitation,10
8,Early Phase 1,Suspended,6
9,Phase 1,Completed,5526


### 1.1 Phase × Status Distribution

The heatmap below shows trial counts across clinical phases and registration statuses. 
Color intensity reflects volume; cell annotations display exact counts.

In [5]:
import numpy as np
import plotly.graph_objects as go

MAJOR_STATUSES = [
    "Completed",
    "Recruiting",
    "Active, not recruiting",
    "Not yet recruiting",
    "Enrolling by invitation",
    "Terminated",
    "Withdrawn",
    "Suspended",
    "Unknown",
]

PHASE_ORDER = [
    "Early Phase 1",
    "Phase 1",
    "Phase 1/2",
    "Phase 2",
    "Phase 2/3",
    "Phase 3",
    "Phase 4",
    "Not Applicable",
]

# --- Keep the heatmap readable: show major statuses only, but account for the rest ---
df_major = df_phase_status[df_phase_status["status_label"].isin(MAJOR_STATUSES)]
df_other = df_phase_status[~df_phase_status["status_label"].isin(MAJOR_STATUSES)]

major_count = int(df_major["trial_count"].sum())
other_count = int(df_other["trial_count"].sum())
assert major_count + other_count == n_analysis

other_pct = round(other_count / n_analysis * 100, 2) if n_analysis else 0.0

# Pivot (major statuses only)
pivot = (
    df_major.pivot_table(
        index="phase_group",
        columns="status_label",
        values="trial_count",
        fill_value=0,
    )
    .reindex(index=[p for p in PHASE_ORDER if p in df_major["phase_group"].unique()],
             columns=[s for s in MAJOR_STATUSES if s in df_major["status_label"].unique()])
)

pivot = pivot.rename(index={"Not Applicable": "Not Applicable*"})
pivot_total = int(pivot.to_numpy().sum())
assert pivot_total == major_count

# Contrast: cap scale at p95 so smaller cells remain visible
z = pivot.to_numpy()
zmax = float(np.percentile(z[z > 0], 95)) if (z > 0).any() else 1.0

# Annotate all non-zero cells
annotations = [
    dict(
        x=col,
        y=row,
        text=f"{int(pivot.loc[row, col]):,}",
        showarrow=False,
        font=dict(
            size=12,
            color=("white" if pivot.loc[row, col] >= zmax * 0.6 else "#1f2a44"),
            family="Arial",
        ),
    )
    for row in pivot.index
    for col in pivot.columns
    if pivot.loc[row, col] > 0
]

na_label = "Not Applicable*"
na_count = int(pivot.loc[na_label].sum()) if na_label in pivot.index else 0
na_pct = round(na_count / n_analysis * 100, 1) if n_analysis else 0.0

fig_heatmap = go.Figure(
    data=go.Heatmap(
        z=z,
        x=pivot.columns,
        y=pivot.index,
        zmin=0,
        zmax=zmax,
        colorscale=[[0, "#e5e7eb"], [1, "#2563eb"]],
        xgap=1,
        ygap=1,
        showscale=False,
        hovertemplate="<b>%{y}</b><br>%{x}: %{z:,}<extra></extra>",
    )
)

fig_heatmap.update_layout(
    title=dict(
        text=(
            "<b>Trial Volume by Phase and Status</b><br>"
            f"<span style='font-size:12px; color:#6b7280'>"
            f"({min_year}–{max_year}, n={n_analysis:,} within analysis scope)</span>"
        ),
        x=0.5,
        xanchor="center",
    ),
    xaxis=dict(tickangle=-30, tickfont=dict(size=11)),
    yaxis=dict(autorange="reversed", tickfont=dict(size=11)),
    annotations=annotations,
    height=650,
    template="plotly_white",
    font=dict(family="Arial", color="#374151"),
    margin=dict(t=80, b=220, l=95, r=30),
)

fig_heatmap.add_annotation(
    text=(
        "<b>* Not Applicable</b><br>"
        "Phase concept does not apply (mostly observational/registry).<br>"
        f"{na_pct}% of analysis scope (n={na_count:,}). "
        f"Excluded rare statuses: {other_pct}% (n={other_count:,})."
    ),
    xref="paper",
    yref="paper",
    x=0,
    y=-0.50,
    showarrow=False,
    align="left",
    font=dict(size=10, color="#6b7280", family="Arial"),
)

fig_heatmap.show()

### 1.2 Phase-Designated Trial Volume

Focusing on phase-designated studies only (excluding "Not Applicable"), the following chart 
shows the distribution across clinical development stages.

In [6]:
from plotly.colors import sample_colorscale
import plotly.graph_objects as go

# ------------------------------------------------------------
# Plot 2 — Phase-designated trial volume (analysis scope only)
# ------------------------------------------------------------

# Totals within analysis scope (1990–2025)
total_in_scope = int(df_phase_status["trial_count"].sum())
assert total_in_scope == n_analysis, (
    f"df_phase_status total {total_in_scope:,} != n_analysis {n_analysis:,}"
)

# Not Applicable share (within analysis scope)
na_trials = int(
    df_phase_status.loc[df_phase_status["phase_group"] == "Not Applicable", "trial_count"].sum()
)
na_pct = round(na_trials / n_analysis * 100, 1) if n_analysis else 0.0

# Phase-designated trials only (exclude Not Applicable + Other)
phase_summary = (
    df_phase_status
    .query("phase_group not in ['Not Applicable', 'Other']")
    .groupby("phase_group")["trial_count"]
    .sum()
)

# Clinical phase ordering (preferred over volume ordering)
phase_order = ["Early Phase 1", "Phase 1", "Phase 1/2", "Phase 2", "Phase 2/3", "Phase 3", "Phase 4"]
phase_summary = phase_summary.reindex([p for p in phase_order if p in phase_summary.index])

phase_designated_total = int(phase_summary.sum())
assert phase_designated_total > 0, "No phase-designated trials found (check filters)."

# Labels: % within phase-designated pipeline + count
labels = [
    f"{(v / phase_designated_total * 100):.1f}% ({int(v):,})"
    for v in phase_summary.values
]

# Color gradient aligned with Heatmap
# Slightly stronger low end for readability
PHASE_COLORSCALE = [[0.0, "#f1f5f9"], [1.0, "#2563eb"]]

min_v = float(phase_summary.min())
max_v = float(phase_summary.max())
ratios = [
    (float(v) - min_v) / (max_v - min_v) if max_v > min_v else 1.0
    for v in phase_summary.values
]
colors = sample_colorscale(PHASE_COLORSCALE, ratios)

fig_phase = go.Figure(
    go.Bar(
        x=phase_summary.values,
        y=phase_summary.index,
        orientation="h",
        text=labels,
        textposition="outside",
        marker_color=colors,
        cliponaxis=False,
        hovertemplate="<b>%{y}</b><br>Trials: %{x:,}<extra></extra>",
    )
)

fig_phase.update_layout(
    title=dict(
        text=(
            "<b>Phase-Designated Trial Volume by Clinical Phase</b><br>"
            f"<span style='font-size:12px; color:#6b7280'>"
            f"({min_year}–{max_year}, n={phase_designated_total:,} phase-designated trials)</span>"
        ),
        x=0.5,
        xanchor="center",
    ),
    xaxis=dict(showgrid=False, showticklabels=False, title=None, rangemode="tozero"),
    yaxis=dict(title=None, tickfont=dict(size=12), autorange="reversed"),
    height=450,
    template="plotly_white",
    font=dict(family="Arial", color="#374151"),
    margin=dict(l=120, r=260, t=80, b=170),
    bargap=0.22,
)

fig_phase.add_annotation(
    text=(
        "<b>Note</b>: 'Not Applicable' studies (observational/registry) are excluded here.<br>"
        f"They represent <b>{na_pct}%</b> of trials in the {min_year}–{max_year} analysis scope "
        f"(n = {na_trials:,})."
    ),
    xref="paper",
    yref="paper",
    x=0,
    y=-0.45,
    showarrow=False,
    align="left",
    font=dict(size=10, color="#6b7280", family="Arial"),
)

fig_phase.show()

### 1.3 Summary Table

Complete breakdown of trial counts by phase group, including cumulative percentages.

In [8]:
import pandas as pd

# ------------------------------------------------------------
# Phase distribution (analysis scope: start years 1990–2025)
# Source: df_phase_status already filtered by is_start_year_in_scope = 1
# ------------------------------------------------------------

phase_summary = (
    df_phase_status
    .groupby("phase_group")["trial_count"]
    .sum()
)

phase_order = [
    "Early Phase 1",
    "Phase 1",
    "Phase 1/2",
    "Phase 2",
    "Phase 2/3",
    "Phase 3",
    "Phase 4",
    "Other",
    "Not Applicable",
]

phase_summary = phase_summary.reindex([p for p in phase_order if p in phase_summary.index]).dropna()

n_scope = int(phase_summary.sum())
assert n_scope == n_analysis, f"Phase table total {n_scope:,} != n_analysis {n_analysis:,}"

phase_stats = phase_summary.reset_index()
phase_stats.columns = ["Phase Group", "Trials"]

# exact % for cum (avoid rounding drift)
phase_stats["pct_exact"] = phase_stats["Trials"] / n_scope * 100
phase_stats["cum_pct_exact"] = phase_stats["pct_exact"].cumsum()

# Notes
phase_stats["Notes"] = ""
phase_stats.loc[phase_stats["Phase Group"] == "Not Applicable", "Notes"] = (
    "Phase not applicable (primarily observational/registry studies)"
)
phase_stats.loc[phase_stats["Phase Group"] == "Other", "Notes"] = (
    "Non-standard or mixed phase label"
)

# Display formatting
phase_stats_display = phase_stats.copy()
phase_stats_display["Trials"] = phase_stats_display["Trials"].map(lambda x: f"{int(x):,}")
phase_stats_display["% of Scope"] = phase_stats_display["pct_exact"].map(lambda x: f"{x:.1f}%")
phase_stats_display["Cum %"] = phase_stats_display["cum_pct_exact"].map(lambda x: f"{x:.1f}%")

phase_stats_display = phase_stats_display[["Phase Group", "Trials", "% of Scope", "Cum %", "Notes"]]

# Total row
total_row = pd.DataFrame([{
    "Phase Group": "Total",
    "Trials": f"{n_scope:,}",
    "% of Scope": "100.0%",
    "Cum %": "100.0%",
    "Notes": ""
}])

phase_stats_display = pd.concat([phase_stats_display, total_row], ignore_index=True)

phase_stats_display

Unnamed: 0,Phase Group,Trials,% of Scope,Cum %,Notes
0,Early Phase 1,1020,1.0%,1.0%,
1,Phase 1,8128,8.3%,9.3%,
2,Phase 1/2,2794,2.9%,12.2%,
3,Phase 2,10991,11.2%,23.4%,
4,Phase 2/3,1264,1.3%,24.7%,
5,Phase 3,7115,7.3%,32.0%,
6,Phase 4,5853,6.0%,37.9%,
7,Not Applicable,60778,62.1%,100.0%,Phase not applicable (primarily observational/...
8,Total,97943,100.0%,100.0%,


### **Key observations**

- **Within the analyzed dataset (1990–2025), research activity among phase-designated studies is most concentrated in Phase 2**, followed by Phase 1 and Phase 3. This reflects where the largest volume of registered, phase-classified trials sits in the current sample.

- **Late-stage development (Phases 3–4) represents a smaller share of the overall study landscape.** This describes a cross-sectional distribution of registered trials; confirming pipeline attrition dynamics would require longitudinal, cohort-level analysis.

- **A substantial majority of studies in the analysis scope (≈ 62%) are classified as “Not Applicable”,** reflecting observational designs and registry-based studies where the clinical phase framework is not meaningful—an expected pattern given ClinicalTrials.gov’s role as a registry for both interventional and observational research.

---

## 2. Temporal Trends

**Question:** How has trial initiation volume evolved over time?

In [12]:
# ------------------------------------------------------------
# Load yearly trial initiation data (by phase)
# ------------------------------------------------------------
with open("../sql/queries/q1_yearly_trends.sql", "r", encoding="utf-8") as f:
    query_yearly = f.read()

df_yearly = pd.read_sql_query(query_yearly, conn)

# Defensive checks
assert not df_yearly.empty, "Yearly trends query returned no rows"
expected_cols = {"start_year", "phase_group", "trial_count"}
assert expected_cols.issubset(df_yearly.columns), (
    f"Unexpected columns returned: {df_yearly.columns.tolist()}"
)

# Ensure correct dtypes
df_yearly["start_year"] = pd.to_numeric(df_yearly["start_year"], errors="coerce").astype("Int64")
df_yearly["trial_count"] = pd.to_numeric(df_yearly["trial_count"], errors="coerce").astype("Int64")

# Sanity checks
assert df_yearly["start_year"].notna().all(), "Found null start_year values"
assert df_yearly["start_year"].between(min_year, max_year).all(), (
    "Found start_year values outside the declared analysis scope"
)

df_yearly.head(10)

Unnamed: 0,start_year,phase_group,trial_count
0,1990,Phase 1/2,2
1,1990,Phase 2,6
2,1990,Not Applicable,22
3,1991,Early Phase 1,1
4,1991,Phase 1,2
5,1991,Phase 2,5
6,1991,Phase 2/3,2
7,1991,Phase 3,6
8,1991,Not Applicable,17
9,1992,Phase 1,2


In [14]:
# Total initiations per year (all phases included)
df_year_total = (
    df_yearly.groupby("start_year", as_index=False)["trial_count"]
    .sum()
    .sort_values("start_year")
)

assert df_year_total["trial_count"].sum() == n_analysis, "Total yearly count != n_analysis"

# Phase-designated only (exclude Not Applicable + Other)
phase_designated = ["Early Phase 1", "Phase 1", "Phase 1/2", "Phase 2", "Phase 2/3", "Phase 3", "Phase 4"]
df_year_phase = df_yearly[df_yearly["phase_group"].isin(phase_designated)].copy()

# Ensure phase order for plotting
df_year_phase["phase_group"] = pd.Categorical(df_year_phase["phase_group"], categories=phase_designated, ordered=True)
df_year_phase = df_year_phase.sort_values(["start_year", "phase_group"])

In [17]:
import plotly.graph_objects as go

# ---- Defensive checks
assert {"start_year", "trial_count"}.issubset(df_year_total.columns)
df_year_total = df_year_total.dropna(subset=["start_year", "trial_count"]).copy()
df_year_total = df_year_total.sort_values("start_year")

# ---- Peak (dynamic)
peak_idx = df_year_total["trial_count"].idxmax()
peak_year = int(df_year_total.loc[peak_idx, "start_year"])
peak_count = int(df_year_total.loc[peak_idx, "trial_count"])

# ---- Ticks every 5 years
tick_step = 5
start_tick = ((min_year + tick_step - 1) // tick_step) * tick_step
tickvals = list(range(start_tick, max_year + 1, tick_step))

fig_total = go.Figure()

# Main line
fig_total.add_trace(
    go.Scatter(
        x=df_year_total["start_year"],
        y=df_year_total["trial_count"],
        mode="lines",
        line=dict(color="#2563eb", width=2.5),
        customdata=df_year_total["trial_count"],
        hovertemplate=(
            "<b>Start year</b>: %{x}<br>"
            "<b>Trials initiated</b>: %{y:,}<extra></extra>"
        ),
        showlegend=False,
    )
)

# Peak marker
fig_total.add_trace(
    go.Scatter(
        x=[peak_year],
        y=[peak_count],
        mode="markers",
        marker=dict(size=10, color="#1f2a44"),
        hovertemplate=(
            "<b>Peak year</b><br>"
            f"{peak_year}: {peak_count:,} trials<extra></extra>"
        ),
        showlegend=False,
    )
)

# Peak label 
fig_total.add_annotation(
    x=peak_year,
    y=peak_count,
    text=f"Peak: {peak_year} ({peak_count:,})",
    showarrow=True,
    arrowhead=2,
    ax=25,
    ay=-30,
    font=dict(size=11, color="#374151"),
    arrowcolor="#9ca3af",
)

fig_total.update_layout(
    title=dict(
        text=(
            "<b>Annual Clinical Trial Initiations</b><br>"
            f"<span style='font-size:12px; color:#6b7280'>"
            f"Analysis scope: {min_year}–{max_year} (n = {n_analysis:,} studies)</span>"
        ),
        x=0.5,
        xanchor="center",
    ),
    xaxis=dict(
        title="Start year",
        tickmode="array",
        tickvals=tickvals,
        tickformat="d",
        showgrid=False,
        showline=True,
        linecolor="#d1d5db",
    ),
    yaxis=dict(
        title="Trials initiated (count)",
        showgrid=True,
        gridcolor="#f3f4f6",
        rangemode="tozero",
        showline=True,
        linecolor="#d1d5db",
    ),
    height=480,
    template="plotly_white",
    font=dict(family="Arial", color="#374151"),
    margin=dict(t=80, b=55, l=70, r=30),
)

# Footnote: 
fig_total.add_annotation(
    text="Counts include all registered studies with a validated start year within the analysis scope.",
    xref="paper",
    yref="paper",
    x=0,
    y=-0.22,
    showarrow=False,
    align="left",
    font=dict(size=10, color="#6b7280", family="Arial"),
)

fig_total.show()

In [37]:
import pandas as pd
import plotly.graph_objects as go

# ------------------------------------------------------------
# Stacked area: trial initiations by phase over time
# (phase-designated only, within analysis scope)
# ------------------------------------------------------------

# Keep only phase-designated groups (exclude Not Applicable / Other if present)
phase_order = [
    "Early Phase 1",
    "Phase 1",
    "Phase 1/2",
    "Phase 2",
    "Phase 2/3",
    "Phase 3",
    "Phase 4",
]

df_plot = df_yearly[df_yearly["phase_group"].isin(phase_order)].copy()
assert not df_plot.empty, "No phase-designated data found for stacked trend (check df_yearly filters)."

# Ensure proper types
df_plot["start_year"] = pd.to_numeric(df_plot["start_year"], errors="coerce").astype("Int64")
df_plot["trial_count"] = pd.to_numeric(df_plot["trial_count"], errors="coerce").fillna(0).astype(int)
df_plot = df_plot.dropna(subset=["start_year"]).sort_values(["start_year", "phase_group"])

# Figure
fig_stack = go.Figure()

# Add phases in the desired order with legendrank to control legend order
for idx, phase in enumerate(phase_order):
    df_p = df_plot[df_plot["phase_group"] == phase].sort_values("start_year")
    if df_p.empty:
        continue

    fig_stack.add_trace(
        go.Scatter(
            x=df_p["start_year"],
            y=df_p["trial_count"],
            stackgroup="one",
            name=phase,
            mode="lines",
            line=dict(width=0.6),
            legendrank=idx + 1,  # 1-7 for phases in order
            hovertemplate=(
                f"<b>%{{x}}</b><br>"
                f"{phase}: %{{y:,}}<extra></extra>"
            ),
        )
    )

# Optional: total line (thin) on top of the stack to help read the scale
df_total = (
    df_plot.groupby("start_year", as_index=False)["trial_count"]
    .sum()
    .sort_values("start_year")
)
fig_stack.add_trace(
    go.Scatter(
        x=df_total["start_year"],
        y=df_total["trial_count"],
        mode="lines",
        name="Total (phase-designated)",
        line=dict(color="#1f2a44", width=2),
        legendrank=0,  # First in legend
        hovertemplate="<b>%{x}</b><br>Total: %{y:,}<extra></extra>",
    )
)

# Dynamic n for phase-designated-only
n_phase_designated = int(df_plot["trial_count"].sum())

# Layout polish
fig_stack.update_layout(
    title=dict(
        text=(
            "<b>Trial Initiations by Clinical Phase Over Time</b><br>"
            f"<span style='font-size:12px; color:#6b7280'>"
            f"Phase-designated studies only · {min_year}–{max_year} (n = {n_phase_designated:,})"
            "</span>"
        ),
        x=0.5,
        xanchor="center",
    ),
    xaxis=dict(
        title="Start year",
        showgrid=False,
        showline=True,
        linecolor="#d1d5db",
        tickmode="linear",
        dtick=5,
        tickformat="d",
    ),
    yaxis=dict(
        title="Trials initiated (count)",
        showgrid=True,
        gridcolor="#f3f4f6",
        rangemode="tozero",
        showline=True,
        linecolor="#d1d5db",
    ),
    template="plotly_white",
    font=dict(family="Arial", color="#374151"),
    height=560,
    margin=dict(t=90, b=170, l=70, r=30),
    legend=dict(
        orientation="h",
        yanchor="top",
        y=-0.20,        # legend under the plot
        xanchor="center",
        x=0.5,
        title=None,
        font=dict(size=11),
        traceorder="normal",  # Use legendrank to control order
    ),
)

# Note BELOW the legend
fig_stack.add_annotation(
    text=(
        "<b>Note</b>: 'Not Applicable' studies (observational/registry) are excluded to focus on phase-designated trials."
    ),
    xref="paper",
    yref="paper",
    x=0,
    y=-0.33,          # below legend
    showarrow=False,
    align="left",
    font=dict(size=10, color="#6b7280", family="Arial"),
)

fig_stack.show()


### Q1.2 – Trial Initiations Over Time

#### Key observations

- **Trial initiation activity increases steadily over the observed period (1990–2025)**, with a clear upward trend from the early 2000s onward.  
  This reflects a growing number of studies entering the registry over time within the defined analysis scope.

- **The highest annual initiation count in the current dataset is observed in 2023.**  
  Values in the most recent years should be interpreted with caution, as they reflect the state of the registry at the time of data extraction rather than a finalized historical record.

- **Across years, Phase 2 consistently represents the largest share of newly initiated phase-designated studies**, followed by Phase 1 and Phase 3.  
  This mirrors the phase distribution observed in the overall landscape snapshot (Q1.1).

- **Methodological note:** this analysis captures **trial initiation phase**, not phase progression.  
  Each study is counted once, at the phase reported at initiation. As a result, these trends describe **research entry points by phase**, not longitudinal movement of trials through the clinical development pipeline.

---

## 3. Therapeutic Concentration

**Question:** Which conditions show the highest trial volume?

In [11]:
# Load top conditions
with open('../sql/queries/q1_top_therapeutic_areas.sql', 'r') as f:
    query_therapeutic = f.read()

df_therapeutic = pd.read_sql_query(query_therapeutic, conn)
df_therapeutic = df_therapeutic.sort_values('trial_count', ascending=False)
df_therapeutic.head(10)

Unnamed: 0,condition_name,trial_count,percentage_of_trials,completed_trials,recruiting_trials
0,Healthy,2131,1.97,1737,137
1,Breast Cancer,1664,1.54,829,296
2,Obesity,1381,1.28,918,166
3,Stroke,978,0.9,488,217
4,Pain,832,0.77,540,85
5,Depression,813,0.75,495,114
6,Prostate Cancer,803,0.74,394,151
7,Hypertension,801,0.74,505,106
8,HIV Infections,721,0.67,575,34
9,Cancer,720,0.67,381,141


In [12]:
# Horizontal bar chart (sorted ascending for bottom-to-top display)
df_sorted = df_therapeutic.sort_values('trial_count', ascending=True)

# Gradient color
max_val = df_sorted['trial_count'].max()
min_val = df_sorted['trial_count'].min()
colors = []
for val in df_sorted['trial_count'].values:
    ratio = (val - min_val) / (max_val - min_val) if max_val > min_val else 1
    r = int(229 - ratio * (229 - 37))
    g = int(231 - ratio * (231 - 99))
    b = int(235 - ratio * (235 - 235))
    colors.append(f'rgb({r}, {g}, {b})')

# Calculate title dynamically
top1 = df_therapeutic.iloc[0]['condition_name']
top1_count = int(df_therapeutic.iloc[0]['trial_count'])
top2 = df_therapeutic.iloc[1]['condition_name']
top2_count = int(df_therapeutic.iloc[1]['trial_count'])

fig_areas = go.Figure(go.Bar(
    x=df_sorted['trial_count'].values,
    y=df_sorted['condition_name'].values,
    orientation='h',
    text=[f'{int(v):,}' for v in df_sorted['trial_count'].values],
    marker_color=colors,
    textposition='outside',
    hovertemplate='<b>%{y}</b><br>%{x:,} trials<extra></extra>'
))

fig_areas.update_layout(
    title=f'<b>"{top1}" leads at {top1_count:,} trials; {top2} is top disease ({top2_count:,})</b>',
    xaxis=dict(showgrid=False, showticklabels=False, title=None),
    yaxis=dict(title=None, tickfont=dict(size=11)),
    height=650,
    template='plotly_white',
    font=dict(family="Arial", color="#374151"),
    margin=dict(r=50, t=60, b=50),
    bargap=0.15
)
fig_areas.show()

In [13]:
# Category rollups as dataframe
cancer = df_therapeutic[df_therapeutic['condition_name'].str.contains('Cancer', case=False, na=False)]
cardio = df_therapeutic[df_therapeutic['condition_name'].str.contains(
    'Heart|Cardiovascular|Coronary|Hypertension|Stroke', case=False, na=False
)]
metabolic = df_therapeutic[df_therapeutic['condition_name'].str.contains(
    'Diabetes|Obesity', case=False, na=False
)]

categories = pd.DataFrame({
    'Category': ['Oncology', 'Cardiovascular', 'Metabolic'],
    'Total Trials': [
        cancer['trial_count'].sum(),
        cardio['trial_count'].sum(),
        metabolic['trial_count'].sum()
    ],
    'Conditions': [
        ', '.join(cancer['condition_name']),
        ', '.join(cardio['condition_name']),
        ', '.join(metabolic['condition_name'])
    ]
})
categories

Unnamed: 0,Category,Total Trials,Conditions
0,Oncology,4393,"Breast Cancer, Prostate Cancer, Cancer, Colore..."
1,Cardiovascular,3744,"Stroke, Hypertension, Coronary Artery Disease,..."
2,Metabolic,1970,"Obesity, Diabetes Mellitus, Type 2"


### What we see

- **"Healthy" is the single largest category**, reflecting healthy-volunteer studies across indications
- **Oncology shows high concentration:** Multiple cancer-related conditions in top 20
- **Cardiovascular and metabolic conditions are well-represented**

### Implication

High-volume therapeutic areas may face competitive pressures for patient recruitment. **Q3 should examine enrollment performance by condition** to identify areas where trials struggle to meet targets.

---

## Summary

**What this analysis establishes:**

1. **Pipeline composition:** Phase 2 trials represent a significant portion; "Not Applicable" trials dominate
2. **Growth trajectory:** Trial initiations grew steadily over decades
3. **Therapeutic allocation:** Oncology, cardiovascular, and metabolic conditions show highest concentration

**Why subsequent analyses are needed:**

- **Q2 (Completion):** Volume alone doesn't reveal pipeline efficiency—we need completion rates by phase
- **Q3 (Enrollment):** High trial counts don't guarantee adequate patient recruitment—we need enrollment metrics
- **Q4 (Geography):** This analysis ignores location patterns—we need geographic distribution
- **Q5 (Duration):** Growth trends don't show whether trials are getting longer or shorter—we need timeline analysis

---

## Data Limitations

**Sample constraints:**
- Dataset represents a sample, not the complete registry
- Earlier trials underrepresented

**Classification issues:**
- "Not Applicable" is a catch-all category (observational studies, expanded access, etc.)
- Condition labels are free-text and non-standardized (potential overlap)

**Temporal bias:**
- Recent years likely undercount due to registry reporting lag
- Status labels reflect point-in-time snapshots, not real-time data

In [14]:
# Close connection
conn.close()