# Q1: Clinical Trial Landscape

**Dataset:** 10,000 trials from ClinicalTrials.gov (1990–2025)  
**Purpose:** Establish baseline distribution to inform deeper analysis

**Three questions:**
1. Where is research volume concentrated across development phases?
2. How has trial initiation volume changed over time?
3. Which therapeutic areas show the highest trial counts?

**What this analysis does NOT cover:**
- Completion rates (Q2)
- Enrollment performance (Q3)
- Geographic patterns (Q4)
- Trial duration (Q5)

In [None]:
import sqlite3
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from pathlib import Path

# Database connection
DB_PATH = Path('../data/database/clinical_trials.db')
conn = sqlite3.connect(str(DB_PATH))

print(f"✓ Connected: {DB_PATH.name} ({DB_PATH.stat().st_size / 1024 / 1024:.1f} MB)")

---

## 1. Phase Distribution

**Question:** Where is trial volume concentrated across development phases and statuses?

In [None]:
# Load phase × status data
with open('../sql/queries/q1_phase_status_distribution.sql', 'r') as f:
    query_phase_status = f.read()

df_phase_status = pd.read_sql_query(query_phase_status, conn)
print(f"Loaded {len(df_phase_status)} phase-status combinations\n")
df_phase_status.head(10)

In [None]:
# Pivot for heatmap
pivot = df_phase_status.pivot_table(
    index='phase_group', 
    columns='status_label', 
    values='trial_count', 
    fill_value=0
)

# Order: early→late phases, active→terminal statuses
phase_order = ['Early Phase 1', 'Phase 1', 'Phase 1/2', 'Phase 2', 
               'Phase 2/3', 'Phase 3', 'Phase 4', 'Not Applicable']
status_order = ['Completed', 'Recruiting', 'Active, not recruiting', 
                'Not yet recruiting', 'Enrolling by invitation', 
                'Terminated', 'Withdrawn', 'Suspended', 'Unknown']

pivot = pivot.reindex(
    index=[p for p in phase_order if p in pivot.index],
    columns=[s for s in status_order if s in pivot.columns]
)

# Inline annotations (skip cells < 10)
annotations = []
max_val = pivot.values.max()

for i, row in enumerate(pivot.values):
    for j, val in enumerate(row):
        if val >= 10:  # Only show meaningful counts
            text_color = 'white' if val > max_val * 0.3 else '#1f2a44'
            annotations.append(dict(
                x=pivot.columns[j], y=pivot.index[i],
                text=str(int(val)),
                showarrow=False,
                font=dict(size=10, color=text_color)
            ))

# Plot
fig_heatmap = go.Figure(data=go.Heatmap(
    z=pivot.values,
    x=pivot.columns,
    y=pivot.index,
    colorscale=[[0, '#e5e7eb'], [1, '#2563eb']],
    xgap=1, ygap=1,
    showscale=False,
    hovertemplate='<b>%{y}</b><br>%{x}: %{z:,}<extra></extra>'
))

fig_heatmap.update_layout(
    title='<b>Phase 2 and "Not Applicable" account for 7,335 trials (73%)</b>',
    xaxis=dict(tickangle=-30, tickfont=dict(size=11)),
    yaxis=dict(autorange="reversed", tickfont=dict(size=11)),
    annotations=annotations,
    height=550,
    template='plotly_white',
    font=dict(family="Arial", color="#374151"),
    margin=dict(t=60, b=50)
)
fig_heatmap.show()

In [None]:
# Summary by phase
phase_summary = df_phase_status.groupby('phase_group')['trial_count'].sum()
phase_summary = phase_summary.sort_values(ascending=True)

# Gradient color by value
max_val = phase_summary.max()
min_val = phase_summary.min()
colors = []
for val in phase_summary.values:
    ratio = (val - min_val) / (max_val - min_val) if max_val > min_val else 1
    r = int(229 - ratio * (229 - 37))  # #e5e7eb to #2563eb
    g = int(231 - ratio * (231 - 99))
    b = int(235 - ratio * (235 - 235))
    colors.append(f'rgb({r}, {g}, {b})')

fig_phase = go.Figure(go.Bar(
    x=phase_summary.values,
    y=phase_summary.index,
    orientation='h',
    text=[f'{int(v):,}' for v in phase_summary.values],
    marker_color=colors,
    textposition='outside',
    hovertemplate='<b>%{y}</b><br>%{x:,} trials<extra></extra>'
))

fig_phase.update_layout(
    title='<b>"Not Applicable" dominates at 6,149 trials (61% of sample)</b>',
    xaxis=dict(showgrid=False, showticklabels=False, title=None),
    yaxis=dict(title=None, tickfont=dict(size=12)),
    height=450,
    template='plotly_white',
    font=dict(family="Arial", color="#374151"),
    margin=dict(r=50, t=60, b=50),
    bargap=0.15
)
fig_phase.show()

print("\nPhase totals:")
print(phase_summary.sort_values(ascending=False))

### What we see

- **6,149 trials (61%)** are classified as "Not Applicable"—likely observational studies or trials without phase designations
- **Phase 2 is the largest interventional category** at 1,186 trials (12%), ahead of Phase 1 (831) and Phase 3 (746)
- **"Completed" is the dominant status** across all phases, reflecting the historical nature of this dataset

### Implication

Phase 2 concentration suggests mid-stage efficacy testing as a potential operational bottleneck. **Q2 should examine completion rates by phase** to quantify how many Phase 2 trials successfully advance.

---

## 2. Temporal Trends

**Question:** How has trial initiation volume evolved from 1990 to 2025?

In [None]:
# Load yearly data
with open('../sql/queries/q1_yearly_trends.sql', 'r') as f:
    query_yearly = f.read()

df_yearly = pd.read_sql_query(query_yearly, conn)
df_yearly['start_year'] = pd.to_numeric(df_yearly['start_year'])

print(f"Loaded {len(df_yearly)} years (1994–2025)\n")
df_yearly.head(10)

In [None]:
# Time series
fig_yearly = go.Figure()

fig_yearly.add_trace(go.Scatter(
    x=df_yearly['start_year'],
    y=df_yearly['trial_count'],
    mode='lines',
    line=dict(color='#2563eb', width=2.5),
    hovertemplate='<b>%{x}</b><br>%{y:,} trials<extra></extra>'
))

fig_yearly.update_layout(
    title='<b>Trial initiations peaked at 690 in 2023</b>',
    xaxis=dict(
        title=None,
        showgrid=False,
        linecolor='#374151',
        showline=True,
        type='linear',
        dtick=5,
        tickformat='d'
    ),
    yaxis=dict(
        title="Trials initiated",
        showgrid=True,
        gridcolor='#f3f4f6',
        rangemode='tozero',
        showline=True,
        linecolor='#374151'
    ),
    height=500,
    template='plotly_white',
    font=dict(family="Arial", color="#374151"),
    margin=dict(t=60, b=50)
)
fig_yearly.show()

In [None]:
# Key statistics
peak_year = df_yearly.loc[df_yearly['trial_count'].idxmax(), 'start_year']
peak_count = df_yearly['trial_count'].max()
avg_per_year = df_yearly['trial_count'].mean()
total = df_yearly['trial_count'].sum()

print(f"Peak year: {int(peak_year)} ({int(peak_count)} trials)")
print(f"Average per year: {int(avg_per_year)}")
print(f"Total (1990–2025): {int(total):,}")
print("\nRecent years (2020–2025):")
print(df_yearly[df_yearly['start_year'] >= 2020][['start_year', 'trial_count']].to_string(index=False))

### What we see

- **Steady growth from 1994 to 2023:** Initiations climbed from single digits in the mid-1990s to a peak of 690 in 2023
- **Post-2020 volume remains high:** 600–690 trials per year (2020–2025), consistent with pre-pandemic levels
- **9,812 trials (98%)** in this sample have start dates from 1990 onward

### Implication

Growth has stabilized at ~650 trials/year since 2020. **Q5 should analyze trial duration trends** to understand whether recent trials are completing faster or slower than historical benchmarks.

---

## 3. Therapeutic Concentration

**Question:** Which conditions show the highest trial volume?

In [None]:
# Load top conditions
with open('../sql/queries/q1_top_therapeutic_areas.sql', 'r') as f:
    query_therapeutic = f.read()

df_therapeutic = pd.read_sql_query(query_therapeutic, conn)
print(f"Loaded top 20 conditions (≥10 trials each)\n")
df_therapeutic.head(10)

In [None]:
# Horizontal bar chart
df_sorted = df_therapeutic.sort_values('trial_count', ascending=True)

# Gradient color
max_val = df_sorted['trial_count'].max()
min_val = df_sorted['trial_count'].min()
colors = []
for val in df_sorted['trial_count'].values:
    ratio = (val - min_val) / (max_val - min_val) if max_val > min_val else 1
    r = int(229 - ratio * (229 - 37))
    g = int(231 - ratio * (231 - 99))
    b = int(235 - ratio * (235 - 235))
    colors.append(f'rgb({r}, {g}, {b})')

fig_areas = go.Figure(go.Bar(
    x=df_sorted['trial_count'].values,
    y=df_sorted['condition_name'].values,
    orientation='h',
    text=[f'{int(v):,}' for v in df_sorted['trial_count'].values],
    marker_color=colors,
    textposition='outside',
    hovertemplate='<b>%{y}</b><br>%{x:,} trials<extra></extra>'
))

fig_areas.update_layout(
    title='<b>"Healthy" leads at 178 trials; Breast Cancer is top disease condition (175)</b>',
    xaxis=dict(showgrid=False, showticklabels=False, title=None),
    yaxis=dict(title=None, tickfont=dict(size=11)),
    height=650,
    template='plotly_white',
    font=dict(family="Arial", color="#374151"),
    margin=dict(r=50, t=60, b=50),
    bargap=0.15
)
fig_areas.show()

In [None]:
# Category rollups
cancer = df_therapeutic[df_therapeutic['condition_name'].str.contains('Cancer', case=False, na=False)]
cardio = df_therapeutic[df_therapeutic['condition_name'].str.contains(
    'Heart|Cardiovascular|Coronary|Hypertension|Stroke', case=False, na=False
)]
metabolic = df_therapeutic[df_therapeutic['condition_name'].str.contains(
    'Diabetes|Obesity', case=False, na=False
)]

print("Category totals (Top 20 only):")
print(f"  Oncology: {cancer['trial_count'].sum():,} trials ({', '.join(cancer['condition_name'])})")
print(f"  Cardiovascular: {cardio['trial_count'].sum():,} trials ({', '.join(cardio['condition_name'])})")
print(f"  Metabolic: {metabolic['trial_count'].sum():,} trials ({', '.join(metabolic['condition_name'])})")

### What we see

- **"Healthy" is the single largest category** at 178 trials (1.78%), reflecting healthy-volunteer studies across indications
- **Oncology shows high concentration:** 4 cancer-related conditions in the top 20 (Breast, Colorectal, Prostate, and generic "Cancer"), totaling 389 trials
- **Cardiovascular and metabolic conditions are well-represented:** 329 and 214 trials respectively in the top 20

### Implication

High-volume therapeutic areas may face competitive pressures for patient recruitment. **Q3 should examine enrollment performance by condition** to identify areas where trials struggle to meet targets.

---

## Summary

**What this analysis establishes:**

1. **Pipeline composition:** Phase 2 trials represent 12% of the sample; "Not Applicable" trials dominate at 61%
2. **Growth trajectory:** Trial initiations grew steadily to ~650–690/year by 2020–2025
3. **Therapeutic allocation:** Oncology, cardiovascular, and metabolic conditions show highest concentration

**Why subsequent analyses are needed:**

- **Q2 (Completion):** Volume alone doesn't reveal pipeline efficiency—we need completion rates by phase
- **Q3 (Enrollment):** High trial counts don't guarantee adequate patient recruitment—we need enrollment metrics
- **Q4 (Geography):** This analysis ignores location patterns—we need geographic distribution
- **Q5 (Duration):** Growth trends don't show whether trials are getting longer or shorter—we need timeline analysis

---

## Data Limitations

**Sample constraints:**
- Dataset represents 10,000 trials (a sample, not the complete registry)
- Coverage period: 1990–2025 (earlier trials underrepresented)

**Classification issues:**
- "Not Applicable" is a catch-all category (includes observational studies, expanded access, etc.)
- Condition labels are free-text and non-standardized (e.g., "Cancer" vs. "Breast Cancer" may overlap)

**Temporal bias:**
- Recent years (2024–2025) likely undercount due to registry reporting lag
- Status labels (Recruiting, Completed) reflect point-in-time snapshots, not real-time data

In [None]:
# Close connection
conn.close()
print("✓ Analysis complete")