# Q1: Clinical Trial Landscape

**Purpose:** Establish baseline distribution to inform deeper analysis

**Three questions:**
1. Where is research volume concentrated across development phases?
2. How has trial initiation volume changed over time?
3. Which therapeutic areas show the highest trial counts?

**What this analysis does NOT cover:**
- Completion rates (Q2)
- Enrollment performance (Q3)
- Geographic patterns (Q4)
- Trial duration (Q5)

In [None]:
import sqlite3
import pandas as pd
import plotly.graph_objects as go
from pathlib import Path

# Database connection
DB_PATH = Path('../data/database/clinical_trials.db')
conn = sqlite3.connect(str(DB_PATH))

In [None]:
# Dataset snapshot
snapshot = pd.read_sql_query("""
    SELECT 
        COUNT(*) as n_studies,
        MIN(CAST(strftime('%Y', start_date) AS INTEGER)) as min_year,
        MAX(CAST(strftime('%Y', start_date) AS INTEGER)) as max_year
    FROM studies 
    WHERE start_date IS NOT NULL
""", conn)

n_studies = int(snapshot['n_studies'].iloc[0])
min_year = int(snapshot['min_year'].iloc[0])
max_year = int(snapshot['max_year'].iloc[0])

print(f"Dataset: {n_studies:,} trials · {min_year}–{max_year}")

---

## 1. Phase Distribution

**Question:** Where is trial volume concentrated across development phases and statuses?

In [None]:
# Load phase × status data
with open('../sql/queries/q1_phase_status_distribution.sql', 'r') as f:
    query_phase_status = f.read()

df_phase_status = pd.read_sql_query(query_phase_status, conn)
df_phase_status.head(10)

In [None]:
# Pivot for heatmap
pivot = df_phase_status.pivot_table(
    index='phase_group', 
    columns='status_label', 
    values='trial_count', 
    fill_value=0
)

# Order: early→late phases, active→terminal statuses
phase_order = ['Early Phase 1', 'Phase 1', 'Phase 1/2', 'Phase 2', 
               'Phase 2/3', 'Phase 3', 'Phase 4', 'Not Applicable']
status_order = ['Completed', 'Recruiting', 'Active, not recruiting', 
                'Not yet recruiting', 'Enrolling by invitation', 
                'Terminated', 'Withdrawn', 'Suspended', 'Unknown']

pivot = pivot.reindex(
    index=[p for p in phase_order if p in pivot.index],
    columns=[s for s in status_order if s in pivot.columns]
)

# Annotations (only show if >= 200 to avoid clutter)
annotations = []
max_val = pivot.values.max()

for i, row in enumerate(pivot.values):
    for j, val in enumerate(row):
        if val >= 200:
            text_color = 'white' if val > max_val * 0.4 else '#1f2a44'
            annotations.append(dict(
                x=pivot.columns[j], y=pivot.index[i],
                text=str(int(val)),
                showarrow=False,
                font=dict(size=11, color=text_color, family='Arial')
            ))

# Calculate title dynamically
phase2_count = pivot.loc['Phase 2'].sum()
na_count = pivot.loc['Not Applicable'].sum()
combined_count = phase2_count + na_count
combined_pct = round(combined_count / n_studies * 100)

# Plot
fig_heatmap = go.Figure(data=go.Heatmap(
    z=pivot.values,
    x=pivot.columns,
    y=pivot.index,
    colorscale=[[0, '#e5e7eb'], [1, '#2563eb']],
    xgap=1, ygap=1,
    showscale=False,
    hovertemplate='<b>%{y}</b><br>%{x}: %{z:,}<extra></extra>'
))

fig_heatmap.update_layout(
    title=f'<b>Phase 2 and "Not Applicable" account for {combined_count:,} trials ({combined_pct}%)</b>',
    xaxis=dict(tickangle=-30, tickfont=dict(size=11)),
    yaxis=dict(autorange="reversed", tickfont=dict(size=11)),
    annotations=annotations,
    height=550,
    template='plotly_white',
    font=dict(family="Arial", color="#374151"),
    margin=dict(t=60, b=50)
)
fig_heatmap.show()

In [None]:
# Summary by phase (descending order)
phase_summary = df_phase_status.groupby('phase_group')['trial_count'].sum()
phase_summary = phase_summary.sort_values(ascending=False)

# Gradient color (lighter = smaller, darker = larger)
max_val = phase_summary.max()
min_val = phase_summary.min()
colors = []
for val in phase_summary.values:
    ratio = (val - min_val) / (max_val - min_val) if max_val > min_val else 1
    r = int(229 - ratio * (229 - 37))
    g = int(231 - ratio * (231 - 99))
    b = int(235 - ratio * (235 - 235))
    colors.append(f'rgb({r}, {g}, {b})')

# Calculate title dynamically
top_phase = phase_summary.index[0]
top_count = int(phase_summary.iloc[0])
top_pct = round(top_count / n_studies * 100)

fig_phase = go.Figure(go.Bar(
    x=phase_summary.values,
    y=phase_summary.index,
    orientation='h',
    text=[f'{int(v):,}' for v in phase_summary.values],
    marker_color=colors,
    textposition='outside',
    hovertemplate='<b>%{y}</b><br>%{x:,} trials<extra></extra>'
))

fig_phase.update_layout(
    title=f'<b>"{top_phase}" dominates at {top_count:,} trials ({top_pct}%)</b>',
    xaxis=dict(showgrid=False, showticklabels=False, title=None),
    yaxis=dict(title=None, tickfont=dict(size=12)),
    height=450,
    template='plotly_white',
    font=dict(family="Arial", color="#374151"),
    margin=dict(r=50, t=60, b=50),
    bargap=0.15
)
fig_phase.show()

In [None]:
# Key stats as dataframe
phase_stats = phase_summary.reset_index()
phase_stats.columns = ['Phase', 'Trial Count']
phase_stats['% of Total'] = (phase_stats['Trial Count'] / n_studies * 100).round(1)
phase_stats

### What we see

- **"Not Applicable" is the largest category**—likely observational studies or trials without phase designations
- **Phase 2 is the largest interventional category**, ahead of Phase 1 and Phase 3
- **"Completed" is the dominant status** across all phases, reflecting the historical nature of this dataset

### Implication

Phase 2 concentration suggests mid-stage efficacy testing as a potential operational bottleneck. **Q2 should examine completion rates by phase** to quantify how many Phase 2 trials successfully advance.

---

## 2. Temporal Trends

**Question:** How has trial initiation volume evolved over time?

In [None]:
# Load yearly data
with open('../sql/queries/q1_yearly_trends.sql', 'r') as f:
    query_yearly = f.read()

df_yearly = pd.read_sql_query(query_yearly, conn)
df_yearly['start_year'] = pd.to_numeric(df_yearly['start_year'])

df_yearly.head(10)

In [None]:
# Calculate peak dynamically
peak_idx = df_yearly['trial_count'].idxmax()
peak_year = int(df_yearly.loc[peak_idx, 'start_year'])
peak_count = int(df_yearly.loc[peak_idx, 'trial_count'])

# Time series with peak annotation
fig_yearly = go.Figure()

fig_yearly.add_trace(go.Scatter(
    x=df_yearly['start_year'],
    y=df_yearly['trial_count'],
    mode='lines',
    line=dict(color='#2563eb', width=2.5),
    hovertemplate='<b>%{x}</b><br>%{y:,} trials<extra></extra>'
))

# Add peak marker
fig_yearly.add_trace(go.Scatter(
    x=[peak_year],
    y=[peak_count],
    mode='markers+text',
    marker=dict(size=12, color='#dc2626', symbol='circle'),
    text=f'{peak_count}',
    textposition='top center',
    textfont=dict(size=11, color='#dc2626', family='Arial'),
    showlegend=False,
    hoverinfo='skip'
))

fig_yearly.update_layout(
    title=f'<b>Trial initiations peaked at {peak_count:,} in {peak_year}</b>',
    xaxis=dict(
        title=None,
        showgrid=False,
        linecolor='#374151',
        showline=True,
        type='linear',
        dtick=5,
        tickformat='d'
    ),
    yaxis=dict(
        title="Trials initiated",
        showgrid=True,
        gridcolor='#f3f4f6',
        rangemode='tozero',
        showline=True,
        linecolor='#374151'
    ),
    height=500,
    template='plotly_white',
    font=dict(family="Arial", color="#374151"),
    margin=dict(t=60, b=50)
)
fig_yearly.show()

In [None]:
# Key stats as dataframe
recent = df_yearly[df_yearly['start_year'] >= 2020][['start_year', 'trial_count', 'completed_count', 'recruiting_count']].copy()
recent.columns = ['Year', 'Total', 'Completed', 'Recruiting']
recent

### What we see

- **Steady growth from mid-1990s to peak:** Initiations climbed from single digits to peak in recent years
- **Post-2020 volume remains high:** Consistent with pre-pandemic levels
- **Most trials in sample have start dates from 1990 onward**

### Implication

Growth has stabilized at high levels. **Q5 should analyze trial duration trends** to understand whether recent trials are completing faster or slower than historical benchmarks.

---

## 3. Therapeutic Concentration

**Question:** Which conditions show the highest trial volume?

In [None]:
# Load top conditions
with open('../sql/queries/q1_top_therapeutic_areas.sql', 'r') as f:
    query_therapeutic = f.read()

df_therapeutic = pd.read_sql_query(query_therapeutic, conn)
df_therapeutic = df_therapeutic.sort_values('trial_count', ascending=False)
df_therapeutic.head(10)

In [None]:
# Horizontal bar chart (sorted ascending for bottom-to-top display)
df_sorted = df_therapeutic.sort_values('trial_count', ascending=True)

# Gradient color
max_val = df_sorted['trial_count'].max()
min_val = df_sorted['trial_count'].min()
colors = []
for val in df_sorted['trial_count'].values:
    ratio = (val - min_val) / (max_val - min_val) if max_val > min_val else 1
    r = int(229 - ratio * (229 - 37))
    g = int(231 - ratio * (231 - 99))
    b = int(235 - ratio * (235 - 235))
    colors.append(f'rgb({r}, {g}, {b})')

# Calculate title dynamically
top1 = df_therapeutic.iloc[0]['condition_name']
top1_count = int(df_therapeutic.iloc[0]['trial_count'])
top2 = df_therapeutic.iloc[1]['condition_name']
top2_count = int(df_therapeutic.iloc[1]['trial_count'])

fig_areas = go.Figure(go.Bar(
    x=df_sorted['trial_count'].values,
    y=df_sorted['condition_name'].values,
    orientation='h',
    text=[f'{int(v):,}' for v in df_sorted['trial_count'].values],
    marker_color=colors,
    textposition='outside',
    hovertemplate='<b>%{y}</b><br>%{x:,} trials<extra></extra>'
))

fig_areas.update_layout(
    title=f'<b>"{top1}" leads at {top1_count:,} trials; {top2} is top disease ({top2_count:,})</b>',
    xaxis=dict(showgrid=False, showticklabels=False, title=None),
    yaxis=dict(title=None, tickfont=dict(size=11)),
    height=650,
    template='plotly_white',
    font=dict(family="Arial", color="#374151"),
    margin=dict(r=50, t=60, b=50),
    bargap=0.15
)
fig_areas.show()

In [None]:
# Category rollups as dataframe
cancer = df_therapeutic[df_therapeutic['condition_name'].str.contains('Cancer', case=False, na=False)]
cardio = df_therapeutic[df_therapeutic['condition_name'].str.contains(
    'Heart|Cardiovascular|Coronary|Hypertension|Stroke', case=False, na=False
)]
metabolic = df_therapeutic[df_therapeutic['condition_name'].str.contains(
    'Diabetes|Obesity', case=False, na=False
)]

categories = pd.DataFrame({
    'Category': ['Oncology', 'Cardiovascular', 'Metabolic'],
    'Total Trials': [
        cancer['trial_count'].sum(),
        cardio['trial_count'].sum(),
        metabolic['trial_count'].sum()
    ],
    'Conditions': [
        ', '.join(cancer['condition_name']),
        ', '.join(cardio['condition_name']),
        ', '.join(metabolic['condition_name'])
    ]
})
categories

### What we see

- **"Healthy" is the single largest category**, reflecting healthy-volunteer studies across indications
- **Oncology shows high concentration:** Multiple cancer-related conditions in top 20
- **Cardiovascular and metabolic conditions are well-represented**

### Implication

High-volume therapeutic areas may face competitive pressures for patient recruitment. **Q3 should examine enrollment performance by condition** to identify areas where trials struggle to meet targets.

---

## Summary

**What this analysis establishes:**

1. **Pipeline composition:** Phase 2 trials represent a significant portion; "Not Applicable" trials dominate
2. **Growth trajectory:** Trial initiations grew steadily over decades
3. **Therapeutic allocation:** Oncology, cardiovascular, and metabolic conditions show highest concentration

**Why subsequent analyses are needed:**

- **Q2 (Completion):** Volume alone doesn't reveal pipeline efficiency—we need completion rates by phase
- **Q3 (Enrollment):** High trial counts don't guarantee adequate patient recruitment—we need enrollment metrics
- **Q4 (Geography):** This analysis ignores location patterns—we need geographic distribution
- **Q5 (Duration):** Growth trends don't show whether trials are getting longer or shorter—we need timeline analysis

---

## Data Limitations

**Sample constraints:**
- Dataset represents a sample, not the complete registry
- Earlier trials underrepresented

**Classification issues:**
- "Not Applicable" is a catch-all category (observational studies, expanded access, etc.)
- Condition labels are free-text and non-standardized (potential overlap)

**Temporal bias:**
- Recent years likely undercount due to registry reporting lag
- Status labels reflect point-in-time snapshots, not real-time data

In [None]:
# Close connection
conn.close()