# Q3: Enrollment Performance

**Purpose:** Assess recruitment scale beyond trial counts

**Three questions:**
1. How does enrollment vary across development phases?
2. Which therapeutic areas achieve highest enrollment?
3. Does enrollment differ between completed and terminated trials?

**What this analysis does NOT cover:**
- Geographic site distribution (Q4)
- Enrollment speed or timeline (Q5)
- Enrollment targets vs actuals (data not available)

In [None]:
import sqlite3
import pandas as pd
import plotly.graph_objects as go
from pathlib import Path

# Database connection
DB_PATH = Path('../data/database/clinical_trials.db')
conn = sqlite3.connect(str(DB_PATH))

In [None]:
# Dataset snapshot
snapshot = pd.read_sql_query("""
    SELECT 
        COUNT(*) as n_studies,
        COUNT(CASE WHEN enrollment IS NOT NULL AND enrollment > 0 THEN 1 END) as n_with_enrollment,
        ROUND(AVG(CASE WHEN enrollment IS NOT NULL AND enrollment > 0 THEN enrollment END), 0) as avg_enrollment
    FROM studies
""", conn)

n_studies = int(snapshot['n_studies'].iloc[0])
n_with_enrollment = int(snapshot['n_with_enrollment'].iloc[0])
avg_enrollment = int(snapshot['avg_enrollment'].iloc[0])
pct_with_enrollment = round(n_with_enrollment / n_studies * 100, 1)

print(f"Dataset: {n_studies:,} trials · {n_with_enrollment:,} with enrollment data ({pct_with_enrollment}%) · avg {avg_enrollment:,} participants")

---

## 1. Enrollment by Phase

**Question:** How does enrollment scale vary across development phases?

In [None]:
# Load enrollment by phase
with open('../sql/queries/q3_enrollment_by_phase.sql', 'r') as f:
    query_phase = f.read()

df_phase = pd.read_sql_query(query_phase, conn)
df_phase.head(10)

In [None]:
# Filter out 'Not Applicable' for clearer visualization of interventional phases
df_phase_clean = df_phase[df_phase['phase_group'] != 'Not Applicable'].copy()

# Calculate title dynamically
phase3_avg = int(df_phase_clean.loc[df_phase_clean['phase_group'] == 'Phase 3', 'avg_enrollment'].values[0])
phase1_avg = int(df_phase_clean.loc[df_phase_clean['phase_group'] == 'Phase 1', 'avg_enrollment'].values[0])

fig_phase = go.Figure()

# Bar chart with gradient
max_val = df_phase_clean['avg_enrollment'].max()
min_val = df_phase_clean['avg_enrollment'].min()
colors = []
for val in df_phase_clean['avg_enrollment'].values:
    ratio = (val - min_val) / (max_val - min_val) if max_val > min_val else 1
    r = int(229 - ratio * (229 - 37))
    g = int(231 - ratio * (231 - 99))
    b = int(235 - ratio * (235 - 235))
    colors.append(f'rgb({r}, {g}, {b})')

fig_phase.add_trace(go.Bar(
    x=df_phase_clean['phase_group'],
    y=df_phase_clean['avg_enrollment'],
    marker_color=colors,
    text=[f"{int(v):,}" for v in df_phase_clean['avg_enrollment']],
    textposition='outside',
    hovertemplate='<b>%{x}</b><br>Avg enrollment: %{y:,}<extra></extra>'
))

fig_phase.update_layout(
    title=f'<b>Phase 3 enrolls avg {phase3_avg:,} participants; Phase 1 enrolls {phase1_avg:,}</b>',
    xaxis=dict(title=None, tickfont=dict(size=11)),
    yaxis=dict(title='Average enrollment', rangemode='tozero'),
    height=500,
    template='plotly_white',
    font=dict(family="Arial", color="#374151"),
    margin=dict(t=60, b=50, r=50)
)
fig_phase.show()

In [None]:
# Scale distribution by phase
scale_stats = df_phase[['phase_group', 'total_trials', 'avg_enrollment', 'trials_100plus', 'trials_500plus', 'trials_1000plus']].copy()
scale_stats.columns = ['Phase', 'Total Trials', 'Avg Enrollment', '100+ enrolled', '500+ enrolled', '1000+ enrolled']
scale_stats

### What we see

- **Phase 3 achieves highest enrollment** at avg 707 participants, consistent with late-stage efficacy requirements
- **Phase 1 shows smallest enrollment** at avg 55 participants, reflecting safety testing scope
- **Progressive scaling:** Phase 2 (107) → Phase 2/3 (196) → Phase 3 (707) shows expected progression

### Implication

Enrollment scales appropriately by phase, but scale alone doesn't reveal site distribution. **Q4 should examine geographic patterns** to understand how trials distribute enrollment across locations.

---

## 2. Enrollment by Therapeutic Area

**Question:** Which conditions achieve highest enrollment?

In [None]:
# Load enrollment by condition
with open('../sql/queries/q3_enrollment_by_condition.sql', 'r') as f:
    query_condition = f.read()

df_condition = pd.read_sql_query(query_condition, conn)
df_condition.head(10)

In [None]:
# Horizontal bar chart (sorted for bottom-to-top display)
df_sorted = df_condition.sort_values('avg_enrollment', ascending=True)

# Gradient color
max_val = df_sorted['avg_enrollment'].max()
min_val = df_sorted['avg_enrollment'].min()
colors = []
for val in df_sorted['avg_enrollment'].values:
    ratio = (val - min_val) / (max_val - min_val) if max_val > min_val else 1
    r = int(229 - ratio * (229 - 37))
    g = int(231 - ratio * (231 - 99))
    b = int(235 - ratio * (235 - 235))
    colors.append(f'rgb({r}, {g}, {b})')

# Calculate title dynamically
top_condition = df_condition.iloc[0]['condition_name']
top_enrollment = int(df_condition.iloc[0]['avg_enrollment'])
second_condition = df_condition.iloc[1]['condition_name']
second_enrollment = int(df_condition.iloc[1]['avg_enrollment'])

fig_condition = go.Figure(go.Bar(
    x=df_sorted['avg_enrollment'].values,
    y=df_sorted['condition_name'].values,
    orientation='h',
    marker_color=colors,
    text=[f"{int(v):,}" for v in df_sorted['avg_enrollment'].values],
    textposition='outside',
    hovertemplate='<b>%{y}</b><br>Avg enrollment: %{x:,}<extra></extra>'
))

fig_condition.update_layout(
    title=f'<b>{top_condition} leads at avg {top_enrollment:,} enrolled; {second_condition} at {second_enrollment:,}</b>',
    xaxis=dict(showgrid=False, showticklabels=False, title=None),
    yaxis=dict(title=None, tickfont=dict(size=11)),
    height=650,
    template='plotly_white',
    font=dict(family="Arial", color="#374151"),
    margin=dict(r=50, t=60, b=50),
    bargap=0.15
)
fig_condition.show()

In [None]:
# Top conditions with large-scale trials
large_scale = df_condition[['condition_name', 'total_trials', 'avg_enrollment', 'trials_500plus']].copy()
large_scale.columns = ['Condition', 'Total Trials', 'Avg Enrollment', 'Trials 500+ enrolled']
large_scale.head(10)

### What we see

- **Cardiovascular diseases show highest enrollment** at avg 58,944 participants per trial
- **Cancer and COVID-19 follow** at 53,089 and 41,185 respectively
- **Common conditions like Healthy volunteers show lower enrollment** at avg 93, despite high trial count

### Implication

Large-scale cardiovascular and cancer trials likely require multi-site coordination. **Q4 should examine site counts** to understand geographic complexity of high-enrollment trials.

---

## 3. Enrollment by Trial Status

**Question:** Does enrollment differ between completed and terminated trials?

In [None]:
# Load enrollment by status
with open('../sql/queries/q3_enrollment_by_status.sql', 'r') as f:
    query_status = f.read()

df_status = pd.read_sql_query(query_status, conn)
df_status.head(10)

In [None]:
# Bar chart comparing enrollment across statuses
# Filter to key statuses
key_statuses = ['Completed', 'Recruiting', 'Active, not recruiting', 'Terminated', 'Withdrawn']
df_status_key = df_status[df_status['status_group'].isin(key_statuses)].copy()
df_status_key = df_status_key.sort_values('avg_enrollment', ascending=True)

# Gradient color
max_val = df_status_key['avg_enrollment'].max()
min_val = df_status_key['avg_enrollment'].min()
colors = []
for val in df_status_key['avg_enrollment'].values:
    ratio = (val - min_val) / (max_val - min_val) if max_val > min_val else 1
    r = int(229 - ratio * (229 - 37))
    g = int(231 - ratio * (231 - 99))
    b = int(235 - ratio * (235 - 235))
    colors.append(f'rgb({r}, {g}, {b})')

# Calculate title dynamically
completed_enrollment = int(df_status.loc[df_status['status_group'] == 'Completed', 'avg_enrollment'].values[0])
terminated_enrollment = int(df_status.loc[df_status['status_group'] == 'Terminated', 'avg_enrollment'].values[0])

fig_status = go.Figure(go.Bar(
    x=df_status_key['avg_enrollment'].values,
    y=df_status_key['status_group'].values,
    orientation='h',
    marker_color=colors,
    text=[f"{int(v):,}" for v in df_status_key['avg_enrollment'].values],
    textposition='outside',
    hovertemplate='<b>%{y}</b><br>Avg enrollment: %{x:,}<extra></extra>'
))

fig_status.update_layout(
    title=f'<b>Completed trials avg {completed_enrollment:,} enrolled; Terminated avg {terminated_enrollment:,}</b>',
    xaxis=dict(showgrid=False, showticklabels=False, title=None),
    yaxis=dict(title=None, tickfont=dict(size=12)),
    height=450,
    template='plotly_white',
    font=dict(family="Arial", color="#374151"),
    margin=dict(r=50, t=60, b=50),
    bargap=0.15
)
fig_status.show()

In [None]:
# Status enrollment comparison
status_comparison = df_status[['status_group', 'total_trials', 'trials_with_enrollment', 'avg_enrollment', 'trials_100plus', 'trials_under_50']].copy()
status_comparison.columns = ['Status', 'Total Trials', 'With Enrollment', 'Avg Enrollment', '100+ enrolled', 'Under 50 enrolled']
status_comparison

### What we see

- **Completed trials show higher enrollment** at avg 2,438 compared to terminated at 172
- **Terminated trials have 407 trials with <50 enrollment**, suggesting early stoppage
- **Withdrawn trials show minimal enrollment** (avg 63), indicating very early termination

### Implication

Lower enrollment in terminated trials suggests recruitment challenges contribute to stoppage. **Q5 should analyze duration patterns** to determine whether terminated trials also show shorter timelines, indicating quick go/no-go decisions.

---

## Summary

**What this analysis establishes:**

1. **Phase-appropriate scaling:** Phase 3 achieves highest enrollment (707 avg); Phase 1 smallest (55 avg)
2. **Therapeutic variation:** Cardiovascular diseases enroll 58,944 avg; healthy volunteers only 93 avg
3. **Status correlation:** Completed trials enroll 2,438 avg; terminated trials only 172 avg

**Why subsequent analyses are needed:**

- **Q4 (Geography):** Enrollment numbers don't reveal how trials distribute across sites—need location analysis
- **Q5 (Duration):** Lower enrollment in terminated trials doesn't show timeline—need to assess whether they stopped quickly or struggled over time

---

## Data Limitations

**Enrollment data availability:**
- 92.3% of trials have enrollment data, but missing values may bias averages
- Enrollment represents point-in-time snapshot (recruiting trials may increase)

**No target comparison:**
- Dataset lacks enrollment targets, cannot assess whether trials met goals
- High enrollment doesn't indicate success if target was higher

**Outliers:**
- 'Not Applicable' category shows extreme outliers (80M enrolled), likely data quality issues
- Averages affected by long-tail distributions in large-scale conditions

In [None]:
# Close connection
conn.close()