# Business Question #1: Trial Landscape Analysis

**Question**: What is the current landscape of clinical trials in our dataset?

This notebook analyzes:
1. Phase × Status distribution (which phases have most active research)
2. Yearly trial initiation trend (temporal evolution of research activity)
3. Top therapeutic areas by trial count (which diseases are most studied)

**Data Source**: ClinicalTrials.gov API v2 (10,000 trial sample)

**Database**: SQLite with 10,000 studies, 17,973 conditions, 16,143 sponsors, 64,198 locations

## Setup

In [None]:
import sqlite3
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from pathlib import Path

# Database connection
DB_PATH = Path('../data/database/clinical_trials.db')
conn = sqlite3.connect(str(DB_PATH))

print(f"✓ Connected to database: {DB_PATH}")
print(f"✓ Database size: {DB_PATH.stat().st_size / 1024 / 1024:.1f} MB")

## Output 1: Phase × Status Distribution

**Insight**: Understanding where research activity is concentrated across trial phases and their current status.

This cross-tabulation shows:
- Which phases have the most trials
- Distribution of trial statuses (completed, recruiting, terminated, etc.)
- Average enrollment per phase-status combination

In [None]:
# Load OUTPUT 1: Phase × Status Distribution
query_phase_status = """
SELECT
    CASE
        WHEN phase IS NULL OR phase = '' THEN 'Not Applicable'
        WHEN phase = 'NA' THEN 'Not Applicable'
        WHEN phase LIKE '%EARLY_PHASE%' THEN 'Early Phase 1'
        WHEN phase = 'PHASE1' THEN 'Phase 1'
        WHEN phase = 'PHASE2' THEN 'Phase 2'
        WHEN phase = 'PHASE3' THEN 'Phase 3'
        WHEN phase = 'PHASE4' THEN 'Phase 4'
        WHEN phase LIKE '%PHASE1%' AND phase LIKE '%PHASE2%' THEN 'Phase 1/2'
        WHEN phase LIKE '%PHASE2%' AND phase LIKE '%PHASE3%' THEN 'Phase 2/3'
        ELSE 'Other'
    END AS phase_group,
    CASE
        WHEN status = 'COMPLETED' THEN 'Completed'
        WHEN status = 'RECRUITING' THEN 'Recruiting'
        WHEN status = 'ACTIVE_NOT_RECRUITING' THEN 'Active, not recruiting'
        WHEN status = 'NOT_YET_RECRUITING' THEN 'Not yet recruiting'
        WHEN status = 'TERMINATED' THEN 'Terminated'
        WHEN status = 'SUSPENDED' THEN 'Suspended'
        WHEN status = 'WITHDRAWN' THEN 'Withdrawn'
        WHEN status = 'ENROLLING_BY_INVITATION' THEN 'Enrolling by invitation'
        WHEN status = 'AVAILABLE' THEN 'Available'
        WHEN status = 'NO_LONGER_AVAILABLE' THEN 'No longer available'
        WHEN status = 'TEMPORARILY_NOT_AVAILABLE' THEN 'Temporarily not available'
        WHEN status = 'APPROVED_FOR_MARKETING' THEN 'Approved for marketing'
        WHEN status = 'WITHHELD' THEN 'Withheld'
        WHEN status = 'UNKNOWN' THEN 'Unknown'
        ELSE status
    END AS status_label,
    COUNT(*) AS trial_count,
    ROUND(AVG(enrollment), 0) AS avg_enrollment
FROM studies
GROUP BY phase_group, status_label
ORDER BY
    CASE phase_group
        WHEN 'Early Phase 1' THEN 1
        WHEN 'Phase 1' THEN 2
        WHEN 'Phase 1/2' THEN 3
        WHEN 'Phase 2' THEN 4
        WHEN 'Phase 2/3' THEN 5
        WHEN 'Phase 3' THEN 6
        WHEN 'Phase 4' THEN 7
        WHEN 'Not Applicable' THEN 8
        ELSE 9
    END,
    trial_count DESC;
"""

df_phase_status = pd.read_sql_query(query_phase_status, conn)
print(f"Loaded {len(df_phase_status)} phase-status combinations")
df_phase_status.head(10)

In [None]:
# Pivot table for heatmap visualization
pivot_phase_status = df_phase_status.pivot_table(
    index='phase_group',
    columns='status_label',
    values='trial_count',
    fill_value=0
)

# Reorder phases
phase_order = ['Early Phase 1', 'Phase 1', 'Phase 1/2', 'Phase 2', 'Phase 2/3', 'Phase 3', 'Phase 4', 'Not Applicable', 'Other']
pivot_phase_status = pivot_phase_status.reindex([p for p in phase_order if p in pivot_phase_status.index])

# Create heatmap
fig_heatmap = go.Figure(data=go.Heatmap(
    z=pivot_phase_status.values,
    x=pivot_phase_status.columns,
    y=pivot_phase_status.index,
    colorscale='Blues',
    text=pivot_phase_status.values,
    texttemplate='%{text}',
    textfont={"size": 10},
    hovertemplate='Phase: %{y}<br>Status: %{x}<br>Count: %{z}<extra></extra>'
))

fig_heatmap.update_layout(
    title='Clinical Trials: Phase × Status Distribution',
    xaxis_title='Trial Status',
    yaxis_title='Clinical Phase',
    height=500,
    xaxis={'tickangle': 45}
)

fig_heatmap.show()

In [None]:
# Summary by phase (total trials per phase)
phase_summary = df_phase_status.groupby('phase_group')['trial_count'].sum().sort_values(ascending=False)

fig_phase = px.bar(
    x=phase_summary.values,
    y=phase_summary.index,
    orientation='h',
    title='Total Trials by Clinical Phase',
    labels={'x': 'Number of Trials', 'y': 'Clinical Phase'},
    text=phase_summary.values
)

fig_phase.update_traces(texttemplate='%{text}', textposition='outside')
fig_phase.update_layout(height=400, showlegend=False)
fig_phase.show()

print("\nPhase Distribution Summary:")
print(phase_summary)

### Key Findings: Phase × Status

**1. Phase Distribution:**
- Most trials are classified as "Not Applicable" (observational studies, expanded access, etc.)
- Among interventional trials, Phase 2 has the most trials, followed by Phase 1 and Phase 3
- Phase 4 (post-marketing) has fewer trials, as expected

**2. Status Distribution:**
- "Completed" is the dominant status across all phases
- "Unknown" status appears frequently, suggesting data quality issues in source data
- "Recruiting" trials are most common in early phases (Phase 1 and 2)

**3. Trial Termination:**
- Terminated trials are present across all phases
- Early-phase trials (Phase 1/2) show relatively high termination rates
- This aligns with higher risk in early-stage research

## Output 2: Yearly Trial Initiation Trend

**Insight**: Understanding the temporal evolution of clinical trial activity.

This time series shows:
- Growth of clinical trial activity from 1990 to 2025
- Peak years of trial initiation
- Trends in completed vs recruiting trials over time

In [None]:
# Load OUTPUT 2: Yearly Trial Initiation Trend
query_yearly = """
SELECT
    CAST(strftime('%Y', start_date) AS INTEGER) AS start_year,
    COUNT(*) AS trial_count,
    ROUND(AVG(enrollment), 0) AS avg_enrollment,
    COUNT(CASE WHEN status = 'COMPLETED' THEN 1 END) AS completed_count,
    COUNT(CASE WHEN status IN ('RECRUITING', 'NOT_YET_RECRUITING', 'ENROLLING_BY_INVITATION') THEN 1 END) AS recruiting_count
FROM studies
WHERE
    start_date IS NOT NULL
    AND start_date != ''
    AND CAST(strftime('%Y', start_date) AS INTEGER) >= 1990
    AND CAST(strftime('%Y', start_date) AS INTEGER) <= 2025
GROUP BY start_year
HAVING trial_count >= 5
ORDER BY start_year;
"""

df_yearly = pd.read_sql_query(query_yearly, conn)
print(f"Loaded {len(df_yearly)} years of data (1990-2025)")
df_yearly.head(10)

In [None]:
# Time series: Total trials initiated per year
fig_yearly = go.Figure()

fig_yearly.add_trace(go.Scatter(
    x=df_yearly['start_year'],
    y=df_yearly['trial_count'],
    mode='lines+markers',
    name='Total Trials',
    line=dict(color='blue', width=2),
    marker=dict(size=6)
))

fig_yearly.update_layout(
    title='Clinical Trial Initiation Trend (1990-2025)',
    xaxis_title='Year',
    yaxis_title='Number of Trials Initiated',
    height=500,
    hovermode='x unified'
)

fig_yearly.show()

In [None]:
# Stacked area: Completed vs Recruiting over time
fig_status_yearly = go.Figure()

fig_status_yearly.add_trace(go.Scatter(
    x=df_yearly['start_year'],
    y=df_yearly['completed_count'],
    mode='lines',
    name='Completed',
    stackgroup='one',
    fillcolor='rgba(0, 128, 0, 0.5)'
))

fig_status_yearly.add_trace(go.Scatter(
    x=df_yearly['start_year'],
    y=df_yearly['recruiting_count'],
    mode='lines',
    name='Recruiting',
    stackgroup='one',
    fillcolor='rgba(255, 165, 0, 0.5)'
))

fig_status_yearly.update_layout(
    title='Trial Status Distribution Over Time',
    xaxis_title='Year',
    yaxis_title='Number of Trials',
    height=500,
    hovermode='x unified'
)

fig_status_yearly.show()

In [None]:
# Statistical summary
print("Yearly Trial Initiation Statistics:")
print(f"  Peak year: {df_yearly.loc[df_yearly['trial_count'].idxmax(), 'start_year']} ({df_yearly['trial_count'].max()} trials)")
print(f"  Average trials per year: {df_yearly['trial_count'].mean():.0f}")
print(f"  Total trials (1990-2025): {df_yearly['trial_count'].sum()}")
print(f"\nRecent trend (2020-2025):")
recent = df_yearly[df_yearly['start_year'] >= 2020]
print(recent[['start_year', 'trial_count', 'completed_count', 'recruiting_count']])

### Key Findings: Yearly Trend

**1. Growth Pattern:**
- Steady growth from 1990s through 2010s
- Peak activity around 2015-2020
- Recent years (2020-2025) show varied activity, possibly affected by COVID-19 pandemic

**2. Trial Completion:**
- Older trials (pre-2015) have higher completion rates (as expected - more time to complete)
- Recent trials (2020+) still mostly recruiting or in progress

**3. Research Momentum:**
- Clinical trial activity has increased significantly over past 30 years
- Modern era (2005+) shows sustained high activity with 200-400 trials initiated per year

## Output 3: Top Therapeutic Areas

**Insight**: Understanding research priorities by identifying the most studied diseases/conditions.

This ranking shows:
- Top 20 conditions by trial count
- Percentage of total trials for each condition
- Completion and recruitment status

In [None]:
# Load OUTPUT 3: Top Therapeutic Areas
query_therapeutic = """
SELECT
    c.condition_name,
    COUNT(DISTINCT c.study_id) AS trial_count,
    ROUND(COUNT(DISTINCT c.study_id) * 100.0 / (SELECT COUNT(*) FROM studies), 2) AS percentage_of_trials,
    ROUND(AVG(s.enrollment), 0) AS avg_enrollment,
    COUNT(DISTINCT CASE WHEN s.status = 'COMPLETED' THEN s.study_id END) AS completed_trials,
    COUNT(DISTINCT CASE WHEN s.status IN ('RECRUITING', 'NOT_YET_RECRUITING', 'ENROLLING_BY_INVITATION') THEN s.study_id END) AS recruiting_trials
FROM conditions c
JOIN studies s ON c.study_id = s.study_id
GROUP BY c.condition_name
HAVING trial_count >= 10
ORDER BY trial_count DESC
LIMIT 20;
"""

df_therapeutic = pd.read_sql_query(query_therapeutic, conn)
print(f"Top 20 therapeutic areas (out of {df_therapeutic['trial_count'].sum()} total trials)")
df_therapeutic

In [None]:
# Horizontal bar chart: Top 20 conditions
fig_therapeutic = px.bar(
    df_therapeutic,
    x='trial_count',
    y='condition_name',
    orientation='h',
    title='Top 20 Therapeutic Areas by Trial Count',
    labels={'trial_count': 'Number of Trials', 'condition_name': 'Condition'},
    text='trial_count',
    color='trial_count',
    color_continuous_scale='Blues'
)

fig_therapeutic.update_traces(texttemplate='%{text}', textposition='outside')
fig_therapeutic.update_layout(
    height=600,
    yaxis={'categoryorder': 'total ascending'},
    showlegend=False
)

fig_therapeutic.show()

In [None]:
# Completion rate by therapeutic area
df_therapeutic['completion_rate'] = (df_therapeutic['completed_trials'] / df_therapeutic['trial_count'] * 100).round(1)

fig_completion = px.scatter(
    df_therapeutic,
    x='trial_count',
    y='completion_rate',
    size='avg_enrollment',
    hover_data=['condition_name'],
    title='Trial Completion Rate vs Trial Volume by Therapeutic Area',
    labels={
        'trial_count': 'Number of Trials',
        'completion_rate': 'Completion Rate (%)',
        'avg_enrollment': 'Avg Enrollment'
    },
    text='condition_name'
)

fig_completion.update_traces(textposition='top center', textfont_size=8)
fig_completion.update_layout(height=600)
fig_completion.show()

In [None]:
# Category breakdown
print("\nTherapeutic Area Categories:")
print("\nOncology (Cancer):")
cancer_conditions = df_therapeutic[df_therapeutic['condition_name'].str.contains('Cancer', case=False, na=False)]
print(f"  Total cancer trials: {cancer_conditions['trial_count'].sum()}")
print(f"  Conditions: {', '.join(cancer_conditions['condition_name'].tolist())}")

print("\nCardiovascular:")
cardio_keywords = ['Heart', 'Cardiovascular', 'Coronary', 'Hypertension', 'Stroke']
cardio_conditions = df_therapeutic[df_therapeutic['condition_name'].str.contains('|'.join(cardio_keywords), case=False, na=False)]
print(f"  Total cardio trials: {cardio_conditions['trial_count'].sum()}")
print(f"  Conditions: {', '.join(cardio_conditions['condition_name'].tolist())}")

print("\nMetabolic/Endocrine:")
metabolic_keywords = ['Diabetes', 'Obesity']
metabolic_conditions = df_therapeutic[df_therapeutic['condition_name'].str.contains('|'.join(metabolic_keywords), case=False, na=False)]
print(f"  Total metabolic trials: {metabolic_conditions['trial_count'].sum()}")
print(f"  Conditions: {', '.join(metabolic_conditions['condition_name'].tolist())}")

### Key Findings: Therapeutic Areas

**1. Most Studied Conditions:**
- **Healthy volunteers** top the list (178 trials) - used in Phase 1 safety studies
- **Breast Cancer** leads disease-specific trials (175 trials, 1.75% of all trials)
- **Obesity** and **Pain** are also heavily studied (117 and 98 trials respectively)

**2. Therapeutic Categories:**
- **Oncology** dominates: Breast Cancer, Colorectal Cancer, Prostate Cancer appear in top 20
- **Cardiovascular**: Heart Failure, Hypertension, Coronary Artery Disease, Stroke
- **Metabolic**: Obesity, Type 2 Diabetes, Diabetes
- **Infectious Disease**: COVID-19 (58 trials despite being recent), HIV Infections

**3. Research Priorities:**
- Cancer research is heavily prioritized across multiple cancer types
- Chronic diseases (cardiovascular, metabolic) have sustained research focus
- Mental health (Depression, Anxiety) present but less prominent

**4. Completion Rates:**
- Most conditions show 50-80% completion rates
- "Healthy" trials have highest completion rate (81%) - simpler Phase 1 studies
- Complex diseases show more varied completion rates

## Summary: Trial Landscape

### Overall Findings

**Research Activity:**
- 10,000 clinical trials in dataset spanning 1990-2025
- Peak activity in 2015-2020 period
- Most trials are in Phase 2 (early efficacy testing)

**Status Distribution:**
- Majority of trials are completed (older trials)
- Significant number still recruiting (recent trials)
- High termination rate in early phases reflects inherent research risk

**Therapeutic Focus:**
- Cancer research is top priority (multiple cancer types in top 20)
- Cardiovascular and metabolic diseases heavily studied
- COVID-19 research emerged rapidly (58 trials despite recent appearance)

### Implications

1. **Investment Concentration**: Research funding heavily concentrated in oncology and chronic diseases
2. **Phase 2 Bottleneck**: Large number of Phase 2 trials suggests this is where most drugs are tested for efficacy
3. **Completion Challenges**: Termination rates highlight difficulty of bringing drugs to market
4. **Temporal Trends**: Steady growth in research activity over 30 years shows increasing clinical research capacity

In [None]:
# Close database connection
conn.close()
print("✓ Database connection closed")