# Q4: Geographic Insights

**Purpose:** Understand where trials are conducted beyond enrollment numbers

**Three questions:**
1. Which countries host the most trials?
2. Which cities serve as major research hubs?
3. How many sites do trials use across different phases?

**What this analysis does NOT cover:**
- Timeline trends across geographies (Q5)
- Enrollment speed by location (no data)
- Site-level success rates (limited data)

In [None]:
import sqlite3
import pandas as pd
import plotly.graph_objects as go
from pathlib import Path

# Database connection
DB_PATH = Path('../data/database/clinical_trials.db')
conn = sqlite3.connect(str(DB_PATH))

In [None]:
# Dataset snapshot
snapshot = pd.read_sql_query("""
    SELECT 
        COUNT(DISTINCT study_id) as n_studies,
        COUNT(DISTINCT country) as n_countries,
        COUNT(DISTINCT city) as n_cities
    FROM locations
    WHERE country IS NOT NULL AND country != ''
""", conn)

n_studies = int(snapshot['n_studies'].iloc[0])
n_countries = int(snapshot['n_countries'].iloc[0])
n_cities = int(snapshot['n_cities'].iloc[0])

print(f"Dataset: {n_studies:,} trials · {n_countries:,} countries · {n_cities:,} cities")

---

## 1. Trial Distribution by Country

**Question:** Which countries host the most clinical research?

In [None]:
# Load country data
with open('../sql/queries/q4_trials_by_country.sql', 'r') as f:
    query_country = f.read()

df_country = pd.read_sql_query(query_country, conn)
df_country.head(10)

In [None]:
# Top 20 countries horizontal bar
df_top20 = df_country.head(20).sort_values('total_trials', ascending=True)

# Gradient color
max_val = df_top20['total_trials'].max()
min_val = df_top20['total_trials'].min()
colors = []
for val in df_top20['total_trials'].values:
    ratio = (val - min_val) / (max_val - min_val) if max_val > min_val else 1
    r = int(229 - ratio * (229 - 37))
    g = int(231 - ratio * (231 - 99))
    b = int(235 - ratio * (235 - 235))
    colors.append(f'rgb({r}, {g}, {b})')

# Calculate title dynamically
top_country = df_country.iloc[0]['country']
top_count = int(df_country.iloc[0]['total_trials'])
second_country = df_country.iloc[1]['country']
second_count = int(df_country.iloc[1]['total_trials'])

fig_country = go.Figure(go.Bar(
    x=df_top20['total_trials'].values,
    y=df_top20['country'].values,
    orientation='h',
    marker_color=colors,
    text=[f"{int(v):,}" for v in df_top20['total_trials'].values],
    textposition='outside',
    hovertemplate='<b>%{y}</b><br>%{x:,} trials<extra></extra>'
))

fig_country.update_layout(
    title=f'<b>{top_country} leads at {top_count:,} trials; {second_country} second at {second_count:,}</b>',
    xaxis=dict(showgrid=False, showticklabels=False, title=None),
    yaxis=dict(title=None, tickfont=dict(size=11)),
    height=650,
    template='plotly_white',
    font=dict(family="Arial", color="#374151"),
    margin=dict(r=50, t=60, b=50),
    bargap=0.15
)
fig_country.show()

In [None]:
# Country completion rates
country_stats = df_country[['country', 'total_trials', 'completed_trials', 'completion_rate', 'recruiting_trials', 'terminated_trials']].head(15).copy()
country_stats.columns = ['Country', 'Total Trials', 'Completed', 'Completion %', 'Recruiting', 'Terminated']
country_stats

### What we see

- **United States dominates** at 3,281 trials, nearly 4× second-place China (894 trials)
- **European countries well-represented:** France (733), Germany (464), UK (461) in top 10
- **Completion rates vary:** Russia (80.3%), India (71.7%) show high completion; China lower (34.5%)

### Implication

US concentration suggests infrastructure advantages, but completion rate variation suggests operational differences. **Q5 should examine duration trends by country** to assess whether high-completion countries also complete faster.

---

## 2. Major Research Hubs

**Question:** Which cities serve as clinical research centers?

In [None]:
# Load city data
with open('../sql/queries/q4_top_cities.sql', 'r') as f:
    query_city = f.read()

df_city = pd.read_sql_query(query_city, conn)
df_city.head(10)

In [None]:
# Top 20 cities horizontal bar
df_city_top = df_city.head(20).sort_values('total_trials', ascending=True)

# Gradient color
max_val = df_city_top['total_trials'].max()
min_val = df_city_top['total_trials'].min()
colors = []
for val in df_city_top['total_trials'].values:
    ratio = (val - min_val) / (max_val - min_val) if max_val > min_val else 1
    r = int(229 - ratio * (229 - 37))
    g = int(231 - ratio * (231 - 99))
    b = int(235 - ratio * (235 - 235))
    colors.append(f'rgb({r}, {g}, {b})')

# Calculate title dynamically
top_city = df_city.iloc[0]['city']
top_city_count = int(df_city.iloc[0]['total_trials'])
us_cities = df_city[df_city['country'] == 'United States'].shape[0]

# Create labels with country
labels = [f"{row['city']}, {row['country']}" for _, row in df_city_top.iterrows()]

fig_city = go.Figure(go.Bar(
    x=df_city_top['total_trials'].values,
    y=labels,
    orientation='h',
    marker_color=colors,
    text=[f"{int(v):,}" for v in df_city_top['total_trials'].values],
    textposition='outside',
    hovertemplate='<b>%{y}</b><br>%{x:,} trials<extra></extra>'
))

fig_city.update_layout(
    title=f'<b>{top_city} leads at {top_city_count:,} trials; {us_cities} of top 30 are US cities</b>',
    xaxis=dict(showgrid=False, showticklabels=False, title=None),
    yaxis=dict(title=None, tickfont=dict(size=10)),
    height=650,
    template='plotly_white',
    font=dict(family="Arial", color="#374151"),
    margin=dict(r=50, t=60, b=50),
    bargap=0.15
)
fig_city.show()

In [None]:
# City statistics
city_stats = df_city[['city', 'country', 'total_trials', 'completed_trials', 'completion_rate']].head(15).copy()
city_stats.columns = ['City', 'Country', 'Total Trials', 'Completed', 'Completion %']
city_stats

### What we see

- **US cities dominate top positions:** New York (424), Houston (393), Boston (341)
- **Major academic/medical centers represented:** Baltimore, Rochester, St Louis
- **International hubs:** Seoul (255), Paris (252), Beijing (248), London (239)

### Implication

City concentration reflects established research infrastructure. Multi-site trials likely leverage these hubs, suggesting coordination complexity.

---

## 3. Site Count Distribution by Phase

**Question:** How many sites do trials use across development phases?

In [None]:
# Load site count data
with open('../sql/queries/q4_site_counts_by_phase.sql', 'r') as f:
    query_sites = f.read()

df_sites = pd.read_sql_query(query_sites, conn)
df_sites.head(10)

In [None]:
# Filter out 'Not Applicable' for clearer visualization
df_sites_clean = df_sites[df_sites['phase_group'] != 'Not Applicable'].copy()

# Calculate title dynamically
phase3_sites = df_sites_clean.loc[df_sites_clean['phase_group'] == 'Phase 3', 'avg_sites_per_trial'].values[0]
phase1_sites = df_sites_clean.loc[df_sites_clean['phase_group'] == 'Phase 1', 'avg_sites_per_trial'].values[0]

fig_sites = go.Figure()

# Bar chart with gradient
max_val = df_sites_clean['avg_sites_per_trial'].max()
min_val = df_sites_clean['avg_sites_per_trial'].min()
colors = []
for val in df_sites_clean['avg_sites_per_trial'].values:
    ratio = (val - min_val) / (max_val - min_val) if max_val > min_val else 1
    r = int(229 - ratio * (229 - 37))
    g = int(231 - ratio * (231 - 99))
    b = int(235 - ratio * (235 - 235))
    colors.append(f'rgb({r}, {g}, {b})')

fig_sites.add_trace(go.Bar(
    x=df_sites_clean['phase_group'],
    y=df_sites_clean['avg_sites_per_trial'],
    marker_color=colors,
    text=[f"{v:.1f}" for v in df_sites_clean['avg_sites_per_trial']],
    textposition='outside',
    hovertemplate='<b>%{x}</b><br>Avg sites: %{y:.1f}<extra></extra>'
))

fig_sites.update_layout(
    title=f'<b>Phase 3 uses avg {phase3_sites:.1f} sites; Phase 1 uses {phase1_sites:.1f}</b>',
    xaxis=dict(title=None, tickfont=dict(size=11)),
    yaxis=dict(title='Average sites per trial', rangemode='tozero'),
    height=500,
    template='plotly_white',
    font=dict(family="Arial", color="#374151"),
    margin=dict(t=60, b=50, r=50)
)
fig_sites.show()

In [None]:
# Site count details
site_stats = df_sites[['phase_group', 'trials_with_sites', 'avg_sites_per_trial', 'avg_countries_per_trial', 'single_site_trials', 'multisite_10plus']].copy()
site_stats.columns = ['Phase', 'Trials with Sites', 'Avg Sites', 'Avg Countries', 'Single Site', '10+ Sites']
site_stats

### What we see

- **Phase 3 shows highest site count** at avg 34.8 sites per trial, reflecting large-scale requirements
- **Phase 2 reaches 12.9 avg sites**, suggesting multi-site coordination even in mid-stage
- **Phase 1 mostly single-site** at 3.1 avg sites, with 505 single-site trials
- **Multinational trials increase with phase:** Phase 3 uses avg 4.1 countries

### Implication

Multi-site complexity peaks in Phase 3. **Q5 should examine whether multi-site trials take longer to complete**, assessing coordination overhead.

---

## Summary

**What this analysis establishes:**

1. **US dominance:** 3,281 trials in US, 4× more than second-place China
2. **City concentration:** New York, Houston, Boston are top hubs; US cities dominate top 30
3. **Site scaling by phase:** Phase 3 uses avg 34.8 sites across 4.1 countries; Phase 1 uses 3.1 sites

**Why subsequent analyses are needed:**

- **Q5 (Duration):** Geographic patterns don't reveal timeline implications—need to assess whether multi-site/multinational trials take longer

---

## Data Limitations

**Location data completeness:**
- Not all trials report location details
- City/country standardization varies ("United States" vs "USA")

**Site-level performance:**
- Cannot assess individual site success rates
- No enrollment breakdown by site

**Completion rate bias:**
- Country-level completion rates don't account for phase mix
- China's low rate may reflect higher Phase 1/2 concentration

In [None]:
# Close connection
conn.close()