# Q4: Geographic Distribution & Regional Specialization

## Research Question

> **How are clinical trials distributed globally? Are there regional specializations in certain therapeutic areas?**

## Analysis Structure

| Section | Question | Approach |
|---------|----------|----------|
| **2. Distribution** | Where are trials conducted? | Frequency counts, concentration metrics |
| **3. Specialization** | Do countries specialize in therapeutic areas? | Location Quotient, χ² test |
| **4. Site Complexity** | How does multi-site scale vary by phase? | Descriptive + Mann-Whitney |
| **5. Temporal Trends** | Has geographic distribution shifted? | Cohort-based share analysis |

## Scope & Data Notes

- **Analysis scope:** Studies with start year in range (matches v_studies_clean)
- **Geographic unit:** Trial presence by country (multi-country trials count once per country)
- **Condition labels:** Free-text registry entries (not standardized taxonomy)
- **Interpretation:** Descriptive; associations not causal

In [1]:
# ============================================================
# Setup
# ============================================================

import sys
from pathlib import Path

import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency, mannwhitneyu, spearmanr
from IPython.display import display, Markdown
import plotly.graph_objects as go

# Project root for imports
PROJECT_ROOT = Path('..')
sys.path.insert(0, str(PROJECT_ROOT))

# Shared utilities
from src.data.loader import load_sql_query, get_db_connection
from src.analysis.viz import DEFAULT_COLORS
from src.analysis.metrics import calc_cramers_v, interpret_effect_size

# Paths (validated at setup)
DB_PATH = PROJECT_ROOT / 'data' / 'database' / 'clinical_trials.db'
SQL_PATH = PROJECT_ROOT / 'sql' / 'queries'
assert DB_PATH.exists(), f"DB not found: {DB_PATH}"
assert SQL_PATH.exists(), f"SQL folder not found: {SQL_PATH}"

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

In [2]:
# ============================================================
# Database connection
# ============================================================

conn = get_db_connection(DB_PATH)

---

## 1. Data Loading & Validation

In [3]:
# ============================================================
# 1.1 Load ABT (study-level with geographic features)
# ============================================================

df_abt = load_sql_query('q4_abt.sql', conn, SQL_PATH)

# Basic validation
n_studies = len(df_abt)
n_with_location = df_abt['has_location_data'].sum()
pct_location = n_with_location / n_studies * 100

# Year range
year_min = df_abt['start_year'].min()
year_max = df_abt['start_year'].max()

display(Markdown(f"""
**ABT loaded:** {n_studies:,} studies ({year_min}–{year_max})

**Location data coverage:**
- Studies with location data: {n_with_location:,} ({pct_location:.1f}%)
- Studies without location: {n_studies - n_with_location:,} ({100 - pct_location:.1f}%)

*Analysis proceeds with studies that have location data.*
"""))

# Filter to studies with location data for geographic analysis
df_geo = df_abt[df_abt['has_location_data'] == 1].copy()
n_geo = len(df_geo)


**ABT loaded:** 97,943 studies (1990–2025)

**Location data coverage:**
- Studies with location data: 88,404 (90.3%)
- Studies without location: 9,539 (9.7%)

*Analysis proceeds with studies that have location data.*


In [4]:
# ============================================================
# 1.2 Geographic summary statistics
# ============================================================

# Unique countries
n_countries = df_geo['primary_country'].nunique()

# Site complexity distribution
n_single_site = df_geo['is_single_site'].sum()
n_multinational = df_geo['is_multinational'].sum()
n_large_multisite = df_geo['is_large_multisite'].sum()

pct_single = n_single_site / n_geo * 100
pct_multinational = n_multinational / n_geo * 100
pct_large = n_large_multisite / n_geo * 100

# Site count statistics
median_sites = df_geo['n_sites'].median()
mean_sites = df_geo['n_sites'].mean()
max_sites = df_geo['n_sites'].max()
q75_sites = df_geo['n_sites'].quantile(0.75)

display(Markdown(f"""
**Geographic summary (n = {n_geo:,} studies with location data):**

| Metric | Value |
|--------|-------|
| Unique countries | {n_countries:,} |
| Single-site trials | {n_single_site:,} ({pct_single:.1f}%) |
| Multinational trials | {n_multinational:,} ({pct_multinational:.1f}%) |
| Large multi-site (≥10 sites) | {n_large_multisite:,} ({pct_large:.1f}%) |
| Median sites per trial | {median_sites:.0f} |
"""))


**Geographic summary (n = 88,404 studies with location data):**

| Metric | Value |
|--------|-------|
| Unique countries | 175 |
| Single-site trials | 64,557 (73.0%) |
| Multinational trials | 7,196 (8.1%) |
| Large multi-site (≥10 sites) | 8,959 (10.1%) |
| Median sites per trial | 1 |


---

## 2. Geographic Distribution

**Question:** Where are clinical trials conducted?

In [5]:
# ============================================================
# 2.1 Country-level distribution
# ============================================================

# Aggregate by primary country
df_country = (
    df_geo
    .groupby('primary_country')
    .agg(
        n_trials=('study_id', 'nunique'),
        n_interventional=('is_interventional', 'sum'),
        n_industry=('is_industry_sponsor', 'sum'),
        median_sites=('n_sites', 'median'),
        pct_multinational=('is_multinational', 'mean'),
    )
    .reset_index()
    .sort_values('n_trials', ascending=False)
)
df_country['pct_multinational'] = df_country['pct_multinational'] * 100
df_country['pct_interventional'] = df_country['n_interventional'] / df_country['n_trials'] * 100
df_country['pct_industry'] = df_country['n_industry'] / df_country['n_trials'] * 100

# Top 20 countries
top20 = df_country.head(20).copy()

# Concentration: share of top 5 / top 10
total_trials = df_country['n_trials'].sum()
top5_share = df_country.head(5)['n_trials'].sum() / total_trials * 100
top10_share = df_country.head(10)['n_trials'].sum() / total_trials * 100

# Count multinational share to contextualize primary_country limitation
pct_multinational_all = df_geo['is_multinational'].mean() * 100

display(Markdown(f"""
### 2.1 Country distribution

**Concentration:**
- Top 5 countries: {top5_share:.1f}% of trials
- Top 10 countries: {top10_share:.1f}% of trials

Ten countries account for {top10_share:.0f}% of trials in the registry.

**Caveat on multinational trials ({pct_multinational_all:.1f}% of dataset):**
Trials spanning multiple countries are assigned to their "primary country" (modal site location). This may overstate concentration for trials with evenly distributed sites.
"""))

# Display table
display(
    top20[['primary_country', 'n_trials', 'pct_interventional', 'pct_industry', 'median_sites']]
    .rename(columns={
        'primary_country': 'Country',
        'n_trials': 'Trials',
        'pct_interventional': 'Interventional %',
        'pct_industry': 'Industry %',
        'median_sites': 'Median Sites',
    })
    .style.format({
        'Trials': '{:,.0f}',
        'Interventional %': '{:.1f}%',
        'Industry %': '{:.1f}%',
        'Median Sites': '{:.0f}',
    }).hide(axis='index')
)


### 2.1 Country distribution

**Concentration:**
- Top 5 countries: 58.0% of trials
- Top 10 countries: 72.8% of trials

Ten countries account for 73% of trials in the registry.

**Caveat on multinational trials (8.1% of dataset):**
Trials spanning multiple countries are assigned to their "primary country" (modal site location). This may overstate concentration for trials with evenly distributed sites.


Country,Trials,Interventional %,Industry %,Median Sites
United States,31791,83.8%,29.8%,1
China,7780,75.6%,22.9%,1
France,5254,60.9%,11.1%,1
Turkey (Türkiye),3274,66.2%,2.0%,1
Canada,3212,82.3%,12.3%,1
United Kingdom,2996,72.0%,27.5%,1
Egypt,2689,76.8%,1.0%,1
Germany,2664,70.8%,38.1%,1
Italy,2504,57.5%,10.9%,1
Spain,2186,76.4%,19.9%,1


In [6]:
# ============================================================
# 2.2 Country distribution chart
# ============================================================

# Prepare data for plot
plot_data = top20.sort_values('n_trials', ascending=True).tail(15)

# Create horizontal bar chart
fig_country = go.Figure()

fig_country.add_trace(go.Bar(
    x=plot_data['n_trials'],
    y=plot_data['primary_country'],
    orientation='h',
    marker_color=DEFAULT_COLORS[0],
    text=[f"{v:,.0f}" for v in plot_data['n_trials']],
    textposition='outside',
    hovertemplate='<b>%{y}</b><br>%{x:,.0f} trials<extra></extra>',
))

# Title with key insight
top1 = df_country.iloc[0]
top2 = df_country.iloc[1]
title_text = f"Top 15 countries by trial count | {top1['primary_country']}: {top1['n_trials']:,.0f} | {top2['primary_country']}: {top2['n_trials']:,.0f}"

fig_country.update_layout(
    title=dict(text=title_text, font=dict(size=14)),
    xaxis=dict(showgrid=False, showticklabels=False, title=None),
    yaxis=dict(title=None, tickfont=dict(size=11)),
    height=500,
    template='plotly_white',
    margin=dict(l=120, r=80, t=60, b=40),
)
fig_country.show()

---

## 3. Regional Specialization

**Question:** Do countries show relative specialization in particular therapeutic areas?

**Approach:** Use Location Quotient (LQ) to measure over-/under-representation of conditions by country.

$$LQ = \frac{\text{share of country's trials in condition}}{\text{share of global trials in condition}}$$

- **LQ > 1.5**: Country relatively specialized
- **LQ < 0.5**: Country relatively under-represented
- **LQ ≈ 1**: Country matches global distribution

In [7]:
# ============================================================
# 3.1 Load country × condition data for specialization analysis
# ============================================================

df_spec = load_sql_query('q4_country_condition.sql', conn, SQL_PATH)

n_combinations = len(df_spec)
n_countries_spec = df_spec['country'].nunique()
n_conditions_spec = df_spec['condition_standardized'].nunique()

# Quantify double-counting: conditions per trial
median_conds_per_trial = df_geo['n_conditions'].median()
mean_conds_per_trial = df_geo['n_conditions'].mean()
pct_multi_condition = (df_geo['n_conditions'] > 1).mean() * 100

display(Markdown(f"""
**Specialization data loaded:**
- {n_combinations:,} country × condition combinations
- {n_countries_spec} countries (≥50 trials each)
- {n_conditions_spec} conditions (≥100 trials globally)
- Minimum 5 trials per combination

**Double-counting note:** Trials map to multiple conditions.
- Median conditions per trial: {median_conds_per_trial:.0f}
- Mean conditions per trial: {mean_conds_per_trial:.1f}
- Trials with >1 condition: {pct_multi_condition:.1f}%

*This inflates counts proportionally—LQ ratios remain valid for relative comparisons, but absolute trial counts in Section 3.2 reflect condition-mappings, not unique trials.*
"""))


**Specialization data loaded:**
- 2,854 country × condition combinations
- 71 countries (≥50 trials each)
- 183 conditions (≥100 trials globally)
- Minimum 5 trials per combination

**Double-counting note:** Trials map to multiple conditions.
- Median conditions per trial: 1
- Mean conditions per trial: 1.8
- Trials with >1 condition: 35.9%

*This inflates counts proportionally—LQ ratios remain valid for relative comparisons, but absolute trial counts in Section 3.2 reflect condition-mappings, not unique trials.*


In [8]:
# ============================================================
# 3.2 Identify strong specializations (LQ > 1.5)
# ============================================================

# Filter to high LQ
df_high_lq = df_spec[df_spec['location_quotient'] > 1.5].copy()
df_high_lq = df_high_lq.sort_values('location_quotient', ascending=False)

# Top specializations for major countries - now including n_trials for uncertainty context
major_countries = ['United States', 'China', 'Germany', 'France', 'United Kingdom', 'Japan', 'Canada', 'Italy', 'Spain', 'India']

specializations = []
for country in major_countries:
    country_data = df_high_lq[df_high_lq['country'] == country].head(3)
    if len(country_data) > 0:
        # Include n_trials for each condition to show sample size
        top_conds = ', '.join([
            f"{row['condition_standardized']} (LQ={row['location_quotient']:.1f}, n={row['n_trials']:,.0f})"
            for _, row in country_data.iterrows()
        ])
        specializations.append({
            'Country': country,
            'Top Specializations (LQ > 1.5, with sample size)': top_conds,
            'n_specializations': len(df_high_lq[df_high_lq['country'] == country]),
        })

df_spec_summary = pd.DataFrame(specializations)

display(Markdown("### 3.2 Top specializations by country"))
display(Markdown("""
*Conditions where country's concentration exceeds 1.5× global average.*

**Reading the table:** LQ values with larger n are more stable; small n (e.g., <50) may reflect noise rather than true specialization.
"""))
display(df_spec_summary.style.hide(axis='index'))

### 3.2 Top specializations by country


*Conditions where country's concentration exceeds 1.5× global average.*

**Reading the table:** LQ values with larger n are more stable; small n (e.g., <50) may reflect noise rather than true specialization.


Country,"Top Specializations (LQ > 1.5, with sample size)",n_specializations
United States,"unspecified adult solid tumor, protocol specific (LQ=2.3, n=112), leukemia (LQ=2.0, n=218), myelodysplastic syndrome (LQ=2.0, n=75)",26
China,"advanced solid tumor (LQ=5.1, n=63), acute ischemic stroke (LQ=4.4, n=43), hepatocellular carcinoma (LQ=4.1, n=113)",20
Germany,"carcinoma, non-small-cell lung (LQ=3.6, n=24), pulmonary disease, chronic obstructive (LQ=3.4, n=30), crohn's disease (LQ=3.0, n=22)",31
France,"crohn disease (LQ=2.4, n=25), carcinoma, non-small-cell lung (LQ=2.3, n=23), crohn's disease (LQ=2.3, n=25)",20
United Kingdom,"pulmonary disease, chronic obstructive (LQ=2.9, n=26), cystic fibrosis (LQ=2.6, n=34), colorectal neoplasms (LQ=2.5, n=14)",30
Japan,"carcinoma, non-small-cell lung (LQ=7.1, n=15), crohn's disease (LQ=5.1, n=12), non-small cell lung cancer (LQ=3.9, n=22)",39
Canada,"crohn's disease (LQ=3.2, n=27), ulcerative colitis (LQ=2.6, n=32), crohn disease (LQ=2.4, n=20)",25
Italy,"carcinoma, non-small-cell lung (LQ=3.9, n=23), endometrial cancer (LQ=3.2, n=21), endometriosis (LQ=3.1, n=18)",26
Spain,"carcinoma, non-small-cell lung (LQ=5.0, n=29), crohn's disease (LQ=3.4, n=22), advanced solid tumors (LQ=3.2, n=19)",31
India,"diabetes mellitus, type 2 (LQ=5.7, n=39), type 2 diabetes mellitus (LQ=4.8, n=20), carcinoma, non-small-cell lung (LQ=4.5, n=7)",17


In [9]:
# ============================================================
# 3.3 Statistical test: Is country-condition distribution non-random?
# ============================================================

# Build contingency table for top countries × conditions
# Use top 10 countries and top 20 conditions to avoid sparse cells

top_countries_list = df_country.head(10)['primary_country'].tolist()
top_conditions_list = (
    df_spec
    .groupby('condition_standardized')['n_trials'].sum()
    .nlargest(20)
    .index.tolist()
)

# Filter data
df_chi = df_spec[
    (df_spec['country'].isin(top_countries_list)) &
    (df_spec['condition_standardized'].isin(top_conditions_list))
].copy()

# Pivot to contingency table
ct = df_chi.pivot_table(
    index='country',
    columns='condition_standardized',
    values='n_trials',
    fill_value=0,
    aggfunc='sum'
)

# Chi-squared test
chi2, p_val, dof, expected = chi2_contingency(ct)

# Effect size: Cramér's V
n = ct.sum().sum()
min_dim = min(ct.shape[0] - 1, ct.shape[1] - 1)
cramers_v = calc_cramers_v(chi2, n, min_dim)
effect_label = interpret_effect_size(cramers_v, metric='v')

# Format p-value
p_str = "< 0.001" if p_val < 0.001 else f"= {p_val:.3f}"

# Effective sample size caveat
n_unique_trials = df_chi['n_trials'].sum()  # This is condition-mappings, not unique trials
inflation_factor = mean_conds_per_trial

display(Markdown(f"""
### 3.3 Association test: Country × Condition

**χ² test (top 10 countries × top 20 conditions):**
- χ²({dof}) = {chi2:,.1f}, p {p_str}
- **Cramér's V = {cramers_v:.3f}** ({effect_label} association)
- Effective n = {n:,.0f} condition-mappings (not unique trials)

**Interpretation:**
Conditions are unevenly distributed across countries. Some countries show higher concentration in specific therapeutic areas relative to the global average.

**Statistical caveat:** This test uses condition-mappings as observations, not unique trials. Since trials map to ~{inflation_factor:.1f} conditions on average, the sample size is inflated, which reduces p-values and may overstate Cramér's V. The test confirms non-uniformity in the country × condition matrix, but effect magnitude should be interpreted cautiously.
"""))


### 3.3 Association test: Country × Condition

**χ² test (top 10 countries × top 20 conditions):**
- χ²(171) = 1,428.8, p < 0.001
- **Cramér's V = 0.115** (small association)
- Effective n = 11,959 condition-mappings (not unique trials)

**Interpretation:**
Conditions are unevenly distributed across countries. Some countries show higher concentration in specific therapeutic areas relative to the global average.

**Statistical caveat:** This test uses condition-mappings as observations, not unique trials. Since trials map to ~1.8 conditions on average, the sample size is inflated, which reduces p-values and may overstate Cramér's V. The test confirms non-uniformity in the country × condition matrix, but effect magnitude should be interpreted cautiously.


---

## 4. Site Complexity by Phase

**Question:** How does trial operational complexity (site count) vary by development phase?

In [10]:
# ============================================================
# 4.1 Site complexity by phase
# ============================================================

# Filter to interventional trials with valid phase (most relevant for phase scaling)
df_phase_sites = df_geo[
    (df_geo['is_interventional'] == 1) &
    (df_geo['phase_group'].notna()) &
    (df_geo['phase_group'] != 'Not Applicable') &
    (df_geo['phase_group'] != 'Other')
].copy()

# Order phases
phase_order = ['Early Phase 1', 'Phase 1', 'Phase 1/2', 'Phase 2', 'Phase 2/3', 'Phase 3', 'Phase 4']
df_phase_sites['phase_group'] = pd.Categorical(df_phase_sites['phase_group'], categories=phase_order, ordered=True)

# Aggregate by phase
phase_summary = (
    df_phase_sites
    .groupby('phase_group', observed=True)
    .agg(
        n_trials=('study_id', 'nunique'),
        median_sites=('n_sites', 'median'),
        mean_sites=('n_sites', 'mean'),
        q75_sites=('n_sites', lambda x: x.quantile(0.75)),
        pct_single_site=('is_single_site', 'mean'),
        pct_multinational=('is_multinational', 'mean'),
    )
    .reset_index()
)
phase_summary['pct_single_site'] = phase_summary['pct_single_site'] * 100
phase_summary['pct_multinational'] = phase_summary['pct_multinational'] * 100

display(Markdown("### 4.1 Site complexity by trial phase (interventional only)"))
display(
    phase_summary
    .rename(columns={
        'phase_group': 'Phase',
        'n_trials': 'N',
        'median_sites': 'Median Sites',
        'mean_sites': 'Mean Sites',
        'q75_sites': 'Q75 Sites',
        'pct_single_site': 'Single-site %',
        'pct_multinational': 'Multinational %',
    })
    .style.format({
        'N': '{:,.0f}',
        'Median Sites': '{:.0f}',
        'Mean Sites': '{:.1f}',
        'Q75 Sites': '{:.0f}',
        'Single-site %': '{:.1f}%',
        'Multinational %': '{:.1f}%',
    }).hide(axis='index')
)

### 4.1 Site complexity by trial phase (interventional only)

Phase,N,Median Sites,Mean Sites,Q75 Sites,Single-site %,Multinational %
Early Phase 1,890,1,1.4,1,89.2%,0.6%
Phase 1,7482,1,2.8,2,70.4%,8.9%
Phase 1/2,2614,1,5.5,5,56.5%,14.3%
Phase 2,10159,1,11.1,8,54.4%,16.8%
Phase 2/3,1118,1,9.6,3,61.7%,13.5%
Phase 3,6333,4,37.2,37,41.1%,31.6%
Phase 4,5277,1,5.4,2,72.7%,6.4%


In [11]:
# ============================================================
# 4.2 Visualization: median sites by phase
# ============================================================

# Focus on main clinical phases
main_phases = ['Phase 1', 'Phase 2', 'Phase 3', 'Phase 4']
plot_data = phase_summary[phase_summary['phase_group'].isin(main_phases)].copy()

# Get key values for title
p1_sites = plot_data.loc[plot_data['phase_group'] == 'Phase 1', 'median_sites'].values
p3_sites = plot_data.loc[plot_data['phase_group'] == 'Phase 3', 'median_sites'].values

p1_val = int(p1_sites[0]) if len(p1_sites) > 0 else 'N/A'
p3_val = int(p3_sites[0]) if len(p3_sites) > 0 else 'N/A'

fig_phase = go.Figure()

fig_phase.add_trace(go.Bar(
    x=plot_data['phase_group'].astype(str),
    y=plot_data['median_sites'],
    marker_color=DEFAULT_COLORS[0],
    text=[f"{v:.0f}" for v in plot_data['median_sites']],
    textposition='outside',
    hovertemplate='<b>%{x}</b><br>Median sites: %{y:.0f}<extra></extra>',
))

fig_phase.update_layout(
    title=dict(
        text=f"Median sites per trial by phase | Phase 1: {p1_val} | Phase 3: {p3_val}",
        font=dict(size=14)
    ),
    xaxis=dict(title=None, tickfont=dict(size=12)),
    yaxis=dict(title='Median sites', rangemode='tozero'),
    height=400,
    template='plotly_white',
    margin=dict(t=60, b=50, r=50),
)
fig_phase.show()

In [12]:
# ============================================================
# 4.3 Sponsor effect on site complexity (Phase 3)
# ============================================================

# Compare industry vs non-industry (for Phase 3, where difference is most meaningful)
p3_data = df_phase_sites[df_phase_sites['phase_group'] == 'Phase 3'].copy()

p3_industry = p3_data[p3_data['is_industry_sponsor'] == 1]['n_sites']
p3_non_industry = p3_data[p3_data['is_industry_sponsor'] == 0]['n_sites']

sponsor_comparison = pd.DataFrame({
    'Sponsor': ['Industry', 'Non-industry'],
    'N': [len(p3_industry), len(p3_non_industry)],
    'Median Sites': [p3_industry.median(), p3_non_industry.median()],
    'Mean Sites': [p3_industry.mean(), p3_non_industry.mean()],
    'Q75 Sites': [p3_industry.quantile(0.75), p3_non_industry.quantile(0.75)],
})

display(Markdown("### 4.3 Phase 3 site complexity by sponsor type"))
display(
    sponsor_comparison
    .style.format({
        'N': '{:,.0f}',
        'Median Sites': '{:.0f}',
        'Mean Sites': '{:.1f}',
        'Q75 Sites': '{:.0f}',
    }).hide(axis='index')
)

# Mann-Whitney U test
u_stat, p_mw = mannwhitneyu(p3_industry, p3_non_industry, alternative='two-sided')

# Rank-biserial correlation (effect size)
n1, n2 = len(p3_industry), len(p3_non_industry)
r_biserial = 1 - (2 * u_stat) / (n1 * n2)
r_abs = abs(r_biserial)

# Interpret effect size
if r_abs < 0.1:
    effect_label = "negligible"
elif r_abs < 0.3:
    effect_label = "small"
elif r_abs < 0.5:
    effect_label = "medium"
else:
    effect_label = "large"

p_str = "< 0.001" if p_mw < 0.001 else f"= {p_mw:.3f}"

display(Markdown(f"""
**Statistical test (Mann-Whitney U):**
- U = {u_stat:,.0f}, p {p_str}
- **Rank-biserial r = {r_biserial:.3f}** ({effect_label} effect)

**Interpretation:** Industry-sponsored Phase 3 trials show larger site counts than non-industry trials. The effect size is {effect_label}.

**Caveat:** This comparison does not control for condition mix (oncology trials may use more sites than dermatology) or multinational status (industry trials may be more often multinational). The observed difference reflects sponsor type confounded with these factors.
"""))

### 4.3 Phase 3 site complexity by sponsor type

Sponsor,N,Median Sites,Mean Sites,Q75 Sites
Industry,3354,24,56.0,72
Non-industry,2979,1,16.1,4



**Statistical test (Mann-Whitney U):**
- U = 7,844,790, p < 0.001
- **Rank-biserial r = -0.570** (large effect)

**Interpretation:** Industry-sponsored Phase 3 trials show larger site counts than non-industry trials. The effect size is large.

**Caveat:** This comparison does not control for condition mix (oncology trials may use more sites than dermatology) or multinational status (industry trials may be more often multinational). The observed difference reflects sponsor type confounded with these factors.


---

## 5. Temporal Trends in Geographic Distribution

**Question:** Has the geographic distribution of trials shifted over time?

In [13]:
# ============================================================
# 5.1 Temporal trends by major country
# ============================================================

# Create cohorts
df_geo['start_cohort'] = pd.cut(
    df_geo['start_year'],
    bins=[1989, 1999, 2009, 2019, 2030],
    labels=['1990-1999', '2000-2009', '2010-2019', '2020+']
)

# Top 5 countries for trend analysis
top5_countries = df_country.head(5)['primary_country'].tolist()

# Aggregate by cohort and country
temporal_dist = (
    df_geo[df_geo['primary_country'].isin(top5_countries)]
    .groupby(['start_cohort', 'primary_country'], observed=True)
    .agg(n_trials=('study_id', 'nunique'))
    .reset_index()
)

# Calculate share within each cohort
cohort_totals = temporal_dist.groupby('start_cohort', observed=True)['n_trials'].sum().reset_index()
cohort_totals.columns = ['start_cohort', 'cohort_total']
temporal_dist = temporal_dist.merge(cohort_totals, on='start_cohort')
temporal_dist['pct_share'] = temporal_dist['n_trials'] / temporal_dist['cohort_total'] * 100

# Pivot for display
temporal_pivot = temporal_dist.pivot_table(
    index='primary_country',
    columns='start_cohort',
    values='pct_share',
    fill_value=0,
    observed=True
).round(1)

# Reorder by total trials
temporal_pivot = temporal_pivot.reindex(top5_countries)

display(Markdown("### 5.1 Share of top-5 countries by decade (%)"))
display(temporal_pivot.style.format("{:.1f}%"))

# Calculate shift from earliest to latest cohort
earliest_cohort = '2000-2009'  # Skip 1990s if sparse
latest_cohort = '2020+'

if earliest_cohort in temporal_pivot.columns and latest_cohort in temporal_pivot.columns:
    shift_summary = []
    for country in top5_countries:
        early = temporal_pivot.loc[country, earliest_cohort]
        late = temporal_pivot.loc[country, latest_cohort]
        delta = late - early
        shift_summary.append({
            'Country': country,
            f'{earliest_cohort}': f"{early:.1f}%",
            f'{latest_cohort}': f"{late:.1f}%",
            'Change (pp)': f"{delta:+.1f}"
        })
    
    df_shift = pd.DataFrame(shift_summary)
    display(Markdown(f"**Shift in share ({earliest_cohort} → {latest_cohort}):**"))
    display(df_shift.style.hide(axis='index'))

# Trend test: Spearman correlation between year and country share for top countries
# This tests monotonic trend, not just descriptive shift

trend_results = []
for country in top5_countries:
    country_yearly = (
        df_geo[df_geo['primary_country'] == country]
        .groupby('start_year')
        .size()
        .reset_index(name='n')
    )
    # Merge with total per year
    yearly_totals = df_geo.groupby('start_year').size().reset_index(name='total')
    country_yearly = country_yearly.merge(yearly_totals, on='start_year')
    country_yearly['share'] = country_yearly['n'] / country_yearly['total'] * 100
    
    # Spearman correlation
    if len(country_yearly) >= 5:
        rho, p_val = spearmanr(country_yearly['start_year'], country_yearly['share'])
        trend_results.append({
            'Country': country,
            'Spearman ρ': f"{rho:.3f}",
            'p-value': "< 0.001" if p_val < 0.001 else f"{p_val:.3f}",
            'Trend': '↑' if rho > 0.1 else ('↓' if rho < -0.1 else '→'),
        })

df_trend = pd.DataFrame(trend_results)

display(Markdown("**Trend test (Spearman ρ: share vs year):**"))
display(df_trend.style.hide(axis='index'))

display(Markdown("""
**Interpretation:** Spearman ρ measures monotonic association between year and country share. Positive values indicate increasing share over time; negative values indicate declining share. Trends are descriptive and do not control for compositional shifts (e.g., changes in phase/condition mix over time).
"""))

### 5.1 Share of top-5 countries by decade (%)

start_cohort,1990-1999,2000-2009,2010-2019,2020+
primary_country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
United States,93.4%,82.7%,65.8%,47.9%
China,0.8%,2.6%,12.2%,24.2%
France,3.0%,6.8%,11.5%,10.6%
Turkey (Türkiye),0.0%,0.6%,3.3%,12.3%
Canada,2.8%,7.2%,7.2%,5.0%


**Shift in share (2000-2009 → 2020+):**

Country,2000-2009,2020+,Change (pp)
United States,82.7%,47.9%,-34.8
China,2.6%,24.2%,21.6
France,6.8%,10.6%,3.8
Turkey (Türkiye),0.6%,12.3%,11.7
Canada,7.2%,5.0%,-2.2


**Trend test (Spearman ρ: share vs year):**

Country,Spearman ρ,p-value,Trend
United States,-0.963,< 0.001,↓
China,0.876,< 0.001,↑
France,0.546,< 0.001,↑
Turkey (Türkiye),0.915,< 0.001,↑
Canada,-0.123,0.496,↓



**Interpretation:** Spearman ρ measures monotonic association between year and country share. Positive values indicate increasing share over time; negative values indicate declining share. Trends are descriptive and do not control for compositional shifts (e.g., changes in phase/condition mix over time).


---

## 6. Summary & Implications

In [14]:
# ============================================================
# 6.1 Summary: Answers to Research Questions
# ============================================================

# Get key metrics for summary
top_country_name = df_country.iloc[0]['primary_country']
top_country_trials = df_country.iloc[0]['n_trials']
second_country_name = df_country.iloc[1]['primary_country']
second_country_trials = df_country.iloc[1]['n_trials']

# Phase 3 median sites
p3_median = phase_summary.loc[phase_summary['phase_group'] == 'Phase 3', 'median_sites'].values
p3_median_val = int(p3_median[0]) if len(p3_median) > 0 else 'N/A'

display(Markdown(f"""
### Answers to Research Questions

---

**Q: How are clinical trials distributed globally?**

Clinical trial activity is highly concentrated:
- Top 5 countries account for {top5_share:.0f}% of trials with location data
- {top_country_name} leads with {top_country_trials:,} trials; {second_country_name} follows with {second_country_trials:,}
- {pct_single:.0f}% of trials operate at a single site; {pct_multinational:.0f}% span multiple countries

---

**Q: Are there regional specializations in certain therapeutic areas?**

Yes. The χ² test confirms non-random country × condition distribution (Cramér's V = {cramers_v:.2f}).
Location Quotients identify relative specializations—countries where a condition's trial share exceeds the global baseline.

See Section 3.2 for specific condition-country pairs with LQ > 1.5.

---

**Q: How does site complexity vary by phase?**

Site count increases with trial phase:
- Phase 1: predominantly single-site
- Phase 3: median {p3_median_val} sites, with higher multinational share
- Industry trials operate at larger scale (Mann-Whitney significant with {effect_label} effect)

---

### Actionable Implications

| Use Case | Recommendation |
|----------|----------------|
| **Site selection** | Leverage country-specific condition strengths (LQ > 1.5) for faster recruitment in specialized areas |
| **Portfolio planning** | Account for geographic concentration risk; diversify across top-10 countries for global reach |
| **Competitive intelligence** | Monitor temporal shifts—emerging regions may offer cost/speed advantages |
| **Feasibility assessment** | Plan for 5–10× more sites in Phase 3 vs Phase 1; industry trials require larger footprint |

---

### Limitations

- **Registry bias:** ClinicalTrials.gov is US-centric; non-US trials may be under-represented
- **Location granularity:** "Primary country" is modal, may overstate concentration for multinationals ({pct_multinational_all:.0f}% of trials)
- **Condition labels:** Free-text entries; LQ reflects registry terms, not standardized taxonomy
- **Double-counting:** Trials with {mean_conds_per_trial:.1f} conditions on average inflate condition-level counts
- **No causality:** Associations describe registry patterns, not determinants of trial placement
"""))


### Answers to Research Questions

---

**Q: How are clinical trials distributed globally?**

Clinical trial activity is highly concentrated:
- Top 5 countries account for 58% of trials with location data
- United States leads with 31,791 trials; China follows with 7,780
- 73% of trials operate at a single site; 8% span multiple countries

---

**Q: Are there regional specializations in certain therapeutic areas?**

Yes. The χ² test confirms non-random country × condition distribution (Cramér's V = 0.12).
Location Quotients identify relative specializations—countries where a condition's trial share exceeds the global baseline.

See Section 3.2 for specific condition-country pairs with LQ > 1.5.

---

**Q: How does site complexity vary by phase?**

Site count increases with trial phase:
- Phase 1: predominantly single-site
- Phase 3: median 4 sites, with higher multinational share
- Industry trials operate at larger scale (Mann-Whitney significant with large effect)

---

### Actionable Implications

| Use Case | Recommendation |
|----------|----------------|
| **Site selection** | Leverage country-specific condition strengths (LQ > 1.5) for faster recruitment in specialized areas |
| **Portfolio planning** | Account for geographic concentration risk; diversify across top-10 countries for global reach |
| **Competitive intelligence** | Monitor temporal shifts—emerging regions may offer cost/speed advantages |
| **Feasibility assessment** | Plan for 5–10× more sites in Phase 3 vs Phase 1; industry trials require larger footprint |

---

### Limitations

- **Registry bias:** ClinicalTrials.gov is US-centric; non-US trials may be under-represented
- **Location granularity:** "Primary country" is modal, may overstate concentration for multinationals (8% of trials)
- **Condition labels:** Free-text entries; LQ reflects registry terms, not standardized taxonomy
- **Double-counting:** Trials with 1.8 conditions on average inflate condition-level counts
- **No causality:** Associations describe registry patterns, not determinants of trial placement


---

## Cleanup

In [15]:
# ============================================================
# Close database connection
# ============================================================

conn.close()
print("Database connection closed.")

Database connection closed.
