# Analyzing Performance at the 2025 SF Marathon: Terrain, Pace, and Heart Rate

This notebook explores my performance at the 2025 San Francisco Marathon. The data is pulled from Strava records of my 26.2-mile race. The focus of this exploratory analysis is to identify the relationship between course structure and terrain (elevation, distance segments, and aid stations) on performance metrics such as pace and heartrate. The goal is to draw insights to guide future training and serve as a reference if I run this marathon agian. 

The analysis is done using SQL and Python (pandas, matplotlib) in a Jupyter notebook. Raw data (TCX and GPX) was cleaned in separate Python scripts. 

## 1. Overview of the Data 

In [1]:
# Data, SQL, visualization imports
import sqlite3
import pandas as pd
import numpy as np
from scipy.stats import pearsonr
import statsmodels.api as sm
import matplotlib.pyplot as plt
from geopy.distance import geodesic
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go

In [2]:
# Connect to SQLite database, display tables and columns
con = sqlite3.connect("../data/marathon_data.db") 
cur = con.cursor()

# Table overview
tables = pd.read_sql_query("SELECT name FROM sqlite_master WHERE type='table';", con)
print("Tables:", tables['name'].tolist())

# Show structure for each table
for table in tables['name']:
    print(f"\n=== {table.upper()} ===")
    cols = pd.read_sql(f"PRAGMA table_info({table});", con)
    for _, row in cols.iterrows():
        print(f"  {row['name']}: {row['type']}")

Tables: ['tcx_data', 'official_route_data', 'aid_stations_data']

=== TCX_DATA ===
  time: TEXT
  latitude: REAL
  longitude: REAL
  elevation: REAL
  distance: REAL
  heart_rate: INTEGER

=== OFFICIAL_ROUTE_DATA ===
  latitude: REAL
  longitude: REAL
  elevation: REAL

=== AID_STATIONS_DATA ===
  name: TEXT
  latitude: REAL
  longitude: REAL
  type: TEXT


The database contains three tables: `tcx_data`, `official_route_data`, and `aid_stations_data`. The names and data types of the columns in each are as follows:

- `tcx_data`: data from my race recorded on my Apple Watch
    - __time__: timestamp in UTC for each row entry; TEXT (convertible to date-time)
    - __latitude__: GPS latitude in decimal degrees (WGS84 coordinate system); REAL
    - __longitude__: GPS longitude in decimal degrees (WGS84 coordinate system); REAL
    - __elevation__: altitude above sea level, in meters; REAL
    - __distance__: cumulative distance covered since the start, in meters; REAL
    - __heart_rate__: runner’s heart rate at that moment, in beats per minute (BPM); INTEGER

- `official_route_data`: official route data provided by SFM organizers
    - __latitude__: GPS latitude in decimal degrees (WGS84 coordinate system); REAL
    - __longitude__: GPS longitude in decimal degrees (WGS84 coordinate system); REAL
    - __elevation__: altitude above sea level, in meters; REAL

- `aid_stations_data`: official route data provided by SFM organizers
    - __name__: name of the waypoint; TEXT
    - __latitude__: GPS latitude in decimal degrees (WGS84 coordinate system); REAL
    - __longitude__: GPS longitude in decimal degrees (WGS84 coordinate system); REAL
    - __type__: type of the waypoint; TEXT


## 2. Data cleaning and processing



Before analysis, the raw data was inspected and adjusted to improve consistency and interpretability:

- **Distance units:** Converted from meters to kilometers for readability.  
- **Route validation:** Compared the recorded elevation profile against the official Strava course to confirm alignment.  
- **Official course distance:** Since the official data only contained coordinates and elevation, distance in kilometers was calculated in Python for direct comparison.  
- **Data trimming:** Rows prior to the official start line were removed to ensure analyses begin from the race’s official start time.  
- **Elevation cleaning:** Short-term GPS anomalies and outliers (including a ~250 m spike) were identified using a rolling-baseline method. Points that deviated substantially from the local trend were marked as missing and then filled via linear interpolation, preserving the overall hill profile while removing unrealistic spikes.  
- **Split calculation:** Per-kilometer average pace and heart rate were computed by aggregating data into kilometer splits, providing a clearer view of discrete segment performance patterns.  
- **Rolling averages:** A **rolling 1km pace** was derived by calculating elapsed time over the preceding kilometer of distance, smoothing out short-term fluctuations and approximating how fitness apps typically present "live pace." Similarly, a **rolling 30-second heart rate** (smoothed HR over a 30-second time window) was calculated to analyze cardiovascular response patterns throughout the race.  
- **Heart rate zones:** Using the rolling heart rate, each data point was classified into one of five intensity zones based on Apple Watch’s in-device classification (Zone 1: <138 bpm, Zone 2: 139–151 bpm, Zone 3: 152–165 bpm, Zone 4: 166–179 bpm, Zone 5: ≥180 bpm). These zones reflect observed cardiovascular effort rather than age-predicted maximums, enabling analysis of effort distribution and physiological load across the course.  
- **Course “valves,” aid stations, and truncated recording:** Certain sections of the route were temporarily rerouted for traffic or safety, moving runners slightly off the main path while keeping the total distance essentially unchanged. Aid stations were not always precisely at their official locations. The watch stopped recording at the exact marathon distance before the finish line, which slightly truncates the final segment. These factors can cause minor GPS offsets or pace fluctuations and explain why the recorded course map may differ slightly from the official route.  


In [19]:
# Query tcx data with pre-processing

# Query tcx race data with distance in km
tcx_query = """
SELECT
    time,
    latitude,
    longitude,
    elevation,
    distance,
    distance / 1000.0 AS distance_km,
    heart_rate,
    FLOOR(distance / 1000.0) AS km_split
FROM tcx_data 
WHERE (
    time >= '2025-07-27 12:35:29'
    AND elevation <= 100
)
ORDER BY time
"""

tcx_df = pd.read_sql(tcx_query, con)

# Create 1km splits with pace calculation in SQL
km_split_query = """
WITH km_split_table AS (
    SELECT
        CAST(FLOOR(distance / 1000.0) AS INT) AS km_split,
        MIN(time) AS start_time,
        MAX(time) AS end_time,
        MIN(distance) AS start_distance,
        MAX(distance) AS end_distance,
        AVG(elevation) AS avg_elevation,
        AVG(heart_rate) AS avg_heart_rate
    FROM tcx_data
    WHERE time >= '2025-07-27 12:35:29'
    GROUP BY km_split
)
SELECT
    km_split,
    start_time,
    end_time,
    start_distance,
    end_distance,
    (end_distance - start_distance) / 1000.0 AS actual_distance_km,
    avg_elevation,
    avg_heart_rate,
    (strftime('%s', end_time) - strftime('%s', start_time)) AS time_diff_seconds,
    CASE 
        WHEN (end_distance - start_distance) > 0 
        THEN ((strftime('%s', end_time) - strftime('%s', start_time)) / 60.0) / ((end_distance - start_distance) / 1000.0)
        ELSE NULL 
    END AS pace
FROM km_split_table
ORDER BY km_split
"""

km_split_df = pd.read_sql(km_split_query, con)

# Format pace in Python 
def format_pace_value(pace):
    if pace is None or np.isnan(pace):
        return "N/A"
    minutes = int(pace)
    seconds = int(round((pace - minutes) * 60))
    return f"{minutes}:{seconds:02d}"

km_split_df['pace_formatted'] = km_split_df['pace'].apply(format_pace_value)

km_split_df.tail()

Unnamed: 0,km_split,start_time,end_time,start_distance,end_distance,actual_distance_km,avg_elevation,avg_heart_rate,time_diff_seconds,pace,pace_formatted
38,38,2025-07-27 17:59:01,2025-07-27 18:08:09,38000.8,38998.4,0.9976,12.792647,152.049632,548,9.155306,9:09
39,39,2025-07-27 18:08:10,2025-07-27 18:16:41,39000.7,39999.5,0.9988,4.502745,153.14902,511,8.526899,8:32
40,40,2025-07-27 18:16:42,2025-07-27 18:24:29,40001.7,40997.9,0.9962,5.005172,160.594828,467,7.813023,7:49
41,41,2025-07-27 18:24:30,2025-07-27 18:32:04,41000.3,41999.4,0.9991,4.665639,163.396476,454,7.573483,7:34
42,42,2025-07-27 18:32:05,2025-07-27 18:33:52,42002.6,42239.4,0.2368,5.027778,170.444444,107,7.530968,7:32


In [20]:
# Load and process official route data
def calculate_route_distances(route_df):
    """Calculate cumulative distances for route coordinates."""
    distances = [0.0]
    for i in range(1, len(route_df)):
        start_coords = (route_df.iloc[i-1]['latitude'], route_df.iloc[i-1]['longitude'])
        end_coords = (route_df.iloc[i]['latitude'], route_df.iloc[i]['longitude'])
        distances.append(geodesic(start_coords, end_coords).km)
    
    route_df = route_df.copy()
    route_df['segment_distance'] = distances
    route_df['cumulative_distance_km'] = route_df['segment_distance'].cumsum()
    return route_df

def map_stations_to_route(stations_df, route_df):
    """Map aid stations to their closest points on the official route."""
    def find_closest_route_distance(station_lat, station_lon):
        station_coords = (station_lat, station_lon)
        distances_to_route = [
            geodesic(station_coords, (row['latitude'], row['longitude'])).meters 
            for _, row in route_df.iterrows()
        ]
        closest_idx = np.argmin(distances_to_route)
        return route_df.iloc[closest_idx]['cumulative_distance_km']
    
    stations_df = stations_df.copy()
    stations_df['distance_from_start_km'] = stations_df.apply(
        lambda row: find_closest_route_distance(row['latitude'], row['longitude']), 
        axis=1
    )
    return stations_df.sort_values('distance_from_start_km').reset_index(drop=True)

# Load and process data
official_route_df = pd.read_sql("SELECT * FROM official_route_data;", con)
official_route_df = calculate_route_distances(official_route_df)

aid_stations_df = pd.read_sql("SELECT * FROM aid_stations_data;", con)
aid_stations_df = map_stations_to_route(aid_stations_df, official_route_df)

In [21]:
# Create plots to compare my race to official route
fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=('Elevation Comparison', 'Route Map Comparison'),
    horizontal_spacing=0.15,
    specs=[[{"secondary_y": False}, {"secondary_y": False}]]
)

# First subplot: Elevation comparison (left)
fig.add_trace(
    go.Scatter(
        x=tcx_df['distance_km'],
        y=tcx_df['elevation'],
        mode='lines',
        name='My Race',
        line=dict(color='blue', width=2),
        opacity=0.7,
        hovertemplate='Distance: %{x:.2f} km<br>My Elevation: %{y:.1f} m<extra></extra>'
    ),
    row=1, col=1
)

fig.add_trace(
    go.Scatter(
        x=official_route_df['cumulative_distance_km'],
        y=official_route_df['elevation'],
        mode='lines',
        name='Official Route',
        line=dict(color='red', width=2, dash='dash'),
        opacity=0.7,
        hovertemplate='Distance: %{x:.2f} km<br>Official Elevation: %{y:.1f} m<extra></extra>'
    ),
    row=1, col=1
)

# Second subplot: Route map (right)
fig.add_trace(
    go.Scatter(
        x=tcx_df['longitude'],
        y=tcx_df['latitude'],
        mode='lines',
        name='My Race Route',
        line=dict(color='blue', width=3),
        opacity=0.7,
        hovertemplate='Lat: %{y:.5f}<br>Lon: %{x:.5f}<br>My Route<extra></extra>',
        showlegend=False
    ),
    row=1, col=2
)

fig.add_trace(
    go.Scatter(
        x=official_route_df['longitude'],
        y=official_route_df['latitude'],
        mode='lines',
        name='Official Route',
        line=dict(color='red', width=2, dash='dash'),
        opacity=0.7,
        hovertemplate='Lat: %{y:.5f}<br>Lon: %{x:.5f}<br>Official Route<extra></extra>',
        showlegend=False
    ),
    row=1, col=2
)

# Add aid stations to map (right)
fig.add_trace(
    go.Scatter(
        x=aid_stations_df['longitude'],
        y=aid_stations_df['latitude'],
        mode='markers',
        name='Aid Stations',
        marker=dict(color='green', size=8, symbol='circle'),
        hovertemplate='%{text}<br>Lat: %{y:.5f}<br>Lon: %{x:.5f}<extra></extra>',
        text=aid_stations_df['name'] if 'name' in aid_stations_df.columns else 'Aid Station',
        showlegend=False
    ),
    row=1, col=2
)

# Update layout and axis labels
fig.update_layout(
    title_text='Marathon Analysis: Elevation and Route Comparison',
    height=500,
    width=1200,
    showlegend=True
)
fig.update_xaxes(title_text="Distance (km)", row=1, col=1)
fig.update_xaxes(title_text="Longitude", row=1, col=2)
fig.update_yaxes(title_text="Elevation (m)", row=1, col=1)
fig.update_yaxes(title_text="Latitude", row=1, col=2, scaleanchor="x2", scaleratio=1)

fig.show()

In [22]:
# Remove elevation outliers using rolling baseline and interpolation
baseline_window = 50
tcx_df['elev_baseline'] = tcx_df['elevation'].rolling(window=baseline_window, center=True, min_periods=1).mean()

deviation_threshold = 5  # meters
tcx_df.loc[(tcx_df['elevation'] - tcx_df['elev_baseline']).abs() > deviation_threshold, 'elevation'] = np.nan
tcx_df['elevation'] = tcx_df['elevation'].interpolate()
tcx_df.drop(columns=['elev_baseline'], inplace=True)

# Create elevation comparison plot
fig = go.Figure()

fig.add_trace(go.Scatter(
    x=tcx_df['distance_km'],
    y=tcx_df['elevation'],
    mode='lines',
    name='My Race Elevation',
    line=dict(color='blue', width=2),
    hovertemplate='Distance: %{x:.2f} km<br>Elevation: %{y:.1f} m<extra></extra>'
))

fig.add_trace(go.Scatter(
    x=official_route_df['cumulative_distance_km'],
    y=official_route_df['elevation'],
    mode='lines',
    name='Official Route Elevation',
    line=dict(color='red', width=2, dash='dash'),
    opacity=0.7,
    hovertemplate='Distance: %{x:.2f} km<br>Official Elevation: %{y:.1f} m<extra></extra>'
))

fig.update_layout(
    title='Elevation Profile: Recorded vs Official Route (After Outlier Removal)',
    xaxis_title='Distance (km)',
    yaxis_title='Elevation (m)',
    width=900,
    height=500,
    hovermode='x'
)

fig.show()


In [23]:
# Find rolling pace and heart rate for cleaner visualization

# Find rolling average pace over 1 km
dist_m = tcx_df['distance'].to_numpy()
tcx_df['time'] = pd.to_datetime(tcx_df['time'])
tcx_df['dt'] = tcx_df['time'].diff().dt.total_seconds().fillna(0)

time_s = tcx_df['time'].astype('int64') // 1_000_000_000  # seconds since epoch

pace_1km = np.full(len(dist_m), np.nan)

window_m = 1000  # 1 km

for i in range(len(dist_m)):
    target_dist = dist_m[i] - window_m
    start_idx = np.searchsorted(dist_m, target_dist, side='right') - 1
    start_idx = max(0, start_idx)
    
    dt = time_s[i] - time_s[start_idx]
    dx_km = (dist_m[i] - dist_m[start_idx]) / 1000
    
    if dx_km > 0:
        pace_1km[i] = (dt / 60) / dx_km  # min/km

tcx_df['pace_1km'] = pace_1km
tcx_df['pace_1km_str'] = tcx_df['pace_1km'].apply(format_pace_value)


# Find rolling heart rate over 30 seconds
# Remove duplicate timestamps, keeping the first occurrence
tcx_df_clean = tcx_df.drop_duplicates(subset='time', keep='first')
tcx_df_temp = tcx_df_clean.set_index('time')
hr_30s_clean = (
    tcx_df_temp['heart_rate']
    .rolling('30s', min_periods=5, center=True)
    .mean()
)

# Map back to original DataFrame
tcx_df['hr_30s'] = tcx_df['time'].map(hr_30s_clean.to_dict())

# Define heart rate zone bins and labels
hr_bins = [0, 138, 151, 165, 179, np.inf]  # Upper bounds for each zone
hr_labels = [1, 2, 3, 4, 5]

# Assign zones based on rolling 30s heart rate
tcx_df['hr_zone'] = pd.cut(tcx_df['hr_30s'], bins=hr_bins, labels=hr_labels, right=True)


tcx_df.head()


Unnamed: 0,time,latitude,longitude,elevation,distance,distance_km,heart_rate,km_split,dt,pace_1km,pace_1km_str,hr_30s,hr_zone
0,2025-07-27 12:35:29,37.794629,-122.394148,5.0,14.3,0.0143,116,0.0,0.0,,,118.0,1
1,2025-07-27 12:35:30,37.794641,-122.394156,4.6,15.9,0.0159,116,0.0,1.0,10.416667,10:25,118.307692,1
2,2025-07-27 12:35:33,37.794641,-122.394156,4.6,21.6,0.0216,116,0.0,3.0,9.13242,9:08,118.625,1
3,2025-07-27 12:35:35,37.794641,-122.394156,4.6,27.0,0.027,116,0.0,2.0,7.874016,7:52,118.666667,1
4,2025-07-27 12:35:36,37.794716,-122.394173,4.6,28.3,0.0283,116,0.0,1.0,8.333333,8:20,118.684211,1


## 3. Overall Trends

The following graph illustrates my rolling pace (1km window) and elevation across the 26.2-mile race. 

In [8]:
# Create dual-axis plot: pace and elevation vs distance
fig = go.Figure()

# Add pace trace (1km rolling average)
fig.add_trace(go.Scatter(
    x=tcx_df['distance_km'],
    y=tcx_df['pace_1km'],
    mode='lines',
    name='Pace (1km rolling avg)',
    line=dict(color='red', width=2),
    hovertemplate='Distance: %{x:.2f}km<br>Pace: %{customdata}<extra></extra>',
    customdata=tcx_df['pace_1km_str'],
    yaxis='y'
))

# Add elevation trace
fig.add_trace(go.Scatter(
    x=tcx_df['distance_km'],
    y=tcx_df['elevation'],
    mode='lines',
    name='Elevation (m)',
    line=dict(color='blue', width=1),
    opacity=0.7,
    hovertemplate='Distance: %{x:.2f}km<br>Elevation: %{y:.0f}m<extra></extra>',
    yaxis='y2'
))

# Add aid stations with labels
for i, (_, station) in enumerate(aid_stations_df.iterrows()):
    # Vertical line
    fig.add_vline(
        x=station['distance_from_start_km'],
        line=dict(color='green', width=2, dash='dot'),
        opacity=0.7
    )
    
    # Distance label
    fig.add_annotation(
        x=station['distance_from_start_km'],
        y=1, yref="paper",
        text=f"{station['distance_from_start_km']:.1f}",
        showarrow=False,
        font=dict(size=10, color='green'),
        yanchor="bottom"
    )

# Add legend entry for aid stations
fig.add_trace(go.Scatter(
    x=[None], y=[None],
    mode='lines',
    name='Aid Stations',
    line=dict(color='green', width=2, dash='dot'),
    showlegend=True
))

# Update layout
fig.update_layout(
    title='Marathon Performance: Pace vs Elevation Profile with Aid Stations',
    xaxis_title='Distance (km)',
    yaxis=dict(title='Pace (min/km)', side='left'),
    yaxis2=dict(title='Elevation (m)', side='right', overlaying='y'),
    width=1000,
    height=500,
    hovermode='x unified',
    legend=dict(orientation="h", yanchor="top", y=-0.15, xanchor="center", x=0.5)
)

fig.show()

The visualization above highlights several key performance patterns:

- **Start-line anomaly:** An initial pace spike appears within the first 0.2km, reflecting bottleneck congestation near the starting line. 
- **Impact of terrain:** Pace strongly correlates with elevation gain/loss. Uphill segments produce sustained slowdowns, while downhill segments allow partial recovery. 
- **Effects of fatigue:** Although earlier downhill sections produce clear accelerations, the effect diminishes in the latter half of the race. This suggests that fatigue not only leads to slower pace but also reduces responsiveness to favorable terrain (e.g., knee stress reducing ability to capitalize on downhills). 
- **End-of-race acceleration:** Despite late-race fatigue, the last 5km show a distinct negative split. This shows a clear end-of-race push as the finish line draws closer. 




In [24]:
# Count time per HR zone per km split
zone_counts = (
    tcx_df.groupby(['km_split', 'hr_zone'], observed=False)
    .size()
    .reset_index(name='count')
)

# Convert count to proportion within each km
zone_counts_total = zone_counts.groupby('km_split')['count'].transform('sum')
zone_counts['proportion'] = zone_counts['count'] / zone_counts_total

# Create stacked bar chart
fig = px.bar(
    zone_counts,
    x='km_split',
    y='proportion',
    color='hr_zone',
    color_discrete_sequence=['#d4f0f0','#a0d0d0','#70a0c0','#4070a0','#103060'],  # Zone 1-5
    labels={'proportion':'Proportion of Time', 'km_split':'Kilometer', 'hr_zone':'Heart Rate Zone'},
    title='Heart Rate Zone Distribution per Kilometer (Stacked Proportions)'
)

fig.update_layout(
    barmode='stack',
    xaxis=dict(dtick=1),
    width=1000,
    height=500
)

fig.show()


In [102]:
# Create violin plot showing heart rate distribution over 10 minute intervals
fig = go.Figure()

# Format time for better x-axis labels and hover text 
def format_start_time(window_str):
    """Convert '0-10 min' format to '0:00' format (start time only)"""
    if pd.isna(window_str):
        return "Unknown"
    start_end = window_str.replace(' min', '').split('-')
    start_min = int(start_end[0])
    return f"{start_min//60}:{start_min%60:02d}"

def format_time_window(window_str):
    """Convert '0-10 min' format to '0:00-10:00' format"""
    if pd.isna(window_str):
        return "Unknown"
    start_end = window_str.replace(' min', '').split('-')
    start_min, end_min = int(start_end[0]), int(start_end[1])
    return f"{start_min//60}:{start_min%60:02d}-{end_min//60}:{end_min%60:02d}"

x_labels_short = [format_start_time(str(cat)) for cat in tcx_df['time_bin'].cat.categories]
x_labels_full = [format_time_window(str(cat)) for cat in tcx_df['time_bin'].cat.categories]

for window, label_short, label_full in zip(tcx_df['time_bin'].cat.categories, x_labels_short, x_labels_full):
    window_data = tcx_df[tcx_df['time_bin'] == window]['heart_rate'].dropna()
    if len(window_data) > 0:
        fig.add_trace(go.Violin(
            x=[label_short] * len(window_data),
            y=window_data,
            name=f"Window: {label_full}",
            box_visible=False,
            points=False,
            meanline_visible=True,
            showlegend=False,
            fillcolor='rgba(0,0,0,0)',
            line_color='darkblue',
            hovertemplate=f'Time Window: {label_full}<br>Heart Rate: %{{y:.0f}} bpm<extra></extra>'
        ))

# Add horizontal background zones with legend entries (Zone 5 to 1)
zones = [
    {"name": "Zone 5", "min": 180, "max": 190, "color": "rgba(247, 25, 80, 0.35)"}, 
    {"name": "Zone 4", "min": 165, "max": 180, "color": "rgba(247, 153, 30, 0.35)"}, 
    {"name": "Zone 3", "min": 151, "max": 165, "color": "rgba(180, 240, 58, 0.35)"}, 
    {"name": "Zone 2", "min": 138, "max": 151, "color": "rgba(53, 252, 252, 0.35)"}, 
    {"name": "Zone 1", "min": 110, "max": 138, "color": "rgba(54, 124, 245, 0.35)"}
]

for z in zones:
    fig.add_shape(
        type="rect",
        xref="paper", x0=0, x1=1,
        yref="y", y0=z["min"], y1=z["max"],
        fillcolor=z["color"],
        line=dict(width=0),
        layer="below"
    )
    fig.add_trace(go.Scatter(
        x=[None],
        y=[None],
        mode='markers',
        marker=dict(size=10, color=z["color"]),
        name=z["name"]
    ))

# Add min/max horizontal dashed lines 
hr_min = tcx_df['heart_rate'].min()
hr_max = tcx_df['heart_rate'].max()

fig.add_hline(y=hr_min, line=dict(color='grey', width=2, dash='2px'), 
              annotation_text=f"Actual Min: {hr_min} bpm", 
              annotation_position="bottom left",
              annotation=dict(yshift=-15))
fig.add_hline(y=hr_max, line=dict(color='grey', width=2, dash='2px'), 
              annotation_text=f"Actual Max: {hr_max} bpm", 
              annotation_position="top left",
              annotation=dict(yshift=10))

fig.update_layout(
    title='Heart Rate Distribution by 10-Minute Windows',
    xaxis_title='Interval Start Time (hh:mm)',
    yaxis_title='Heart Rate (bpm)',
    width=1200,
    height=600,
    xaxis=dict(
        tickangle=45,
        categoryorder='array',
        categoryarray=x_labels_short
    ),
    yaxis=dict(range=[110, 190])
)

fig.show()

The violin plot shows several patterns in heart rate fluctuations:

## 4. Heart Rate vs Pace Analyis

In [10]:
# Create dual-axis plot: pace and heart rate vs distance
fig = go.Figure()

# Add pace trace (1km rolling average)
fig.add_trace(go.Scatter(
    x=tcx_df['distance_km'],
    y=tcx_df['pace_1km'],
    mode='lines',
    name='Pace (min/km; 1km rolling avg)',
    line=dict(color='red', width=2),
    hovertemplate='Distance: %{x:.2f}km<br>Pace: %{customdata}<extra></extra>',
    customdata=tcx_df['pace_1km_str'],
    yaxis='y'
))

# Add elevation trace
fig.add_trace(go.Scatter(
    x=tcx_df['distance_km'],
    y=tcx_df['hr_30s'],
    mode='lines',
    name='Heart Rate (bpm; 30s rolling avg)',
    line=dict(color='blue', width=1),
    opacity=0.7,
    hovertemplate='Distance: %{x:.2f}km<br>Heart Rate: %{y:.0f}bpm<extra></extra>',
    yaxis='y2'
))


# Update layout
fig.update_layout(
    title='Marathon Performance: Pace vs Heart Rate Profile',
    xaxis_title='Distance (km)',
    yaxis=dict(title='Pace (min/km)', side='left'),
    yaxis2=dict(title='Heart Rate (bpm)', side='right', overlaying='y'),
    width=1000,
    height=500,
    hovermode='x unified',
    legend=dict(orientation="h", yanchor="top", y=-0.15, xanchor="center", x=0.5)
)

fig.show()

The line graph above shows some mey insights:

- **HR vs. Pace trends**: Heart rate remains centered around ~150 bpm, while pace shows a gradual slowdown.  
- **Drift over time**: Early race = high HR + fast pace. Late race = high HR + slower pace, suggesting fatigue.  
- **End-phase variability**: More jitter near the end due to alternating walk/run efforts.  
- **Finish surge**: HR spikes again with the final acceleration at the race finish.  


In [11]:
# Calculate Pearson correlation between pace and heart rate
valid_data = km_split_df.dropna(subset=['pace', 'avg_heart_rate'])
correlation_coef, p_value = pearsonr(valid_data['pace'], valid_data['avg_heart_rate'])
r_squared = correlation_coef ** 2

fig = go.Figure(go.Scatter(
    x=km_split_df['pace'],
    y=km_split_df['avg_heart_rate'],
    mode='markers', 
    marker=dict(
        size=10,
        color=km_split_df['km_split'],
        colorscale='Viridis',
        showscale=True,
        colorbar=dict(title="Distance (km)")
    ),
    hovertemplate=(
        "KM: %{marker.color}<br>"
        "Pace: %{customdata}<br>"
        "HR: %{y:.0f} bpm<extra></extra>"
    ),
    customdata=km_split_df['pace_formatted'] 
))

fig.update_layout(
    title=f"HR vs Pace (1 km splits)<br><sub>Pearson r = {correlation_coef:.3f}, R² = {r_squared:.3f}, p = {p_value:.4f}</sub>",
    xaxis_title="Pace (min/km)",
    yaxis_title="Heart Rate (bpm)",
    width=900,
    height=600
)

fig.show()

The relationship between pace and heart rate shows a **modest negative correlation** (r = -0.36, p = 0.017). While not especially strong, a few patterns stand out:

- **Triangular distribution pattern**: Early splits cluster in the bottom-left quadrant (faster pace, moderate HR), while late-race splits drift toward the bottom-right (slower pace, similar HR), illustrating the classic marathon fatigue progression.
- **Cardiovascular efficiency decline**: Many splits fall within the **145–155 bpm range**, but earlier kilometers consistently achieved faster paces at similar heart rates, demonstrating how accumulated fatigue reduces running economy and stride efficiency over the race duration.
- **End-race physiological surge**: The final kilometers show a clear heart rate spike (160+ bpm) with improved pace, indicating a final effort that temporarily overcame the accumulated fatigue from the previous 20+ kilometers.  

Overall, this suggests that efficiency (pace at a given HR) declined over the race, with a clear end-race effort standing apart from the trend.

**Statistical caveat:** While Pearson's *r* = –0.36 (p = 0.017) suggests statistical significance, this correlation test assumes independent observations. Given that marathon splits represent sequential time-series data with inherent autocorrelation, the reported correlation strength and significance should be **interpreted cautiously** as they may be inflated.


## 5. Heatmap

This map visualizes heart rate along the marathon route, with color indicating intensity

In [12]:
# Create heatmap
fig = px.scatter_map(
    tcx_df,
    lat="latitude",
    lon="longitude",
    color="heart_rate",
    color_continuous_scale="Turbo",
    size_max=8,
    zoom=12,
    map_style="outdoors"
)

fig.update_traces(
    hovertemplate="Distance: %{customdata[0]:.2f} km<br>" +
                  "Elevation: %{customdata[1]:.1f} m<br>" +
                  "Heart Rate: %{marker.color:.0f} bpm<extra></extra>",
    customdata=tcx_df[['distance_km', 'elevation']].values
)

fig.update_layout(
    title="Heart Rate Heatmap of Race Route",
    margin={"r":0,"t":60,"l":0,"b":0}
)

fig.show()

Heart rate increases are most pronounced on steep uphill segments such as **4 km (Fort Mason Park)** and **18 km (Golden Gate Bridge on Sausalito side)**, illustrating how terrain influences physiological effort. Pace fluctuations align with these elevations, confirming that hills are a primary driver of fatigue over the course.

A final surge in heart rate and pace occurs near **40 km (Mission Bay to South Beach)**, reflecting the end-of-race effort on the final flat stretch.


## 6. Conclusions

The analysis of marathon GPS and heart rate data highlights several key insights:

- **Terrain effects:** Pace and heart rate respond clearly to uphill and downhill segments, confirming that hills are a primary driver of performance variability.  
- **Cumulative fatigue:** Despite early fast splits, heart rate remains elevated while pace gradually slows, illustrating fatigue over the course.  
- **End-of-race behavior:** A noticeable surge in pace and heart rate occurs near the finish, showing a deliberate final push.  
- **Data handling:** Rolling averages, splits, and basic cleaning allowed meaningful visualizations while mitigating GPS and sampling noise.

Overall, this project demonstrates the ability to ingest, clean, aggregate, and visualize complex time-series and spatial data, providing actionable insights into running performance.


In [13]:
# Close the database connection
#con.close()