# Analyzing Performance at the 2025 SF Marathon: Terrain, Pace, and Heart Rate

This notebook explores my performance at the 2025 San Francisco Marathon. The data is pulled from Strava records of my 26.2-mile race. The focus of this exploratory analysis is to identify the relationship between course structure and terrain (elevation, distance segments, and aid stations) on performance metrics such as pace and heartrate. The goal is to draw insights to guide future training and serve as a reference if I run this marathon agian. 

The analysis is done using SQL and Python (pandas, matplotlib) in a Jupyter notebook. Raw data (TCX and GPX) was cleaned in separate Python scripts. 

## 1. Overview of the Data 

In [None]:
# Data, SQL, visualization imports
import sqlite3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from geopy.distance import geodesic
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go


In [None]:
# Connect to SQLite database, display tables and columns
con = sqlite3.connect("../data/marathon_data.db") 
cur = con.cursor()

# Table overview
tables = pd.read_sql_query("SELECT name FROM sqlite_master WHERE type='table';", con)
print("Tables:", tables['name'].tolist())

# Show structure for each table
for table in tables['name']:
    print(f"\n=== {table.upper()} ===")
    cols = pd.read_sql(f"PRAGMA table_info({table});", con)
    for _, row in cols.iterrows():
        print(f"  {row['name']}: {row['type']}")

The database contains three tables: `tcx_data`, `official_route_data`, and `aid_stations_data`. The names and data types of the columns in each are as follows:

- `tcx_data`: data from my race recorded on my Apple Watch
    - __time__: timestamp in UTC for each row entry; TEXT (convertible to date-time)
    - __latitude__: GPS latitude in decimal degrees (WGS84 coordinate system); REAL
    - __longitude__: GPS longitude in decimal degrees (WGS84 coordinate system); REAL
    - __elevation__: altitude above sea level, in meters; REAL
    - __distance__: cumulative distance covered since the start, in meters; REAL
    - __heart_rate__: runner’s heart rate at that moment, in beats per minute (BPM); INTEGER

- `official_route_data`: official route data provided by SFM organizers
    - __latitude__: GPS latitude in decimal degrees (WGS84 coordinate system); REAL
    - __longitude__: GPS longitude in decimal degrees (WGS84 coordinate system); REAL
    - __elevation__: altitude above sea level, in meters; REAL

- `aid_stations_data`: official route data provided by SFM organizers
    - __name__: name of the waypoint; TEXT
    - __latitude__: GPS latitude in decimal degrees (WGS84 coordinate system); REAL
    - __longitude__: GPS longitude in decimal degrees (WGS84 coordinate system); REAL
    - __type__: type of the waypoint; TEXT


## 2. Data cleaning and processing

Before analysis, the raw data was inspected and adjusted to improve consistency and interpretability:

- **Distance units:** Converted from meters to kilometers for readability.  
- **Route validation:** Compared the recorded elevation profile against the official Strava course to confirm alignment.  
- **Official course distance:** Since the official data only contained coordinates and elevation, distance in kilometers was calculated in Python for direct comparison.  
- **Data trimming:** Rows prior to the official start line were removed to ensure analyses begin from the race’s official start time.  
- **Elevation cleaning:** Short-term GPS anomalies and outliers (including a ~250 m spike) were identified using a rolling-baseline method. Points that deviated substantially from the local trend were marked as missing and then filled via linear interpolation, preserving the overall hill profile while removing unrealistic spikes.  
- **Split calculation:** Per-kilometer average pace was computed by aggregating data into kilometer splits, providing a clearer view of pacing patterns. In addition, a **rolling 1 km pace** was derived by calculating elapsed time over the preceding kilometer of distance, smoothing out short-term fluctuations and approximating how fitness apps typically present “live pace.”  
- **Known limitations:** While isolated elevation spikes were corrected via interpolation, a persistent anomaly (~+10 m) along the bridge segment could not be adjusted. These minor issues do not materially affect the overall analysis.  
- **Course “valves,” aid stations, and truncated recording:** Certain sections of the route were temporarily rerouted for traffic or safety, moving runners slightly off the main path while keeping the total distance essentially unchanged. Aid stations were not always precisely at their official locations. The watch stopped recording at the exact marathon distance before the finish line, which slightly truncates the final segment. These factors can cause minor GPS offsets or pace fluctuations and explain why the recorded course map may differ slightly from the official route.  


In [None]:
# Query tables and read to dataframes
# Query tcx race data with distance in km
tcx_query = """
SELECT
    time,
    latitude,
    longitude,
    elevation,
    distance,
    distance / 1000.0 AS distance_km,
    heart_rate
FROM tcx_data
"""

tcx_df = pd.read_sql(tcx_query, con)


# Query official route data and calculate cumulative distance
official_route_df = pd.read_sql("SELECT * FROM official_route_data;", con)

# Calculate segment distances between consecutive GPS points
distances = [0.0]
for i in range(1, len(official_route_df)):
    start = (official_route_df.iloc[i-1]['latitude'], official_route_df.iloc[i-1]['longitude'])
    end = (official_route_df.iloc[i]['latitude'], official_route_df.iloc[i]['longitude'])
    distances.append(geodesic(start, end).km)

official_route_df['segment_distance'] = distances
official_route_df['cumulative_distance_km'] = official_route_df['segment_distance'].cumsum()


# Query aid stations data and calculate distance from start
aid_stations_df = pd.read_sql("SELECT * FROM aid_stations_data;", con)

# Calculate distance from start for each aid station
def find_distance_from_start(station_lat, station_lon):
    station_coords = (station_lat, station_lon)
    distances = [geodesic(station_coords, (row['latitude'], row['longitude'])).meters 
                for _, row in official_route_df.iterrows()]
    closest_idx = np.argmin(distances)
    return official_route_df.iloc[closest_idx]['cumulative_distance_km']

aid_stations_df['distance_from_start_km'] = aid_stations_df.apply(
    lambda row: find_distance_from_start(row['latitude'], row['longitude']), axis=1
)

# Sort by distance and display summary
aid_stations_df = aid_stations_df.sort_values('distance_from_start_km').reset_index(drop=True)


In [None]:
# Remove trackpoints recorded before the race started
tcx_df['time'] = pd.to_datetime(tcx_df['time']).dt.tz_localize('UTC')
official_start = pd.to_datetime("2025-07-27 05:35:29").tz_localize("America/Los_Angeles").tz_convert('UTC')
tcx_df = tcx_df[tcx_df['time'] >= official_start].copy()
tcx_df.reset_index(drop=True, inplace=True)

tcx_df.head()

In [None]:
# Create plots to compare my race to official route
fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=('Elevation Comparison', 'Route Map Comparison'),
    horizontal_spacing=0.15,
    specs=[[{"secondary_y": False}, {"secondary_y": False}]]
)

# First subplot: Elevation comparison (left)
fig.add_trace(
    go.Scatter(
        x=tcx_df['distance_km'],
        y=tcx_df['elevation'],
        mode='lines',
        name='My Race',
        line=dict(color='blue', width=2),
        opacity=0.7,
        hovertemplate='Distance: %{x:.2f} km<br>My Elevation: %{y:.1f} m<extra></extra>'
    ),
    row=1, col=1
)

fig.add_trace(
    go.Scatter(
        x=official_route_df['cumulative_distance_km'],
        y=official_route_df['elevation'],
        mode='lines',
        name='Official Route',
        line=dict(color='red', width=2, dash='dash'),
        opacity=0.7,
        hovertemplate='Distance: %{x:.2f} km<br>Official Elevation: %{y:.1f} m<extra></extra>'
    ),
    row=1, col=1
)

# Second subplot: Route map (right)
fig.add_trace(
    go.Scatter(
        x=tcx_df['longitude'],
        y=tcx_df['latitude'],
        mode='lines',
        name='My Race Route',
        line=dict(color='blue', width=3),
        opacity=0.7,
        hovertemplate='Lat: %{y:.5f}<br>Lon: %{x:.5f}<br>My Route<extra></extra>',
        showlegend=False
    ),
    row=1, col=2
)

fig.add_trace(
    go.Scatter(
        x=official_route_df['longitude'],
        y=official_route_df['latitude'],
        mode='lines',
        name='Official Route',
        line=dict(color='red', width=2, dash='dash'),
        opacity=0.7,
        hovertemplate='Lat: %{y:.5f}<br>Lon: %{x:.5f}<br>Official Route<extra></extra>',
        showlegend=False
    ),
    row=1, col=2
)

# Add aid stations to map (right)
fig.add_trace(
    go.Scatter(
        x=aid_stations_df['longitude'],
        y=aid_stations_df['latitude'],
        mode='markers',
        name='Aid Stations',
        marker=dict(color='green', size=8, symbol='circle'),
        hovertemplate='%{text}<br>Lat: %{y:.5f}<br>Lon: %{x:.5f}<extra></extra>',
        text=aid_stations_df['name'] if 'name' in aid_stations_df.columns else 'Aid Station',
        showlegend=False
    ),
    row=1, col=2
)

# Update layout and axis labels
fig.update_layout(
    title_text='Marathon Analysis: Elevation and Route Comparison',
    height=500,
    width=1200,
    showlegend=True
)
fig.update_xaxes(title_text="Distance (km)", row=1, col=1)
fig.update_xaxes(title_text="Longitude", row=1, col=2)
fig.update_yaxes(title_text="Elevation (m)", row=1, col=1)
fig.update_yaxes(title_text="Latitude", row=1, col=2, scaleanchor="x2", scaleratio=1)

fig.show()

In [None]:
# Remove elevation outliers using rolling baseline and interpolation
baseline_window = 50
tcx_df['elev_baseline'] = tcx_df['elevation'].rolling(window=baseline_window, center=True, min_periods=1).mean()

deviation_threshold = 5  # meters
tcx_df.loc[(tcx_df['elevation'] - tcx_df['elev_baseline']).abs() > deviation_threshold, 'elevation'] = np.nan
tcx_df['elevation'] = tcx_df['elevation'].interpolate()
tcx_df.drop(columns=['elev_baseline'], inplace=True)

# Create elevation comparison plot
fig = go.Figure()

fig.add_trace(go.Scatter(
    x=tcx_df['distance_km'],
    y=tcx_df['elevation'],
    mode='lines',
    name='My Race Elevation',
    line=dict(color='blue', width=2),
    hovertemplate='Distance: %{x:.2f} km<br>Elevation: %{y:.1f} m<extra></extra>'
))

fig.add_trace(go.Scatter(
    x=official_route_df['cumulative_distance_km'],
    y=official_route_df['elevation'],
    mode='lines',
    name='Official Route Elevation',
    line=dict(color='red', width=2, dash='dash'),
    opacity=0.7,
    hovertemplate='Distance: %{x:.2f} km<br>Official Elevation: %{y:.1f} m<extra></extra>'
))

fig.update_layout(
    title='Elevation Profile: Recorded vs Official Route (After Outlier Removal)',
    xaxis_title='Distance (km)',
    yaxis_title='Elevation (m)',
    width=900,
    height=500,
    hovermode='x'
)

fig.show()


In [None]:
# Find rolling average pace over 1 km
dist_m = tcx_df['distance'].to_numpy()
tcx_df['time'] = pd.to_datetime(tcx_df['time'])

time_s = tcx_df['time'].astype('int64') // 1_000_000_000  # seconds since epoch

# Array to store rolling pace
pace_1km = np.full(len(dist_m), np.nan)

window_m = 1000  # 1 km

for i in range(len(dist_m)):
    target_dist = dist_m[i] - window_m
    start_idx = np.searchsorted(dist_m, target_dist, side='right') - 1
    start_idx = max(0, start_idx)
    
    dt = time_s[i] - time_s[start_idx]
    dx_km = (dist_m[i] - dist_m[start_idx]) / 1000
    
    if dx_km > 0:
        pace_1km[i] = (dt / 60) / dx_km  # min/km

def format_pace_value(pace):
    if pace is None or np.isnan(pace):
        return "N/A"
    minutes = int(pace)
    seconds = int(round((pace - minutes) * 60))
    return f"{minutes}:{seconds:02d}"


tcx_df['pace_1km'] = pace_1km
tcx_df['pace_1km_str'] = tcx_df['pace_1km'].apply(format_pace_value)

tcx_df.head()


In [None]:
# Create 1km pace splits DataFrame
tcx_df['km_split'] = np.floor(tcx_df['distance_km']).astype(int)

km_splits = tcx_df.groupby('km_split').agg({
    'time': ['min', 'max'],
    'elevation': 'mean',
}).reset_index()

km_splits.columns = ['km_split', 'time_start', 'time_end', 'elevation_avg']

km_splits['time_diff_seconds'] = (km_splits['time_end'] - km_splits['time_start']).dt.total_seconds()
km_splits['pace'] = km_splits['time_diff_seconds'] / 60.0
km_splits['pace_formatted'] = km_splits['pace'].apply(format_pace_value)

km_splits.head()


## 3. Overall Trends

The following graph illustrates my rolling pace (1km window) and elevation across the 26.2-mile race. 

In [None]:
# Create dual-axis plot: pace and elevation vs distance
fig = go.Figure()

# Add pace trace (1km rolling average)
fig.add_trace(go.Scatter(
    x=tcx_df['distance_km'],
    y=tcx_df['pace_1km'],
    mode='lines',
    name='Pace (1km rolling avg)',
    line=dict(color='red', width=2),
    hovertemplate='Distance: %{x:.2f}km<br>Pace: %{customdata}<extra></extra>',
    customdata=tcx_df['pace_1km_str'],
    yaxis='y'
))

# Add elevation trace
fig.add_trace(go.Scatter(
    x=tcx_df['distance_km'],
    y=tcx_df['elevation'],
    mode='lines',
    name='Elevation (m)',
    line=dict(color='blue', width=1),
    opacity=0.7,
    hovertemplate='Distance: %{x:.2f}km<br>Elevation: %{y:.0f}m<extra></extra>',
    yaxis='y2'
))

# Add aid stations with labels
for i, (_, station) in enumerate(aid_stations_df.iterrows()):
    # Vertical line
    fig.add_vline(
        x=station['distance_from_start_km'],
        line=dict(color='green', width=2, dash='dot'),
        opacity=0.7
    )
    
    # Distance label
    fig.add_annotation(
        x=station['distance_from_start_km'],
        y=1, yref="paper",
        text=f"{station['distance_from_start_km']:.1f}",
        showarrow=False,
        font=dict(size=10, color='green'),
        yanchor="bottom"
    )

# Add legend entry for aid stations
fig.add_trace(go.Scatter(
    x=[None], y=[None],
    mode='lines',
    name='Aid Stations',
    line=dict(color='green', width=2, dash='dot'),
    showlegend=True
))

# Update layout
fig.update_layout(
    title='Marathon Performance: Pace vs Elevation Profile with Aid Stations',
    xaxis_title='Distance (km)',
    yaxis=dict(title='Pace (min/km)', side='left'),
    yaxis2=dict(title='Elevation (m)', side='right', overlaying='y'),
    width=1000,
    height=500,
    hovermode='x unified',
    legend=dict(orientation="h", yanchor="top", y=-0.15, xanchor="center", x=0.5)
)

fig.show()

The visualization above highlights several key performance patterns:

- **Start-line anomaly:** An initial pace spike appears within the first 0.2km, reflecting bottleneck congestation near the starting line. 
- **Impact of terrain:** Pace strongly correlates with elevation gain/loss. Uphill segments produce sustained slowdowns, while downhill segments allow partial recovery. 
- **Effects of fatigue:** Although earlier downhill sections produce clear accelerations, the effect diminishes in the latter half of the race. This suggests that fatigue not only leads to slower pace but also reduces responsiveness to favorable terrain (e.g., knee stress reducing ability to capitalize on downhills). 
- **End-of-race acceleration:** Despite late-race fatigue, the last 5km show a distinct negative split. This shows a clear end-of-race push as the finish line draws closer. 




## . Heart Rate vs Pace Analyis

## . Heatmap