# Analyzing Performance at the 2025 SF Marathon: Terrain, Pace, and Heart Rate

This notebook explores my performance at the 2025 San Francisco Marathon. The data is pulled from Strava records of my 26.2-mile race. The focus of this exploratory analysis is to identify the relationship between course structure and terrain (elevation, distance segments, and aid stations) on performance metrics such as pace and heartrate. The goal is to draw insights to guide future training and serve as a reference if I run this marathon agian. 

The analysis is done using SQL and Python (pandas, matplotlib) in a Jupyter notebook. Raw data (TCX and GPX) was cleaned in separate Python scripts. 

## 1. Overview of the Data 

In [1]:
# Data, SQL, visualization imports
import sqlite3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from geopy.distance import geodesic
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go


In [31]:
# Connect to SQLite database, display tables and columns
con = sqlite3.connect("../data/marathon_data.db") 
cur = con.cursor()

# Table overview
tables = pd.read_sql_query("SELECT name FROM sqlite_master WHERE type='table';", con)
print("Tables:", tables['name'].tolist())

# Show structure for each table
for table in tables['name']:
    print(f"\n=== {table.upper()} ===")
    cols = pd.read_sql(f"PRAGMA table_info({table});", con)
    for _, row in cols.iterrows():
        print(f"  {row['name']}: {row['type']}")

Tables: ['tcx_data', 'official_route_data', 'aid_stations_data']

=== TCX_DATA ===
  time: TEXT
  latitude: REAL
  longitude: REAL
  elevation: REAL
  distance: REAL
  heart_rate: INTEGER

=== OFFICIAL_ROUTE_DATA ===
  latitude: REAL
  longitude: REAL
  elevation: REAL

=== AID_STATIONS_DATA ===
  name: TEXT
  latitude: REAL
  longitude: REAL
  type: TEXT


The database contains three tables: `tcx_data`, `official_route_data`, and `aid_stations_data`. The names and data types of the columns in each are as follows:

- `tcx_data`: data from my race recorded on my Apple Watch
    - __time__: timestamp in UTC for each row entry; TEXT (convertible to date-time)
    - __latitude__: GPS latitude in decimal degrees (WGS84 coordinate system); REAL
    - __longitude__: GPS longitude in decimal degrees (WGS84 coordinate system); REAL
    - __elevation__: altitude above sea level, in meters; REAL
    - __distance__: cumulative distance covered since the start, in meters; REAL
    - __heart_rate__: runner’s heart rate at that moment, in beats per minute (BPM); INTEGER

- `official_route_data`: official route data provided by SFM organizers
    - __latitude__: GPS latitude in decimal degrees (WGS84 coordinate system); REAL
    - __longitude__: GPS longitude in decimal degrees (WGS84 coordinate system); REAL
    - __elevation__: altitude above sea level, in meters; REAL

- `aid_stations_data`: official route data provided by SFM organizers
    - __name__: name of the waypoint; TEXT
    - __latitude__: GPS latitude in decimal degrees (WGS84 coordinate system); REAL
    - __longitude__: GPS longitude in decimal degrees (WGS84 coordinate system); REAL
    - __type__: type of the waypoint; TEXT

The following cell shows the first 5 rows of each table. 

In [3]:
# Sample data from each table
tables_to_show = [
    ("tcx_data", "TCX Data"),
    ("official_route_data", "Official Route Data"),
    ("aid_stations_data", "Aid Stations Data")
]

for table, name in tables_to_show:
    print(f"=== {name} (First 5 Rows) ===")
    sample = pd.read_sql(f"SELECT * FROM {table} LIMIT 5;", con)
    display(sample)
    print() 
    

=== TCX Data (First 5 Rows) ===


Unnamed: 0,time,latitude,longitude,elevation,distance,heart_rate
0,2025-07-27 12:35:21,37.794481,-122.394033,6.2,0.0,116
1,2025-07-27 12:35:22,37.794512,-122.394049,6.0,1.4,116
2,2025-07-27 12:35:23,37.794526,-122.394061,6.0,2.8,116
3,2025-07-27 12:35:24,37.794539,-122.394074,6.4,4.1,116
4,2025-07-27 12:35:25,37.794552,-122.394086,6.4,5.5,116



=== Official Route Data (First 5 Rows) ===


Unnamed: 0,latitude,longitude,elevation
0,37.79517,-122.39377,3.74
1,37.79528,-122.39386,3.74
2,37.79534,-122.39378,3.33
3,37.795945,-122.394335,3.21
4,37.79655,-122.39489,3.12



=== Aid Stations Data (First 5 Rows) ===


Unnamed: 0,name,latitude,longitude,type
0,Water Stop/Aid Station,37.80709,-122.42587,Aid Station
1,Water Stop/Aid Station,37.80618,-122.46882,Aid Station
2,Water Stop/Aid Station,37.8324,-122.47991,Aid Station
3,Water Stop/Aid Station,37.83621,-122.47356,Aid Station
4,Water Stop/Aid Station,37.83245,-122.4818,Aid Station





## 2. Data cleaning and processing

Before analysis, the raw data was inspected and adjusted to improve consistency and interpretability:

- **Distance units:** Converted from meters to kilometers for readability.  
- **Route validation:** Compared the recorded elevation profile against the official Strava course to confirm alignment.  
- **Official course distance:** Since the official data only contained coordinates and elevation, distance in kilometers was calculated in Python for direct comparison.  
- **Data trimming:** Rows prior to the official start line were removed to ensure analyses begin from the race’s official start time.  
- **Elevation cleaning:** Short-term GPS anomalies and outliers (including a ~250 m spike) were identified using a rolling-baseline method. Points that deviated substantially from the local trend were marked as missing and then filled via linear interpolation, preserving the overall hill profile while removing unrealistic spikes.  
- **Split calculation (feature engineering):** Per-kilometer average pace was computed by aggregating data into kilometer splits, providing a clearer view of pacing patterns.  
- **Known limitations:** While isolated elevation spikes were corrected via interpolation, a persistent anomaly (~+10 m) along the bridge segment could not be adjusted. These minor issues do not materially affect the overall analysis.  
- **Course “valves,” aid stations, and truncated recording:** Certain sections of the route were temporarily rerouted for traffic or safety, moving runners slightly off the main path while keeping the total distance essentially unchanged. Aid stations were not always precisely at their official locations. The watch stopped recording at the exact marathon distance before the finish line, which slightly truncates the final segment. These factors can cause minor GPS offsets or pace fluctuations and explain why the recorded course map may differ slightly from the official route. 

In [33]:
# Query tcx race data with distance in km
tcx_query = """
SELECT
    time,
    latitude,
    longitude,
    elevation,
    distance,
    distance / 1000.0 AS distance_km,
    heart_rate
FROM tcx_data
"""

tcx_df = pd.read_sql(tcx_query, con)

tcx_df.head()

Unnamed: 0,time,latitude,longitude,elevation,distance,distance_km,heart_rate
0,2025-07-27 12:35:21,37.794481,-122.394033,6.2,0.0,0.0,116
1,2025-07-27 12:35:22,37.794512,-122.394049,6.0,1.4,0.0014,116
2,2025-07-27 12:35:23,37.794526,-122.394061,6.0,2.8,0.0028,116
3,2025-07-27 12:35:24,37.794539,-122.394074,6.4,4.1,0.0041,116
4,2025-07-27 12:35:25,37.794552,-122.394086,6.4,5.5,0.0055,116


In [5]:
# Query official route data and calculate cumulative distance
official_route_df = pd.read_sql("SELECT * FROM official_route_data;", con)

# Calculate segment distances between consecutive GPS points
distances = [0.0]
for i in range(1, len(official_route_df)):
    start = (official_route_df.iloc[i-1]['latitude'], official_route_df.iloc[i-1]['longitude'])
    end = (official_route_df.iloc[i]['latitude'], official_route_df.iloc[i]['longitude'])
    distances.append(geodesic(start, end).km)

official_route_df['segment_distance'] = distances
official_route_df['cumulative_distance_km'] = official_route_df['segment_distance'].cumsum()

official_route_df.head()


Unnamed: 0,latitude,longitude,elevation,segment_distance,cumulative_distance_km
0,37.79517,-122.39377,3.74,0.0,0.0
1,37.79528,-122.39386,3.74,0.014557,0.014557
2,37.79534,-122.39378,3.33,0.009695,0.024252
3,37.795945,-122.394335,3.21,0.083058,0.10731
4,37.79655,-122.39489,3.12,0.083058,0.190368


In [6]:
# Query aid stations data and calculate distance from start
aid_stations_df = pd.read_sql("SELECT * FROM aid_stations_data;", con)

# Calculate distance from start for each aid station
def find_distance_from_start(station_lat, station_lon):
    station_coords = (station_lat, station_lon)
    distances = [geodesic(station_coords, (row['latitude'], row['longitude'])).meters 
                for _, row in official_route_df.iterrows()]
    closest_idx = np.argmin(distances)
    return official_route_df.iloc[closest_idx]['cumulative_distance_km']

aid_stations_df['distance_from_start_km'] = aid_stations_df.apply(
    lambda row: find_distance_from_start(row['latitude'], row['longitude']), axis=1
)

# Sort by distance and display summary
aid_stations_df = aid_stations_df.sort_values('distance_from_start_km').reset_index(drop=True)

aid_stations_df.head()

Unnamed: 0,name,latitude,longitude,type,distance_from_start_km
0,Water Stop/Aid Station,37.80709,-122.42587,Aid Station,3.57485
1,Water Stop/Aid Station,37.80618,-122.46882,Aid Station,7.982627
2,Water Stop/Aid Station,37.8324,-122.47991,Aid Station,13.513509
3,Water Stop/Aid Station,37.83621,-122.47356,Aid Station,16.613082
4,Water Stop/Aid Station,37.83245,-122.4818,Aid Station,18.42491


In [7]:
# Remove trackpoints recorded before the race started
tcx_df['time'] = pd.to_datetime(tcx_df['time']).dt.tz_localize('UTC')
official_start = pd.to_datetime("2025-07-27 05:35:29").tz_localize("America/Los_Angeles").tz_convert('UTC')
tcx_df = tcx_df[tcx_df['time'] >= official_start].copy()
tcx_df.reset_index(drop=True, inplace=True)

tcx_df.head()

Unnamed: 0,time,latitude,longitude,elevation,distance_km,heart_rate
0,2025-07-27 12:35:29+00:00,37.794629,-122.394148,5.0,0.0143,116
1,2025-07-27 12:35:30+00:00,37.794641,-122.394156,4.6,0.0159,116
2,2025-07-27 12:35:33+00:00,37.794641,-122.394156,4.6,0.0216,116
3,2025-07-27 12:35:35+00:00,37.794641,-122.394156,4.6,0.027,116
4,2025-07-27 12:35:36+00:00,37.794716,-122.394173,4.6,0.0283,116


In [14]:
# Create plots to compare my race to official route
fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=('Elevation Comparison', 'Route Map Comparison'),
    horizontal_spacing=0.15,
    specs=[[{"secondary_y": False}, {"secondary_y": False}]]
)

# First subplot: Elevation comparison (left)
fig.add_trace(
    go.Scatter(
        x=tcx_df['distance_km'],
        y=tcx_df['elevation'],
        mode='lines',
        name='My Race',
        line=dict(color='blue', width=2),
        opacity=0.7,
        hovertemplate='Distance: %{x:.2f} km<br>My Elevation: %{y:.1f} m<extra></extra>'
    ),
    row=1, col=1
)

fig.add_trace(
    go.Scatter(
        x=official_route_df['cumulative_distance_km'],
        y=official_route_df['elevation'],
        mode='lines',
        name='Official Route',
        line=dict(color='red', width=2, dash='dash'),
        opacity=0.7,
        hovertemplate='Distance: %{x:.2f} km<br>Official Elevation: %{y:.1f} m<extra></extra>'
    ),
    row=1, col=1
)

# Second subplot: Route map (right)
fig.add_trace(
    go.Scatter(
        x=tcx_df['longitude'],
        y=tcx_df['latitude'],
        mode='lines',
        name='My Race Route',
        line=dict(color='blue', width=3),
        opacity=0.7,
        hovertemplate='Lat: %{y:.5f}<br>Lon: %{x:.5f}<br>My Route<extra></extra>',
        showlegend=False
    ),
    row=1, col=2
)

fig.add_trace(
    go.Scatter(
        x=official_route_df['longitude'],
        y=official_route_df['latitude'],
        mode='lines',
        name='Official Route',
        line=dict(color='red', width=2, dash='dash'),
        opacity=0.7,
        hovertemplate='Lat: %{y:.5f}<br>Lon: %{x:.5f}<br>Official Route<extra></extra>',
        showlegend=False
    ),
    row=1, col=2
)

# Add aid stations to map (right)
fig.add_trace(
    go.Scatter(
        x=aid_stations_df['longitude'],
        y=aid_stations_df['latitude'],
        mode='markers',
        name='Aid Stations',
        marker=dict(color='green', size=8, symbol='circle'),
        hovertemplate='%{text}<br>Lat: %{y:.5f}<br>Lon: %{x:.5f}<extra></extra>',
        text=aid_stations_df['name'] if 'name' in aid_stations_df.columns else 'Aid Station',
        showlegend=False
    ),
    row=1, col=2
)

# Update layout and axis labels
fig.update_layout(
    title_text='Marathon Analysis: Elevation and Route Comparison',
    height=500,
    width=1200,
    showlegend=True
)
fig.update_xaxes(title_text="Distance (km)", row=1, col=1)
fig.update_xaxes(title_text="Longitude", row=1, col=2)
fig.update_yaxes(title_text="Elevation (m)", row=1, col=1)
fig.update_yaxes(title_text="Latitude", row=1, col=2, scaleanchor="x2", scaleratio=1)

fig.show()

In [9]:
# Remove elevation outliers using rolling baseline and interpolation
baseline_window = 50
tcx_df['elev_baseline'] = tcx_df['elevation'].rolling(window=baseline_window, center=True, min_periods=1).mean()

deviation_threshold = 5  # meters
tcx_df.loc[(tcx_df['elevation'] - tcx_df['elev_baseline']).abs() > deviation_threshold, 'elevation'] = np.nan
tcx_df['elevation'] = tcx_df['elevation'].interpolate()
tcx_df.drop(columns=['elev_baseline'], inplace=True)

# Create elevation comparison plot
fig = go.Figure()

fig.add_trace(go.Scatter(
    x=tcx_df['distance_km'],
    y=tcx_df['elevation'],
    mode='lines',
    name='My Race Elevation',
    line=dict(color='blue', width=2),
    hovertemplate='Distance: %{x:.2f} km<br>Elevation: %{y:.1f} m<extra></extra>'
))

fig.add_trace(go.Scatter(
    x=official_route_df['cumulative_distance_km'],
    y=official_route_df['elevation'],
    mode='lines',
    name='Official Route Elevation',
    line=dict(color='red', width=2, dash='dash'),
    opacity=0.7,
    hovertemplate='Distance: %{x:.2f} km<br>Official Elevation: %{y:.1f} m<extra></extra>'
))

fig.update_layout(
    title='Elevation Profile: Recorded vs Official Route (After Outlier Removal)',
    xaxis_title='Distance (km)',
    yaxis_title='Elevation (m)',
    width=900,
    height=500,
    hovermode='x'
)

fig.show()


In [39]:
# Find rolling average pace over 1 km
# Convert distance to meters and time to seconds
dist_m = tcx_df['distance'].to_numpy()
tcx_df['time'] = pd.to_datetime(tcx_df['time'])

time_s = tcx_df['time'].astype('int64') // 1_000_000_000  # seconds since epoch

# Array to store rolling pace
pace_1km = np.full(len(dist_m), np.nan)

window_m = 1000  # 1 km

for i in range(len(dist_m)):
    target_dist = dist_m[i] - window_m
    start_idx = np.searchsorted(dist_m, target_dist, side='right') - 1
    start_idx = max(0, start_idx)
    
    dt = time_s[i] - time_s[start_idx]
    dx_km = (dist_m[i] - dist_m[start_idx]) / 1000
    
    if dx_km > 0:
        pace_1km[i] = (dt / 60) / dx_km  # min/km

def format_pace_value(pace):
    if pace is None or np.isnan(pace):
        return "N/A"
    minutes = int(pace)
    seconds = int(round((pace - minutes) * 60))
    return f"{minutes}:{seconds:02d}"


tcx_df['pace_1km'] = pace_1km
tcx_df['pace_1km_str'] = tcx_df['pace_1km'].apply(format_pace_value)

tcx_df.head()


Unnamed: 0,time,latitude,longitude,elevation,distance,distance_km,heart_rate,pace_1km,pace_1km_str
0,2025-07-27 12:35:21,37.794481,-122.394033,6.2,0.0,0.0,116,,
1,2025-07-27 12:35:22,37.794512,-122.394049,6.0,1.4,0.0014,116,11.904762,11:54
2,2025-07-27 12:35:23,37.794526,-122.394061,6.0,2.8,0.0028,116,11.904762,11:54
3,2025-07-27 12:35:24,37.794539,-122.394074,6.4,4.1,0.0041,116,12.195122,12:12
4,2025-07-27 12:35:25,37.794552,-122.394086,6.4,5.5,0.0055,116,12.121212,12:07


In [None]:
# Create km splits and calculate pace per kilometer
tcx_df['km_split'] = tcx_df['distance_km'].astype(int)
tcx_df['time'] = pd.to_datetime(tcx_df['time'])

# For each point, calculate pace over previous 1km
for i in range(len(dist_m)):
    # Find the distance 1km back from current point
    target_dist = dist_m[i] - window_m
    
    # Find the closest point at or before target distance
    start_idx = np.searchsorted(dist_m, target_dist, side='right') - 1    start_idx = max(0, start_idx)  # Ensure we don't go below 0        # Calculate pace if we have sufficient distance    dt = time_s[i] - time_s[start_idx]    dx_km = (dist_m[i] - dist_m[start_idx]) / 1000
    
    if dx_km > 0.1:  # Only calculate if we have at least 100m
        pace_1km[i] = (dt / 60) / dx_km

# Assign back to DataFrame
tcx_df['pace_1km'] = pace_1km


Unnamed: 0,km_split,min,max,time_diff_seconds,pace_min_per_km,pace_formatted
0,0,2025-07-27 12:35:29+00:00,2025-07-27 12:42:09+00:00,400.0,6.666667,6:40
1,1,2025-07-27 12:42:10+00:00,2025-07-27 12:49:09+00:00,419.0,6.983333,6:59
2,2,2025-07-27 12:49:10+00:00,2025-07-27 12:56:30+00:00,440.0,7.333333,7:20
3,3,2025-07-27 12:56:31+00:00,2025-07-27 13:04:38+00:00,487.0,8.116667,8:07
4,4,2025-07-27 13:04:39+00:00,2025-07-27 13:11:34+00:00,415.0,6.916667,6:55


In [None]:
# Create 200m splits for granular pace analysis
tcx_df['split_200m'] = (tcx_df['distance_km'] * 1000 // 200).astype(int)
tcx_df['time'] = pd.to_datetime(tcx_df['time'])

# Aggregate time data by 200m segments
splits_200m = tcx_df.groupby('split_200m')['time'].agg(['min', 'max']).reset_index()
splits_200m.columns = ['split_200m', 'time_start', 'time_end']

# Calculate pace metrics for each 200m segment
splits_200m['time_diff_seconds'] = (splits_200m['time_end'] - splits_200m['time_start']).dt.total_seconds()
splits_200m['pace_min_per_km'] = (splits_200m['time_diff_seconds'] / 60.0) / 0.2
splits_200m['pace_formatted'] = splits_200m['pace_min_per_km'].apply(format_pace_value)
splits_200m['distance_km'] = (splits_200m['split_200m'] + 1) * 0.2
splits_200m['distance_km_center'] = splits_200m['split_200m'] * 0.2 + 0.1

# Apply 1km rolling average to smooth pace variations
rolling_window = 5  # 5 splits × 200m = 1km smoothing window
splits_200m['pace_min_per_km_rolling'] = splits_200m['pace_min_per_km'].rolling(
    window=rolling_window, center=True, min_periods=1
).mean()
splits_200m['pace_rolling_formatted'] = splits_200m['pace_min_per_km_rolling'].apply(format_pace_value)

splits_200m.head()

Unnamed: 0,split_200m,time_start,time_end,time_diff_seconds,pace_min_per_km,pace_formatted,distance_km,distance_km_center,pace_min_per_km_rolling,pace_rolling_formatted
0,0,2025-07-27 12:35:29+00:00,2025-07-27 12:36:48+00:00,79.0,6.583333,6:35,0.2,0.1,6.527778,6:32
1,1,2025-07-27 12:36:49+00:00,2025-07-27 12:38:07+00:00,78.0,6.5,6:30,0.4,0.3,6.5625,6:34
2,2,2025-07-27 12:38:08+00:00,2025-07-27 12:39:26+00:00,78.0,6.5,6:30,0.6,0.5,6.6,6:36
3,3,2025-07-27 12:39:27+00:00,2025-07-27 12:40:47+00:00,80.0,6.666667,6:40,0.8,0.7,6.666667,6:40
4,4,2025-07-27 12:40:48+00:00,2025-07-27 12:42:09+00:00,81.0,6.75,6:45,1.0,0.9,6.8,6:48


## 3. Overall Trends

The following graph illustrates my pace and elevation across the 26.2-mile race. 

In [40]:
# Create dual-axis plot: pace and elevation vs distance
fig = go.Figure()

# Add rolling average pace line using 1km rolling pace from tcx_df
fig.add_trace(
    go.Scatter(
        x=tcx_df['distance_km'],
        y=tcx_df['pace_1km'],
        mode='lines',
        name='Pace (1km rolling avg)',
        line=dict(color='red', width=2),
        hovertemplate='Distance: %{x:.2f}km<br>Pace: %{customdata}<extra></extra>',
        customdata=tcx_df['pace_1km_str'],
        yaxis='y'
    )
)

# Add elevation line on secondary y-axis using full GPS data
fig.add_trace(
    go.Scatter(
        x=tcx_df['distance_km'],
        y=tcx_df['elevation'],
        mode='lines',
        name='Elevation (m)',
        line=dict(color='blue', width=1),
        opacity=0.7,
        hovertemplate='Distance: %{x:.2f}km<br>Elevation: %{y:.0f}m<extra></extra>',
        yaxis='y2'
    )
)

# Add vertical lines for water stations with legend
for i, (_, station) in enumerate(aid_stations_df.iterrows()):
    fig.add_vline(
        x=station['distance_from_start_km'],
        line=dict(color='green', width=2, dash='dot'),
        opacity=0.7
    )
    # Add invisible trace for legend (only for first station)
    if i == 0:
        fig.add_trace(
            go.Scatter(
                x=[None], y=[None],
                mode='lines',
                name='Aid Stations',
                line=dict(color='green', width=2, dash='dot'),
                showlegend=True
            )
        )

# Update layout with dual y-axes
fig.update_layout(
    title='Marathon Performance: Pace vs Elevation Profile with Aid Stations',
    xaxis_title='Distance (km)',
    yaxis=dict(title='Pace (min/km)', side='left'),
    yaxis2=dict(title='Elevation (m)', side='right', overlaying='y'),
    width=1000,
    height=500,
    hovermode='x unified',
    legend=dict(x=0.02, y=0.98),
)

fig.show()

## . Heart Rate vs Pace Analyis

## . Time Losses at Events

## . Heatmap

In [13]:
con.commit()
con.close()


## . Conclusions