# Analyzing Performance at the 2025 SF Marathon: Terrain, Pace, and Heart Rate

This notebook explores my performance at the 2025 San Francisco Marathon. The data is pulled from Strava records of my 26.2-mile race. The focus of this exploratory analysis is to identify the relationship between course structure and terrain (elevation, distance segments, and aid stations) on performance metrics such as pace and heartrate. The goal is to draw insights to guide future training and serve as a reference if I run this marathon agian. 

The analysis is done using SQL and Python (pandas, matplotlib) in a Jupyter notebook. Raw data (TCX and GPX) was cleaned in separate Python scripts. 

## 1. Overview of the Data 

In [24]:
# Data, SQL, visualization imports
import sqlite3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from geopy.distance import geodesic
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go


In [9]:
# Connect to SQLite database, display tables and columns
con = sqlite3.connect("../data/marathon_data.db") 
cur = con.cursor()

# Table overview
tables = pd.read_sql_query("SELECT name FROM sqlite_master WHERE type='table';", con)
print("Tables:", tables['name'].tolist())

# Show structure for each table
for table in tables['name']:
    print(f"\n=== {table.upper()} ===")
    cols = pd.read_sql(f"PRAGMA table_info({table});", con)
    for _, row in cols.iterrows():
        print(f"  {row['name']}: {row['type']}")

Tables: ['tcx_data', 'official_route_data', 'aid_stations_data']

=== TCX_DATA ===
  time: TEXT
  latitude: REAL
  longitude: REAL
  elevation: REAL
  distance: REAL
  heart_rate: INTEGER

=== OFFICIAL_ROUTE_DATA ===
  latitude: REAL
  longitude: REAL
  elevation: REAL

=== AID_STATIONS_DATA ===
  name: TEXT
  latitude: REAL
  longitude: REAL
  type: TEXT


The database contains three tables: `tcx_data`, `official_route_data`, and `aid_stations_data`. The names and data types of the columns in each are as follows:

- `tcx_data`: data from my race recorded on my Apple Watch
    - __time__: timestamp in UTC for each row entry; TEXT (convertible to date-time)
    - __latitude__: GPS latitude in decimal degrees (WGS84 coordinate system); REAL
    - __longitude__: GPS longitude in decimal degrees (WGS84 coordinate system); REAL
    - __elevation__: altitude above sea level, in meters; REAL
    - __distance__: cumulative distance covered since the start, in meters; REAL
    - __heart_rate__: runner’s heart rate at that moment, in beats per minute (BPM); INTEGER

- `official_route_data`: official route data provided by SFM organizers
    - __latitude__: GPS latitude in decimal degrees (WGS84 coordinate system); REAL
    - __longitude__: GPS longitude in decimal degrees (WGS84 coordinate system); REAL
    - __elevation__: altitude above sea level, in meters; REAL

- `aid_stations_data`: official route data provided by SFM organizers
    - __name__: name of the waypoint; TEXT
    - __latitude__: GPS latitude in decimal degrees (WGS84 coordinate system); REAL
    - __longitude__: GPS longitude in decimal degrees (WGS84 coordinate system); REAL
    - __type__: type of the waypoint; TEXT

The following cell shows the first 5 rows of each table. 

In [10]:
# Sample data from each table
tables_to_show = [
    ("tcx_data", "TCX Data"),
    ("official_route_data", "Official Route Data"),
    ("aid_stations_data", "Aid Stations Data")
]

for table, name in tables_to_show:
    print(f"=== {name} (First 5 Rows) ===")
    sample = pd.read_sql(f"SELECT * FROM {table} LIMIT 5;", con)
    display(sample)
    print() 
    

=== TCX Data (First 5 Rows) ===


Unnamed: 0,time,latitude,longitude,elevation,distance,heart_rate
0,2025-07-27 12:35:21,37.794481,-122.394033,6.2,0.0,116
1,2025-07-27 12:35:22,37.794512,-122.394049,6.0,1.4,116
2,2025-07-27 12:35:23,37.794526,-122.394061,6.0,2.8,116
3,2025-07-27 12:35:24,37.794539,-122.394074,6.4,4.1,116
4,2025-07-27 12:35:25,37.794552,-122.394086,6.4,5.5,116



=== Official Route Data (First 5 Rows) ===


Unnamed: 0,latitude,longitude,elevation
0,37.79517,-122.39377,3.74
1,37.79528,-122.39386,3.74
2,37.79534,-122.39378,3.33
3,37.795945,-122.394335,3.21
4,37.79655,-122.39489,3.12



=== Aid Stations Data (First 5 Rows) ===


Unnamed: 0,name,latitude,longitude,type
0,Water Stop/Aid Station,37.80709,-122.42587,Aid Station
1,Water Stop/Aid Station,37.80618,-122.46882,Aid Station
2,Water Stop/Aid Station,37.8324,-122.47991,Aid Station
3,Water Stop/Aid Station,37.83621,-122.47356,Aid Station
4,Water Stop/Aid Station,37.83245,-122.4818,Aid Station





## 2. Data cleaning and processing

Before analysis, the raw data was checked and adjusted to improve consistency and interpretability:

- **Distance units:** Converted from meters to kilometers for readability.  
- **Route validation:** Compared the recorded elevation profile against the official Strava course to confirm alignment.  
- **Official course distance:** Since the official data only contained coordinates and elevation, distance in kilometers was calculated in Python for direct comparison.  
- **Elevation outliers:** A single unrealistic spike (~250 m) was corrected using linear interpolation.  
- **Split calculation (feature engineering):** Per-kilometer average pace was derived by aggregating data into kilometer splits, providing a clearer view of pacing patterns.  


In [18]:
# Query tcx race data with distance in km
tcx_query = """
SELECT
    time,
    latitude,
    longitude,
    elevation,
    distance / 1000.0 AS distance_km,
    heart_rate
FROM tcx_data
"""

tcx_df = pd.read_sql(tcx_query, con)

tcx_df.head()

Unnamed: 0,time,latitude,longitude,elevation,distance_km,heart_rate
0,2025-07-27 12:35:21,37.794481,-122.394033,6.2,0.0,116
1,2025-07-27 12:35:22,37.794512,-122.394049,6.0,0.0014,116
2,2025-07-27 12:35:23,37.794526,-122.394061,6.0,0.0028,116
3,2025-07-27 12:35:24,37.794539,-122.394074,6.4,0.0041,116
4,2025-07-27 12:35:25,37.794552,-122.394086,6.4,0.0055,116


In [22]:
# Query official route data and calculate cumulative distance
official_route_df = pd.read_sql("SELECT * FROM official_route_data;", con)

# Calculate segment distances between consecutive GPS points
distances = [0.0]
for i in range(1, len(official_route_df)):
    start = (official_route_df.iloc[i-1]['latitude'], official_route_df.iloc[i-1]['longitude'])
    end = (official_route_df.iloc[i]['latitude'], official_route_df.iloc[i]['longitude'])
    distances.append(geodesic(start, end).km)

official_route_df['segment_distance'] = distances
official_route_df['cumulative_distance_km'] = official_route_df['segment_distance'].cumsum()

official_route_df.head()


Unnamed: 0,latitude,longitude,elevation,segment_distance,cumulative_distance_km
0,37.79517,-122.39377,3.74,0.0,0.0
1,37.79528,-122.39386,3.74,0.014557,0.014557
2,37.79534,-122.39378,3.33,0.009695,0.024252
3,37.795945,-122.394335,3.21,0.083058,0.10731
4,37.79655,-122.39489,3.12,0.083058,0.190368


In [20]:
# Query aid stations data
aid_stations_df = pd.read_sql("SELECT * FROM aid_stations_data;", con)
aid_stations_df.head()

Unnamed: 0,name,latitude,longitude,type
0,Water Stop/Aid Station,37.80709,-122.42587,Aid Station
1,Water Stop/Aid Station,37.80618,-122.46882,Aid Station
2,Water Stop/Aid Station,37.8324,-122.47991,Aid Station
3,Water Stop/Aid Station,37.83621,-122.47356,Aid Station
4,Water Stop/Aid Station,37.83245,-122.4818,Aid Station


In [None]:
# Create plots to compare my race to official route
fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=('Elevation Comparison', 'Route Map Comparison'),
    horizontal_spacing=0.15,
    specs=[[{"secondary_y": False}, {"secondary_y": False}]]
)

# First subplot: Elevation comparison (left)
fig.add_trace(
    go.Scatter(
        x=tcx_df['distance_km'],
        y=tcx_df['elevation'],
        mode='lines',
        name='My Race Elevation',
        line=dict(color='blue', width=2),
        opacity=0.7,
        hovertemplate='Distance: %{x:.2f} km<br>My Elevation: %{y:.1f} m<extra></extra>'
    ),
    row=1, col=1
)

fig.add_trace(
    go.Scatter(
        x=official_route_df['cumulative_distance_km'],
        y=official_route_df['elevation'],
        mode='lines',
        name='Official Route Elevation',
        line=dict(color='red', width=2),
        opacity=0.7,
        hovertemplate='Distance: %{x:.2f} km<br>Official Elevation: %{y:.1f} m<extra></extra>'
    ),
    row=1, col=1
)

# Second subplot: Route map (right)
fig.add_trace(
    go.Scatter(
        x=tcx_df['longitude'],
        y=tcx_df['latitude'],
        mode='lines',
        name='My Race Route',
        line=dict(color='blue', width=3),
        opacity=0.7,
        hovertemplate='Lat: %{y:.5f}<br>Lon: %{x:.5f}<br>My Route<extra></extra>',
        showlegend=False
    ),
    row=1, col=2
)

fig.add_trace(
    go.Scatter(
        x=official_route_df['longitude'],
        y=official_route_df['latitude'],
        mode='lines',
        name='Official Route',
        line=dict(color='red', width=2, dash='dash'),
        opacity=0.7,
        hovertemplate='Lat: %{y:.5f}<br>Lon: %{x:.5f}<br>Official Route<extra></extra>',
        showlegend=False
    ),
    row=1, col=2
)

# Add aid stations to map (right)
fig.add_trace(
    go.Scatter(
        x=aid_stations_df['longitude'],
        y=aid_stations_df['latitude'],
        mode='markers',
        name='Aid Stations',
        marker=dict(color='green', size=8, symbol='circle'),
        hovertemplate='%{text}<br>Lat: %{y:.5f}<br>Lon: %{x:.5f}<extra></extra>',
        text=aid_stations_df['name'] if 'name' in aid_stations_df.columns else 'Aid Station',
        showlegend=False
    ),
    row=1, col=2
)

# Update layout and axis labels
fig.update_layout(
    title_text='Marathon Analysis: Elevation and Route Comparison',
    height=500,
    width=1200,
    showlegend=True
)
fig.update_xaxes(title_text="Distance (km)", row=1, col=1)
fig.update_xaxes(title_text="Longitude", row=1, col=2)
fig.update_yaxes(title_text="Elevation (m)", row=1, col=1)
fig.update_yaxes(title_text="Latitude", row=1, col=2, scaleanchor="x2", scaleratio=1)

fig.show()

In [26]:
# Replace obvious outliers with interpolation
tcx_df.loc[tcx_df['elevation'] > 100, 'elevation'] = None
tcx_df['elevation'] = tcx_df['elevation'].interpolate()

fig = go.Figure()

fig.add_trace(go.Scatter(
    x=tcx_df['distance_km'],
    y=tcx_df['elevation'],
    mode='lines',
    name='Elevation',
    line=dict(color='blue', width=2),
    hovertemplate='Distance: %{x:.2f} km<br>Elevation: %{y:.1f} m<extra></extra>'
))

fig.update_layout(
    title='Elevation over Full Marathon Route (cleaned data)',
    xaxis_title='Distance (km)',
    yaxis_title='Elevation (m)',
    width=900,
    height=500,
    hovermode='x'
)

fig.show()


In [None]:
# Create km splits and calculate pace per kilometer
tcx_df['km_split'] = tcx_df['distance_km'].astype(int)
tcx_df['time'] = pd.to_datetime(tcx_df['time'])

def format_pace_value(pace):
    if pace is None or np.isnan(pace):
        return None
    minutes = int(pace)
    seconds = int(round((pace - minutes) * 60))
    return f"{minutes}:{seconds:02d}"

# Aggregate by km and calculate pace from time differences
splits = tcx_df.groupby('km_split')['time'].agg(['min', 'max']).reset_index()
splits['time_diff_seconds'] = (splits['max'] - splits['min']).dt.total_seconds()
splits['pace_min_per_km'] = splits['time_diff_seconds'] / 60.0
splits['pace_formatted'] = splits['pace_min_per_km'].apply(format_pace_value)

splits.head()

Unnamed: 0,km_split,min,max,time_diff_seconds,pace_min_per_km,pace_formatted
0,0,2025-07-27 12:35:21,2025-07-27 12:42:09,408.0,6.8,6:48
1,1,2025-07-27 12:42:10,2025-07-27 12:49:09,419.0,6.983333,6:59
2,2,2025-07-27 12:49:10,2025-07-27 12:56:30,440.0,7.333333,7:20
3,3,2025-07-27 12:56:31,2025-07-27 13:04:38,487.0,8.116667,8:07
4,4,2025-07-27 13:04:39,2025-07-27 13:11:34,415.0,6.916667,6:55


## 3. Overall Trends

The following graph illustrates my pace and elevation across the 26.2-mile race. 

## . Heart Rate vs Pace Analyis

## . Time Losses at Events

## . Heatmap

In [6]:
con.commit()
con.close()


## . Conclusions