# Strava Activities Analysis

This notebook analyzes GPS activity data from Strava to explore workout preferences, performance evolution, and spatial patterns.

We use Python tools such as `pandas`, `geopandas`, `matplotlib`, `seaborn`,  and `folium` to visualize and analyze the data.


In [16]:
# Import of the packages

import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import folium
import ipyleaflet
import shapely
import pyproj
import contextily
import seaborn as sns
import h3

Once the data is gotten from the Strava API and store in a csv, we can read the data using pandas

In [17]:
#Upload CSV
ruta_csv = "strava_data.csv"
activities = pd.read_csv(ruta_csv)
activities

Unnamed: 0,resource_state,name,distance,moving_time,elapsed_time,total_elevation_gain,type,sport_type,id,start_date,...,has_kudoed,athlete.id,athlete.resource_state,map.id,map.summary_polyline,map.resource_state,workout_type,average_watts,device_watts,kilojoules
0,2,Afternoon HIIT,0.0,4531,4531,0.0,Workout,HighIntensityIntervalTraining,14500223317,2025-05-16T15:44:26Z,...,False,109915717,1,a14500223317,,2,,,,
1,2,Evening HIIT,0.0,3774,3774,0.0,Workout,HighIntensityIntervalTraining,14490724599,2025-05-15T16:24:27Z,...,False,109915717,1,a14490724599,,2,,,,
2,2,Lunch Ride - Innersbachklamm,102228.0,21805,29696,545.7,Ride,Ride,14480127482,2025-05-14T09:02:05Z,...,False,109915717,1,a14480127482,oaybHkcqnAvMdFfBnr@{CZrCh@LbIqE`X~OpMaHzGqPnz@...,2,10.0,55.9,False,1219.1
3,2,Night Workout,0.0,4233,4233,0.0,Workout,Workout,14470958749,2025-05-13T18:38:54Z,...,False,109915717,1,a14470958749,,2,,,,
4,2,Afternoon Workout,0.0,7148,7148,0.0,Workout,Workout,14458609248,2025-05-12T15:59:28Z,...,False,109915717,1,a14458609248,,2,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
215,2,Morning Ride,10296.7,3654,3671,344.1,Ride,Ride,9244885459,2023-06-10T12:35:28Z,...,False,109915717,1,a9244885459,}ffYlumgMtCh@`@JzBb@l@^FF?NAJW~A{@nEQn@]t@aAbB...,2,,80.1,False,292.8
216,2,Lunch Walk,3998.6,3009,5817,7.7,Walk,Walk,8621748510,2023-02-25T17:39:16Z,...,False,109915717,1,a8621748510,mf`\f~`cMtCc@rE_@b@AZEFBP?REJCfAITIt@Aj@GdBGdB...,2,,,,
217,2,Arbolito 🌲,23473.4,7171,15255,452.4,Ride,Ride,8348110863,2023-01-06T11:39:16Z,...,False,109915717,1,a8348110863,egfYjumgMhI|Af@XAj@oAtG{@fCqBzCcAj@qFhBoQhNy@f...,2,10.0,,False,
218,2,Ricaurte - Agua de Dios,44084.1,6479,11483,375.5,Ride,Ride,8344653157,2023-01-05T21:17:54Z,...,False,109915717,1,a8344653157,c{eYrwmgMV_@v@wDdBwFdH_RtHeL|RqPkTuIsFeBuQiHgK...,2,10.0,,False,


Then, we can explore the data using the info function from pandas. As a result, we can see that there are 220 records and the DataFrame contains 55 columns. The number of null values per column and the data types are also shown. Furthermore, it's important to check the column names as a single array in order to identify and select the specific information we want to process later.

In [18]:
#df info
activities.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 220 entries, 0 to 219
Data columns (total 55 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   resource_state                 220 non-null    int64  
 1   name                           220 non-null    object 
 2   distance                       220 non-null    float64
 3   moving_time                    220 non-null    int64  
 4   elapsed_time                   220 non-null    int64  
 5   total_elevation_gain           220 non-null    float64
 6   type                           220 non-null    object 
 7   sport_type                     220 non-null    object 
 8   id                             220 non-null    int64  
 9   start_date                     220 non-null    object 
 10  start_date_local               220 non-null    object 
 11  timezone                       220 non-null    object 
 12  utc_offset                     220 non-null    flo

In [19]:
#df columns
activities.columns

Index(['resource_state', 'name', 'distance', 'moving_time', 'elapsed_time',
       'total_elevation_gain', 'type', 'sport_type', 'id', 'start_date',
       'start_date_local', 'timezone', 'utc_offset', 'location_city',
       'location_state', 'location_country', 'achievement_count',
       'kudos_count', 'comment_count', 'athlete_count', 'photo_count',
       'trainer', 'commute', 'manual', 'private', 'visibility', 'flagged',
       'gear_id', 'start_latlng', 'end_latlng', 'average_speed', 'max_speed',
       'has_heartrate', 'average_heartrate', 'max_heartrate',
       'heartrate_opt_out', 'display_hide_heartrate_option', 'elev_high',
       'elev_low', 'upload_id', 'upload_id_str', 'external_id',
       'from_accepted_tag', 'pr_count', 'total_photo_count', 'has_kudoed',
       'athlete.id', 'athlete.resource_state', 'map.id',
       'map.summary_polyline', 'map.resource_state', 'workout_type',
       'average_watts', 'device_watts', 'kilojoules'],
      dtype='object')

Filter the data to work with a better performance and understanding of the df. Then, the column related with time and hour is mixed, we need to separate it.

In [20]:
#Create new dataframe with only columns I care about
cols = ['name', 'upload_id', 'type', 'distance', 'moving_time',   
         'average_speed', 'max_speed','total_elevation_gain',
         'start_date_local', 'start_latlng', 'end_latlng'
       ]
activities = activities[cols]

In [21]:
# format of the date and hour
activities = activities.copy()
activities['start_date_local'] = pd.to_datetime(activities['start_date_local'])
activities['start_time'] = activities['start_date_local'].dt.time
activities['start_date_local'] = activities['start_date_local'].dt.date
activities.head(5)

Unnamed: 0,name,upload_id,type,distance,moving_time,average_speed,max_speed,total_elevation_gain,start_date_local,start_latlng,end_latlng,start_time
0,Afternoon HIIT,15470552955,Workout,0.0,4531,0.0,0.0,0.0,2025-05-16,[],[],17:44:26
1,Evening HIIT,15460331668,Workout,0.0,3774,0.0,0.0,0.0,2025-05-15,[],[],18:24:27
2,Lunch Ride - Innersbachklamm,15449061470,Ride,102228.0,21805,4.688,18.043,545.7,2025-05-14,"[47.810196, 13.038183]","[47.810232, 13.038135]",11:02:05
3,Night Workout,15439339287,Workout,0.0,4233,0.0,0.0,0.0,2025-05-13,[],[],20:38:54
4,Afternoon Workout,15426341743,Workout,0.0,7148,0.0,0.0,0.0,2025-05-12,[],[],17:59:28


To start to work with geographic data, In the next lines the records with null geodata are printed. It's possible to identify there are 114 without registers. It 114 represents $\frac{114}{220} \times 100 \approx 51.82\%$ of the total amount of data.


In [None]:
# records with null start lanlong
print(activities[activities['start_latlng'].isnull() | activities['start_latlng'].isin(['[]', '', 'None'])])


                             name    upload_id            type  distance  \
0                  Afternoon HIIT  15470552955         Workout       0.0   
1                    Evening HIIT  15460331668         Workout       0.0   
3                   Night Workout  15439339287         Workout       0.0   
4               Afternoon Workout  15426341743         Workout       0.0   
5                  Afternoon HIIT  15415012755         Workout       0.0   
..                            ...          ...             ...       ...   
191                Lunch Crossfit  12008937900        Crossfit       0.0   
192       Morning Weight Training  11996027862  WeightTraining       0.0   
193  Afternoon Swim at Uniandes 🐐  11987571574            Swim      33.5   
194              Morning Crossfit  11977681097        Crossfit       0.0   
196  Afternoon Swim at Uniandes 🐐  11942623045            Swim    1100.0   

     moving_time  average_speed  max_speed  total_elevation_gain  \
0           4531   

In [None]:
#records with null end lanlong
print(activities[activities['end_latlng'].isnull() | activities['end_latlng'].isin(['[]', '', 'None'])])


                             name    upload_id            type  distance  \
0                  Afternoon HIIT  15470552955         Workout       0.0   
1                    Evening HIIT  15460331668         Workout       0.0   
3                   Night Workout  15439339287         Workout       0.0   
4               Afternoon Workout  15426341743         Workout       0.0   
5                  Afternoon HIIT  15415012755         Workout       0.0   
..                            ...          ...             ...       ...   
191                Lunch Crossfit  12008937900        Crossfit       0.0   
192       Morning Weight Training  11996027862  WeightTraining       0.0   
193  Afternoon Swim at Uniandes 🐐  11987571574            Swim      33.5   
194              Morning Crossfit  11977681097        Crossfit       0.0   
196  Afternoon Swim at Uniandes 🐐  11942623045            Swim    1100.0   

     moving_time  average_speed  max_speed  total_elevation_gain  \
0           4531   

A function to extract the longitud and latitude is created to be applied using a map in the next lines. Once the function is applied, the df activities shows the columns of latitud and longitude separated. 

In [24]:
def extract_lat_lon(latlng_str):
    try:
        lat, lon = latlng_str.strip('[]').split(',')
        return float(lat), float(lon)
    except:
        return None, None


In [25]:
activities['start_lat'], activities['start_lon'] = zip(*activities['start_latlng'].map(extract_lat_lon))
activities['end_lat'], activities['end_lon'] = zip(*activities['end_latlng'].map(extract_lat_lon))


In [26]:
activities

Unnamed: 0,name,upload_id,type,distance,moving_time,average_speed,max_speed,total_elevation_gain,start_date_local,start_latlng,end_latlng,start_time,start_lat,start_lon,end_lat,end_lon
0,Afternoon HIIT,15470552955,Workout,0.0,4531,0.000,0.000,0.0,2025-05-16,[],[],17:44:26,,,,
1,Evening HIIT,15460331668,Workout,0.0,3774,0.000,0.000,0.0,2025-05-15,[],[],18:24:27,,,,
2,Lunch Ride - Innersbachklamm,15449061470,Ride,102228.0,21805,4.688,18.043,545.7,2025-05-14,"[47.810196, 13.038183]","[47.810232, 13.038135]",11:02:05,47.810196,13.038183,47.810232,13.038135
3,Night Workout,15439339287,Workout,0.0,4233,0.000,0.000,0.0,2025-05-13,[],[],20:38:54,,,,
4,Afternoon Workout,15426341743,Workout,0.0,7148,0.000,0.000,0.0,2025-05-12,[],[],17:59:28,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
215,Morning Ride,9917797597,Ride,10296.7,3654,2.818,7.982,344.1,2023-06-10,"[4.297962170094252, -74.78609061799943]","[4.348327768966556, -74.83286884613335]",07:35:28,4.297962,-74.786091,4.348328,-74.832869
216,Lunch Walk,9255579410,Walk,3998.6,3009,1.329,1.909,7.7,2023-02-25,"[4.75921, -74.066105]","[4.724326, -74.073518]",12:39:16,4.759210,-74.066105,4.724326,-74.073518
217,Arbolito 🌲,8952619873,Ride,23473.4,7171,3.273,10.387,452.4,2023-01-06,"[4.297663, -74.784424]","[4.297661, -74.784449]",06:39:16,4.297663,-74.784424,4.297661,-74.784449
218,Ricaurte - Agua de Dios,8948754056,Ride,44084.1,6479,6.804,15.073,375.5,2023-01-05,"[4.295948, -74.786636]","[4.297545, -74.78435]",16:17:54,4.295948,-74.786636,4.297545,-74.784350


The libraries geopandas and shapely are used to create a geodatabase with the data georeferenced.  it is necessary to clean the gdf to keep just the raws which contain latitud and longitude.The gdb final just keep 106 records.

In [None]:
from shapely.geometry import Point

# Create Point geometries
start_geometry = [Point(xy) for xy in zip(activities['start_lon'], activities['start_lat'])]
end_geometry = [Point(xy) for xy in zip(activities['end_lon'], activities['end_lat'])]

# Create a GeoDataFrame with WGS84 coordinate system
gdf = gpd.GeoDataFrame(activities, geometry=start_geometry, crs="EPSG:4326")

#Delete nan
gdf = gdf.dropna(subset=['start_lat', 'start_lon'])

gdf


Unnamed: 0,name,upload_id,type,distance,moving_time,average_speed,max_speed,total_elevation_gain,start_date_local,start_latlng,end_latlng,start_time,start_lat,start_lon,end_lat,end_lon,geometry
2,Lunch Ride - Innersbachklamm,15449061470,Ride,102228.0,21805,4.688,18.043,545.7,2025-05-14,"[47.810196, 13.038183]","[47.810232, 13.038135]",11:02:05,47.810196,13.038183,47.810232,13.038135,POINT (13.03818 47.8102)
6,Afternoon Ride,15405126253,Ride,25410.6,5436,4.675,10.180,36.7,2025-05-10,"[47.810012, 13.037966]","[47.810545, 13.03853]",16:26:26,47.810012,13.037966,47.810545,13.038530,POINT (13.03797 47.81001)
8,Lunch Ride - Wiestaltausee,15391491264,Ride,47500.6,9612,4.942,10.960,440.2,2025-05-09,"[47.810077, 13.037723]","[47.810784, 13.037866]",11:28:36,47.810077,13.037723,47.810784,13.037866,POINT (13.03772 47.81008)
9,Morning Ride - Königssee 🇩🇪,15370066510,Ride,70676.2,14971,4.721,11.700,350.0,2025-05-07,"[47.810075, 13.037336]","[47.810325, 13.038629]",10:34:10,47.810075,13.037336,47.810325,13.038629,POINT (13.03734 47.81008)
14,Morning Ride - Waginger See 🇩🇪,15293821114,Ride,75665.7,15721,4.813,13.200,422.3,2025-04-30,"[47.810186, 13.037924]","[47.810239, 13.038488]",09:56:00,47.810186,13.037924,47.810239,13.038488,POINT (13.03792 47.81019)
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
215,Morning Ride,9917797597,Ride,10296.7,3654,2.818,7.982,344.1,2023-06-10,"[4.297962170094252, -74.78609061799943]","[4.348327768966556, -74.83286884613335]",07:35:28,4.297962,-74.786091,4.348328,-74.832869,POINT (-74.78609 4.29796)
216,Lunch Walk,9255579410,Walk,3998.6,3009,1.329,1.909,7.7,2023-02-25,"[4.75921, -74.066105]","[4.724326, -74.073518]",12:39:16,4.759210,-74.066105,4.724326,-74.073518,POINT (-74.0661 4.75921)
217,Arbolito 🌲,8952619873,Ride,23473.4,7171,3.273,10.387,452.4,2023-01-06,"[4.297663, -74.784424]","[4.297661, -74.784449]",06:39:16,4.297663,-74.784424,4.297661,-74.784449,POINT (-74.78442 4.29766)
218,Ricaurte - Agua de Dios,8948754056,Ride,44084.1,6479,6.804,15.073,375.5,2023-01-05,"[4.295948, -74.786636]","[4.297545, -74.78435]",16:17:54,4.295948,-74.786636,4.297545,-74.784350,POINT (-74.78664 4.29595)


Finally, the library folium is used to show and save in a html file the map with the start points of these activities. 

In [29]:
# Create the folium map
m = folium.Map(location=[0, 0], zoom_start=2)

# Add each activity as a CircleMarker
for _, row in gdf.iterrows():
    folium.CircleMarker(
        location=[row['start_lat'], row['start_lon']],
        radius=5,
        popup=f"{row['name']} ({row['type']})",
        color='blue',
        fill=True,
        fill_opacity=0.7
    ).add_to(m)

# Save the map to an HTML file
m.save('strava_start_points.html')



In [30]:
m

In [31]:
m = folium.Map(location=[0,0], zoom_start=2)

for _, row in gdf.iterrows():
    folium.CircleMarker(
        location=[row.latitude, row.longitude],
        radius=2,
        color="blue"
    ).add_to(m)


AttributeError: 'Series' object has no attribute 'latitude'