In [2]:
import pandas as pd
import numpy as np 
from gpxutils import parse_gpx 
import matplotlib.pyplot as plt
%matplotlib inline

# Analysis of Cycling Data

We are provided with four files containing recordings of cycling activities that include GPS location data as
well as some measurements related to cycling performace like heart rate and power.  The goal is to perform
some exploration and analysis of this data. 

The data represents four races.  Two are time trials where the rider rides alone on a set course.  Two are 
road races where the rider rides with a peleton.  All were held on the same course but the road races include
two laps where the time trials include just one. 

Questions to explore with the data:
* What is the overall distance travelled for each of the rides? What are the average speeds etc.  Provide a summary for each ride.
* Compare the range of speeds for each ride, are time trials faster than road races? 
* Compare the speeds achieved in the two time trials (three years apart).  As well as looking at the averages, can you see where in the ride one or the other is faster.  
* From the elevation_gain field you can see whether the rider is _climbing_ , _descending_ or on the _flat_.   Use this to calculate the average speeds in those three cases (climbing, flat or descending).  Note that _flat_ might not be zero elevation_gain but might allow for slight climbs and falls.  

For time varying data like this it is often useful to _smooth_ the data using eg. a [rolling mean](https://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.rolling_mean.html).  You might want to experiment with smoothing in some of your analysis (not required but may be of interest).

## Description of Fields

* _index_ is a datetime showing the time that the observation was made (I wasn't riding at night, this is converted to UTC)
* __latitude, longitude, elevation__ from the GPS, the position of the rider at each timepoint, elevation in m
* __temperature__ the current ambient temperature in degrees celcius
* __power__ the power being generated by the rider in Watts
* __cadence__ the rotational speed of the pedals in revolutions per minute
* __hr__ heart rate in beats per minute
* __elevation_gain__ the change in elevation in m between two observations
* __distance__ distance travelled between observations in km
* __speed__ speed measured in km/h

You are provided with code in [gpxutils.py](gpxutils.py) to read the GPX XML format files that are exported by cycling computers and applications.  The sample files were exported from [Strava](https://strava.com/) and represent four races by Steve Cassidy.


In [3]:
# read the four data files
rr_2016 = parse_gpx('files/Calga_RR_2016.gpx')
tt_2016 = parse_gpx('files/Calga_TT_2016.gpx')
rr_2019 = parse_gpx('files/Calga_RR_2019.gpx')
tt_2019 = parse_gpx('files/Calga_TT_2019.gpx')

In [12]:
rr_2016.head()

Unnamed: 0,latitude,longitude,elevation,temperature,power,cadence,hr,distance,elevation_gain,speed,timedelta
2016-05-14 04:02:41+00:00,-33.415561,151.222303,208.6,29.0,0.0,40.0,102.0,0.0,0.0,0.0,
2016-05-14 04:02:42+00:00,-33.415534,151.222289,208.6,29.0,0.0,40.0,102.0,0.003271,0.0,11.77702,1.0
2016-05-14 04:02:46+00:00,-33.415398,151.22218,208.6,29.0,0.0,40.0,103.0,0.018194,0.0,16.375033,4.0
2016-05-14 04:02:49+00:00,-33.415264,151.222077,208.6,29.0,0.0,55.0,106.0,0.017703,0.0,21.243901,3.0
2016-05-14 04:02:51+00:00,-33.41516,151.222013,208.6,29.0,0.0,61.0,109.0,0.013001,0.0,23.401217,2.0


In [13]:
rr_2016.shape

(2822, 11)

In [14]:
tt_2016.head()

Unnamed: 0,latitude,longitude,elevation,temperature,power,cadence,hr,distance,elevation_gain,speed,timedelta
2016-07-02 23:05:30+00:00,-33.415971,151.222016,111.8,12.0,0.0,58.0,108.0,0.0,0.0,0.0,
2016-07-02 23:05:32+00:00,-33.416026,151.222008,111.8,12.0,0.0,58.0,105.0,0.006161,0.0,11.089134,2.0
2016-07-02 23:05:38+00:00,-33.416034,151.222023,111.8,12.0,0.0,58.0,105.0,0.001652,0.0,0.991282,6.0
2016-07-02 23:06:01+00:00,-33.416041,151.222038,111.8,13.0,0.0,58.0,100.0,0.001595,0.0,0.249655,23.0
2016-07-02 23:06:02+00:00,-33.416048,151.222053,111.8,13.0,0.0,65.0,101.0,0.001595,0.0,5.742071,1.0


In [15]:
tt_2016.shape

(1541, 11)

In [16]:
rr_2019.head()

Unnamed: 0,latitude,longitude,elevation,temperature,power,cadence,hr,distance,elevation_gain,speed,timedelta
2019-06-22 22:33:45+00:00,-33.416592,151.222853,215.0,0.0,0.0,58.0,71.0,0.0,0.0,0.0,
2019-06-22 22:33:46+00:00,-33.416629,151.222877,215.0,0.0,147.0,58.0,71.0,0.004679,0.0,16.842677,1.0
2019-06-22 22:33:47+00:00,-33.416677,151.222905,214.8,0.0,97.0,60.0,71.0,0.005936,-0.2,21.371074,1.0
2019-06-22 22:33:48+00:00,-33.41673,151.222937,214.8,0.0,74.0,61.0,71.0,0.006599,0.0,23.757913,1.0
2019-06-22 22:33:49+00:00,-33.416783,151.222972,214.8,0.0,136.0,62.0,71.0,0.006729,0.0,24.225566,1.0


In [17]:
rr_2019.shape

(5503, 11)

In [18]:
tt_2019.head()

Unnamed: 0,latitude,longitude,elevation,temperature,power,cadence,hr,distance,elevation_gain,speed,timedelta
2019-06-01 22:54:55+00:00,-33.415798,151.22206,219.4,13.0,0.0,0.0,88.0,0.0,0.0,0.0,
2019-06-01 22:54:56+00:00,-33.415782,151.222051,219.4,13.0,0.0,0.0,88.0,0.001965,0.0,7.075656,1.0
2019-06-01 22:54:57+00:00,-33.415767,151.222041,219.4,13.0,0.0,0.0,88.0,0.001909,0.0,6.871582,1.0
2019-06-01 22:54:58+00:00,-33.415751,151.222032,219.4,13.0,0.0,0.0,89.0,0.001965,0.0,7.075656,1.0
2019-06-01 22:54:59+00:00,-33.415735,151.222022,219.4,13.0,0.0,0.0,89.0,0.002007,0.0,7.223997,1.0


In [19]:
tt_2019.shape

(2655, 11)

**1. What is the overall distance travelled for each of the rides? What are the average speeds etc. Provide a summary for each ride.**

In [20]:
#function to summarise data
def summary(ride):
    total_distance = ride.distance.sum()
    total_time = ride.timedelta.sum()
    average_speed = total_distance / ((total_time)/3600)
    print (('Total Distance Covered  :  {:.2f} km').format(total_distance))
    print (('Total time taken  :  {:.2f} seconds').format(total_time))
    print (('Average Speed :  {:.2f} km/hr').format(average_speed))
    print (('Total Power Generated :  {:2} watts ').format(ride.power.sum()))
    return ride.drop(columns =['latitude','longitude','timedelta']).describe()

2016: Road Races

In [21]:
print ("Summary Data for Road Races 2016")
summary(rr_2016)

Summary Data for Road Races 2016
Total Distance Covered  :  49.05 km
Total time taken  :  5200.00 seconds
Average Speed :  33.96 km/hr
Total Power Generated :  0.0 watts 


Unnamed: 0,elevation,temperature,power,cadence,hr,distance,elevation_gain,speed
count,2822.0,2822.0,2822.0,2822.0,2822.0,2822.0,2822.0,2822.0
mean,232.404465,25.280652,0.0,65.987952,158.394401,0.017381,-0.003756,34.933085
std,29.725934,1.348746,0.0,34.425881,11.304588,0.015695,0.458872,10.738677
min,176.0,24.0,0.0,0.0,102.0,0.0,-1.6,0.0
25%,209.45,24.0,0.0,68.0,151.0,0.007894,-0.4,26.656312
50%,226.1,25.0,0.0,79.0,158.0,0.011794,0.0,33.307339
75%,258.2,26.0,0.0,87.0,166.0,0.016899,0.4,42.871885
max,295.8,30.0,0.0,117.0,205.0,0.076283,1.2,92.749036


2016: Time Trials

In [23]:
print ("Summary Data for Time Trials 2016")
summary(tt_2016)

Summary Data for Time Trials 2016
Total Distance Covered  :  24.80 km
Total time taken  :  2747.00 seconds
Average Speed :  32.50 km/hr
Total Power Generated :  0.0 watts 


Unnamed: 0,elevation,temperature,power,cadence,hr,distance,elevation_gain,speed
count,1541.0,1541.0,1541.0,1541.0,1541.0,1541.0,1541.0,1541.0
mean,139.068657,10.953277,0.0,83.277093,170.93965,0.016095,-0.002466,33.529963
std,30.301132,0.657937,0.0,21.169978,23.392548,0.015897,0.515343,11.519681
min,85.0,10.0,0.0,0.0,100.0,0.0,-7.4,0.0
25%,116.2,11.0,0.0,77.0,157.0,0.007622,-0.4,25.068271
50%,134.0,11.0,0.0,86.0,161.0,0.010974,0.0,32.840076
75%,165.4,11.0,0.0,96.0,180.0,0.01596,0.4,41.470522
max,202.6,13.0,0.0,118.0,251.0,0.288175,2.4,162.505764


2019: Road Races

In [24]:
print ("Summary Data for Road Races 2019")
summary(rr_2019)

Summary Data for Road Races 2019
Total Distance Covered  :  51.79 km
Total time taken  :  5502.00 seconds
Average Speed :  33.89 km/hr
Total Power Generated :  1175539.0 watts 


Unnamed: 0,elevation,temperature,power,cadence,hr,distance,elevation_gain,speed
count,5503.0,5503.0,5503.0,5503.0,5503.0,5503.0,5503.0,5503.0
mean,243.243576,5.997819,213.617845,70.004906,138.998546,0.009411,0.000254,33.879861
std,30.197981,0.806414,144.123686,29.869938,16.184123,0.002459,0.347322,8.853503
min,185.2,0.0,0.0,0.0,71.0,0.0,-2.0,0.0
25%,219.6,5.0,104.0,66.0,129.0,0.007839,-0.2,28.219962
50%,236.0,6.0,212.0,81.0,142.0,0.009178,0.0,33.04136
75%,269.6,7.0,308.0,89.0,152.0,0.010733,0.2,38.640026
max,310.4,7.0,785.0,120.0,170.0,0.019547,1.0,70.370469


2019: Time Trials

In [25]:
print ("Summary Data for Time Trials 2019")
summary(tt_2019)

Summary Data for Time Trials 2019
Total Distance Covered  :  24.38 km
Total time taken  :  2654.00 seconds
Average Speed :  33.07 km/hr
Total Power Generated :  683840.0 watts 


Unnamed: 0,elevation,temperature,power,cadence,hr,distance,elevation_gain,speed
count,2655.0,2655.0,2655.0,2655.0,2655.0,2655.0,2655.0,2655.0
mean,250.435104,10.19435,257.566855,89.979661,152.741243,0.009183,0.000377,33.057824
std,29.434104,0.833934,80.023555,17.543883,8.217632,0.002715,0.298628,9.773522
min,195.8,9.0,0.0,0.0,88.0,0.0,-1.6,0.0
25%,229.2,10.0,213.5,88.0,150.0,0.007345,-0.2,26.440757
50%,243.8,10.0,264.0,94.0,153.0,0.009228,0.0,33.220108
75%,276.2,10.0,308.0,98.0,158.0,0.010913,0.2,39.286242
max,312.2,13.0,522.0,111.0,166.0,0.017584,0.6,63.300734


**2. Compare the range of speeds for each ride, are time trials faster than road races?**

Road Races 

In [26]:
# Merge data from rr_2016 and rr_2019 into a single table.
print ("Range of Speeds for Road Races")
roadraces = pd.concat([rr_2016, rr_2019])
total_distance = roadraces.distance.sum()
total_time = roadraces.timedelta.sum()
average_speed = total_distance / ((total_time)/3600)
print (('Average Speed of road races :  {:.2f} km/hr').format(average_speed))
print (('Maximum speed of road races : {:.2f}').format(roadraces.speed.max()))
print (('mean speed of road races    : {:.2f}').format(roadraces.speed.mean()))

Range of Speeds for Road Races
Average Speed of road races :  33.92 km/hr
Maximum speed of road races : 92.75
mean speed of road races    : 34.24


Time Trials

In [27]:
# Merge data from rr_2016 and rr_2019 into a single table.
print ("Range of Speeds for Time Trials")
timetrials = pd.concat([tt_2016, tt_2019])
total_distance = timetrials.distance.sum()
total_time = timetrials.timedelta.sum()
average_speed = total_distance / ((total_time)/3600)
print (('Average Speed of time trials :  {:.2f} km/hr').format(average_speed))
print (('Maximum speed of time trials : {:.2f}').format(timetrials.speed.max()))
print (('mean speed of time trials    : {:.2f}').format(timetrials.speed.mean()))

Range of Speeds for Time Trials
Average Speed of time trials :  32.78 km/hr
Maximum speed of time trials : 162.51
mean speed of time trials    : 33.23


Time trials are not faster than road races. Despite the fact that time trials have a peak speed of 162.51 and road races have a peak speed of 92.75, road races are faster in terms of average and mean speeds.

**3. Compare the speeds achieved in the two time trials (three years apart). As well as looking at the averages, can you see where in the ride one or the other is faster.**

**4.  From the elevation_gain field you can see whether the rider is climbing , descending or on the flat. Use this to calculate the average speeds in those three cases (climbing, flat or descending). Note that flat might not be zero elevation_gain but might allow for slight climbs and falls.**

## Challenge: Gear Usage

A modern race bike has up to 22 different gears with two chainrings on the front (attached to the pedals) and 10 or 11 at the back (attached to the wheel).   The ratio of the number of teeth on the front and rear cogs determines the distance travelled with one revolution of the pedals (often called __development__, measured in metres).  Low development is good for climbing hills while high development is for going fast downhill or in the final sprint. 

We have a measure of the number of rotations of the pedals per minute (__cadence__) and a measure of __speed__.  Using these two variables we should be able to derive a measure of __development__ which would effectivly tell us which gear the rider was using at the time.   Development will normally range between __2m__ and __10m__.  Due to errors in GPS and cadence measurements you will see many points outside this range and you should just discard them as outliers. 

Write code to calculate __development__ in _meters_ for each row in a ride.  Plot the result in a _histogram_ and compare the plots for the four rides.   Comment on what you observe in the histograms.



