In [3]:
import pandas as pd
import numpy as np 
from gpxutils import parse_gpx 
import matplotlib.pyplot as plt
%matplotlib inline

# Analysis of Cycling Data

We are provided with four files containing recordings of cycling activities that include GPS location data as
well as some measurements related to cycling performace like heart rate and power.  The goal is to perform
some exploration and analysis of this data. 

The data represents four races.  Two are time trials where the rider rides alone on a set course.  Two are 
road races where the rider rides with a peleton.  All were held on the same course but the road races include
two laps where the time trials include just one. 

Questions to explore with the data:
* What is the overall distance travelled for each of the rides? What are the average speeds etc.  Provide a summary for each ride.
* Compare the range of speeds for each ride, are time trials faster than road races? 
* Compare the speeds achieved in the two time trials (three years apart).  As well as looking at the averages, can you see where in the ride one or the other is faster.  
* From the elevation_gain field you can see whether the rider is _climbing_ , _descending_ or on the _flat_.   Use this to calculate the average speeds in those three cases (climbing, flat or descending).  Note that _flat_ might not be zero elevation_gain but might allow for slight climbs and falls.  

For time varying data like this it is often useful to _smooth_ the data using eg. a [rolling mean](https://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.rolling_mean.html).  You might want to experiment with smoothing in some of your analysis (not required but may be of interest).

## Description of Fields

* _index_ is a datetime showing the time that the observation was made (I wasn't riding at night, this is converted to UTC)
* __latitude, longitude, elevation__ from the GPS, the position of the rider at each timepoint, elevation in m
* __temperature__ the current ambient temperature in degrees celcius
* __power__ the power being generated by the rider in Watts
* __cadence__ the rotational speed of the pedals in revolutions per minute
* __hr__ heart rate in beats per minute
* __elevation_gain__ the change in elevation in m between two observations
* __distance__ distance travelled between observations in km
* __speed__ speed measured in km/h

You are provided with code in [gpxutils.py](gpxutils.py) to read the GPX XML format files that are exported by cycling computers and applications.  The sample files were exported from [Strava](https://strava.com/) and represent four races by Steve Cassidy.


In [4]:
# read the four data files
rr_2016 = parse_gpx('files/Calga_RR_2016.gpx')
tt_2016 = parse_gpx('files/Calga_TT_2016.gpx')
rr_2019 = parse_gpx('files/Calga_RR_2019.gpx')
tt_2019 = parse_gpx('files/Calga_TT_2019.gpx')

EXAMINING 2016 ROAD RACES

In [5]:
rr_2016.head()

Unnamed: 0,latitude,longitude,elevation,temperature,power,cadence,hr,distance,elevation_gain,speed,timedelta
2016-05-14 04:02:41+00:00,-33.415561,151.222303,208.6,29.0,0.0,40.0,102.0,0.0,0.0,0.0,
2016-05-14 04:02:42+00:00,-33.415534,151.222289,208.6,29.0,0.0,40.0,102.0,0.003271,0.0,11.77702,1.0
2016-05-14 04:02:46+00:00,-33.415398,151.22218,208.6,29.0,0.0,40.0,103.0,0.018194,0.0,16.375033,4.0
2016-05-14 04:02:49+00:00,-33.415264,151.222077,208.6,29.0,0.0,55.0,106.0,0.017703,0.0,21.243901,3.0
2016-05-14 04:02:51+00:00,-33.41516,151.222013,208.6,29.0,0.0,61.0,109.0,0.013001,0.0,23.401217,2.0


In [6]:
# Number of rows and number of columns in rr_2016
rr_2016.shape

(2822, 11)

In [23]:
rr_2016.corr()

Unnamed: 0,latitude,longitude,elevation,temperature,power,cadence,hr,distance,elevation_gain,speed,timedelta
latitude,1.0,0.89623,0.902604,-0.243038,,-0.053667,-0.164643,-0.028951,0.003261,0.017049,-0.018191
longitude,0.89623,1.0,0.705915,-0.262367,,-0.114852,-0.15183,-0.036753,-0.00701,0.020254,-0.028617
elevation,0.902604,0.705915,1.0,-0.116702,,0.02203,-0.240242,0.034963,0.015736,-0.035951,0.057167
temperature,-0.243038,-0.262367,-0.116702,1.0,,0.077495,-0.111024,0.002694,0.126452,-0.147017,0.045334
power,,,,,,,,,,,
cadence,-0.053667,-0.114852,0.02203,0.077495,,1.0,-0.188127,0.091947,0.505334,-0.441832,0.177566
hr,-0.164643,-0.15183,-0.240242,-0.111024,,-0.188127,1.0,-0.017371,-0.166509,0.2365,-0.067493
distance,-0.028951,-0.036753,0.034963,0.002694,,0.091947,-0.017371,1.0,-0.12152,0.149308,0.952109
elevation_gain,0.003261,-0.00701,0.015736,0.126452,,0.505334,-0.166509,-0.12152,1.0,-0.817032,0.053663
speed,0.017049,0.020254,-0.035951,-0.147017,,-0.441832,0.2365,0.149308,-0.817032,1.0,-0.100493


In [24]:
# Statistical summary of rr_2016
rr_2016.describe()

Unnamed: 0,latitude,longitude,elevation,temperature,power,cadence,hr,distance,elevation_gain,speed,timedelta
count,2822.0,2822.0,2822.0,2822.0,2822.0,2822.0,2822.0,2822.0,2822.0,2822.0,2821.0
mean,-33.368017,151.225527,232.404465,25.280652,0.0,65.987952,158.394401,0.017381,-0.003756,34.933085,1.843318
std,0.028329,0.006014,29.725934,1.348746,0.0,34.425881,11.304588,0.015695,0.458872,10.738677,1.692364
min,-33.416753,151.211496,176.0,24.0,0.0,0.0,102.0,0.0,-1.6,0.0,1.0
25%,-33.393691,151.221912,209.45,24.0,0.0,68.0,151.0,0.007894,-0.4,26.656312,1.0
50%,-33.37182,151.227236,226.1,25.0,0.0,79.0,158.0,0.011794,0.0,33.307339,1.0
75%,-33.342269,151.230069,258.2,26.0,0.0,87.0,166.0,0.016899,0.4,42.871885,2.0
max,-33.31689,151.235131,295.8,30.0,0.0,117.0,205.0,0.076283,1.2,92.749036,9.0


In [16]:
# Total distance travelled for rr_2016
TotalDistance2016 = rr_2016["distance"].sum()
print("Total Distance for rr_2016, (Rounded to the Nearest Whole Number):", "{0:.0f}".format(TotalDistance2016), "km")

Total Distance for rr_2016, (Rounded to the Nearest Whole Number): 49 km


EXAMINING 2019 ROAD RACES

In [6]:
rr_2019.head()

Unnamed: 0,latitude,longitude,elevation,temperature,power,cadence,hr,distance,elevation_gain,speed,timedelta
2019-06-22 22:33:45+00:00,-33.416592,151.222853,215.0,0.0,0.0,58.0,71.0,0.0,0.0,0.0,
2019-06-22 22:33:46+00:00,-33.416629,151.222877,215.0,0.0,147.0,58.0,71.0,0.004679,0.0,16.842677,1.0
2019-06-22 22:33:47+00:00,-33.416677,151.222905,214.8,0.0,97.0,60.0,71.0,0.005936,-0.2,21.371074,1.0
2019-06-22 22:33:48+00:00,-33.41673,151.222937,214.8,0.0,74.0,61.0,71.0,0.006599,0.0,23.757913,1.0
2019-06-22 22:33:49+00:00,-33.416783,151.222972,214.8,0.0,136.0,62.0,71.0,0.006729,0.0,24.225566,1.0


In [8]:
# Number of rows and number of columns in rr_2019
rr_2019.shape

(5503, 11)

In [19]:
# Statistical summary of rr_2019
rr_2019.describe()

Unnamed: 0,latitude,longitude,elevation,temperature,power,cadence,hr,distance,elevation_gain,speed,timedelta
count,5503.0,5503.0,5503.0,5503.0,5503.0,5503.0,5503.0,5503.0,5503.0,5503.0,5502.0
mean,-33.371644,151.225232,243.243576,5.997819,213.617845,70.004906,138.998546,0.009411,0.000254,33.879861,1.0
std,0.030592,0.006142,30.197981,0.806414,144.123686,29.869938,16.184123,0.002459,0.347322,8.853503,0.0
min,-33.422174,151.211507,185.2,0.0,0.0,0.0,71.0,0.0,-2.0,0.0,1.0
25%,-33.396939,151.221591,219.6,5.0,104.0,66.0,129.0,0.007839,-0.2,28.219962,1.0
50%,-33.373835,151.227064,236.0,6.0,212.0,81.0,142.0,0.009178,0.0,33.04136,1.0
75%,-33.344994,151.229977,269.6,7.0,308.0,89.0,152.0,0.010733,0.2,38.640026,1.0
max,-33.316865,151.235094,310.4,7.0,785.0,120.0,170.0,0.019547,1.0,70.370469,1.0


In [27]:
# Total distance travelled for rr_2019
TotalDistance2019 = rr_2019["distance"].sum()
print("Total Distance for rr_2019, (Rounded to the Nearest Whole Number):", "{0:.0f}".format(TotalDistance2019), "km")

51.78913253596059


In [29]:
rr_2019["distance"].describe(include=['count'])

count    5503.000000
mean        0.009411
std         0.002459
min         0.000000
25%         0.007839
50%         0.009178
75%         0.010733
max         0.019547
Name: distance, dtype: float64

EXAMINING 2016 TIME TRIALS

In [10]:
tt_2016.head()

Unnamed: 0,latitude,longitude,elevation,temperature,power,cadence,hr,distance,elevation_gain,speed,timedelta
2016-07-02 23:05:30+00:00,-33.415971,151.222016,111.8,12.0,0.0,58.0,108.0,0.0,0.0,0.0,
2016-07-02 23:05:32+00:00,-33.416026,151.222008,111.8,12.0,0.0,58.0,105.0,0.006161,0.0,11.089134,2.0
2016-07-02 23:05:38+00:00,-33.416034,151.222023,111.8,12.0,0.0,58.0,105.0,0.001652,0.0,0.991282,6.0
2016-07-02 23:06:01+00:00,-33.416041,151.222038,111.8,13.0,0.0,58.0,100.0,0.001595,0.0,0.249655,23.0
2016-07-02 23:06:02+00:00,-33.416048,151.222053,111.8,13.0,0.0,65.0,101.0,0.001595,0.0,5.742071,1.0


In [11]:
# Calculating the number of rows and number of columns in tt_2016
tt_2016.shape

(1541, 11)

In [12]:
# Statistical summary of tt_2016
tt_2016.describe()

Unnamed: 0,latitude,longitude,elevation,temperature,power,cadence,hr,distance,elevation_gain,speed,timedelta
count,1541.0,1541.0,1541.0,1541.0,1541.0,1541.0,1541.0,1541.0,1541.0,1541.0,1540.0
mean,-33.368105,151.225411,139.068657,10.953277,0.0,83.277093,170.93965,0.016095,-0.002466,33.529963,1.783766
std,0.028055,0.006166,30.301132,0.657937,0.0,21.169978,23.392548,0.015897,0.515343,11.519681,1.822713
min,-33.418368,151.211206,85.0,10.0,0.0,0.0,100.0,0.0,-7.4,0.0,1.0
25%,-33.393795,151.22187,116.2,11.0,0.0,77.0,157.0,0.007622,-0.4,25.068271,1.0
50%,-33.370613,151.227333,134.0,11.0,0.0,86.0,161.0,0.010974,0.0,32.840076,1.0
75%,-33.343332,151.230072,165.4,11.0,0.0,96.0,180.0,0.01596,0.4,41.470522,2.0
max,-33.316888,151.235137,202.6,13.0,0.0,118.0,251.0,0.288175,2.4,162.505764,31.0


EXAMINING 2019 TIME TRIALS

In [13]:
tt_2019.head()

Unnamed: 0,latitude,longitude,elevation,temperature,power,cadence,hr,distance,elevation_gain,speed,timedelta
2019-06-01 22:54:55+00:00,-33.415798,151.22206,219.4,13.0,0.0,0.0,88.0,0.0,0.0,0.0,
2019-06-01 22:54:56+00:00,-33.415782,151.222051,219.4,13.0,0.0,0.0,88.0,0.001965,0.0,7.075656,1.0
2019-06-01 22:54:57+00:00,-33.415767,151.222041,219.4,13.0,0.0,0.0,88.0,0.001909,0.0,6.871582,1.0
2019-06-01 22:54:58+00:00,-33.415751,151.222032,219.4,13.0,0.0,0.0,89.0,0.001965,0.0,7.075656,1.0
2019-06-01 22:54:59+00:00,-33.415735,151.222022,219.4,13.0,0.0,0.0,89.0,0.002007,0.0,7.223997,1.0


In [14]:
# Calculating the number of rows and number of columns in tt_2019
tt_2019.shape

(2655, 11)

In [15]:
# Statistical summary of tt_2019
tt_2019.describe()

Unnamed: 0,latitude,longitude,elevation,temperature,power,cadence,hr,distance,elevation_gain,speed,timedelta
count,2655.0,2655.0,2655.0,2655.0,2655.0,2655.0,2655.0,2655.0,2655.0,2655.0,2654.0
mean,-33.368391,151.225397,250.435104,10.19435,257.566855,89.979661,152.741243,0.009183,0.000377,33.057824,1.0
std,0.028682,0.006234,29.434104,0.833934,80.023555,17.543883,8.217632,0.002715,0.298628,9.773522,0.0
min,-33.415798,151.211507,195.8,9.0,0.0,0.0,88.0,0.0,-1.6,0.0,1.0
25%,-33.39491,151.2214,229.2,10.0,213.5,88.0,150.0,0.007345,-0.2,26.440757,1.0
50%,-33.370118,151.227363,243.8,10.0,264.0,94.0,153.0,0.009228,0.0,33.220108,1.0
75%,-33.343803,151.23004,276.2,10.0,308.0,98.0,158.0,0.010913,0.2,39.286242,1.0
max,-33.316882,151.235098,312.2,13.0,522.0,111.0,166.0,0.017584,0.6,63.300734,1.0


Selecting the **DISTANCE** column for rr_2016 and rr_2019 and calculating the overall distance travelled.

In [7]:
# Total Distance for rr_2016
TotalDistance2016 = rr_2016["distance"].sum()
print("Total Distance for rr_2016, (Rounded to the Nearest Whole Number):", "{0:.0f}".format(TotalDistance2016), "km")

# Total Distance for rr_2019
TotalDistance2019 = rr_2019["distance"].sum()
print("Total Distance for rr_2019, (Rounded to the Nearest Whole Number):", "{0:.0f}".format(TotalDistance2019), "km")

Total Distance for rr_2016, (Rounded to the Nearest Whole Number): 49 km
Total Distance for rr_2019, (Rounded to the Nearest Whole Number): 52 km


Selecting the **SPEED** column for rr_2016 and rr_2019 and calculating the average speed.

In [21]:
# Average Speed for rr_2016
AverageSpeed2016 = rr_2016["speed"].mean()
print("Average Speed for rr_2016, (Rounded to the Nearest Whole Number):", "{0:.0f}".format(AverageSpeed2016), "km/hr")

# Average Speed for rr_2019
AverageSpeed2019 = rr_2019["speed"].mean()
print("Average Speed for rr_2019, (Rounded to the Nearest Whole Number):", "{0:.0f}".format(AverageSpeed2019), "km/hr")

Average Speed for rr_2016, (Rounded to the Nearest Whole Number): 35 km/hr
Average Speed for rr_2019, (Rounded to the Nearest Whole Number): 34 km/hr


In [9]:
tt_2016.head()

Unnamed: 0,latitude,longitude,elevation,temperature,power,cadence,hr,distance,elevation_gain,speed,timedelta
2016-07-02 23:05:30+00:00,-33.415971,151.222016,111.8,12.0,0.0,58.0,108.0,0.0,0.0,0.0,
2016-07-02 23:05:32+00:00,-33.416026,151.222008,111.8,12.0,0.0,58.0,105.0,0.006161,0.0,11.089134,2.0
2016-07-02 23:05:38+00:00,-33.416034,151.222023,111.8,12.0,0.0,58.0,105.0,0.001652,0.0,0.991282,6.0
2016-07-02 23:06:01+00:00,-33.416041,151.222038,111.8,13.0,0.0,58.0,100.0,0.001595,0.0,0.249655,23.0
2016-07-02 23:06:02+00:00,-33.416048,151.222053,111.8,13.0,0.0,65.0,101.0,0.001595,0.0,5.742071,1.0


The size of tt_2016 data.

In [15]:
tt_2016.shape

(1541, 11)

In [16]:
tt_2016.corr()

Unnamed: 0,latitude,longitude,elevation,temperature,power,cadence,hr,distance,elevation_gain,speed,timedelta
latitude,1.0,0.890596,0.893507,-0.590807,,0.064267,-0.251437,-0.044196,0.015368,0.020488,-0.035231
longitude,0.890596,1.0,0.680784,-0.578757,,-0.013272,-0.265901,-0.021201,-0.00602,0.0084,-0.003513
elevation,0.893507,0.680784,1.0,-0.423521,,0.155272,-0.345915,0.03324,0.013113,-0.002949,0.043992
temperature,-0.590807,-0.578757,-0.423521,1.0,,0.053191,0.200181,-0.006791,0.137933,-0.154563,0.049447
power,,,,,,,,,,,
cadence,0.064267,-0.013272,0.155272,0.053191,,1.0,-0.088311,0.103936,0.029906,0.094116,0.045653
hr,-0.251437,-0.265901,-0.345915,0.200181,,-0.088311,1.0,-0.031376,-0.253651,0.399057,-0.156528
distance,-0.044196,-0.021201,0.03324,-0.006791,,0.103936,-0.031376,1.0,-0.276446,0.2073,0.889481
elevation_gain,0.015368,-0.00602,0.013113,0.137933,,0.029906,-0.253651,-0.276446,1.0,-0.695794,-0.103052
speed,0.020488,0.0084,-0.002949,-0.154563,,0.094116,0.399057,0.2073,-0.695794,1.0,-0.089224


In [10]:
tt_2019.head()

Unnamed: 0,latitude,longitude,elevation,temperature,power,cadence,hr,distance,elevation_gain,speed,timedelta
2019-06-01 22:54:55+00:00,-33.415798,151.22206,219.4,13.0,0.0,0.0,88.0,0.0,0.0,0.0,
2019-06-01 22:54:56+00:00,-33.415782,151.222051,219.4,13.0,0.0,0.0,88.0,0.001965,0.0,7.075656,1.0
2019-06-01 22:54:57+00:00,-33.415767,151.222041,219.4,13.0,0.0,0.0,88.0,0.001909,0.0,6.871582,1.0
2019-06-01 22:54:58+00:00,-33.415751,151.222032,219.4,13.0,0.0,0.0,89.0,0.001965,0.0,7.075656,1.0
2019-06-01 22:54:59+00:00,-33.415735,151.222022,219.4,13.0,0.0,0.0,89.0,0.002007,0.0,7.223997,1.0


The size of tt_2019 data.

In [13]:
tt_2019.shape

(2655, 11)

In [14]:
tt_2019.corr()

Unnamed: 0,latitude,longitude,elevation,temperature,power,cadence,hr,distance,elevation_gain,speed,timedelta
latitude,1.0,0.895549,0.909724,-0.363884,-0.069783,0.064495,0.223411,-0.014569,0.001221,-0.014569,
longitude,0.895549,1.0,0.717345,-0.361646,-0.065396,-0.015984,0.072519,-0.028295,-0.005216,-0.028295,
elevation,0.909724,0.717345,1.0,-0.180429,-0.088002,0.133378,0.28919,-0.04496,0.003764,-0.04496,
temperature,-0.363884,-0.361646,-0.180429,1.0,0.079099,-0.048971,-0.375004,-0.14624,0.112273,-0.14624,
power,-0.069783,-0.065396,-0.088002,0.079099,1.0,0.500652,0.058352,-0.34544,0.660651,-0.34544,
cadence,0.064495,-0.015984,0.133378,-0.048971,0.500652,1.0,0.238539,0.092067,0.21747,0.092067,
hr,0.223411,0.072519,0.28919,-0.375004,0.058352,0.238539,1.0,-0.177191,0.156098,-0.177191,
distance,-0.014569,-0.028295,-0.04496,-0.14624,-0.34544,0.092067,-0.177191,1.0,-0.772041,1.0,
elevation_gain,0.001221,-0.005216,0.003764,0.112273,0.660651,0.21747,0.156098,-0.772041,1.0,-0.772041,
speed,-0.014569,-0.028295,-0.04496,-0.14624,-0.34544,0.092067,-0.177191,1.0,-0.772041,1.0,


Selecting the **DISTANCE** column for tt_2016 and tt_2019 and calculating the overall distance travelled.

In [91]:
# Total Distance for tt_2016
TotalDistance2016 = tt_2016["distance"].sum()
print("Total Distance for rr_2016, (Rounded to the Nearest Whole Number):", "{0:.0f}".format(TotalDistance2016), "km")

# Total Distance for tt_2019
TotalDistance2019 = tt_2019["distance"].sum()
print("Total Distance for rr_2019, (Rounded to the Nearest Whole Number):", "{0:.0f}".format(TotalDistance2019), "km")

Total Distance for rr_2016, (Rounded to the Nearest Whole Number): 25 km
Total Distance for rr_2019, (Rounded to the Nearest Whole Number): 24 km


Selecting the **SPEED** column for tt_2016 and tt_2019 and calculating the average speed.

In [95]:
# Average Speed for tt_2016
AverageSpeed2016 = tt_2016["speed"].mean()
print("Average Speed for rr_2016, (Rounded to the Nearest Whole Number):", "{0:.0f}".format(AverageSpeed2016), "km/hr")

# Average Speed for tt_2019
AverageSpeed2019 = tt_2019["speed"].mean()
print("Average Speed for rr_2019, (Rounded to the Nearest Whole Number):", "{0:.0f}".format(AverageSpeed2019), "km/hr")

Average Speed for rr_2016, (Rounded to the Nearest Whole Number): 34 km/hr
Average Speed for rr_2019, (Rounded to the Nearest Whole Number): 33 km/hr


AttributeError: module 'numpy' has no attribute 'speed'

## Challenge: Gear Usage

A modern race bike has up to 22 different gears with two chainrings on the front (attached to the pedals) and 10 or 11 at the back (attached to the wheel).   The ratio of the number of teeth on the front and rear cogs determines the distance travelled with one revolution of the pedals (often called __development__, measured in metres).  Low development is good for climbing hills while high development is for going fast downhill or in the final sprint. 

We have a measure of the number of rotations of the pedals per minute (__cadence__) and a measure of __speed__.  Using these two variables we should be able to derive a measure of __development__ which would effectivly tell us which gear the rider was using at the time.   Development will normally range between __2m__ and __10m__.  Due to errors in GPS and cadence measurements you will see many points outside this range and you should just discard them as outliers. 

Write code to calculate __development__ in _meters_ for each row in a ride.  Plot the result in a _histogram_ and compare the plots for the four rides.   Comment on what you observe in the histograms.



