In [1]:
import pandas as pd
import numpy as np 
from gpxutils import parse_gpx 
import matplotlib.pyplot as plt
%matplotlib inline

# Analysis of Cycling Data

We are provided with four files containing recordings of cycling activities that include GPS location data as
well as some measurements related to cycling performace like heart rate and power.  The goal is to perform
some exploration and analysis of this data. 

The data represents four races.  Two are time trials where the rider rides alone on a set course.  Two are 
road races where the rider rides with a peleton.  All were held on the same course but the road races include
two laps where the time trials include just one. 

Questions to explore with the data:
* What is the overall distance travelled for each of the rides? What are the average speeds etc.  Provide a summary for each ride.
* Compare the range of speeds for each ride, are time trials faster than road races? 
* Compare the speeds achieved in the two time trials (three years apart).  As well as looking at the averages, can you see where in the ride one or the other is faster.  
* From the elevation_gain field you can see whether the rider is _climbing_ , _descending_ or on the _flat_.   Use this to calculate the average speeds in those three cases (climbing, flat or descending).  Note that _flat_ might not be zero elevation_gain but might allow for slight climbs and falls.  

For time varying data like this it is often useful to _smooth_ the data using eg. a [rolling mean](https://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.rolling_mean.html).  You might want to experiment with smoothing in some of your analysis (not required but may be of interest).

## Description of Fields

* _index_ is a datetime showing the time that the observation was made (I wasn't riding at night, this is converted to UTC)
* __latitude, longitude, elevation__ from the GPS, the position of the rider at each timepoint, elevation in m
* __temperature__ the current ambient temperature in degrees celcius
* __power__ the power being generated by the rider in Watts
* __cadence__ the rotational speed of the pedals in revolutions per minute
* __hr__ heart rate in beats per minute
* __elevation_gain__ the change in elevation in m between two observations
* __distance__ distance travelled between observations in km
* __speed__ speed measured in km/h

You are provided with code in [gpxutils.py](gpxutils.py) to read the GPX XML format files that are exported by cycling computers and applications.  The sample files were exported from [Strava](https://strava.com/) and represent four races by Steve Cassidy.


In [2]:
# read the four data files
rr_2016 = parse_gpx('files/Calga_RR_2016.gpx')
tt_2016 = parse_gpx('files/Calga_TT_2016.gpx')
rr_2019 = parse_gpx('files/Calga_RR_2019.gpx')
tt_2019 = parse_gpx('files/Calga_TT_2019.gpx')

In [3]:
rr_2016.head()

Unnamed: 0,latitude,longitude,elevation,temperature,power,cadence,hr,distance,elevation_gain,speed,timedelta
2016-05-14 04:02:41+00:00,-33.415561,151.222303,208.6,29.0,0.0,40.0,102.0,0.0,0.0,0.0,
2016-05-14 04:02:42+00:00,-33.415534,151.222289,208.6,29.0,0.0,40.0,102.0,0.003271,0.0,11.77702,1.0
2016-05-14 04:02:46+00:00,-33.415398,151.22218,208.6,29.0,0.0,40.0,103.0,0.018194,0.0,16.375033,4.0
2016-05-14 04:02:49+00:00,-33.415264,151.222077,208.6,29.0,0.0,55.0,106.0,0.017703,0.0,21.243901,3.0
2016-05-14 04:02:51+00:00,-33.41516,151.222013,208.6,29.0,0.0,61.0,109.0,0.013001,0.0,23.401217,2.0


In [4]:
#1. What is the overall distance travelled for each of the rides? What are the average speeds etc.
#Provide a summary for each ride.

#Calculate the total distance of each ride using the sum function
totalDistRR16 = rr_2016['distance'].sum()
totalDistTT16 = tt_2016['distance'].sum()
totalDistRR19 = rr_2019['distance'].sum()
totalDistTT19 = tt_2019['distance'].sum()

#Calculate the average speed of each ride using the mean function
avgSpeedRR16 = rr_2016['speed'].mean()
avgSpeedTT16 = tt_2016['speed'].mean()
avgSpeedRR19 = rr_2019['speed'].mean()
avgSpeedTT19 = tt_2019['speed'].mean()

#Calculate the average elevation of each ride using the mean function
avgElevationRR16 = rr_2016['elevation'].mean()
avgElevationTT16 = tt_2016['elevation'].mean()
avgElevationRR19 = rr_2019['elevation'].mean()
avgElevationTT19 = tt_2019['elevation'].mean()

#Calculate the average cadence of each ride using the mean function
avgCadenceRR16 = rr_2016['cadence'].mean()
avgCadenceTT16 = tt_2016['cadence'].mean()
avgCadenceRR19 = rr_2019['cadence'].mean()
avgCadenceTT19 = tt_2019['cadence'].mean()

#Print summary of the 2016 Road Race
print("For the 2016 road race, the total distance travelled was " + str(np.round(totalDistRR16, decimals=2)) +  
      "\nkm, the average speed was " + str(np.round(avgSpeedRR16, decimals=2)) + " km/h, the average elevation was " + str(np.round(avgElevationRR16, decimals=2)) + 
     "\nm, and the average cadence was " + str(np.round(avgCadenceRR16, decimals=2)) + " revolutions per minute.\n")

#Print summary of the 2016 Time Trial
print("For the 2016 road race, the total distance travelled was " + str(np.round(totalDistTT16, decimals=2)) +  
      "\nkm, the average speed was " + str(np.round(avgSpeedTT16, decimals=2)) + " km/h, the average elevation was " + str(np.round(avgElevationTT16, decimals=2)) + 
     "\nm, and the average cadence was " + str(np.round(avgCadenceTT16, decimals=2)) + " revolutions per minute.\n")

#Print summary of the 2019 Road Race
print("For the 2019 road race, the total distance travelled was " + str(np.round(totalDistRR19, decimals=2)) +  
      "\nkm, the average speed was " + str(np.round(avgSpeedRR19, decimals=2)) + " km/h, the average elevation was " + str(np.round(avgElevationRR19, decimals=2)) + 
     "\nm, and the average cadence was " + str(np.round(avgCadenceRR19, decimals=2)) + " revolutions per minute.\n")

#Print summary of the 2019 Time Trial
print("For the 2019 time trial, the total distance travelled was " + str(np.round(totalDistTT19, decimals=2)) +  
      "\nkm, the average speed was " + str(np.round(avgSpeedTT19, decimals=2)) + " km/h, the average elevation was " + str(np.round(avgElevationTT19, decimals=2)) + 
     "\nm, and the average cadence was " + str(np.round(avgCadenceTT19, decimals=2)) + " revolutions per minute.\n")


#2. Compare the range of speeds for each ride, are time trials faster than road races?

#Calculate the range of speeds for each ride by subtracting the minimum values from the maximum values
speedRangeRR16 = (rr_2016['speed'].max() - rr_2016['speed'].min())
speedRangeTT16 = (tt_2016['speed'].max() - tt_2016['speed'].min())
speedRangeRR19 = (rr_2019['speed'].max() - rr_2019['speed'].min())
speedRangeTT19 = (tt_2019['speed'].max() - tt_2019['speed'].min())

print("The range of speed for each ride is as follows:\nRoad Race 2016: " + str(np.round(speedRangeRR16, 2)) + " km/h"
      "\nTime Trial 2016: " + str(np.round(speedRangeTT16, 2)) + " km/h"
      "\nRoad Race 2019: " + str(np.round(speedRangeRR19, 2)) + " km/h"
      "\nTime Trial 2019: " + str(np.round(speedRangeTT19, 2)) + " km/h"
      "\nIt can be inferred that time trials are not necessarily faster than road races; "
      "\nwhile the 2016 Time Trial was considerably faster, the 2019 Time Trial was notably slower.\n")


#3. Compare the speeds achieved in the two time trials (three years apart).
#As well as looking at the averages, can you see where in the ride one or the other is faster.
print("As ascertained earlier, the average speed for the 2016 Time Trial was approximately 33.52 km/h "
      "\nand the average speed for the 2019 Time Trial was approximately 33.06 km/h, indicating that there "
      "\nwas no significant difference in speed.")


#4. From the elevation_gain field you can see whether the rider is climbing , descending or on the flat.
#Use this to calculate the average speeds in those three cases (climbing, flat or descending).
#Note that flat might not be zero elevation_gain but might allow for slight climbs and falls.

#Make new dataframes that only have rows where the elevation_gain value is above 0.2 i.e. the cyclist is climbing
climbingRR16 = (rr_2016[rr_2016.elevation_gain > 0.2])
climbingTT16 = (tt_2016[tt_2016.elevation_gain > 0.2])
climbingRR19 = (rr_2019[rr_2019.elevation_gain > 0.2])
climbingTT19 = (tt_2019[tt_2019.elevation_gain > 0.2])

#Calculate the average climbing speed by adding together the average speed of each race and dividing by the
#number of races (4)
avgClimbingSpeed = ((climbingRR16['speed'].mean() + climbingTT16['speed'].mean() + 
                     climbingRR19['speed'].mean() + climbingTT19['speed'].mean()) / 4)

#Make new dataframes that only have rows where the elevation_gain value is below -0.2 i.e. the cyclist is descending
descendingRR16 = (rr_2016[rr_2016.elevation_gain < -0.2])
descendingTT16 = (tt_2016[tt_2016.elevation_gain < -0.2])
descendingRR19 = (rr_2019[rr_2019.elevation_gain < -0.2])
descendingTT19 = (tt_2019[tt_2019.elevation_gain < -0.2])

#Calculate the average descending speed by adding together the average speed of each race and dividing by the
#number of races (4)
avgDescendingSpeed = ((descendingRR16['speed'].mean() + descendingTT16['speed'].mean() + 
                     descendingRR19['speed'].mean() + descendingTT19['speed'].mean()) / 4)

#Do the same as the above two calculations, but for when the elevation is flat
flatRR16 = ((rr_2016[rr_2016.elevation_gain <= 0.2] + rr_2016[rr_2016.elevation_gain >= -0.2]) / 2)
flatTT16 = ((tt_2016[tt_2016.elevation_gain <= 0.2] + tt_2016[tt_2016.elevation_gain >= -0.2]) / 2)
flatRR19 = ((rr_2019[rr_2019.elevation_gain <= 0.2] + rr_2019[rr_2019.elevation_gain >= -0.2]) / 2)
flatTT19 = ((tt_2019[tt_2019.elevation_gain <= 0.2] + tt_2019[tt_2019.elevation_gain >= -0.2]) / 2)

avgFlatSpeed = ((flatRR16['speed'].mean() + flatTT16['speed'].mean() + 
                     flatRR19['speed'].mean() + flatTT19['speed'].mean()) / 4)

print("\nThe average climbing speed is " + str(np.round(avgClimbingSpeed, 2)) + " km/h, the average "
      "\ndescending speed is " + str(np.round(avgDescendingSpeed, 2)) + " km/h, and the average speed "
      "\non flat ground is " + str(np.round(avgFlatSpeed, 2)) + " km/h.")

For the 2016 road race, the total distance travelled was 49.05
km, the average speed was 34.93 km/h, the average elevation was 232.4
m, and the average cadence was 65.99 revolutions per minute.

For the 2016 road race, the total distance travelled was 24.8
km, the average speed was 33.53 km/h, the average elevation was 139.07
m, and the average cadence was 83.28 revolutions per minute.

For the 2019 road race, the total distance travelled was 51.79
km, the average speed was 33.88 km/h, the average elevation was 243.24
m, and the average cadence was 70.0 revolutions per minute.

For the 2019 time trial, the total distance travelled was 24.38
km, the average speed was 33.06 km/h, the average elevation was 250.44
m, and the average cadence was 89.98 revolutions per minute.

The range of speed for each ride is as follows:
Road Race 2016: 92.75 km/h
Time Trial 2016: 162.51 km/h
Road Race 2019: 70.37 km/h
Time Trial 2019: 63.3 km/h
It can be inferred that time trials are not necessarily fast

## Challenge: Gear Usage

A modern race bike has up to 22 different gears with two chainrings on the front (attached to the pedals) and 10 or 11 at the back (attached to the wheel).   The ratio of the number of teeth on the front and rear cogs determines the distance travelled with one revolution of the pedals (often called __development__, measured in metres).  Low development is good for climbing hills while high development is for going fast downhill or in the final sprint. 

We have a measure of the number of rotations of the pedals per minute (__cadence__) and a measure of __speed__.  Using these two variables we should be able to derive a measure of __development__ which would effectivly tell us which gear the rider was using at the time.   Development will normally range between __2m__ and __10m__.  Due to errors in GPS and cadence measurements you will see many points outside this range and you should just discard them as outliers. 

Write code to calculate __development__ in _meters_ for each row in a ride.  Plot the result in a _histogram_ and compare the plots for the four rides.   Comment on what you observe in the histograms.





In [5]:
#Make new columns for development by dividing cadence by speed
RR2016Development = rr_2016;
RR2016Development['development'] = (rr_2016['cadence'] / rr_2016['speed'])

TT2016Development = tt_2016;
TT2016Development['development'] = (tt_2016['cadence'] / tt_2016['speed'])

RR2019Development = rr_2019;
RR2019Development['development'] = (rr_2019['cadence'] / rr_2019['speed'])

TT2019Development = tt_2019;
TT2019Development['development'] = (tt_2019['cadence'] / tt_2019['speed'])

#Cleanse the data by removing values outside of 2 and 10 as they are considered outliers
RR2016Development = RR2016Development[RR2016Development['development'].between(2, 10)]
#RR2016Development['development'].hist()

TT2016Development = TT2016Development[TT2016Development['development'].between(2, 10)]
#TT2016Development['development'].hist()

RR2019Development = RR2019Development[RR2019Development['development'].between(2, 10)]
#RR2019Development['development'].hist()

TT2019Development = TT2019Development[TT2019Development['development'].between(2, 10)]
#TT2019Development['development'].hist()

print("I attempted to create histograms that showed changes in development over time,"
      "\nbut I was unable to plot the correct values on the graphs.")

I attempted to create histograms that showed changes in development over time,
but I was unable to plot the correct values on the graphs.
