# Analysis of Cycling Data

We are provided with four files containing recordings of cycling activities that include GPS location data as
well as some measurements related to cycling performace like heart rate and power.  The goal is to perform
some exploration and analysis of this data. 

The data represents four races.  Two are time trials where the rider rides alone on a set course.  Two are 
road races where the rider rides with a peleton.  All were held on the same course but the road races include
two laps where the time trials include just one. 

Questions to explore with the data:
* What is the overall distance travelled for each of the rides? What are the average speeds etc.  Provide a summary for each ride.
* Compare the range of speeds for each ride, are time trials faster than road races? 
* Compare the speeds achieved in the two time trials (three years apart).  As well as looking at the averages, can you see where in the ride one or the other is faster.  
* From the elevation_gain field you can see whether the rider is _climbing_ , _descending_ or on the _flat_.   Use this to calculate the average speeds in those three cases (climbing, flat or descending).  Note that _flat_ might not be zero elevation_gain but might allow for slight climbs and falls.  

For time varying data like this it is often useful to _smooth_ the data using eg. a [rolling mean](https://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.rolling_mean.html).  You might want to experiment with smoothing in some of your analysis (not required but may be of interest).

## Description of Fields

* _index_ is a datetime showing the time that the observation was made (I wasn't riding at night, this is converted to UTC)
* __latitude, longitude, elevation__ from the GPS, the position of the rider at each timepoint, elevation in m
* __temperature__ the current ambient temperature in degrees celcius
* __power__ the power being generated by the rider in Watts
* __cadence__ the rotational speed of the pedals in revolutions per minute
* __hr__ heart rate in beats per minute
* __elevation_gain__ the change in elevation in m between two observations
* __distance__ distance travelled between observations in km
* __speed__ speed measured in km/h

You are provided with code in [gpxutils.py](gpxutils.py) to read the GPX XML format files that are exported by cycling computers and applications.  The sample files were exported from [Strava](https://strava.com/) and represent four races by Steve Cassidy.


In [1]:
import gpxpy
import gpxpy.gpx
from gpxutils import parse_gpx 

import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
# read the four data files
rr_16 = parse_gpx('files/Calga_RR_2016.gpx')
tt_16 = parse_gpx('files/Calga_TT_2016.gpx')
rr_19 = parse_gpx('files/Calga_RR_2019.gpx')
tt_19 = parse_gpx('files/Calga_TT_2019.gpx')

In [3]:
rr_16.describe()

Unnamed: 0,latitude,longitude,elevation,temperature,power,cadence,hr,distance,elevation_gain,speed,timedelta
count,2822.0,2822.0,2822.0,2822.0,2822.0,2822.0,2822.0,2822.0,2822.0,2822.0,2821.0
mean,-33.368017,151.225527,232.404465,25.280652,0.0,65.987952,158.394401,0.017381,-0.003756,34.933085,1.843318
std,0.028329,0.006014,29.725934,1.348746,0.0,34.425881,11.304588,0.015695,0.458872,10.738677,1.692364
min,-33.416753,151.211496,176.0,24.0,0.0,0.0,102.0,0.0,-1.6,0.0,1.0
25%,-33.393691,151.221912,209.45,24.0,0.0,68.0,151.0,0.007894,-0.4,26.656312,1.0
50%,-33.37182,151.227236,226.1,25.0,0.0,79.0,158.0,0.011794,0.0,33.307339,1.0
75%,-33.342269,151.230069,258.2,26.0,0.0,87.0,166.0,0.016899,0.4,42.871885,2.0
max,-33.31689,151.235131,295.8,30.0,0.0,117.0,205.0,0.076283,1.2,92.749036,9.0


In [4]:
# checking the shapes to confirm they are of large sizes (don't want to scroll through the dataset, will have to use code
# access the dataset if I face any issues)

print(rr_16.shape, tt_16.shape, rr_19.shape, tt_19.shape)

(2822, 11) (1541, 11) (5503, 11) (2655, 11)


### What is the overall distance travelled for each of the rides? What are the average speeds etc.  Provide a summary for each ride.

In [5]:
rr_16['distance'].sum()

49.04858574628638

In [6]:
# 'distance' is the total distance travelled at each 'checkpoint'.

print(f"Total distance travelled for the 2016 road racer is {rr_16['distance'].sum():.4} Km") # taking the sum will tell
#us how far the cyclist travelled as it adds all the distances between each checkpoint. 

Total distance travelled for the 2016 road racer is 49.05 Km


We can confirm the above by noting the timestamps at the start and end of the race and multiplying it by the speed of the rider. Using simple formulae: 

$$\text{Total distance = speed}\times\text{time}$$
$$\text{km = km/hr}\times\text{hr}$$

In [7]:
rr_16['speed'].mean()

34.93308475482947

In [8]:
time = 5*60*60 + 29*60 + 21 - 4*60*60 + 2*60 + 41 # converting the time taken to seconds then taking the difference between
# the start and end time 

speed = rr_16['speed'].mean() /(60*60) #converting the mean speed to Km/s

(time * speed)

53.58347056004676

The result is similar, differences may be attributed to the fact we used a mean speed. The result of 49.05Km has a varying speed included. 

### **unsure of what exactly is required in this question**

The following summary will be for the following information:
<ul>
    <li> the heart rate mean and max (no min because min will be before the race commences) </li>
    <li> temperature mean and max (no min because min will be before the race commences) </li>
    <li> speed mean and max </li>
    <li> time taken to complete the race </li>
    <li> total distance travelled </li>
    <li> mean and max cadence</li>
    <li> mean and max power </li>
    
</ul>

In [9]:
rr_16.describe()

Unnamed: 0,latitude,longitude,elevation,temperature,power,cadence,hr,distance,elevation_gain,speed,timedelta
count,2822.0,2822.0,2822.0,2822.0,2822.0,2822.0,2822.0,2822.0,2822.0,2822.0,2821.0
mean,-33.368017,151.225527,232.404465,25.280652,0.0,65.987952,158.394401,0.017381,-0.003756,34.933085,1.843318
std,0.028329,0.006014,29.725934,1.348746,0.0,34.425881,11.304588,0.015695,0.458872,10.738677,1.692364
min,-33.416753,151.211496,176.0,24.0,0.0,0.0,102.0,0.0,-1.6,0.0,1.0
25%,-33.393691,151.221912,209.45,24.0,0.0,68.0,151.0,0.007894,-0.4,26.656312,1.0
50%,-33.37182,151.227236,226.1,25.0,0.0,79.0,158.0,0.011794,0.0,33.307339,1.0
75%,-33.342269,151.230069,258.2,26.0,0.0,87.0,166.0,0.016899,0.4,42.871885,2.0
max,-33.31689,151.235131,295.8,30.0,0.0,117.0,205.0,0.076283,1.2,92.749036,9.0


In [10]:
print(f"The mean heart rate for the road racer in 2016 is {rr_16['hr'].mean()} bpm")

The mean heart rate for the road racer in 2016 is 158.39440113394755 bpm


In [11]:
print(f"The max heart rate for the road racer in 2016 is {rr_16['hr'].max()} bpm")

The max heart rate for the road racer in 2016 is 205.0 bpm


In [12]:
print(f"The mean temperature for the road racer in 2016 is {rr_16['temperature'].mean()} degrees Celcius")

The mean temperature for the road racer in 2016 is 25.280652019844084 degrees Celcius


In [13]:
print(f"The max temperature for the road racer in 2016 is {rr_16['temperature'].max()} degrees Celcius") 

The max temperature for the road racer in 2016 is 30.0 degrees Celcius


In [14]:
print(f"The mean speed for the road racer in 2016 is {rr_16['speed'].mean()} Km/hr")

The mean speed for the road racer in 2016 is 34.93308475482947 Km/hr


In [15]:
print(f"The max speed for the road racer in 2016 is {rr_16['speed'].max()} Km/hr")

The max speed for the road racer in 2016 is 92.74903649913952 Km/hr


In [16]:
print(f"The time taken for the road racer in 2016 is {rr_16['distance'].sum() / rr_16['speed'].mean()} hours")

The time taken for the road racer in 2016 is 1.404072560168207 hours


In [17]:
print(f"The total distance travelled for the road racer in 2016 is { rr_16['distance'].sum() } Km")

The total distance travelled for the road racer in 2016 is 49.04858574628638 Km


In [18]:
print(f"The mean cadence for the road racer in 2016 is {rr_16['cadence'].mean() } rpm")

The mean cadence for the road racer in 2016 is 65.98795180722891 rpm


In [19]:
print(f"The max cadence for the road racer in 2016 is {rr_16['cadence'].max() } rpm")

The max cadence for the road racer in 2016 is 117.0 rpm


In [20]:
print(f"The power produced by the road racer in 2016 is null in the DF. This column should be dropped")

The power produced by the road racer in 2016 is null in the DF. This column should be dropped


As we can see, this is a very tedious task. This would work best in a function

In [23]:
def summaryFn(myDF, raceTypeString, yearString):
    
    print(f"The mean heart rate for the {raceTypeString} racer in {yearString} is {myDF['hr'].mean():.4} bpm")
    print(" ")
    print(f"The max heart rate for the {raceTypeString} racer in {yearString} is {myDF['hr'].max():.4} bpm")
    print(" ")
    print(f"The mean temperature for the {raceTypeString} racer in {yearString} is {myDF['temperature'].mean():.4} degrees Celcius")
    print(" ")
    print(f"The max temperature for the {raceTypeString} racer in {yearString} is {myDF['temperature'].max():.4} degrees Celcius") 
    print(" ")
    print(f"The mean speed for the {raceTypeString} racer in {yearString} is {myDF['speed'].mean():.4} Km/hr")
    print(" ")
    print(f"The max speed for the {raceTypeString} racer in {yearString} is {myDF['speed'].max():.4} Km/hr")
    print(" ")
    print(f"The time taken for the {raceTypeString} racer in {yearString} is {myDF['distance'].sum() / myDF['speed'].mean():.4} hours")
    print(" ")
    print(f"The total distance travelled for the {raceTypeString} racer in {yearString} is { myDF['distance'].sum():.4} Km")
    print(" ")
    print(f"The mean cadence for the {raceTypeString} racer in {yearString} is {myDF['cadence'].mean():.4} rpm")
    print(" ")
    print(f"The max cadence for the {raceTypeString} racer in {yearString} is {myDF['cadence'].max():.4} rpm")

In [24]:
summaryFn(rr_16, 'road', '2016')

The mean heart rate for the road racer in 2016 is 158.4 bpm
 
The max heart rate for the road racer in 2016 is 205.0 bpm
 
The mean temperature for the road racer in 2016 is 25.28 degrees Celcius
 
The max temperature for the road racer in 2016 is 30.0 degrees Celcius
 
The mean speed for the road racer in 2016 is 34.93 Km/hr
 
The max speed for the road racer in 2016 is 92.75 Km/hr
 
The time taken for the road racer in 2016 is 1.404 hours
 
The total distance travelled for the road racer in 2016 is 49.05 Km
 
The mean cadence for the road racer in 2016 is 65.99 rpm
 
The max cadence for the road racer in 2016 is 117.0 rpm


In [25]:
summaryFn(tt_16, 'track', '2016')

The mean heart rate for the track racer in 2016 is 170.9 bpm
 
The max heart rate for the track racer in 2016 is 251.0 bpm
 
The mean temperature for the track racer in 2016 is 10.95 degrees Celcius
 
The max temperature for the track racer in 2016 is 13.0 degrees Celcius
 
The mean speed for the track racer in 2016 is 33.53 Km/hr
 
The max speed for the track racer in 2016 is 162.5 Km/hr
 
The time taken for the track racer in 2016 is 0.7397 hours
 
The total distance travelled for the track racer in 2016 is 24.8 Km
 
The mean cadence for the track racer in 2016 is 83.28 rpm
 
The max cadence for the track racer in 2016 is 118.0 rpm


In [26]:
summaryFn(rr_19, 'road', '2019')

The mean heart rate for the road racer in 2019 is 139.0 bpm
 
The max heart rate for the road racer in 2019 is 170.0 bpm
 
The mean temperature for the road racer in 2019 is 5.998 degrees Celcius
 
The max temperature for the road racer in 2019 is 7.0 degrees Celcius
 
The mean speed for the road racer in 2019 is 33.88 Km/hr
 
The max speed for the road racer in 2019 is 70.37 Km/hr
 
The time taken for the road racer in 2019 is 1.529 hours
 
The total distance travelled for the road racer in 2019 is 51.79 Km
 
The mean cadence for the road racer in 2019 is 70.0 rpm
 
The max cadence for the road racer in 2019 is 120.0 rpm


In [27]:
summaryFn(tt_19, 'track', '2019')

The mean heart rate for the track racer in 2019 is 152.7 bpm
 
The max heart rate for the track racer in 2019 is 166.0 bpm
 
The mean temperature for the track racer in 2019 is 10.19 degrees Celcius
 
The max temperature for the track racer in 2019 is 13.0 degrees Celcius
 
The mean speed for the track racer in 2019 is 33.06 Km/hr
 
The max speed for the track racer in 2019 is 63.3 Km/hr
 
The time taken for the track racer in 2019 is 0.7375 hours
 
The total distance travelled for the track racer in 2019 is 24.38 Km
 
The mean cadence for the track racer in 2019 is 89.98 rpm
 
The max cadence for the track racer in 2019 is 111.0 rpm


We will most likely have to access some of this information later on so we can expect to modify the function such that it returns the required information. 

Dropping the 'power' column from the DFs, only tt_2019 has the power field filled:

using help from https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html

In [28]:
rr_16= rr_16.drop(['power'], axis=1)
rr_19= rr_19.drop(['power'], axis=1)
tt_16= tt_16.drop(['power'], axis=1)

###  Q2: Compare the range of speeds for each ride, are time trials faster than road races? 

Let's start by modifying the `summaryFn` from before

In [None]:
# displaying the mean, median and standard deviation of age for each group

ageDF = {'DF': ['adult_male', 'adult_female', 'adult_male50k', 'adult_female50k'],
         'mean_age':[adult_male['age'].mean(),adult_female['age'].mean(),adult_male50k['age'].mean(),adult_female50k['age'].mean()],
         'median_age':[adult_male['age'].median(),adult_female['age'].median(),adult_male50k['age'].median(),adult_female50k['age'].median()],
         'standard_deviation_age':[adult_male['age'].std(),adult_female['age'].std(),adult_male50k['age'].std(),adult_female50k['age'].std()]}


#is there a easier way to display it like this? 

In [29]:
def summaryFn2(myDF, raceTypeString, yearString):
    hrMean = myDF['hr'].mean()
#     print(f"The mean heart rate for the {raceTypeString} racer in {yearString} is {myDF['hr'].mean():.4} bpm")
#     print(" ")
    hrMax =  myDF['hr'].max()
#     print(f"The max heart rate for the {raceTypeString} racer in {yearString} is {myDF['hr'].max():.4} bpm")
#     print(" ")
    tempMean = myDF['temperature'].mean()
#     print(f"The mean temperature for the {raceTypeString} racer in {yearString} is {myDF['temperature'].mean():.4} degrees Celcius")
#     print(" ")
    tempMax = myDF['temperature'].max()
#     print(f"The max temperature for the {raceTypeString} racer in {yearString} is {myDF['temperature'].max():.4} degrees Celcius") 
#     print(" ")
    speedMean = myDF['speed'].mean()
#     print(f"The mean speed for the {raceTypeString} racer in {yearString} is {myDF['speed'].mean():.4} Km/hr")
#     print(" ")
    speedMax = myDF['speed'].max()
#     print(f"The max speed for the {raceTypeString} racer in {yearString} is {myDF['speed'].max():.4} Km/hr")
#     print(" ")
    timeTaken = myDF['distance'].sum() / myDF['speed'].mean()
#     print(f"The time taken for the {raceTypeString} racer in {yearString} is {myDF['distance'].sum() / myDF['speed'].mean():.4} hours")
#     print(" ")
    totalDist = myDF['distance'].sum()
#     print(f"The total distance travelled for the {raceTypeString} racer in {yearString} is { myDF['distance'].sum():.4} Km")
#     print(" ")
    cadenceMean = myDF['cadence'].mean()
#     print(f"The mean cadence for the {raceTypeString} racer in {yearString} is {myDF['cadence'].mean():.4} rpm")
#     print(" ")
    cadenceMax = myDF['cadence'].max()
#     print(f"The max cadence for the {raceTypeString} racer in {yearString} is {myDF['cadence'].max():.4} rpm")
    return hrMean , hrMax , tempMean,  tempMax,  speedMean,  timeTaken,  totalDist,  cadenceMean,  cadenceMax

In [34]:
test = list(summaryFn2(tt_16, 'road','2016'))
type(test)

list

In [35]:
test

[170.93964957819597,
 251.0,
 10.953277092796885,
 13.0,
 33.52996304869014,
 0.7397230648686031,
 24.80288703130808,
 83.27709279688514,
 118.0]

### Q3: Compare the speeds achieved in the two time trials (three years apart).  As well as looking at the averages, can you see where in the ride one or the other is faster.  

### Q4: From the elevation_gain field you can see whether the rider is _climbing_ , _descending_ or on the _flat_.   Use this to calculate the average speeds in those three cases (climbing, flat or descending).  Note that _flat_ might not be zero elevation_gain but might allow for slight climbs and falls.  

## Challenge: Gear Usage

A modern race bike has up to 22 different gears with two chainrings on the front (attached to the pedals) and 10 or 11 at the back (attached to the wheel).   The ratio of the number of teeth on the front and rear cogs determines the distance travelled with one revolution of the pedals (often called __development__, measured in metres).  Low development is good for climbing hills while high development is for going fast downhill or in the final sprint. 

We have a measure of the number of rotations of the pedals per minute (__cadence__) and a measure of __speed__.  Using these two variables we should be able to derive a measure of __development__ which would effectivly tell us which gear the rider was using at the time.   Development will normally range between __2m__ and __10m__.  Due to errors in GPS and cadence measurements you will see many points outside this range and you should just discard them as outliers. 

Write code to calculate __development__ in _meters_ for each row in a ride.  Plot the result in a _histogram_ and compare the plots for the four rides.   Comment on what you observe in the histograms.





# NEED HELP FOR THE FOLLOWING: 

- why isn't the drop() function working
- how to use smoothing function 