# Boston Marathon Split Comparision
by Sam Tobin and Cole DeMeulemeester

#### Inspiration

This project compares splits of a selected 2017 Boston Marathon finisher, and graphs the results on a simple map of the course.

Being track runners ourselves, we wanted to explore the ways in which all runners, and most impotantly the successful and unsuccessful, approached the race in comparison with the average.

Originally, we wanted to create something similar to the visualization of Napolean's march to Russia. However, soon we decided to make a simpler design given time and skill constraints.

#### Collecting and Processing the Data

We found the json file for the boston marathon route here: https://gist.github.com/jwass/11119254 

Our dataset of marathon finishers in 2017 came from the following source: https://www.kaggle.com/rojour/boston-results/data

However, using only this dataset, we were unable to effectively calculate averages. Dr. Z helped us by formatting the data, adding "seconds" and "splits" columns with the following cells.

In [4]:
import pandas as pd

df = pd.read_csv("marathon_results_2017.csv", index_col="Overall")

In [5]:
df.head()

Unnamed: 0_level_0,Unnamed: 0,Bib,Name,Age,M/F,City,State,Country,Citizen,Unnamed: 9,...,Half,twentyfive,thirty,thirtyfive,forty,Pace,Proj_Time,Official_Time,Gender,Division
Overall,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0,11,"Kirui, Geoffrey",24,M,Keringet,,KEN,,,...,1:04:35,1:16:59,1:33:01,1:48:19,2:02:53,0:04:57,-,2:09:37,1,1
2,1,17,"Rupp, Galen",30,M,Portland,OR,USA,,,...,1:04:35,1:16:59,1:33:01,1:48:19,2:03:14,0:04:58,-,2:09:58,2,2
3,2,23,"Osako, Suguru",25,M,Machida-City,,JPN,,,...,1:04:36,1:17:00,1:33:01,1:48:31,2:03:38,0:04:59,-,2:10:28,3,3
4,3,21,"Biwott, Shadrack",32,M,Mammoth Lakes,CA,USA,,,...,1:04:45,1:17:00,1:33:01,1:48:58,2:04:35,0:05:03,-,2:12:08,4,4
5,4,9,"Chebet, Wilson",31,M,Marakwet,,KEN,,,...,1:04:35,1:16:59,1:33:01,1:48:41,2:05:00,0:05:04,-,2:12:35,5,5


In [6]:
df.columns

Index(['Unnamed: 0', 'Bib', 'Name', 'Age', 'M/F', 'City', 'State', 'Country',
       'Citizen', 'Unnamed: 9', 'five', 'ten', 'fifteen', 'twenty', 'Half',
       'twentyfive', 'thirty', 'thirtyfive', 'forty', 'Pace', 'Proj_Time',
       'Official_Time', 'Gender', 'Division'],
      dtype='object')

In [8]:
splits = df[['Name', 'five', 'ten', 'fifteen', 'twenty', 'Half',
       'twentyfive', 'thirty', 'thirtyfive', 'forty', 'Pace', 'Official_Time']]

We added an "Official_Time_s" column to the exporting csv so that we may represent the total race time in our visualization. We added on to the existing cells as necessary.

In [9]:
splits.head()

Unnamed: 0_level_0,Name,five,ten,fifteen,twenty,Half,twentyfive,thirty,thirtyfive,forty,Pace,Official_Time
Overall,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,"Kirui, Geoffrey",0:15:25,0:30:28,0:45:44,1:01:15,1:04:35,1:16:59,1:33:01,1:48:19,2:02:53,0:04:57,2:09:37
2,"Rupp, Galen",0:15:24,0:30:27,0:45:44,1:01:15,1:04:35,1:16:59,1:33:01,1:48:19,2:03:14,0:04:58,2:09:58
3,"Osako, Suguru",0:15:25,0:30:29,0:45:44,1:01:16,1:04:36,1:17:00,1:33:01,1:48:31,2:03:38,0:04:59,2:10:28
4,"Biwott, Shadrack",0:15:25,0:30:29,0:45:44,1:01:19,1:04:45,1:17:00,1:33:01,1:48:58,2:04:35,0:05:03,2:12:08
5,"Chebet, Wilson",0:15:25,0:30:28,0:45:44,1:01:15,1:04:35,1:16:59,1:33:01,1:48:41,2:05:00,0:05:04,2:12:35


In [10]:
def to_seconds(s):
    times = s.split(":")
    return int(times[0])* 3600 + int(times[1])* 60 + int(times[2])

In [13]:
split_cols = ['five', 'ten', 'fifteen', 'twenty',
       'Half', 'twentyfive', 'thirty', 'thirtyfive', 'forty', 'Official_Time']

df["five_s"] = df.five.map(to_seconds)

Wait, there are `"-"`'s in there?

In [14]:
# splits = df[split_cols]

for col in split_cols:
    df = df[~df[col].str.contains("-")]
    
df.head()

Unnamed: 0_level_0,Unnamed: 0,Bib,Name,Age,M/F,City,State,Country,Citizen,Unnamed: 9,...,twentyfive,thirty,thirtyfive,forty,Pace,Proj_Time,Official_Time,Gender,Division,five_s
Overall,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0,11,"Kirui, Geoffrey",24,M,Keringet,,KEN,,,...,1:16:59,1:33:01,1:48:19,2:02:53,0:04:57,-,2:09:37,1,1,925
2,1,17,"Rupp, Galen",30,M,Portland,OR,USA,,,...,1:16:59,1:33:01,1:48:19,2:03:14,0:04:58,-,2:09:58,2,2,924
3,2,23,"Osako, Suguru",25,M,Machida-City,,JPN,,,...,1:17:00,1:33:01,1:48:31,2:03:38,0:04:59,-,2:10:28,3,3,925
4,3,21,"Biwott, Shadrack",32,M,Mammoth Lakes,CA,USA,,,...,1:17:00,1:33:01,1:48:58,2:04:35,0:05:03,-,2:12:08,4,4,925
5,4,9,"Chebet, Wilson",31,M,Marakwet,,KEN,,,...,1:16:59,1:33:01,1:48:41,2:05:00,0:05:04,-,2:12:35,5,5,925


In [15]:
# map all the columns to seconds
for col in split_cols:
    df[col + "_s"] = df[col].map(to_seconds)

This is the stuff I did after you two left on 11/7.  Basically: we need a function that can subtract the previous rows.

In [18]:
# make a pairing of things to subtract
s = [split for split in split_cols if (split != "Half") | (split != "Final")]
split_col_pairs = [(s[i], s[i+1]) for i in range(len(s)-1)]

# check it
split_col_pairs

[('five', 'ten'),
 ('ten', 'fifteen'),
 ('fifteen', 'twenty'),
 ('twenty', 'Half'),
 ('Half', 'twentyfive'),
 ('twentyfive', 'thirty'),
 ('thirty', 'thirtyfive'),
 ('thirtyfive', 'forty'),
 ('forty', 'Official_Time')]

In [19]:
# check that it works for the first dude

for col1, col2 in split_col_pairs:
    print(df.loc[1,][col2+"_s"] - df.loc[1,][col1+"_s"])

903
916
931
200
744
962
918
874
404


In [20]:
# Now do it for everyone
for col1, col2 in split_col_pairs:
    df[col2 + "_split"] = df[col2 + "_s"] - df[col1 + "_s"]
    
#check it
df.head()

Unnamed: 0_level_0,Unnamed: 0,Bib,Name,Age,M/F,City,State,Country,Citizen,Unnamed: 9,...,Official_Time_s,ten_split,fifteen_split,twenty_split,Half_split,twentyfive_split,thirty_split,thirtyfive_split,forty_split,Official_Time_split
Overall,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0,11,"Kirui, Geoffrey",24,M,Keringet,,KEN,,,...,7777,903,916,931,200,744,962,918,874,404
2,1,17,"Rupp, Galen",30,M,Portland,OR,USA,,,...,7798,903,917,931,200,744,962,918,895,404
3,2,23,"Osako, Suguru",25,M,Machida-City,,JPN,,,...,7828,904,915,932,200,744,961,930,907,410
4,3,21,"Biwott, Shadrack",32,M,Mammoth Lakes,CA,USA,,,...,7928,904,915,935,206,735,961,957,937,453
5,4,9,"Chebet, Wilson",31,M,Marakwet,,KEN,,,...,7955,903,916,931,200,744,962,940,979,455


In [21]:
# check that there's nothing strange going on 
df["thirtyfive_split"].describe()

count    26259.000000
mean      1871.579420
std        416.732748
min        828.000000
25%       1580.000000
50%       1804.000000
75%       2089.500000
max      10720.000000
Name: thirtyfive_split, dtype: float64

Only thing that's missing is you don't have a `five_split` column.  It's the same as `five_s`, but just to make iteration in your script a little easier, let's add that.

In [22]:
df["five_split"] = df["five_s"]

Looks great, let's ship it out.

In [23]:
df.to_csv("marathon_2017_with_seconds.csv")

Enjoy!

#### How it Works

After the users selects a number for a runner and presses the Generate Graph button, our code reads in both of the datasets to generate the plot. [line 68]

Two functions are then called to generate two arrays: The average splits for every runner, and the splits for the selected runner. [line 197, 215]

We create a projection in order to align our data (Boston Area) with the screen, and then plot the route for the Marathon using a path with this projection. [line 78]

For the 5k split markers and the runner data, we create group element and, using the arrays above as well as an array of 5k marker coordinates, append circles and rects containing the data from both arrays (forEach()) to the plot. This happens with an each() function within a call to the group element. [line 116-185]

#### Running the Program

Simply collect each file (csv,json,html) into a folder, and run a local server from this folder on the console. Then, select your runner and press Generate!