# Marathon Dataset

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline


In [None]:
data = pd.read_csv('marathon-data.csv')
data.head()

### Terminology

* Splits= a race or run’s total time divided into parts (usually km or miles)
* Negative split= running the second half of a race faster than the first half; ideal way to pace most races
* Positive split= running the second half of the race slower than the first half
* Even split= running the first and second half of a race at a consistent pace

In [None]:
""" By default, Pandas loaded the time columns as Python strings (type object); we can 
see this by looking at the dtypes attribute of the DataFrame:"""

# (Q) Find out the data types of the attributes in the data
data.dtypes

In [None]:
# Convert time into the desired format

def convert_time(s):
    h, m, s = map(int, s.split(':'))
    return pd.datetools.timedelta(hours=h, minutes=m, seconds=s)
data = pd.read_csv('marathon-data.csv', converters={'split':convert_time, 'final':convert_time})
data.head()

In [None]:
# (Q) Find out the data types of the attributes in the data after conversion

In [None]:
"""That looks much better. For the purpose of our Seaborn plotting utilities, let’s next
add columns that give the times in seconds:"""
# Creating the split_sec attribute using the split attribute
data['split_sec'] = data['split'] / np.timedelta64(1, 's')

#(Q) repeat the same for the final attribute and create final_sec attribute
data['final_sec'] = data['final'] / np.timedelta64(1, 's')
data.head()

In [None]:
# To get an idea of what the data looks like, we can plot a jointplot over the data

#(Q) Generate a seaborn joint plot using the split_sec and final_sec attributes

### Observations
The dotted line shows where someone’s time would lie if they ran the marathon at a
perfectly steady pace. The fact that the distribution lies above this indicates (as you
might expect) that most people slow down over the course of the marathon.

In [None]:
"""Let’s create another column in the data, the split fraction, which measures the degree
to which each runner negative-splits or positive-splits the race:"""

# generating split fraction attribute using the formula
data['split_frac'] = 1 - 2 * data['split_sec'] / data['final_sec']
data.head()

In [None]:
"""Where this split difference is less than zero, the person negative-split the race by that
fraction. Let’s do a distribution plot of this split fraction """

#(Q) Generate a distribution plot for split fraction attribute

In [None]:
#(Q) Find number of points having split fraction less than 0

Out of nearly 40,000 participants, there were only 250 people who negative-split their
marathon.
Let’s see whether there is any correlation between this split fraction and other variables.
We’ll do this using a pairgrid, which draws plots of all these correlations

In [None]:
#(Q) Generate a PairGrid plot with the variables : age, split_sec, final_sec, split_frac with the hue: gender as 'g'
# 'g' should be used to map the scatter plot

#Code goes here 
g.map(plt.scatter, alpha=0.8)
g.add_legend();

### Observations
It looks like the split fraction does not correlate particularly with age, but does correlate
with the final time: faster runners tend to have closer to even splits on their marathon
time.

In [None]:
"""The difference between men and women here is interesting. Let’s look at the histogram
of split fractions for these two groups"""

#Generate a KDE plot for all the men  
sns.kdeplot(data.split_frac[data.gender=='M'], label='men', shade=True)
#(Q) Generate a KDE plot for all the women 
#Code goes here
plt.xlabel('split_frac');

The interesting thing here is that there are many more men than women who are
running close to an even split! This almost looks like some kind of bimodal distribution
among the men and women.

In [None]:
#nice way to compare distributions is to use a violin plot
#(Q) Generate a voilin plot for gender and split_frac with the palette 


This is yet another way to compare the distributions between men and women.
Let’s look a little deeper, and compare these violin plots as a function of age. We’ll
start by creating a new column in the array that specifies the decade of age that each
person is in

In [None]:
data['age_dec'] = data.age.map(lambda age: 10 * (age // 10))
data.head()

In [None]:
men = (data.gender == 'M')
women = (data.gender == 'W')
#(Q) Generate a voilin plot for age_dec and split_frac with hue: gender and the palette 


Looking at this, we can see where the distributions of men and women differ: the split
distributions of men in their 20s to 50s show a pronounced over-density toward
lower splits when compared to women of the same age (or of any age, for that
matter).

### Back to the men with negative splits: who are these runners? Does this split fraction correlate with finishing quickly? We can plot this very easily. We’ll use regplot, which will automatically fit a linear regression to the data

In [None]:
#(Q) Generate a lmplot using seaborn for final sec and split fraction attributes with color as gender 

Apparently the people with fast splits are the elite runners who are finishing within
~15,000 seconds, or about 4 hours. People slower than that are much less likely to
have a fast second split.