# Ultramarathon Study

In this study we are considering the effects of running ultramarathons before the age of 19 effects life outcomes at the age of 25, including but not limited to running status, alcohol usage, weight/height and employment.
An ultramarathon is any race distance longer than a marathon. For those daring to participate in one, you usually do not run for a time, but instead to say you completed 50 miles (for example).
What is unusual about this study then? Ultramarathons are even worse for your body than a marathon. Most who particpate do not run until their late 20's or even their 30's, as you want to fully develop your body before stressing it out this much.

We start by initializing our libraries and uploading our ultramarathon study.

In [1]:
import numpy as np
from bokeh.plotting import figure
from bokeh.io import show, output_notebook
from bokeh.layouts import gridplot
output_notebook()

import pandas as pd
ultramarathon = pd.read_csv('Childhood Ultramarathon Runner Study v0.92.csv')

Similarly to our linear regression lab, we will define our plot here to save time later.

In [2]:
def comparison_plot(x,Y, Yhat):
    '''Plots Predicted vs True values for analysis of regression'''
    comparison_plot = figure(title='Difference between Measured and Predicted Values')
    comparison_plot.xaxis.axis_label='x'
    comparison_plot.yaxis.axis_label='Yhat-Y'
    comparison_plot.scatter(x=x,y=Yhat-Y)
    comparison_plot.line(x=[x.min(),x.max()],y=[0,0])
    return comparison_plot

### Quanity of Ultramarathons 

We will first consider the quantity of ultramarathons ran before the age of 19. We would like to see if running more Ultramarathons is predicative of specific life outcomes. We will be looking at the linear regression case.

In the data there are runners who ran as many as 30 ultramarathons before the age of 19. We will begin by creating a scatter plot of data where the x-axis is quantity of ultramarathons. We load the data such that we can consider there plots.

In [3]:
# accessing specific columns for our scatter plot
x = ultramarathon["NBR_UM_BEFORE_19"]
km_week = ultramarathon["RUN_DISTANCE_PW_LAST_12_MONTHS"]
work_hrs = ultramarathon["AVG_WORK_HOURS_PW_LAST_12_MONTHS"]
drink_day_pw = ultramarathon["DRINK_DAYS_PW_LAST_12_MONTHS"]

We now create our scatter plots to get a better image of our data.

In [4]:
mileage=figure(width=400,height=400,title='Ultramarathons Ran vs. Current Weekly Mileage')
mileage.scatter(x=x,y=km_week)
mileage.xaxis.axis_label='Ultramarathons Ran'
mileage.yaxis.axis_label='Weekly Mileage (km)'

work=figure(width=400,height=400,title='Ultramrathons Ran vs. Average Work Hours Per Week')
work.scatter(x=x,y=work_hrs)
work.xaxis.axis_label='Ultramarathons Ran'
work.yaxis.axis_label='Work Time per week (hrs)'

drinks=figure(width=400,height=400,title='Ultramrathons Ran vs. Drinking Days')
drinks.scatter(x=x,y=drink_day_pw)
drinks.xaxis.axis_label='Ultramarathons Ran'
drinks.yaxis.axis_label='Drinking Days Per Week'

show(mileage)
show(work)
show(drinks)

Least surpisingly to me, is that ultramarathons do not appear to have an affect on working hours or drinking days. We can check out the correlations of these using principal component analysis. This way we can consider a scale for how related our data is. We define the correlation coefficient as 
$$
R^{2} = \frac{\sigma_{XY}^2}{\sigma_{X}^2\sigma_{Y}^2}
$$

We need to compute variances and covariances, which is done below. Due to constraints on the questionaire, we consider different N values and compute each differently. For the same reasons, we can no longer consider the working hours data.

In [5]:
km_coeff = (np.corrcoef(x, km_week)[0][1]) ** 2
drink_coeff = (np.corrcoef(x, drink_day_pw)[0][1]) ** 2

print('KM:', km_coeff)
print('Drinks:', drink_coeff)

KM: 0.20628286398406612
Drinks: 0.01210500585589698


Notice that the coefficient for weekly mileage is much higher than that of drinking days per week. Neither are particularly strong correlations, however. The drinking result is much less surprising than that of weekly mileage. The expectation was for higher quantity of ultramarathons to correlate to a higher amount of weekly mileage. This was not the case though. Lets consider the line of best fit for weekly mileage using the method of least squares, to see if we are missing anything on the data. We must convert our data into appropiate arrays to continue.

In [6]:
X = np.array(x)

In [7]:
X = X.reshape(-1,1)
ones = np.ones(X.shape)
X = np.concatenate([X, ones], axis = 1)

We check that our matrix X looks as it should (size 78, 2):

In [8]:
print(X.shape)

(78, 2)


The measurements we are considering is KM ran per week.

In [9]:
Y = np.array(km_week)
Y = Y.reshape(-1,1)

We are looking to compute the least squares line.

In [10]:
D = np.dot(X.transpose(),X)
M = np.dot(np.linalg.inv(D),np.dot(X.transpose(), Y))
Yhat = np.dot(np.dot(np.dot(X,np.linalg.inv(D)),X.transpose()),Y)
mileage.scatter(x=x,y=Yhat[:,0],color='red')
mileage.line(x=x,y=Yhat[:,0],color='red')
show(mileage)

This is the line that mitigates the mean squares error, however, we can see that the quantity of ultramarathons is not predicative of the outcomes of current weekly mileage. Note that a large amount of the points are on the far left side of the plot, away from where the line of best fit would suggest. It would appear that running ultramarathons correlates to running, but that is not a result we are looking for, as that result is trivial.

### Age of First Ultramarathon

The quantity of ultramarathons did not produce much results. Perhaps this was due to participants having ran there ultramarathons as they slowed development. This would be a notworthy conclusion due to the ultramarathon advice of waiting to run one until you are fully developed. We will now consider how height and weight at the age of 25 was affected by the age in which a participant ran their first ultramarathon.

In [11]:
age = ultramarathon["AGE_1ST_UM"]
height25 = ultramarathon["HEIGHT_AT_25"]
weight25 = ultramarathon["WEIGHT_AT_25"]

There were two heights and weights to consider in the study. There is current height and weight, versus that of age 25. We consider the age 25 height and weight for continuity purposes.

In [12]:
height=figure(width=400,height=400,title='Age of 1st Ultramarathon and Height')
height.scatter(x=age,y=height25)
height.xaxis.axis_label='Age of 1st Ultramarathon'
height.yaxis.axis_label='Weekly Mileage (km)'

weight=figure(width=400,height=400,title='Age of 1st Ultramarathon and Weight')
weight.scatter(x=age,y=weight25)
weight.xaxis.axis_label='Age of 1st Ultramarathon'
weight.yaxis.axis_label='Weight at 25 (kg)'

show(height)
show(weight)

These graphs appear to be much more predicative than earlier. We similarly compute the correlation coefficient.

In [13]:
print('Correlation')
height_coeff = np.corrcoef(age, height25)[0][1] ** 2
weight_coeff = np.corrcoef(age, weight25)[0][1] ** 2
print('Height:', height_coeff)
print('Weight:', weight_coeff)

Correlation
Height: 0.006927987546289202
Weight: 0.025083503746599928


We again find little correlation in our data, suggesting that the age you run your first ultramarathon at is unlikely to affect height and weight in your mid 20's. We will consider linear regression however, to see if the method of least squares will provide more insightful results. We consider the weight graph due to its higher correlation coefficient.

We compute this using the same process as the previous example. We similarly check for the shape of H.

In [14]:
H = np.array(age)
H = H.reshape(-1,1)
ones = np.ones(H.shape)
H = np.concatenate([H, ones], axis = 1)

print(H.shape)

(78, 2)


In [15]:
W = np.array(weight25)
W = W.reshape(-1,1)

D = np.dot(H.transpose(),H)
M = np.dot(np.linalg.inv(D),np.dot(H.transpose(), W))
Yhat = np.dot(np.dot(np.dot(H,np.linalg.inv(D)),H.transpose()),W)
weight.scatter(x=age,y=Yhat[:,0],color='red')
weight.line(x=age,y=Yhat[:,0],color='red')
show(weight)

Our line of best fit here does almost the opposite of what we saw previously. In our previous least squares example, we had a lot of the data skewered on the left side of the graph, however this data is skewered to the right, making it difficult for the least squares method to show much.

### Injuries and Ultramarathons

We lastly consider injuries. Earlier on it was noted that ultramarathons are often not ran this early in one's life due to potential to harm development. There appeared to be minimal correlation between this and weight or height at the age of 25. We would like to consider injuries now, specifically, the probability of injury given an ultramarathon.

Let's consider the odds of injury given that we ran an Ultramarathon. Using Baye's theorem, we obtain the following result:
$$
P(A\vert B) = \frac{P(A\cap B)}{P(B)}
$$

Unfortunately for us, we do not have the percentage of under 19 year olds who have ran an ultramarathon. So let us instead consider how the age of a 1st ultramarathon effects. We will have to denote these into age "buckets" in order to get a probability.

In [16]:
A = np.array(ultramarathon["AGE_1ST_UM"])
N = A.shape[0]
Injury = np.array(ultramarathon['SUFFERED_UM_INJURIES_UNDER_19'])

In [18]:
# saves time later, counts less than
def count(age, col):
    total = 0
    for i in col:
        if i < age:
            total += 1
    return total

In [19]:
less_19 = count(19, A) / N
less_17 = count(17, A) / N
less_15 = count(15, A) / N
less_13 = count(13, A) / N

print('Less than 19:', less_19)
print('Less than 17:', less_17)
print('Less than 15:', less_15)
print('Less than 13:', less_13)

Less than 19: 1.0
Less than 17: 0.5
Less than 15: 0.16666666666666666
Less than 13: 0.07692307692307693


These are our probabilities to be in a given age bucket. Below is a function to find the odds of having been hurt before the age of 19 as well as below a given age.

In [20]:
def injuries(ages, injuries, age):
    total = 0
    for i in range(N):
        if ((ages[i] < age) and (injuries[i] == 'Yes')):
            total += 1
    return total / N

We now return these probabilities.

In [21]:
print('Given less than 19 the odds a participant was hurt:', injuries(A, Injury, 19) / less_19)
print('Given less than 17 the odds a participant was hurt:', injuries(A, Injury, 17) / less_17)
print('Given less than 15 the odds a participant was hurt:', injuries(A, Injury, 15) / less_15)
print('Given less than 13 the odds a participant was hurt:', injuries(A, Injury, 13) / less_13)

Given less than 19 the odds a participant was hurt: 0.23076923076923078
Given less than 17 the odds a participant was hurt: 0.2564102564102564
Given less than 15 the odds a participant was hurt: 0.3076923076923077
Given less than 13 the odds a participant was hurt: 0.5


Notice that a particpants odds of getting injured increase as their age decreases. This supports the idea of running an ultramarathon too early could lead to pain down the road.

### Conclusions

We were not able to deduce much about running ultramarathons before the age of 19 when it comes to many outcomes. The participant data pool was relatively small, but that is expected as ultramarathons is still a niche sport to this day. However, where we were able to succeed was how our body handles running ultramarathons at an early age. There is a connection between injuries and age, for example.

One other factor to consider in these results is the other results of the questionaire in which this data comes from. In addition to some of the quanitative data that we tested against was qualitative data. This data was more difficult to extract. Other data that in the questionaire that has potential is time related factors. For example, we have height and weight data for participants at their current age (which varies).

#### Sources

https://www.kaggle.com/datasets/aiaiaidavid/childhood-ultramarathon-runner-study-jan-2020?select=Childhood+Ultramarathon+Runner+Study_January+18++2020_16.50.csv