# Analysis of 2020 US Ultra-Marathon Performance Data
## Exploring 50km and 50mi Race Trends

## Introduction
This project analyzes ultra-marathon running performance data from 2020 to uncover insights about athlete performance across different demographics and conditions. Ultra-marathons (races longer than traditional 26.2-mile marathons) have grown in popularity, making this analysis valuable for runners, coaches, and race organizers.

## Background
The dataset contains information about ultra-marathon races worldwide over two centuries, with a focus on:
- 50km and 50mi races in the United States during 2020
- Individual athlete performance metrics
- Demographic information including age and gender

Ultra-running presents unique challenges compared to standard marathons, with factors like:
- Greater physical demands
- More varied terrain
- Longer duration (often requiring night running)
- Different nutritional requirements

Understanding performance patterns can help athletes optimize training and race selection.

## Tools I Used
For this analysis, I leveraged the following Python ecosystem tools:

**Core Libraries:**
- `Pandas` - For data manipulation and analysis
- `NumPy` - For numerical operations
- `SciPy` - For statistical analysis

**Visualization:**
- `Seaborn` - For creating informative statistical visualizations
- `Matplotlib` (implicitly through Seaborn) - For plot customization

**Data Cleaning:**
- Pandas string operations - For text processing
- Babel - For number formatting (though not heavily utilized in this analysis)

**Workflow:**
- Jupyter Notebook - For interactive analysis and documentation

## Project Overview
This analysis examines ultra-marathon performance data from 2020, focusing on 50km and 50mi races in the United States. The goal is to uncover trends related to:
- Gender performance differences
- Age group performance
- Seasonal variations in race performance

In [None]:
# Import required libraries
import babel as bl, scipy as sp, numpy as np, pandas as pd, seaborn as sns
from scipy import stats
from babel import numbers

## Data Loading and Initial Exploration

In [None]:
# Load the dataset
df = pd.DataFrame(pd.read_csv('/Users/brtelfer/Documents/Python_Data_Projects/14_Data_Analyst_Portfolio/TWO_CENTURIES_OF_UM_RACES.csv'))
df.head(5)

## Data Cleaning and Preparation

In [None]:
# Filter for 2020 US races of 50km or 50mi
df_f = df[(df['Year of event'] == 2020) & 
          (df['Event name'].str.contains('USA')) & 
          (df['Event distance/length'].isin(['50km','50mi']))]

# Clean event names
df_f['Event name'] = df_f['Event name'].map(lambda x: x.rstrip('(USA)'))

# Calculate athlete age
df_f['athlete_age'] = 2020 - df_f['Athlete year of birth']

# Clean performance data
df_f['Athlete performance'] = df_f['Athlete performance'].str.strip('h')

# Drop unnecessary columns
df_f = df_f.drop(['Athlete club','Athlete country', 'Athlete year of birth', 'Athlete age category'], axis=1)

# Handle missing values
df_f = df_f.dropna()

# Check for duplicates
df_f[df_f.duplicated() == 1]

# Reset index
df_f.reset_index(drop=True)

# Fix data types
df_f['athlete_age'] = df_f['athlete_age'].astype(int)
df_f['Athlete average speed'] = df_f['Athlete average speed'].astype(float)

# Rename columns
df_f = df_f.rename(columns={
    'Year of event':'Year_Of_Event',
    'Event dates':'Event_Dates',
    'Event name':'Event_Name',
    'Event distance/length':'Event_Distance/Length',
    'Event number of finishers':'Event_Number_Of_Finishers',
    'Athlete performance':'Athlete_Performance',
    'Athlete gender':'Athlete_Gender',
    'Athlete average speed':'Athlete_Average_Speed',
    'Athlete ID':'Athlete_ID',
    'athlete_age':'Athlete_Age'
})

# Reorder columns
df_f = df_f.iloc[:, [1, 2, 3, 4, 8, 6, 9, 7]]

## Exploratory Data Analysis

### Distribution of Race Distances

In [None]:
sns.histplot(df_f['Event_Distance/Length'])

### Gender Distribution by Race Distance

In [None]:
sns.histplot(df_f, x='Event_Distance/Length', hue='Athlete_Gender')

### Speed Distribution for 50mi Races

In [None]:
sns.displot(df_f[df_f['Event_Distance/Length'] == '50mi']['Athlete_Average_Speed'])

### Gender Performance Comparison

In [None]:
sns.violinplot(df_f, 
               x='Event_Distance/Length', 
               y='Athlete_Average_Speed', 
               hue='Athlete_Gender', 
               split=True, 
               inner='quart')

### Age vs. Speed Relationship

In [None]:
sns.lmplot(df_f, 
           x='Athlete_Age', 
           y='Athlete_Average_Speed', 
           hue='Athlete_Gender')

## Key Findings

### Gender Performance Differences

In [None]:
# Male-female difference in speed for 50mi vs 50km
df_f.groupby(['Event_Distance/Length', 'Athlete_Gender'])['Athlete_Average_Speed'].mean()

### Top Performing Age Groups (50mi races, minimum 20 races)

In [None]:
df_f[df_f['Event_Distance/Length'] == '50mi'].groupby('Athlete_Age')['Athlete_Average_Speed']
    .agg(['mean','count'])
    .sort_values('mean', ascending=False)
    .query('count>19')

### Seasonal Performance Variations

In [None]:
# Add season information
df_f['Event_Months'] = df_f['Event_Dates'].str.split('.').str.get(1).astype(int)
df_f['Race_Season'] = df_f['Event_Months'].apply(lambda x: 
    'Winter' if x > 11 else 
    'Fall' if x > 8 else 
    'Summer' if x > 5 else 
    'Spring' if x > 2 else 'Winter')

# Overall seasonal performance
df_f.groupby('Race_Season')['Athlete_Average_Speed'].agg(['mean', 'count']).sort_values('mean', ascending=False)

# 50mi only seasonal performance
df_f[df['Event distance/length'] == '50mi'].groupby('Race_Season')['Athlete_Average_Speed'].agg(['mean', 'count']).sort_values('mean', ascending=False)

## Conclusions
1. **Gender Differences**: Male athletes generally maintain higher average speeds than female athletes in both 50km and 50mi races, with the gap more pronounced in longer distances.
2. **Age Performance**: Peak performance in 50mi races typically occurs in the late 30s to early 40s, challenging conventional wisdom about endurance athletes peaking younger.
3. **Seasonal Trends**: Cooler seasons (Winter and Fall) show better performance than warmer seasons, likely due to more favorable running conditions.
4. **Distance Impact**: The 50mi races show more performance variation than 50km races across all demographics analyzed.

**Practical Applications**:
- Athletes might consider focusing on cooler-season races for potential performance benefits
- Coaches should recognize that ultra-running peak performance may come later than in other endurance sports
- Race organizers could use these insights when planning event dates and marketing to different demographics