# USA Swimming Data Analysis

Kendrick McDonald

At both the Men's and Women's NCAA Division 1 Swimming Championships this March, as well as conference championships earlier in the season, it seemed like NCAA records were falling at an unprecedented rate. Some of the top performers achieved times I never thought I would see in my lifetime. Perhaps even more interesting, it seemed that the top end of the NCAA as a whole was improving at a faster rate than ever before. Kyle Sockwell, among others, has a name for this phenomenon: Swimflation.

I wondered what this pheonomenon really means and how we might measure it.

Swimming is an impressively quantifiable sport. 

On the individual level, there are:
- Finish times
- Event records
- Splits
- Stroke rates
- Underwaters
- Breathing patterns
- Reaction times

On the team level:
- Team points
- Team records
- Championship titles
- Championship-qualifying swimmers

There are many ways to study how swimming performance has changed over time. I decided to focus on the NCAA Division 1 level because short course yards is the fastest format in the sport and because I'm a former NCAA D1 swimmer myself. USA Swimming has a robust database of *individual* NCAA performances that are easy to download in csv format, making it easy to collect and analyze. *(Note: The USA Swimming forms did not include relay data.)*

I decided not to limit my analysis to performances from the championships but instead chose to include the top individual performances from any point in the season, in particular to account for swimmers who may have peaked at conference championships with extraordinary performances. I also collected data showing the progression of NCAA records over time, although it's unclear whether the USA Swimming database is as complete on this front as it is on the individual performance front, since it has fewer record breaking performances for each event in the format available for download.

Using top performance and record progression data, I can try to study Swimflation in several different ways:
- How have the top performances changed over time?
- How has the range of top performances (i.e. the difference between the 1st and 16th) changed over time?
- What do the distributions of top performances during different time periods look like?
- TK

This notebook will show how I can use usasw_scrape.py to collect data about top NCAA Division 1 swimming performances from the USA Swimming website, and then usasw_clean.py to clean the data and prepare it for analysis. 

In [1]:
# Import swimming scripts

from usasw_clean_data import clean_ncaa_record_data
from usasw_clean_data import calculate_record_stats
from usasw_scrape_data import get_NCAA_results, fill_out_form

# Import other libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
records = 'data/NCAA_Records.csv'
records = clean_ncaa_record_data(records)
records = calculate_record_stats(records)

In the process of cleaning NCAA record data, I've removed records broken by the same swimmer at the same meet in the same year. It could be the case that a dominant swimmer breaks a record during prelims and again in finals, and while that's an impressive feat, I think it's more important to consider the final record set during a particular meet.

Here's a sample of the data I collected:

In [None]:
records.sample(frac=0.05, random_state=23)

In order to do some analysis, I need to remove the earliest record performances from the data, since they won't have information in the 'record_broken_by' or 'record_improvement' data fields.

In [None]:
def only_new_records(df):
    return df[df['record_broken_by'] != 'No Earlier Record']

new_records = only_new_records(records)

# Create new dataframes for all male and female records
male_records = records[records['gender'] == 'M']
female_records = records[records['gender'] == 'F']

# Create new dataframes for all new male and female records
male_new_records = only_new_records(male_records)
female_new_records = only_new_records(female_records)

I can create a dataframe that only includes the number of records broken in each season and use that to make a histogram.

In [None]:
records_by_season = records_by_season = records.groupby('season').count().reset_index()
records_by_season = records_by_season[['season', 'name']].rename(columns={'name': 'count'}).sort_values('count', ascending=False)

bins = np.arange(2000,2024)

# Plot the histogram of the number of records per season
plt.bar(records_by_season['season'], 
            records_by_season['count'],
            edgecolor='black', 
            linewidth=1.2)
plt.xticks(bins, rotation=45, ha='center')
plt.xlabel("Year")
plt.ylabel("Number of Records")
plt.title("Number of Records Set Per Season")
plt.show()


Interestingly, the number of records broken in 2023 was only the fourth most in a season, according to the data USA Swimming has. One important note is that this doesn't account for relay records, all of which were broken in 2023.

Let's look at the breakdown for men and women:

In [None]:
male_records_by_season = male_records.groupby('season').count().reset_index()
male_records_by_season = male_records_by_season[['season', 'name']].rename(
    columns={'name': 'count'}).sort_values('count', ascending=False)

female_records_by_season = female_records.groupby(
    'season').count().reset_index()
female_records_by_season = female_records_by_season[['season', 'name']].rename(
    columns={'name': 'count'}).sort_values('count', ascending=False)

bins = np.arange(2005, 2024)

# Create subplots for male and female record counts
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6))

# Plot the histogram of the number of records per season
ax1.bar(male_records_by_season['season'],
        male_records_by_season['count'],
        edgecolor='black',
        linewidth=1.2)
ax1.set_xticks(bins)
ax1.set_xticklabels(bins, rotation=75, ha='center')
ax1.set_title('Male Records')
ax1.set_xlabel("Year")
ax1.set_ylabel("Number of Records")

ax2.bar(female_records_by_season['season'],
        female_records_by_season['count'],
        edgecolor='black',
        linewidth=1.2)
ax2.set_xticks(bins)
ax2.set_xticklabels(bins, rotation=75, ha='center')
ax2.set_title('Female Records')
ax2.set_xlabel("Year")
ax2.set_ylabel("Number of Records")

plt.suptitle("Number of Records Set Per Season")
plt.show()


In [None]:
# Create a dataframe where each athlete_id and event_id can only appear once per season
unique_records = records.drop_duplicates(
    subset=['athlete_id', 'event_id', 'season'])
unique_records_by_season = unique_records.groupby(
    'season').count().reset_index()
unique_records_by_season = unique_records_by_season[[
    'season', 'athlete_id']].rename(columns={'athlete_id': 'unique_records'})

male_unique_records = male_records.drop_duplicates(
    subset=['athlete_id', 'event_id', 'season'])
male_unique_records_by_season = male_unique_records.groupby(
    'season').count().reset_index()
male_unique_records_by_season = male_unique_records_by_season[[
    'season', 'athlete_id']].rename(columns={'athlete_id': 'unique_records'})

female_unique_records = female_records.drop_duplicates(
    subset=['athlete_id', 'event_id', 'season'])
female_unique_records_by_season = female_unique_records.groupby(
    'season').count().reset_index()
female_unique_records_by_season = female_unique_records_by_season[[
    'season', 'athlete_id']].rename(columns={'athlete_id': 'unique_records'})

bins = np.arange(2005, 2024)

# Create subplots for male and female record counts
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6))

# Plot the histogram of the number of records per season
ax1.bar(male_unique_records_by_season['season'],
        male_unique_records_by_season['unique_records'],
        edgecolor='black',
        linewidth=1.2)
ax1.set_xticks(bins)
ax1.set_xticklabels(bins, rotation=75, ha='center')
ax1.set_xlabel("Year")
ax1.set_ylabel("Number of Records")

ax2.bar(female_unique_records_by_season['season'],
        female_unique_records_by_season['unique_records'],
        edgecolor='black',
        linewidth=1.2)
ax2.set_xticks(bins)
ax2.set_xticklabels(bins, rotation=75, ha='center')
ax2.set_xlabel("Year")
ax2.set_ylabel("Number of Records")

plt.suptitle("Number of Unique Records Set Per Season")
plt.show()


We can restrict this further by only considering truly "unique" records, i.e. each swimmer only setsone record per event per season. This means that if a swimmer breaks a record at conference championships and then again at the NCAA championships, only the latter record will be counted.

In [None]:
# Create a dataframe where each athlete_id and event_id can only appear once per season
unique_records = records.drop_duplicates(
    subset=['athlete_id', 'event_id', 'season'])
unique_records_by_season = unique_records.groupby(
    'season').count().reset_index()
unique_records_by_season = unique_records_by_season[[
    'season', 'athlete_id']].rename(columns={'athlete_id': 'unique_records'})

male_unique_records = male_records.drop_duplicates(
    subset=['athlete_id', 'event_id', 'season'])
male_unique_records_by_season = male_unique_records.groupby(
    'season').count().reset_index()
male_unique_records_by_season = male_unique_records_by_season[[
    'season', 'athlete_id']].rename(columns={'athlete_id': 'unique_records'})

female_unique_records = female_records.drop_duplicates(
    subset=['athlete_id', 'event_id', 'season'])
female_unique_records_by_season = female_unique_records.groupby(
    'season').count().reset_index()
female_unique_records_by_season = female_unique_records_by_season[[
    'season', 'athlete_id']].rename(columns={'athlete_id': 'unique_records'})

bins = np.arange(2005, 2024)

# Create subplots for male and female record counts
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6))

# Plot the histogram of the number of records per season
ax1.bar(male_unique_records_by_season['season'],
        male_unique_records_by_season['unique_records'],
        edgecolor='black',
        linewidth=1.2)
ax1.set_xticks(bins)
ax1.set_xticklabels(bins, rotation=75, ha='center')
ax1.set_title("Male Records")
ax1.set_xlabel("Year")
ax1.set_ylabel("Number of Records")

ax2.bar(female_unique_records_by_season['season'],
        female_unique_records_by_season['unique_records'],
        edgecolor='black',
        linewidth=1.2)
ax2.set_xticks(bins)
ax2.set_xticklabels(bins, rotation=75, ha='center')
ax2.set_title("Female Records")
ax2.set_xlabel("Year")
ax2.set_ylabel("Number of Records")

plt.suptitle("Number of Unique Records Set Per Season")
plt.show()


The next question we should consider is whether record improvement has been increasing over time. I can use the 'record_improvement' data field to answer this question.

I see two ways of considering this question:
- How has the average record improvement changed over time?
- In a given season, how much faster did an event get overall?

In [None]:
# Select dataframe with only season and record_broken_by columns
records_broken_by = records[['season', 'distance','record_broken_by']]
records_broken_by = records_broken_by.groupby(['season', 'distance'])['record_broken_by'].sum().reset_index()
