# USA Swimming Data Analysis

Kendrick McDonald

At both the Men's and Women's NCAA Division 1 Swimming Championships this March, as well as conference championships earlier in the season, it seemed like NCAA records were falling at an unprecedented rate. Some of the top performers achieved times I never thought I would see in my lifetime. Perhaps even more interesting, it seemed that the top end of the NCAA as a whole was improving at a faster rate than ever before. Kyle Sockwell, among others, has a name for this phenomenon: Swimflation.

I wondered what this pheonomenon really means and how we might measure it.

Swimming is an impressively quantifiable sport. 

On the individual level, there are:
- Finish times
- Event records
- Splits
- Stroke rates
- Underwaters
- Breathing patterns
- Reaction times

On the team level:
- Team points
- Team records
- Championship titles
- Championship-qualifying swimmers

There are many ways to study how swimming performance has changed over time. I decided to focus on the NCAA Division 1 level because short course yards is the fastest format in the sport and because I'm a former NCAA D1 swimmer myself. USA Swimming has a robust database of NCAA performances that are easy to download in csv format, making it easy to collect and analyze. 

I decided not to limit my analysis to performances from the championships but instead chose to include the top performances from any point in the season, in particular to account for swimmers who may have peaked at conference championships with extraordinary performances. I also collected data showing the progression of NCAA records over time, although it's unclear whether the USA Swimming database is as complete on this front as it is on the individual performance front, since it has fewer record breaking performances for each event in the format available for download.

Using top performance and record progression data, I can try to study Swimflation in several different ways:
- How have the top performances changed over time?
- How has the range of top performances (i.e. the difference between the 1st and 16th) changed over time?
- What do the distributions of top performances during different time periods look like?
- TK

This notebook will show how I can use usasw_scrape.py to collect data about top NCAA Division 1 swimming performances from the USA Swimming website, and then usasw_clean.py to clean the data and prepare it for analysis. 

In [35]:
# Import swimming scripts

from usasw_clean_data import clean_ncaa_record_data
from usasw_clean_data import calculate_record_stats
from usasw_scrape_data import get_NCAA_results, fill_out_form

# Import other libraries
import numpy as np
import pandas as pd

In [36]:
records = 'data/NCAA_Records.csv'
records = clean_ncaa_record_data(records)
records = calculate_record_stats(records)

In the process of cleaning NCAA record data, I've removed records broken by the same swimmer at the same meet in the same year. It could be the case that a dominant swimmer breaks a record during prelims and again in finals, and while that's an impressive feat, I think it's more important to consider the final record set during a particular meet.

Here's a sample of the data I collected:

In [37]:
records.sample(frac=0.05, random_state=23)

AttributeError: 'NoneType' object has no attribute 'sample'

In [12]:
def only_new_records(df):
    return df[df['record_broken_by'] != 'No Earlier Record']