# CrossFit Athletes Performance Analysis & Prediction

## Objective
This project aims to analyze and model the performance of CrossFit athletes using data collected from competitions and training profiles worldwide.  
The dataset includes **demographics** (age, gender, height, weight), **performance metrics** (deadlift, snatch, Fran time, pullups, etc.), and **training/lifestyle habits** (nutrition, schedule, experience).

The main goals are:
1. **Exploratory Analysis**  
   - Understand the characteristics of CrossFit athletes (age, gender, body size, training background).  
   - Identify trends and patterns in performance across different demographics and experience levels.  

2. **Predictive Modeling**  
   - Build regression models to predict key performance outcomes (e.g., deadlift max, Fran time) from athlete demographics and training habits.  
   - Evaluate and compare models (baseline, linear, regularized, tree-based).  
   - Determine the most important features driving performance.  

3. **Practical Insights**  
   - Highlight which factors are most strongly associated with higher performance.  
   - Provide evidence-based insights that athletes and coaches can use to optimize training approaches.  

---


# Data Inspection

In [6]:
# Importing pandas library for loading the dataset
import pandas as pd

# Loading the dataset
df = pd.read_csv('../data/athletes.csv')
df.head()                

Unnamed: 0,athlete_id,name,region,team,affiliate,gender,age,height,weight,fran,...,snatch,deadlift,backsq,pullups,eat,train,background,experience,schedule,howlong
0,2554.0,Pj Ablang,South West,Double Edge,Double Edge CrossFit,Male,24.0,70.0,166.0,,...,,400.0,305.0,,,I workout mostly at a CrossFit Affiliate|I hav...,I played youth or high school level sports|I r...,I began CrossFit with a coach (e.g. at an affi...,I do multiple workouts in a day 2x a week|,4+ years|
1,3517.0,Derek Abdella,,,,Male,42.0,70.0,190.0,,...,,,,,,I have a coach who determines my programming|I...,I played youth or high school level sports|,I began CrossFit with a coach (e.g. at an affi...,I do multiple workouts in a day 2x a week|,4+ years|
2,4691.0,,,,,,,,,,...,,,,,,,,,,
3,5164.0,Abo Brandon,Southern California,LAX CrossFit,LAX CrossFit,Male,40.0,67.0,,211.0,...,200.0,375.0,325.0,25.0,I eat 1-3 full cheat meals per week|,I workout mostly at a CrossFit Affiliate|I hav...,I played youth or high school level sports|,I began CrossFit by trying it alone (without a...,I usually only do 1 workout a day|,4+ years|
4,5286.0,Bryce Abbey,,,,Male,32.0,65.0,149.0,206.0,...,150.0,,325.0,50.0,I eat quality foods but don't measure the amount|,I workout mostly at a CrossFit Affiliate|I inc...,I played college sports|,I began CrossFit by trying it alone (without a...,I usually only do 1 workout a day|I strictly s...,1-2 years|


## CrossFit Dataset Glossary

This dataset combines **demographics**, **strength lifts**, and **benchmark workouts (WODs)**.  
Here’s what each performance related column means:

### Strength Lifts
- **snatch** → Olympic lift where the barbell moves from ground to overhead in one motion. (Score: max weight lifted, lbs)  
- **candj** → Clean & Jerk, Olympic lift (ground → shoulders → overhead). (Score: max weight lifted, lbs)  
- **deadlift** → Classic powerlift, barbell lifted from ground to hips. (Score: max weight lifted, lbs)  
- **backsq** → Back Squat, barbell squat with weight on the back. (Score: max weight lifted, lbs)  
- **pullups** → Max strict pull-ups performed without assistance. (Score: count of reps)

### Benchmark WODs (Workouts of the Day)
- **Fran** → 21-15-9 reps of thrusters (squat + press) and pull-ups. (Score: completion time, lower is better)  
- **Helen** → 3 rounds of 400m run, 21 kettlebell swings, 12 pull-ups. (Score: completion time)  
- **Grace** → 30 clean & jerks (135 lb / 95 lb). (Score: completion time)  
- **Filthy50 (filthy50)** → 50 reps each of 10 movements. (Score: completion time)  
- **Fight Gone Bad (fgonebad)** → 3 rounds, 5 stations (wall balls, deadlift high pull, box jump, push press, rowing). (Score: total reps + calories, higher is better)  
- **Run400** → 400m sprint time.  
- **Run5k (run5k)** → 5k run time.  

### Demographics & Training Background
- **athlete_id, name** → Identifiers.  
- **region, team, affiliate** → Competition info / gym affiliation.  
- **gender, age, height, weight** → Demographics and body size.  
- **eat, train, background, experience, schedule, howlong** → Lifestyle, training style, athletic background, years of experience.

---

### Summary
- **Lifts** = strength/power capacity.  
- **WODs & runs** = conditioning/endurance performance.  
- **Habits/demographics** = potential predictors of performance.


In [3]:
# Shape of the dataset, rows & columns

print("Dataset shape (rows, columns):", df.shape)

Dataset shape (rows, columns): (423006, 27)


In [4]:
# Column names

print("\nColumn names:")
print(df.columns.tolist())


Column names:
['athlete_id', 'name', 'region', 'team', 'affiliate', 'gender', 'age', 'height', 'weight', 'fran', 'helen', 'grace', 'filthy50', 'fgonebad', 'run400', 'run5k', 'candj', 'snatch', 'deadlift', 'backsq', 'pullups', 'eat', 'train', 'background', 'experience', 'schedule', 'howlong']


In [5]:
# Dataset information

print("\nDataFrame info:")
print(df.info())


DataFrame info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 423006 entries, 0 to 423005
Data columns (total 27 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   athlete_id  423003 non-null  float64
 1   name        331110 non-null  object 
 2   region      251262 non-null  object 
 3   team        155160 non-null  object 
 4   affiliate   241916 non-null  object 
 5   gender      331110 non-null  object 
 6   age         331110 non-null  float64
 7   height      159869 non-null  float64
 8   weight      229890 non-null  float64
 9   fran        55426 non-null   float64
 10  helen       30279 non-null   float64
 11  grace       40745 non-null   float64
 12  filthy50    19359 non-null   float64
 13  fgonebad    29738 non-null   float64
 14  run400      22246 non-null   float64
 15  run5k       36097 non-null   float64
 16  candj       104435 non-null  float64
 17  snatch      97280 non-null   float64
 18  deadlift    115323 non-null

## Data Inspection Results

- **Shape:** The dataset has **423,006 rows** (athletes) and **27 columns** (attributes).  
- **Column types:**  
  - 16 numeric columns (e.g., `age`, `height`, `weight`, `deadlift`, `snatch`, `backsq`, workout times).  
  - 11 text/categorical columns (e.g., `name`, `region`, `gender`, `eat`, `train`, `background`).  

- **Missing values:**  
  - Demographics:  
    - `age` present for ~78% of rows.  
    - `height` missing in ~62% of rows.  
    - `weight` missing in ~46% of rows.  
    - `gender` missing in ~22% of rows.  
  - Performance:  
    - `deadlift` (115k entries, ~27%).  
    - `backsq` (110k entries, ~26%).  
    - `snatch` (97k entries, ~23%).  
    - `fran` (55k entries, ~13%).  
    - Other workouts (`Helen`, `Grace`, etc.) have <10% coverage.  
  - Training habits (e.g., `eat`, `train`, `schedule`): ~22–25% filled.  

- **Findings:**  
  - Not all athletes recorded full information.  
  - Strength metrics (`deadlift`, `backsq`, `snatch`) have the most usable data.  
  - Time-based WODs have much smaller sample sizes.  
  - For modeling, **deadlift/back squat** are the best first target candidates.


# Data Consistency Check

In [10]:
# columns check to see how many values are filled vs missing
# The minimum and maximum values and a few sample values to confirms if all the entered values are realistic
key_cols = ['age', 'height', 'weight', 
            'deadlift', 'backsq', 'snatch', 'candj', 'pullups', 
            'fran', 'helen', 'grace', 'filthy50', 'run400', 'run5k']

# empty list to store results
summary = []

# Looping through each column and collect stats
for col in key_cols:
    if col in df.columns:
        summary.append({
            "column": col,
            "non_missing": df[col].notna().sum(),
            "missing": df[col].isna().sum(),
            "min": df[col].min(),
            "max": df[col].max(),
            "sample_values": df[col].dropna().unique()[:5]
        })

# Turning the results into a DataFrame
range_check = pd.DataFrame(summary)

# Displaying the table 
range_check


Unnamed: 0,column,non_missing,missing,min,max,sample_values
0,age,331110,91896,13.0,125.0,"[24.0, 42.0, 40.0, 32.0, 37.0]"
1,height,159869,263137,0.0,8388607.0,"[70.0, 67.0, 65.0, 73.0, 72.0]"
2,weight,229890,193116,1.0,20175.0,"[166.0, 190.0, 149.0, 230.0, 175.0]"
3,deadlift,115323,307683,-500.0,8388607.0,"[400.0, 375.0, 435.0, 0.0, 365.0]"
4,backsq,110517,312489,-7.0,8388607.0,"[305.0, 325.0, 414.0, 0.0, 365.0]"
5,snatch,97280,325726,0.0,8388607.0,"[200.0, 150.0, 0.0, 185.0, 225.0]"
6,candj,104435,318571,-45.0,8388607.0,"[220.0, 245.0, 205.0, 265.0, 0.0]"
7,pullups,50608,372398,-6.0,2147484000.0,"[25.0, 50.0, 0.0, 81.0, 55.0]"
8,fran,55426,367580,1.0,8388607.0,"[211.0, 206.0, 205.0, 119.0, 304.0]"
9,helen,30279,392727,1.0,8388607.0,"[645.0, 465.0, 614.0, 417.0, 485.0]"


## Consistency Check Results

- **Demographics**
  - `age`: Values range 13–125. The upper end (125) is unrealistic → needs capping (likely 14–80).
  - `height`: Contains zeros and extremely large values (up to 8.3M). Normal human range is ~55–83 inches → cleaning required.
  - `weight`: Goes from 1 to 20,175 lbs, which is impossible. Typical range is ~90–400 lbs.

- **Strength Lifts**
  - `deadlift`, `backsq`, `snatch`, `candj`: Many impossible values (negatives, zeros, or millions). 
    - Plausible ranges: 
      - Deadlift: 100–1,200 lbs 
      - Back Squat: 100–1,000 lbs 
      - Snatch: 75–400 lbs 
      - Clean & Jerk: 100–500 lbs
  - `pullups`: Negative values and a maximum in the billions. Realistic range: 0–100.

- **WOD Times**
  - `fran`, `helen`, `grace`, `filthy50`, `run400`, `run5k`: Contain corrupted values (up to 8.3M). 
  - Typical ranges:  
    - Fran: 2–15 min  
    - Helen: 7–20 min  
    - Grace: 1–10 min  
    - Filthy50: 15–45 min  
    - Run400: 50–120 sec  
    - Run5k: 12–40 min  

### Findings
The dataset contains **a large number of unrealistic values** (negatives, zeros, and huge numbers).  
Before modeling, we'll define **cleaning rules** to cap values within humanly plausible ranges and handle bad entries appropriately.
