# CSC 2621 Final Project: Running (Away)
### Members: Alex Ewart, Mikhail Filippov, Benjamin Liebl

In this Final Project, we will perform statistical analyses and use different models for a [Running](https://www.kaggle.com/datasets/mexwell/long-distance-running-dataset?resource=download&select=run_ww_2019_w.csv) dataset.

Our research questions include:
1. Is there a significant difference in the distance athletes ran in 2019 vs 2020 due to the COVID pandemic?
2. Can we accurately predict an athletes weekly running distance based on previous weeks?

Our hypotheses are:
1. During the year 2020, athletes ran **less** distance overall than the same athletes in the year 2019.
2. We can achieve better RMSE when predicting the X week's running distance based on the previous X-1 weeks in 2019 than the RMSE from picking a mean value for each athlete's X-1 weeks to predict X using RandomForestRegressor, LSTM, and XGBoost models.


### About the [dataset](https://www.kaggle.com/datasets/mexwell/long-distance-running-dataset?resource=download&select=run_ww_2019_w.csv)
The "Long-Distance Running Dataset" was obtained from Kaggle and contains the running statistics of 36,412 athletes from around the world. The data was obtained off of a large social network for athletes. The features present in the dataset include:
1. datetime: date of the running activity;
2. athlete: a computer-generated ID for the athlete (integer);
3. distance: distance of running (floating-point number, in kilometers);
4. duration: duration of running (floating-point number, in minutes);
5. gender: gender (string 'M' of 'F');
6. age_group: age interval (one of the strings '18 - 34', '35 - 54', or '55 +');
7. country: country of origin of the athlete (string);
8. major: marathon(s) and year(s) the athlete ran (comma-separated list of strings).

This dataset has been conveniently split up into several frequencies of data, included by day, week, month, and quarter. We will be using the weekly data in order to have a finer dataset while keeping it relatively on the smaller end.

Our target variable will be the Xth week's distance to predict, while the features will include the X-1 weeks beforehand, gender, age_group, country, and major.

It is important to note that for most of the modeling, we will be using the **2019 dataset** as to not include abnormalities that could be present in 2020 due to COVID.

In [2]:
# imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm
import dill
import os

from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error

In [None]:
# import saved data to avoid rerunning
if os.path.exists('final_project.db'):
    dill.load_session('final_project.db')
if os.path.exists('random_forest.db'):
    dill.load_session('random_forest.db')

### Data Preprocessing

In [7]:
# load data
df_2019 = pd.read_csv('../run_ww_2019_w.csv')
df_2020 = pd.read_csv('../run_ww_2020_w.csv')

# convert 2019 objects to correct types
df_2019['datetime'] = pd.to_datetime(df_2019['datetime'], format='%Y-%m-%d')
df_2019['gender'] = df_2019['gender'].astype('category')
df_2019['age_group'] = df_2019['age_group'].astype('category')
df_2019['country'] = df_2019['country'].astype('category')
df_2019['major'] = df_2019['major'].astype('category')
df_2019.drop(columns=['Unnamed: 0'], inplace=True)

# convert 2020 objects to correct types
df_2020['datetime'] = pd.to_datetime(df_2020['datetime'], format='%Y-%m-%d')
df_2020['gender'] = df_2020['gender'].astype('category')
df_2020['age_group'] = df_2020['age_group'].astype('category')
df_2020['country'] = df_2020['country'].astype('category')
df_2020['major'] = df_2020['major'].astype('category')
df_2020.drop(columns=['Unnamed: 0'], inplace=True)

display(df_2019.head())
display(df_2020.head())
display(df_2019.info())
display(df_2020.info())

Unnamed: 0,datetime,athlete,distance,duration,gender,age_group,country,major
0,2019-01-01,0,0.0,0.0,F,18 - 34,United States,CHICAGO 2019
1,2019-01-01,1,5.27,30.2,M,35 - 54,Germany,BERLIN 2016
2,2019-01-01,2,9.3,98.0,M,35 - 54,United Kingdom,"LONDON 2018,LONDON 2019"
3,2019-01-01,3,103.13,453.4,M,18 - 34,United Kingdom,LONDON 2017
4,2019-01-01,4,34.67,185.65,M,35 - 54,United States,BOSTON 2017


Unnamed: 0,datetime,athlete,distance,duration,gender,age_group,country,major
0,2020-01-01,0,0.0,0.0,F,18 - 34,United States,CHICAGO 2019
1,2020-01-01,1,70.33,394.2,M,35 - 54,Germany,BERLIN 2016
2,2020-01-01,2,14.65,79.066667,M,35 - 54,United Kingdom,"LONDON 2018,LONDON 2019"
3,2020-01-01,3,41.41,195.666667,M,18 - 34,United Kingdom,LONDON 2017
4,2020-01-01,4,41.34,209.1,M,35 - 54,United States,BOSTON 2017


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1893424 entries, 0 to 1893423
Data columns (total 8 columns):
 #   Column     Dtype         
---  ------     -----         
 0   datetime   datetime64[ns]
 1   athlete    int64         
 2   distance   float64       
 3   duration   float64       
 4   gender     category      
 5   age_group  category      
 6   country    category      
 7   major      category      
dtypes: category(4), datetime64[ns](1), float64(2), int64(1)
memory usage: 68.7 MB


None

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1893424 entries, 0 to 1893423
Data columns (total 8 columns):
 #   Column     Dtype         
---  ------     -----         
 0   datetime   datetime64[ns]
 1   athlete    int64         
 2   distance   float64       
 3   duration   float64       
 4   gender     category      
 5   age_group  category      
 6   country    category      
 7   major      category      
dtypes: category(4), datetime64[ns](1), float64(2), int64(1)
memory usage: 68.7 MB


None

In [8]:
# adding marathon features to see if a week is within a 1 month of a major marathon for the athlete
marathon_map = {
    'CHICAGO': '10-12',
    'BERLIN': '09-21',
    'LONDON': '04-27',
    'BOSTON': '04-21',
    'NEW YORK': '11-02'
}

df_expanded = df_2019.copy()
df_expanded['major_split'] = df_expanded['major'].str.split(',')
df_expanded = df_expanded.explode('major_split')

df_expanded[['event', 'year']] = df_expanded['major_split'].str.extract(r'(\D+)\s+(\d{4})')
df_expanded['event'] = df_expanded['event'].str.strip()
df_expanded['year'] = df_expanded['year'].astype(int)
df_expanded['major_date'] = pd.to_datetime(
    df_expanded['year'].astype(str) + '-' + df_expanded['event'].map(marathon_map),
    errors='coerce'
)

In [9]:
one_month = pd.Timedelta(days=30)

# Check conditions
df_expanded['within-month-before'] = (
    (df_expanded['datetime'] > df_expanded['major_date'] - one_month) &
    (df_expanded['datetime'] <= df_expanded['major_date'])
)

df_expanded['within-month-after'] = (
    (df_expanded['datetime'] > df_expanded['major_date']) &
    (df_expanded['datetime'] <= df_expanded['major_date'] + one_month)
)

# Group back to original rows and aggregate using any()
df_result = df_expanded.groupby(df_expanded.index)[['within-month-before', 'within-month-after']].any()
df_result

Unnamed: 0,within-month-before,within-month-after
0,False,False
1,False,False
2,False,False
3,False,False
4,False,False
...,...,...
1893419,False,False
1893420,False,False
1893421,False,False
1893422,False,False


In [10]:
df_2019 = df_2019.join(df_result)

In [11]:
# pivot the datetime data to columns in order to effectively use it within the models as features
df_2019_new = df_2019.pivot_table(
    index='athlete',
    columns='datetime',
    values=['distance', 'duration', 'within-month-before', 'within-month-after'],
    aggfunc='sum',
    fill_value=0
)
df_2019_new.columns = [
    f'{val}_week_{date.isocalendar()[1]}' for val, date in df_2019_new.columns
]


df_2019_new = df_2019_new.reset_index()
mask = ~df_2019['athlete'].duplicated()
df_2019_new['age_group'] = df_2019[mask]['age_group']
df_2019_new['country'] = df_2019[mask]['country']
df_2019_new['gender'] = df_2019[mask]['gender']
df_2019_new['major'] = df_2019[mask]['major']
age_map = {}
# compute mean age for each age group to convert to numeric
for age_group in df_2019_new['age_group'].unique():
    ages_split = age_group.split()
    mean_age = 0
    if ages_split[1] == '-':
        mean_age = (int(ages_split[0]) + int(ages_split[2])) / 2
    else:
        mean_age = (55 + 75) / 2
    age_map[age_group] = mean_age
df_2019_new['age_group'] = pd.Series(df_2019_new['age_group'].map(age_map), dtype=float)
df_2019_new = pd.get_dummies(df_2019_new, columns=['country'])
df_2019_new

Unnamed: 0,athlete,distance_week_1,distance_week_2,distance_week_3,distance_week_4,distance_week_5,distance_week_6,distance_week_7,distance_week_8,distance_week_9,...,country_Uganda,country_Ukraine,country_United Arab Emirates,country_United Kingdom,country_United States,country_Uruguay,country_Uzbekistan,country_Venezuela,country_Vietnam,country_Zimbabwe
0,0,0.00,0.000,0.00,0.000,0.000,0.00,0.000,0.00,0.000,...,False,False,False,False,True,False,False,False,False,False
1,1,5.27,59.860,55.99,58.500,58.180,51.59,63.710,62.04,52.480,...,False,False,False,False,False,False,False,False,False,False
2,2,9.30,30.820,10.01,54.340,37.099,58.28,61.690,61.16,71.319,...,False,False,False,True,False,False,False,False,False,False
3,3,103.13,93.100,87.40,97.840,54.870,9.76,87.260,4.88,41.060,...,False,False,False,True,False,False,False,False,False,False
4,4,34.67,0.000,30.51,38.680,0.000,38.30,0.000,8.66,10.160,...,False,False,False,False,True,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36407,37594,168.05,113.140,163.52,161.509,163.320,123.18,66.189,88.89,149.859,...,False,False,False,True,False,False,False,False,False,False
36408,37595,79.81,114.879,113.51,91.680,128.270,136.32,121.530,127.39,134.540,...,False,False,False,False,True,False,False,False,False,False
36409,37596,118.89,111.070,117.22,136.400,134.308,136.25,118.340,90.93,92.400,...,False,False,False,False,True,False,False,False,False,False
36410,37597,28.67,54.410,49.88,41.220,48.930,50.09,75.060,23.43,72.260,...,False,False,False,False,True,False,False,False,False,False


In [13]:
def get_features_and_target(week_x, df=df_2019_new, x=3, use_season=True):
    """"
    This function helps select the features and target variable for model use

    Parameters:
    week_x (int): The week number to use as the target variable
    df (DataFrame): The DataFrame containing the data
    x (int): The number of weeks back to consider for features
    use_season (bool): Whether to include season-based features
    """
    target = f'distance_week_{week_x}'
    
    # Select time-based features
    features = [
        col for col in df.columns
        if 'week' in col
        # and 'within' not in col
        # and 'duration' not in col
        and (week_x - int(col.split('_')[2])) <= x
        and (week_x - int(col.split('_')[2])) > 0
    ]
    if use_season:
        # Base feature set
        X = df[features + ['gender', 'age_group']].copy()

        X['gender'] = X['gender'].eq('M')

        # One-hot encode age_group
        X = pd.get_dummies(X, columns=['gender'], drop_first=False)
    else:
        X = df[features]

    y = df[target]
    return X, y

In [14]:
# get training and test data
def get_train_test_data(week_x, df=df_2019_new, x=10, use_season=True):
    """
    This function helps select the training and test data for model use

    Parameters:
    week_x (int): The week number to use as the target variable
    df (DataFrame): The DataFrame containing the data
    x (int): The number of weeks back to consider for features
    use_season (bool): Whether to include season-based features
    """
    X, y = get_features_and_target(week_x, df, x, use_season)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
    return X_train, X_test, y_train, y_test

### Data Analysis

### Data Modeling

### Results