# Data resampling  

> [https://github.com/BMClab/covid19](https://github.com/BMClab/covid19)  
> [Laboratory of Biomechanics and Motor Control](http://pesquisa.ufabc.edu.br/bmclab/)  
> Federal University of ABC, Brazil

**This Jupyter notebook uses data created by the notebook `preprocessing.ipynb` and we are not sharing this data publicly for ethical reasons. If you want to run this notebook, you will first have to run the 'Strava web scrapper' and the 'Data preprocessing' notebooks we provide.**

<h1>Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Setup" data-toc-modified-id="Setup-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Setup</a></span><ul class="toc-item"><li><span><a href="#Environment" data-toc-modified-id="Environment-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Environment</a></span></li><li><span><a href="#Helping-functions" data-toc-modified-id="Helping-functions-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Helping functions</a></span></li></ul></li><li><span><a href="#Load-dataset" data-toc-modified-id="Load-dataset-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Load dataset</a></span><ul class="toc-item"><li><span><a href="#Checking-for-missing-values" data-toc-modified-id="Checking-for-missing-values-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Checking for missing values</a></span></li><li><span><a href="#Basic-information-about-the-dataset" data-toc-modified-id="Basic-information-about-the-dataset-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Basic information about the dataset</a></span></li></ul></li><li><span><a href="#Data-resampling" data-toc-modified-id="Data-resampling-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Data resampling</a></span><ul class="toc-item"><li><span><a href="#Verify-data" data-toc-modified-id="Verify-data-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Verify data</a></span></li><li><span><a href="#Export-data" data-toc-modified-id="Export-data-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Export data</a></span></li><li><span><a href="#Resample-by-year-and-at-different-periods" data-toc-modified-id="Resample-by-year-and-at-different-periods-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Resample by year and at different periods</a></span></li><li><span><a href="#Test-files" data-toc-modified-id="Test-files-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>Test files</a></span></li></ul></li></ul></div>

## Setup

In [1]:
import sys, os
import numpy as np
import pandas as pd
from tqdm.notebook import tqdm
%load_ext watermark  

%watermark
%watermark --iversions

Last updated: 2022-01-18T00:10:44.928856-03:00

Python implementation: CPython
Python version       : 3.9.9
IPython version      : 8.0.0

Compiler    : GCC 9.4.0
OS          : Linux
Release     : 5.13.0-25-generic
Machine     : x86_64
Processor   : x86_64
CPU cores   : 12
Architecture: 64bit

numpy : 1.22.0
sys   : 3.9.9 | packaged by conda-forge | (main, Dec 20 2021, 02:41:03) 
[GCC 9.4.0]
pandas: 1.3.5
json  : 2.0.9



### Environment

In [2]:
path2 = r'./../data/'

pd.set_option('display.float_format', lambda x: '%.4g' % x)

### Helping functions

In [3]:
def resample(y, cat1='athlete', freq='d', observed=True):
    """Resample numerical columns of `y` and repeat the categorical columns.
    
    y is a dataframe with a datetime index, a categorical column named cat1
    and possibly other categorical columns.
    The resampling period is given by the parameter freq and resample is
    performed at the datetime index and data are grouped by datetime and the
    categorical column given by the parameter cat1.
    Returns the resampled dataframe and the number of elements in each index
    level at each freq period.
    """
    # numerical columns
    cols_num = y.select_dtypes(include='number').columns.to_list()
    # categorical columns
    cols_cat = y.select_dtypes(include='category').columns.to_list()
    if len(cols_cat) > 1:
        y2 = y.drop_duplicates(subset=cat1)[cols_cat
                                           ].sort_values(cat1).reset_index(drop=True)
    cols_cat.remove(cat1)
    # change the dates of the last days to avoid a last week with less than 7 days
    nlastdays = 7 + y.index[-1].dayofyear % 7
    if freq.lower() == '7d' and nlastdays > 7:
        ts = pd.Timestamp(y.index[-1].date() - pd.to_timedelta(nlastdays-7, unit='D'))
        y.index = y.index.where(y.index <= ts, ts)
    t0 = pd.to_datetime({'year':[y.index[0].year], 'month':[1], 'day':[1]})[0]
    grouper = pd.Grouper(axis=0, freq=freq, sort=True, origin=t0)
    # resample only the numerical columns
    y = y.groupby([grouper, cat1], sort=True, observed=observed).sum().reset_index(level=1)
    #y.fillna(0, inplace=True)
    # calculate acumulated runs for numerical columns based on freq period   
    # just correct for differences in length of week, month or year
    if freq.lower() == 'd':
        pass
    elif freq.lower() == '7d':
        y[cols_num] = y[cols_num]
        # correct the divisor if the last week doesn't have 7 days
        y.loc[y.index[-1], cols_num] = y.loc[y.index[-1], cols_num] * (7 / nlastdays)
    elif freq.lower() == 'm':  # faster than using apply if not too many months
        for year in y.index.year.unique().astype(str):
            for month in y.loc[year].index.month.unique().astype(str):
                date = '{}-{}'.format(year, month)
                ndays = pd.Period(date).daysinmonth        
                y.loc[date, cols_num] = y.loc[date, cols_num] * (30 / ndays)
    elif freq.lower() == 'q':
        print('No corrections are implemented for freq {}.'.format(freq))
    elif freq.lower() == 'y':
        for year in y.index.year.unique().astype(str):
            date = year
            ndays = 366 if pd.Period(date).is_leap_year else 365   
            y.loc[date, cols_num] = y.loc[date, cols_num] * (365 / ndays)
    else:
        print('No corrections are implemented for freq {}.'.format(freq))
    # number of elements in each index level at each freq period
    nidx0 = y.groupby(level=0, observed=True).size()
    nidx1 = y.groupby(cat1, observed=True).size()
    # add back the categorical columns
    if len(cols_cat):
        y = y.join(y2.set_index(cat1), on=cat1)
    
    return y, nidx0, nidx1

## Load dataset

In [4]:
df = pd.read_parquet(os.path.join(path2, 'run_ww_2019_2020.parquet'))
#df = df[['datetime', 'athlete', 'gender', 'age_group', 'distance', 'duration']]
df['athlete'] = df['athlete'].astype('category')  # bug in parquet
df

Unnamed: 0,datetime,athlete,gender,age_group,distance,duration,country,major
0,2019-01-01 00:00:00,261,F,18 - 34,4.71,27.25,Canada,CHICAGO 2018
1,2019-01-01 00:00:00,1575,M,35 - 54,6.5,23.63,United States,BOSTON 2019
2,2019-01-01 00:00:00,2010,M,35 - 54,6.64,31.95,United States,NEW YORK 2018
3,2019-01-01 00:00:00,2937,M,35 - 54,42.34,194,Germany,BERLIN 2019
4,2019-01-01 00:00:00,3172,M,18 - 34,6.59,38.58,United States,BOSTON 2019
...,...,...,...,...,...,...,...,...
10703685,2020-12-31 23:54:00,223,M,35 - 54,5.21,30.1,China,CHICAGO 2015
10703686,2020-12-31 23:56:00,20357,F,35 - 54,1.62,8.717,United States,BOSTON 2019
10703687,2020-12-31 23:58:00,16899,F,18 - 34,6.57,49.5,United States,NEW YORK 2017
10703688,2020-12-31 23:59:00,6134,M,35 - 54,5.56,34.05,United States,NEW YORK 2019


### Checking for missing values

In [5]:
n = 0
for col in df:
    null = df[df[col].isnull()]['athlete'].unique().tolist()
    if null:
        print('Athlete: {}, null value in {}'.format(null, col))
        n = 1
if n == 0:
    print('No missing values found.')

Athlete: [12513, 35224, 28610, 15246, 36624, 13009, 26308, 36611, 993, 36185, 31524, 20933, 12232, 34331, 18431, 12764, 4140, 4003, 11248, 8167, 34353, 35667, 11667, 5904, 6259, 25997, 32351, 21841, 31482, 12336, 13819, 15873, 2369, 95, 25354, 10370, 4978, 8116, 22214, 23175, 16932, 17160, 8762, 24620, 35038, 33673, 16339, 6973, 23350, 26303, 15822, 27611, 21531, 35209, 10862, 15759, 9504, 22760, 4683, 26483, 12534, 2554, 22052, 18129, 16359, 34975, 27126, 35664, 25992, 29233, 31158, 3814, 27178, 23787, 27496, 34581, 29498, 3191, 26155, 10244, 16605, 29634, 28106, 36002, 3190, 21102, 8358, 20903, 37158, 15546, 35671, 4479, 13248, 368, 13693, 26867, 31665, 4317, 34117, 9821, 29991, 32203, 14764, 32797, 9820, 21182, 34145, 26423, 26063, 6843, 21728, 19960, 34716, 18018, 34600, 33751, 35405, 16862, 11139, 1547, 26001, 29338, 3968, 31744, 2073, 22009, 24084, 36970, 4365, 33190, 36232, 18537, 37311, 25813, 3994, 12323, 28156, 6101, 33313, 4126, 2842, 15999, 31505, 8956, 22942, 8862, 17536, 

### Basic information about the dataset

In [6]:
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10703690 entries, 0 to 10703689
Data columns (total 8 columns):
 #   Column     Dtype         
---  ------     -----         
 0   datetime   datetime64[ns]
 1   athlete    category      
 2   gender     category      
 3   age_group  category      
 4   distance   float64       
 5   duration   float64       
 6   country    category      
 7   major      category      
dtypes: category(5), datetime64[ns](1), float64(2)
memory usage: 348.5 MB


In [7]:
nday = df['datetime'].dt.date.value_counts().size
print('Number of days:', nday)
nathlete = df['athlete'].unique().size
print('Number of athletes:', nathlete)
nactivity = df.shape[0]
print('Number of running activities:', nactivity)

Number of days: 731
Number of athletes: 36412
Number of running activities: 10703690


## Data resampling

Resample data using a custom function to speedup the process.  
This process consumes at peak about 10 GB of RAM memory. The process could be divided by year, but because we want to generate all possible categories for days and athletes, we first would have to fill each year with all  athletes.

In [8]:
df.set_index('datetime', inplace=True)
df, nathletes, nruns = resample(df, cat1='athlete', freq='d', observed=False)
df.reset_index(inplace=True)

### Verify data

We can see that the Cartesian product of days and athletes was performed and the distance and duration columns were filled with zeros when there was no register for an athlete on a given day:

In [9]:
df

Unnamed: 0,datetime,athlete,distance,duration,gender,age_group,country,major
0,2019-01-01,261,4.71,27.25,F,18 - 34,Canada,CHICAGO 2018
1,2019-01-01,1575,9.35,38.93,M,35 - 54,United States,BOSTON 2019
2,2019-01-01,2010,6.64,31.95,M,35 - 54,United States,NEW YORK 2018
3,2019-01-01,2937,42.34,194,M,35 - 54,Germany,BERLIN 2019
4,2019-01-01,3172,6.59,38.58,M,18 - 34,United States,BOSTON 2019
...,...,...,...,...,...,...,...,...
26617167,2020-12-31,16429,0,0,M,18 - 34,United States,BOSTON 2018
26617168,2020-12-31,28336,9.66,53.57,M,18 - 34,United States,BOSTON 2020
26617169,2020-12-31,29690,0,0,M,18 - 34,United States,NEW YORK 2017
26617170,2020-12-31,9977,0,0,M,35 - 54,Spain,CHICAGO 2018


In [10]:
nathlete * nday

26617172

In [11]:
nathletes

datetime
2019-01-01    36412
2019-01-02    36412
2019-01-03    36412
2019-01-04    36412
2019-01-05    36412
              ...  
2020-12-27    36412
2020-12-28    36412
2020-12-29    36412
2020-12-30    36412
2020-12-31    36412
Length: 731, dtype: int64

In [12]:
df.drop_duplicates(subset='athlete')[['athlete', 'age_group', 'gender']
                                    ].groupby(['age_group', 'gender']
                                             ).count().unstack(level=0)

Unnamed: 0_level_0,athlete,athlete,athlete
age_group,18 - 34,35 - 54,55 +
gender,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
F,3864,4629,396
M,8378,16972,2173


In [13]:
df.groupby([df['datetime'].dt.year]).describe()

Unnamed: 0_level_0,distance,distance,distance,distance,distance,distance,distance,distance,duration,duration,duration,duration,duration,duration,duration,duration
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
datetime,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
2019,13290000.0,4.176,7.335,0,0,0,7.52,390.3,13290000.0,22.93,43.55,0,0,0,40.88,2388
2020,13330000.0,3.865,6.662,0,0,0,7.07,347.9,13330000.0,21.39,39.27,0,0,0,39.93,2300


In [14]:
df.groupby([df['datetime'].dt.year, 'age_group', 'gender']).describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,distance,distance,distance,distance,distance,distance,distance,distance,duration,duration,duration,duration,duration,duration,duration,duration
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
datetime,age_group,gender,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2
2019,18 - 34,F,1410000.0,3.492,6.557,0,0,0,5.83,180.4,1410000.0,19.92,39.8,0,0,0,33.53,1920
2019,18 - 34,M,3058000.0,4.219,7.452,0,0,0,7.57,221.1,3058000.0,21.4,40.81,0,0,0,38.37,2153
2019,35 - 54,F,1690000.0,3.961,6.97,0,0,0,6.86,217.8,1690000.0,23.99,45.41,0,0,0,41.92,2300
2019,35 - 54,M,6195000.0,4.343,7.537,0,0,0,8.05,390.3,6195000.0,23.58,44.49,0,0,0,42.5,2388
2019,55 +,F,144500.0,4.318,7.068,0,0,0,7.71,162.4,144500.0,28.75,51.07,0,0,0,49.1,2115
2019,55 +,M,793100.0,4.356,7.321,0,0,0,8.06,238.8,793100.0,25.87,46.54,0,0,0,46.7,1989
2020,18 - 34,F,1414000.0,3.155,5.779,0,0,0,5.31,168.6,1414000.0,18.09,35.1,0,0,0,31.88,1439
2020,18 - 34,M,3066000.0,3.919,6.818,0,0,0,7.09,201.4,3066000.0,19.91,36.71,0,0,0,36.87,1917
2020,35 - 54,F,1694000.0,3.722,6.306,0,0,0,6.77,167.1,1694000.0,22.77,41.09,0,0,0,42.07,1612
2020,35 - 54,M,6212000.0,3.999,6.851,0,0,0,7.61,347.9,6212000.0,21.88,40.09,0,0,0,41.07,2300


### Export data

See its [docs](https://arrow.apache.org/docs/python/feather.html) and a comparison on [formats to save Pandas data](https://towardsdatascience.com/the-best-format-to-save-pandas-data-414dca023e0d).

In [15]:
df.to_parquet(os.path.join(path2, 'run_ww_2019_2020_d.parquet'))

### Resample by year and at different periods

Now that we have a DataFrame with all possible categories for athletes in all days (from the Cartesian product between days and athletes, setting the parameter `observed` to `False` in the `resample` function), we can resample the dataset separately by year and at different periods.

In [16]:
years = ['2019', '2020']
freqs = ['d', '7d', 'm', 'q']
if not pd.api.types.is_datetime64_ns_dtype(df.index):
    df.set_index('datetime', inplace=True)
for year in tqdm(years, desc='Year'):
    for freq in tqdm(freqs, desc='Freq'):
        dfi = resample(df.loc[year], cat1='athlete', freq=freq, observed=True)[0]
        dfi.reset_index(inplace=True)
        if freq == '7d': freq = 'w'
        dfi.to_parquet(os.path.join(path2, 'run_ww_{}_{}.parquet'.format(year, freq)))

Year:   0%|          | 0/2 [00:00<?, ?it/s]

Freq:   0%|          | 0/4 [00:00<?, ?it/s]

No corrections are implemented for freq q.


Freq:   0%|          | 0/4 [00:00<?, ?it/s]

No corrections are implemented for freq q.


In [17]:
dfi

Unnamed: 0,datetime,athlete,distance,duration,gender,age_group,country,major
0,2020-03-31,261,74.08,523.4,F,18 - 34,Canada,CHICAGO 2018
1,2020-03-31,1575,506.3,2168,M,35 - 54,United States,BOSTON 2019
2,2020-03-31,2010,279.1,1527,M,35 - 54,United States,NEW YORK 2018
3,2020-03-31,2937,562.8,2981,M,35 - 54,Germany,BERLIN 2019
4,2020-03-31,3172,528.2,2361,M,18 - 34,United States,BOSTON 2019
...,...,...,...,...,...,...,...,...
145643,2020-12-31,16429,0,0,M,18 - 34,United States,BOSTON 2018
145644,2020-12-31,28336,195.5,1085,M,18 - 34,United States,BOSTON 2020
145645,2020-12-31,29690,62.83,374.1,M,18 - 34,United States,NEW YORK 2017
145646,2020-12-31,9977,0,0,M,35 - 54,Spain,CHICAGO 2018


### Test files

In [18]:
for year in years:
    for freq in freqs:
        if freq == '7d': freq = 'w'
        df = pd.read_parquet(os.path.join(path2, 'run_ww_{}_{}.parquet'.format(year, freq)))
        df['athlete'] = df['athlete'].astype('category')  # bug in parquet
        #df.set_index('datetime', inplace=True)
        print('\nFile: run_ww_{}_{}.parquet'.format(year, freq))
        display(df.drop_duplicates(subset='athlete')[['athlete', 'age_group', 'gender']
                                                    ].groupby(['age_group', 'gender']
                                                             ).count().unstack(level=0))
        display(df.groupby([df['datetime'].dt.year, 'age_group', 'gender']).describe())


File: run_ww_2019_d.parquet


Unnamed: 0_level_0,athlete,athlete,athlete
age_group,18 - 34,35 - 54,55 +
gender,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
F,3864,4629,396
M,8378,16972,2173


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,distance,distance,distance,distance,distance,distance,distance,distance,duration,duration,duration,duration,duration,duration,duration,duration
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
datetime,age_group,gender,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2
2019,18 - 34,F,1410000.0,3.492,6.557,0,0,0,5.83,180.4,1410000.0,19.92,39.8,0,0,0,33.53,1920
2019,18 - 34,M,3058000.0,4.219,7.452,0,0,0,7.57,221.1,3058000.0,21.4,40.81,0,0,0,38.37,2153
2019,35 - 54,F,1690000.0,3.961,6.97,0,0,0,6.86,217.8,1690000.0,23.99,45.41,0,0,0,41.92,2300
2019,35 - 54,M,6195000.0,4.343,7.537,0,0,0,8.05,390.3,6195000.0,23.58,44.49,0,0,0,42.5,2388
2019,55 +,F,144500.0,4.318,7.068,0,0,0,7.71,162.4,144500.0,28.75,51.07,0,0,0,49.1,2115
2019,55 +,M,793100.0,4.356,7.321,0,0,0,8.06,238.8,793100.0,25.87,46.54,0,0,0,46.7,1989



File: run_ww_2019_w.parquet


Unnamed: 0_level_0,athlete,athlete,athlete
age_group,18 - 34,35 - 54,55 +
gender,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
F,3864,4629,396
M,8378,16972,2173


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,distance,distance,distance,distance,distance,distance,distance,distance,duration,duration,duration,duration,duration,duration,duration,duration
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
datetime,age_group,gender,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2
2019,18 - 34,F,200900.0,24.45,27.52,0,0.0,15.81,39.56,419.7,200900.0,139.5,156.4,0,0.0,94.3,228.5,3442
2019,18 - 34,M,435700.0,29.55,33.21,0,0.0,19.18,46.85,510.3,435700.0,149.9,170.0,0,0.0,100.5,239.9,6810
2019,35 - 54,F,240700.0,27.74,26.33,0,4.87,22.7,43.69,450.4,240700.0,168.0,162.1,0,30.22,141.4,263.1,7649
2019,35 - 54,M,882500.0,30.41,30.37,0,4.58,23.14,47.89,711.1,882500.0,165.1,166.7,0,25.27,129.0,258.7,5162
2019,55 +,F,20590.0,30.23,24.32,0,9.67,28.23,46.54,479.5,20590.0,201.3,183.8,0,64.0,187.0,301.0,8239
2019,55 +,M,113000.0,30.5,27.26,0,6.537,26.06,47.94,331.4,113000.0,181.1,168.5,0,41.18,155.1,279.3,6418



File: run_ww_2019_m.parquet


Unnamed: 0_level_0,athlete,athlete,athlete
age_group,18 - 34,35 - 54,55 +
gender,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
F,3864,4629,396
M,8378,16972,2173


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,distance,distance,distance,distance,distance,distance,distance,distance,duration,duration,duration,duration,duration,duration,duration,duration
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
datetime,age_group,gender,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2
2019,18 - 34,F,46370.0,104.8,107.9,0,15.56,75.12,160.0,1793.0,46370.0,597.9,593.8,0,97.82,453.8,927.5,14420.0
2019,18 - 34,M,100500.0,126.6,130.4,0,21.64,89.33,190.5,1554.0,100500.0,642.2,647.6,0,119.6,477.0,978.6,23340.0
2019,35 - 54,F,55550.0,118.9,101.2,0,36.69,102.0,177.6,1944.0,55550.0,720.0,597.0,0,240.7,641.7,1072.0,16450.0
2019,35 - 54,M,203700.0,130.4,117.4,0,35.55,105.3,194.4,1295.0,203700.0,707.6,620.1,0,206.2,589.6,1057.0,9230.0
2019,55 +,F,4752.0,129.6,90.55,0,59.53,123.2,189.0,877.6,4752.0,862.6,645.1,0,417.3,821.4,1231.0,16150.0
2019,55 +,M,26080.0,130.8,103.7,0,46.63,116.1,194.6,787.7,26080.0,776.3,618.6,0,288.6,693.5,1138.0,10440.0



File: run_ww_2019_q.parquet


Unnamed: 0_level_0,athlete,athlete,athlete
age_group,18 - 34,35 - 54,55 +
gender,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
F,3864,4629,396
M,8378,16972,2173


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,distance,distance,distance,distance,distance,distance,distance,distance,duration,duration,duration,duration,duration,duration,duration,duration
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
datetime,age_group,gender,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2
2019,18 - 34,F,15460.0,318.6,308.9,0,72.22,235.7,474.2,4308,15460.0,1818,1684,0,457.5,1420,2754,35620.0
2019,18 - 34,M,33510.0,385.0,372.9,0,92.76,280.8,569.7,3232,33510.0,1953,1820,0,516.2,1495,2918,37200.0
2019,35 - 54,F,18520.0,361.4,287.6,0,134.0,318.3,524.3,4079,18520.0,2189,1672,0,874.2,1993,3170,29650.0
2019,35 - 54,M,67890.0,396.3,333.6,0,133.7,328.5,578.0,2808,67890.0,2151,1743,0,783.6,1843,3131,18840.0
2019,55 +,F,1584.0,394.0,254.4,0,207.0,369.8,569.2,1695,1584.0,2623,1768,0,1431.0,2515,3605,23990.0
2019,55 +,M,8692.0,397.5,291.9,0,168.5,356.1,576.8,1829,8692.0,2360,1722,0,1050.0,2130,3377,19430.0



File: run_ww_2020_d.parquet


Unnamed: 0_level_0,athlete,athlete,athlete
age_group,18 - 34,35 - 54,55 +
gender,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
F,3864,4629,396
M,8378,16972,2173


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,distance,distance,distance,distance,distance,distance,distance,distance,duration,duration,duration,duration,duration,duration,duration,duration
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
datetime,age_group,gender,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2
2020,18 - 34,F,1414000.0,3.155,5.779,0,0,0,5.31,168.6,1414000.0,18.09,35.1,0,0,0,31.88,1439
2020,18 - 34,M,3066000.0,3.919,6.818,0,0,0,7.09,201.4,3066000.0,19.91,36.71,0,0,0,36.87,1917
2020,35 - 54,F,1694000.0,3.722,6.306,0,0,0,6.77,167.1,1694000.0,22.77,41.09,0,0,0,42.07,1612
2020,35 - 54,M,6212000.0,3.999,6.851,0,0,0,7.61,347.9,6212000.0,21.88,40.09,0,0,0,41.07,2300
2020,55 +,F,144900.0,4.29,6.503,0,0,0,8.05,109.4,144900.0,29.66,48.0,0,0,0,52.7,1297
2020,55 +,M,795300.0,4.092,6.689,0,0,0,8.05,266.2,795300.0,24.73,42.81,0,0,0,46.75,1857



File: run_ww_2020_w.parquet


Unnamed: 0_level_0,athlete,athlete,athlete
age_group,18 - 34,35 - 54,55 +
gender,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
F,3864,4629,396
M,8378,16972,2173


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,distance,distance,distance,distance,distance,distance,distance,distance,duration,duration,duration,duration,duration,duration,duration,duration
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
datetime,age_group,gender,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2
2020,18 - 34,F,200900.0,22.11,27.46,0,0.0,11.5,35.48,528.5,200900.0,126.8,158.1,0,0.0,70.9,206.3,6913
2020,18 - 34,M,435700.0,27.46,32.9,0,0.0,15.9,43.42,476.8,435700.0,139.5,165.8,0,0.0,84.61,222.9,5421
2020,35 - 54,F,240700.0,26.07,27.19,0,0.0,19.68,41.92,579.2,240700.0,159.5,167.2,0,0.0,124.0,253.3,5408
2020,35 - 54,M,882500.0,28.02,30.66,0,0.0,19.39,44.4,624.6,882500.0,153.2,168.4,0,0.0,108.8,242.9,5065
2020,55 +,F,20590.0,30.06,26.71,0,5.27,26.74,46.84,281.7,20590.0,207.8,201.4,0,39.57,182.8,309.6,3285
2020,55 +,M,113000.0,28.66,28.02,0,2.876,23.03,45.08,384.4,113000.0,173.3,173.3,0,20.31,141.0,268.4,3302



File: run_ww_2020_m.parquet


Unnamed: 0_level_0,athlete,athlete,athlete
age_group,18 - 34,35 - 54,55 +
gender,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
F,3864,4629,396
M,8378,16972,2173


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,distance,distance,distance,distance,distance,distance,distance,distance,duration,duration,duration,duration,duration,duration,duration,duration
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
datetime,age_group,gender,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2
2020,18 - 34,F,46370.0,94.71,110.9,0,3.405,55.17,149.2,1600.0,46370.0,543.1,624.9,0,23.5,337.4,865.8,22150.0
2020,18 - 34,M,100500.0,117.6,132.0,0,10.43,73.63,181.0,1377.0,100500.0,597.5,650.8,0,60.0,396.2,930.6,9243.0
2020,35 - 54,F,55550.0,111.7,108.5,0,17.16,88.22,173.8,1962.0,55550.0,683.3,651.2,0,117.1,562.7,1057.0,16990.0
2020,35 - 54,M,203700.0,120.0,122.1,0,17.87,87.62,184.7,1560.0,203700.0,656.6,654.7,0,107.6,494.4,1011.0,10710.0
2020,55 +,F,4752.0,128.7,106.2,0,35.59,117.4,197.5,935.5,4752.0,890.0,793.4,0,261.3,806.4,1295.0,9606.0
2020,55 +,M,26080.0,122.8,110.8,0,29.97,101.9,186.9,1020.0,26080.0,742.2,673.7,0,197.6,625.5,1112.0,8395.0



File: run_ww_2020_q.parquet


Unnamed: 0_level_0,athlete,athlete,athlete
age_group,18 - 34,35 - 54,55 +
gender,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
F,3864,4629,396
M,8378,16972,2173


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,distance,distance,distance,distance,distance,distance,distance,distance,duration,duration,duration,duration,duration,duration,duration,duration
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
datetime,age_group,gender,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2
2020,18 - 34,F,15460.0,288.7,323.3,0,26.71,175.9,447.3,3179,15460.0,1656,1801,0,178.6,1082,2601,41650.0
2020,18 - 34,M,33510.0,358.6,382.6,0,54.31,236.9,543.4,3705,33510.0,1822,1871,0,311.7,1277,2786,24520.0
2020,35 - 54,F,18520.0,340.5,313.8,0,74.89,275.1,521.1,3765,18520.0,2083,1867,0,517.3,1762,3163,31000.0
2020,35 - 54,M,67890.0,365.9,352.2,0,79.52,276.0,554.1,2954,67890.0,2002,1873,0,475.2,1561,3024,19860.0
2020,55 +,F,1584.0,392.6,307.4,0,134.6,358.8,589.7,2386,1584.0,2714,2294,0,991.7,2451,3891,24280.0
2020,55 +,M,8692.0,374.4,317.0,0,115.7,317.1,556.5,2543,8692.0,2263,1916,0,751.8,1943,3313,24240.0
