# `train_dev_test_split.ipynb`

### Author: Anthony Hein

#### Last updated: 11/14/2021

# Overview:

Create the training set, dev set, and test set using in this analysis.

It is important that all races in the training set come chronologically _before_ those in the dev set, and that all races in the dev set come chronologically _before_ those in the test set, so that the model does not learn information about future races during training.

This simulates the real world and reflects the fact that we expect the real-world distribution to change over time as well.

---

## Setup

In [1]:
from datetime import datetime
import git
import os
import re
from typing import List
from sklearn.preprocessing import StandardScaler
from tqdm import tqdm
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
BASE_DIR = git.Repo(os.getcwd(), search_parent_directories=True).working_dir
BASE_DIR

'/Users/anthonyhein/Desktop/SML310/project'

---

## Load `horses_featurized_jockey_paired_input_with_race_data.csv`

In [3]:
horses_paired_input = pd.read_csv(
    f"{BASE_DIR}/data/streamline/horses_featurized_jockey_paired_input_with_race_data.csv",
    low_memory=False
)
horses_paired_input.head()

Unnamed: 0,rid,horse1_horseName,horse1_age,horse1_saddle,horse1_decimalPrice,horse1_isFav,horse1_trainerName,horse1_jockeyName,horse1_position,horse1_positionL,...,pressure_level_2,pressure_level_3,pressure_level_4,is_raining,rhum_level_0,rhum_level_1,rhum_level_2,rhum_level_3,rhum_level_4,entropy of odds
0,377929,Strawberry Roan,3.0,1.0,0.714286,1,A P O'Brien,C Roche,1,0.0,...,1,0,0,0,0,0,0,1,0,1.173658
1,377929,Strawberry Roan,3.0,1.0,0.714286,1,A P O'Brien,C Roche,1,0.0,...,1,0,0,0,0,0,0,1,0,1.173658
2,377929,Strawberry Roan,3.0,1.0,0.714286,1,A P O'Brien,C Roche,1,0.0,...,1,0,0,0,0,0,0,1,0,1.173658
3,377929,Magical Cliche,3.0,3.0,0.090909,0,D K Weld,Mick Kinane,3,0.75,...,1,0,0,0,0,0,0,1,0,1.173658
4,377929,Magical Cliche,3.0,3.0,0.090909,0,D K Weld,Mick Kinane,3,0.75,...,1,0,0,0,0,0,0,1,0,1.173658


In [4]:
horses_paired_input.shape

(1143824, 295)

---

## Train-Dev-Test Split

First, sort the data by datetime so that we can select a cutoff.

In [5]:
horses_paired_input['datetime']

0          1997-05-11 14:00:00
1          1997-05-11 14:00:00
2          1997-05-11 14:00:00
3          1997-05-11 14:00:00
4          1997-05-11 14:00:00
                  ...         
1143819    1999-12-29 14:15:00
1143820    1999-12-29 14:15:00
1143821    1999-12-29 14:15:00
1143822    1999-12-29 14:15:00
1143823    1999-12-29 14:15:00
Name: datetime, Length: 1143824, dtype: object

In [6]:
horses_paired_input['datetime'] = pd.to_datetime(horses_paired_input['datetime'])

In [7]:
horses_paired_input['datetime']

0         1997-05-11 14:00:00
1         1997-05-11 14:00:00
2         1997-05-11 14:00:00
3         1997-05-11 14:00:00
4         1997-05-11 14:00:00
                  ...        
1143819   1999-12-29 14:15:00
1143820   1999-12-29 14:15:00
1143821   1999-12-29 14:15:00
1143822   1999-12-29 14:15:00
1143823   1999-12-29 14:15:00
Name: datetime, Length: 1143824, dtype: datetime64[ns]

In [8]:
horses_paired_input = horses_paired_input.sort_values(by='datetime')
horses_paired_input.head()

Unnamed: 0,rid,horse1_horseName,horse1_age,horse1_saddle,horse1_decimalPrice,horse1_isFav,horse1_trainerName,horse1_jockeyName,horse1_position,horse1_positionL,...,pressure_level_2,pressure_level_3,pressure_level_4,is_raining,rhum_level_0,rhum_level_1,rhum_level_2,rhum_level_3,rhum_level_4,entropy of odds
98,341451,Dance Design,3.0,6.0,0.181818,0,D K Weld,Mick Kinane,2,1.5,...,0,1,0,0,0,0,1,0,0,1.601872
99,341451,Idris,6.0,1.0,0.066667,0,J S Bolger,Kevin Manning,5,nk,...,0,1,0,0,0,0,1,0,0,1.601872
100,50025,Azra,2.0,11.0,0.090909,0,J S Bolger,Kevin Manning,3,1,...,0,1,0,0,0,0,1,0,0,2.103465
101,50025,Azra,2.0,11.0,0.090909,0,J S Bolger,Kevin Manning,3,1,...,0,1,0,0,0,0,1,0,0,2.103465
102,50025,Johan Cruyff,2.0,5.0,0.083333,0,A P O'Brien,Johnny Murtagh,5,nk,...,0,1,0,0,0,0,1,0,0,2.103465


In [9]:
horses_paired_input['datetime']

98       1996-09-14 15:00:00
99       1996-09-14 15:00:00
100      1996-09-21 15:30:00
101      1996-09-21 15:30:00
102      1996-09-21 15:30:00
                 ...        
335959   2020-12-05 14:28:00
335960   2020-12-05 14:28:00
335961   2020-12-05 14:28:00
335952   2020-12-05 14:28:00
335950   2020-12-05 14:28:00
Name: datetime, Length: 1143824, dtype: datetime64[ns]

Our target is a test set containing 10% of the data.

In [56]:
target_size = round(0.1 * len(horses_paired_input))
target_size

114382

In [57]:
horses_paired_input[-target_size:]

Unnamed: 0,rid,horse1_horseName,horse1_age,horse1_saddle,horse1_decimalPrice,horse1_isFav,horse1_trainerName,horse1_jockeyName,horse1_position,horse1_positionL,...,pressure_level_2,pressure_level_3,pressure_level_4,is_raining,rhum_level_0,rhum_level_1,rhum_level_2,rhum_level_3,rhum_level_4,entropy of odds
112137,136782,Captain Power,7.0,6.0,0.076923,0,Gordon Elliott,Gary Halpin,8,1.5,...,1,0,0,0,0,0,1,0,0,2.306015
112132,136782,Sestriere,3.0,7.0,0.029412,0,Kevin Prendergast,Chris Hayes,7,.75,...,1,0,0,0,0,0,1,0,0,2.306015
112125,136782,Sestriere,3.0,7.0,0.029412,0,Kevin Prendergast,Chris Hayes,7,.75,...,1,0,0,0,0,0,1,0,0,2.306015
112123,136782,Pillar,6.0,8.0,0.090909,0,Adrian McGuinness,Shane Foley,6,hd,...,1,0,0,0,0,0,1,0,0,2.306015
112138,136782,Captain Power,7.0,6.0,0.076923,0,Gordon Elliott,Gary Halpin,8,1.5,...,1,0,0,0,0,0,1,0,0,2.306015
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
335959,415052,Knockmoylan,5.0,7.0,0.066667,0,Miss Ellmarie Holden,Mr Derek O'Connor,4,7.5,...,1,0,0,0,0,0,0,1,0,1.643973
335960,415052,Alko Rouge,6.0,1.0,0.200000,0,Thomas Gibney,Mr N McParlan,6,2.5,...,1,0,0,0,0,0,0,1,0,1.643973
335961,415052,Alko Rouge,6.0,1.0,0.200000,0,Thomas Gibney,Mr N McParlan,6,2.5,...,1,0,0,0,0,0,0,1,0,1.643973
335952,415052,Ballycairn,5.0,2.0,0.200000,0,Gordon Elliott,Mr J J Codd,3,4.5,...,1,0,0,0,0,0,0,1,0,1.643973


In [58]:
horses_paired_input[-target_size:].iloc[0]['datetime']

Timestamp('2019-07-24 16:50:00')

In [59]:
cutoff = horses_paired_input[-target_size:].iloc[0]['datetime'].replace(hour=0, minute=0)
cutoff

Timestamp('2019-07-24 00:00:00')

In [60]:
horses_paired_input[
    (horses_paired_input['datetime'] >= cutoff) &
    (horses_paired_input['datetime'] <= cutoff.replace(hour=23, minute=59))
]['datetime']

112127   2019-07-24 16:50:00
112128   2019-07-24 16:50:00
112129   2019-07-24 16:50:00
112130   2019-07-24 16:50:00
112131   2019-07-24 16:50:00
                 ...        
112053   2019-07-24 20:00:00
112050   2019-07-24 20:00:00
112060   2019-07-24 20:00:00
112061   2019-07-24 20:00:00
112062   2019-07-24 20:00:00
Name: datetime, Length: 140, dtype: datetime64[ns]

This is the first race of the day.

In [38]:
len(horses_paired_input[horses_paired_input['datetime'] >= cutoff])

114392

So our cutoff is 2019-07-24.

In [39]:
assert (len(horses_paired_input[horses_paired_input['datetime'] >= cutoff]) + \
        len(horses_paired_input[horses_paired_input['datetime'] < cutoff])) == len(horses_paired_input)

In [40]:
X_test = horses_paired_input[horses_paired_input['datetime'] >= cutoff]
X_train_dev = horses_paired_input[horses_paired_input['datetime'] < cutoff]

In [41]:
len(X_test), len(X_train_dev)

(114392, 1029432)

Our target is a dev set containing 20% of the data.

In [61]:
target_size = round(0.2 * len(horses_paired_input))
target_size

228765

In [62]:
X_train_dev[-target_size:]

Unnamed: 0,rid,horse1_horseName,horse1_age,horse1_saddle,horse1_decimalPrice,horse1_isFav,horse1_trainerName,horse1_jockeyName,horse1_position,horse1_positionL,...,pressure_level_2,pressure_level_3,pressure_level_4,is_raining,rhum_level_0,rhum_level_1,rhum_level_2,rhum_level_3,rhum_level_4,entropy of odds
1075803,159686,Mothers Finest,4.0,4.0,0.142857,0,Adrian Paul Keatley,Gary Carroll,7,2.5,...,1,0,0,1,0,0,0,0,1,1.852846
1075802,159686,Mothers Finest,4.0,4.0,0.142857,0,Adrian Paul Keatley,Gary Carroll,7,2.5,...,1,0,0,1,0,0,0,0,1,1.852846
1075801,159686,Mothers Finest,4.0,4.0,0.142857,0,Adrian Paul Keatley,Gary Carroll,7,2.5,...,1,0,0,1,0,0,0,0,1,1.852846
1075799,159686,Mothers Finest,4.0,4.0,0.142857,0,Adrian Paul Keatley,Gary Carroll,7,2.5,...,1,0,0,1,0,0,0,0,1,1.852846
1075798,159686,Mothers Finest,4.0,4.0,0.142857,0,Adrian Paul Keatley,Gary Carroll,7,2.5,...,1,0,0,1,0,0,0,0,1,1.852846
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
112041,1269,Rian Thomas,6.0,5.0,0.181818,0,Gordon Elliott,Mr J J Codd,2,2,...,1,0,0,0,0,0,0,1,0,1.863178
112040,1269,Above The City,5.0,7.0,0.500000,1,Desmond McDonogh,Mr P W Mullins,1,0,...,1,0,0,0,0,0,0,1,0,1.863178
112039,1269,Above The City,5.0,7.0,0.500000,1,Desmond McDonogh,Mr P W Mullins,1,0,...,1,0,0,0,0,0,0,1,0,1.863178
112038,1269,Above The City,5.0,7.0,0.500000,1,Desmond McDonogh,Mr P W Mullins,1,0,...,1,0,0,0,0,0,0,1,0,1.863178


In [63]:
X_train_dev[-target_size:].iloc[0]['datetime']

Timestamp('2016-07-28 14:30:00')

In [64]:
cutoff = X_train_dev[-target_size:].iloc[0]['datetime'].replace(hour=0, minute=0)
cutoff

Timestamp('2016-07-28 00:00:00')

In [65]:
X_train_dev[
    (X_train_dev['datetime'] >= cutoff) &
    (X_train_dev['datetime'] <= cutoff.replace(hour=23, minute=59))
]['datetime']

1075788   2016-07-28 14:30:00
1075803   2016-07-28 14:30:00
1075802   2016-07-28 14:30:00
1075801   2016-07-28 14:30:00
1075799   2016-07-28 14:30:00
1075798   2016-07-28 14:30:00
1075797   2016-07-28 14:30:00
1075796   2016-07-28 14:30:00
1075795   2016-07-28 14:30:00
1075794   2016-07-28 14:30:00
1075793   2016-07-28 14:30:00
1075792   2016-07-28 14:30:00
1075791   2016-07-28 14:30:00
1075789   2016-07-28 14:30:00
1075787   2016-07-28 14:30:00
1075790   2016-07-28 14:30:00
1075800   2016-07-28 14:30:00
1075785   2016-07-28 14:30:00
1075763   2016-07-28 14:30:00
1075764   2016-07-28 14:30:00
1075765   2016-07-28 14:30:00
1075762   2016-07-28 14:30:00
1075766   2016-07-28 14:30:00
1075767   2016-07-28 14:30:00
1075768   2016-07-28 14:30:00
1075769   2016-07-28 14:30:00
1075770   2016-07-28 14:30:00
1075771   2016-07-28 14:30:00
1075772   2016-07-28 14:30:00
1075786   2016-07-28 14:30:00
1075774   2016-07-28 14:30:00
1075775   2016-07-28 14:30:00
1075776   2016-07-28 14:30:00
1075777   

Again, this is the first race of the day, good.

In [66]:
len(X_train_dev[X_train_dev['datetime'] >= cutoff])

228766

So our cutoff is 2016-07-28.

In [67]:
assert (len(X_train_dev[X_train_dev['datetime'] >= cutoff]) + \
        len(X_train_dev[X_train_dev['datetime'] < cutoff])) == len(X_train_dev)

In [68]:
X_dev = X_train_dev[X_train_dev['datetime'] >= cutoff]
X_train = X_train_dev[X_train_dev['datetime'] < cutoff]

In [69]:
len(X_test), len(X_dev), len(X_train)

(114392, 228766, 800666)

Final sanity check.

In [71]:
assert len(X_test) + len(X_dev) + len(X_train) == len(horses_paired_input)
assert set(X_train['rid']).intersection(set(X_dev['rid'])) == set()
assert set(X_train['rid']).intersection(set(X_test['rid'])) == set()
assert set(X_dev['rid']).intersection(set(X_test['rid'])) == set()

---

## Save Dataframes

In [72]:
X_train.to_csv(f"{BASE_DIR}/data/analysis/X_train_everything.csv", index=False)

In [73]:
X_dev.to_csv(f"{BASE_DIR}/data/analysis/X_dev_everything.csv", index=False)

In [74]:
X_test.to_csv(f"{BASE_DIR}/data/analysis/X_test_everything.csv", index=False)

---