# `streamline_data_selection.ipynb`

### Author: Anthony Hein

#### Last updated: 11/3/2021

# Overview:

This notebook is written well after the actual data cleaning took place. The purpose of this notebook is to replicate the results of selecting data in a much cleaner fashion since it is known exactly what data will be used and so less avenues have to be explored in the process. Of course, this does not invalidate any work done previously and cannot be a substitute because that would otherwise reverse cause and effect. In other words, we are able to write this slimmer notebook precisely because we wrote the larger notebooks which made us knowledgeable about the data.

This is primarily for ease of reproduction by other users.

---

## Setup

In [18]:
from datetime import datetime
import git
import os
import re
from typing import List
from tqdm import tqdm
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [19]:
BASE_DIR = git.Repo(os.getcwd(), search_parent_directories=True).working_dir
BASE_DIR

'/Users/anthonyhein/Desktop/SML310/project'

---

## Load `horses_all.csv`

In [20]:
horses_all = pd.read_csv(f"{BASE_DIR}/data/streamline/horses_all.csv", low_memory=False) 
horses_all.head()

Unnamed: 0,rid,horseName,age,saddle,decimalPrice,isFav,trainerName,jockeyName,position,positionL,...,TR,OR,father,mother,gfather,runners,margin,weight,res_win,res_place
0,267255,Going For Broke,3.0,4.0,0.1,0,P C Haslam,Seb Sanders,1,,...,62.0,62.0,Simply Great,Empty Purse,Pennine Walk,6,1.168254,58,1.0,1.0
1,267255,Pinchincha,3.0,3.0,0.266667,0,Dave Morris,Tony Clark,2,4.0,...,56.0,65.0,Priolo,Western Heights,Shirley Heights,6,1.168254,60,0.0,1.0
2,267255,Skelton Sovereign,3.0,5.0,0.142857,0,Reg Hollinshead,D Griffiths,3,3.0,...,40.0,60.0,Contract Law,Mrs Lucky,Royal Match,6,1.168254,55,0.0,0.0
3,267255,Fast Spin,3.0,6.0,0.380952,1,David Barron,Tony Culhane,4,7.0,...,30.0,59.0,Formidable I,Topwinder,Topsider,6,1.168254,57,0.0,0.0
4,267255,As-Is,3.0,2.0,0.166667,0,Mark Johnston,J Weaver,5,7.0,...,21.0,65.0,Lomond,Capriati I,Diesis,6,1.168254,60,0.0,0.0


In [21]:
horses_all.shape

(4107315, 27)

In [22]:
horses_selected_streamline = horses_all.copy()
horses_selected_streamline.head()

Unnamed: 0,rid,horseName,age,saddle,decimalPrice,isFav,trainerName,jockeyName,position,positionL,...,TR,OR,father,mother,gfather,runners,margin,weight,res_win,res_place
0,267255,Going For Broke,3.0,4.0,0.1,0,P C Haslam,Seb Sanders,1,,...,62.0,62.0,Simply Great,Empty Purse,Pennine Walk,6,1.168254,58,1.0,1.0
1,267255,Pinchincha,3.0,3.0,0.266667,0,Dave Morris,Tony Clark,2,4.0,...,56.0,65.0,Priolo,Western Heights,Shirley Heights,6,1.168254,60,0.0,1.0
2,267255,Skelton Sovereign,3.0,5.0,0.142857,0,Reg Hollinshead,D Griffiths,3,3.0,...,40.0,60.0,Contract Law,Mrs Lucky,Royal Match,6,1.168254,55,0.0,0.0
3,267255,Fast Spin,3.0,6.0,0.380952,1,David Barron,Tony Culhane,4,7.0,...,30.0,59.0,Formidable I,Topwinder,Topsider,6,1.168254,57,0.0,0.0
4,267255,As-Is,3.0,2.0,0.166667,0,Mark Johnston,J Weaver,5,7.0,...,21.0,65.0,Lomond,Capriati I,Diesis,6,1.168254,60,0.0,0.0


---

## Load `races_all.csv`

In [23]:
races_all = pd.read_csv(f"{BASE_DIR}/data/streamline/races_all.csv", low_memory=False) 
races_all.head()

Unnamed: 0,rid,course,time,date,title,rclass,band,ages,distance,condition,hurdles,prizes,winningTime,prize,metric,countryCode,ncond,class
0,267255,Southwell (AW),03:40,97/01/01,New Year Handicap Class E,Class 5,0-70,3yo,1m,Standard,,"[2752.25, 833.0, 406.5, 193.25]",106.9,4184.0,1609.0,GB,0,5
1,297570,Southwell (AW),12:35,97/01/01,Resolution Claiming Stakes Class F (Div I),Class 6,,4yo+,7f,Standard,,"[1944.0, 544.0, 264.0]",91.0,2752.0,1407.0,GB,0,6
2,334421,Southwell (AW),01:05,97/01/01,One Too Many Median Auction Maiden Apprentices...,Class 6,,4-6yo,1m3f,Standard,,"[2502.0, 702.0, 342.0]",150.7,3546.0,2212.0,GB,0,6
3,366304,Southwell (AW),03:10,97/01/01,Morning Call Selling Stakes Class G Southwell ...,Class 6,,3yo,1m,Standard,,"[2189.0, 614.0, 299.0]",108.6,3102.0,1609.0,GB,0,6
4,13063,Southwell (AW),02:40,97/01/01,Thinking &amp; Drinking Handicap Class E,Class 5,0-70,4yo+,2m½f,Standard,,"[2726.25, 825.0, 402.5, 191.25]",231.4,4144.0,3318.5,GB,0,5


In [24]:
races_all.shape

(396572, 18)

In [25]:
races_selected_streamline = races_all.copy()
races_selected_streamline.head()

Unnamed: 0,rid,course,time,date,title,rclass,band,ages,distance,condition,hurdles,prizes,winningTime,prize,metric,countryCode,ncond,class
0,267255,Southwell (AW),03:40,97/01/01,New Year Handicap Class E,Class 5,0-70,3yo,1m,Standard,,"[2752.25, 833.0, 406.5, 193.25]",106.9,4184.0,1609.0,GB,0,5
1,297570,Southwell (AW),12:35,97/01/01,Resolution Claiming Stakes Class F (Div I),Class 6,,4yo+,7f,Standard,,"[1944.0, 544.0, 264.0]",91.0,2752.0,1407.0,GB,0,6
2,334421,Southwell (AW),01:05,97/01/01,One Too Many Median Auction Maiden Apprentices...,Class 6,,4-6yo,1m3f,Standard,,"[2502.0, 702.0, 342.0]",150.7,3546.0,2212.0,GB,0,6
3,366304,Southwell (AW),03:10,97/01/01,Morning Call Selling Stakes Class G Southwell ...,Class 6,,3yo,1m,Standard,,"[2189.0, 614.0, 299.0]",108.6,3102.0,1609.0,GB,0,6
4,13063,Southwell (AW),02:40,97/01/01,Thinking &amp; Drinking Handicap Class E,Class 5,0-70,4yo+,2m½f,Standard,,"[2726.25, 825.0, 402.5, 191.25]",231.4,4144.0,3318.5,GB,0,5


---

## Drop Horses w/ Bad Data

A horse has bad data if
* it has a `position` which is null or zero
* it has a `positionL` which is null _but the horses's position is not 1 or 40_,
* it lacks a `trainerName`, or
* it lacks a `jockeyName`

_This is not a complete definition, we will add on to this as need be when we clean the data._

The reason that these are chosen in particular is because many of our engineered features will rely on these, such that the widespread lack of such data may make it difficult to impute these values.

Note that if one horse in a given race has bad data, then we must drop all horses in that race.

In [27]:
print(len(horses_selected_streamline))

entries_w_bad_data = horses_selected_streamline[
    (horses_selected_streamline['position'] <= 0) |
    (horses_selected_streamline['position'].isnull()) |
    (~horses_selected_streamline['position'].isin([1,40]) & horses_selected_streamline['positionL'].isnull()) |
    (horses_selected_streamline['trainerName'].isnull()) |
    (horses_selected_streamline['jockeyName'].isnull())
]['rid']

horses_selected_streamline = horses_selected_streamline[
    ~horses_selected_streamline['rid'].isin(entries_w_bad_data)
]

print(len(horses_selected_streamline))

horses_selected_streamline.head()

4107315
3814910


Unnamed: 0,rid,horseName,age,saddle,decimalPrice,isFav,trainerName,jockeyName,position,positionL,...,TR,OR,father,mother,gfather,runners,margin,weight,res_win,res_place
0,267255,Going For Broke,3.0,4.0,0.1,0,P C Haslam,Seb Sanders,1,,...,62.0,62.0,Simply Great,Empty Purse,Pennine Walk,6,1.168254,58,1.0,1.0
1,267255,Pinchincha,3.0,3.0,0.266667,0,Dave Morris,Tony Clark,2,4.0,...,56.0,65.0,Priolo,Western Heights,Shirley Heights,6,1.168254,60,0.0,1.0
2,267255,Skelton Sovereign,3.0,5.0,0.142857,0,Reg Hollinshead,D Griffiths,3,3.0,...,40.0,60.0,Contract Law,Mrs Lucky,Royal Match,6,1.168254,55,0.0,0.0
3,267255,Fast Spin,3.0,6.0,0.380952,1,David Barron,Tony Culhane,4,7.0,...,30.0,59.0,Formidable I,Topwinder,Topsider,6,1.168254,57,0.0,0.0
4,267255,As-Is,3.0,2.0,0.166667,0,Mark Johnston,J Weaver,5,7.0,...,21.0,65.0,Lomond,Capriati I,Diesis,6,1.168254,60,0.0,0.0


## Select Subset of Races

We will only look at races which
* take place in ireland, and
* do not have hurdles

_This is not a complete definition, we will add on to this as need be._

To know the number of runners, we will first have to make a dictionary of this from the `horses_all_streamline` dataset and then use this information to create a new column within the `races_all_streamline` dataset.

In [28]:
print(len(races_selected_streamline))

races_selected_streamline = races_selected_streamline[
    (races_selected_streamline['countryCode'] == 'IE') &
    (races_selected_streamline['hurdles'].isnull())
]

print(len(races_selected_streamline))

races_selected_streamline.head()

396572
36445


Unnamed: 0,rid,course,time,date,title,rclass,band,ages,distance,condition,hurdles,prizes,winningTime,prize,metric,countryCode,ncond,class
42,31147,Thurles (IRE),03:50,97/01/06,Archerstown I.N.H. Flat Race,,,5-6yo,2m,Good,,[],243.5,,3218.0,IE,1,0
63,302858,Thurles (IRE),01:15,97/01/09,Liffey Maiden Hurdle (Div 1),,,5yo+,2m3f,Good,,[],277.2,,3821.0,IE,1,0
72,68706,Thurles (IRE),01:45,97/01/09,Liffey Maiden Hurdle (Div 2),,,5yo+,2m3f,Good,,[],272.9,,3821.0,IE,1,0
74,195897,Thurles (IRE),03:45,97/01/09,Dodder I N H Flat Race,,,5yo+,2m,Good,,[],245.0,,3218.0,IE,1,0
95,305185,Leopardstown (IRE),03:40,97/01/11,Taney I.N.H. Flat Race,,,4yo,2m,Good To Yielding,,[],241.8,,3218.0,IE,4,0


---

## Symmetric Difference of Datasets

In [29]:
print(len(horses_selected_streamline))

horses_selected_streamline = horses_selected_streamline[
    horses_selected_streamline['rid'].isin(races_selected_streamline['rid'])
]

print(len(horses_selected_streamline))

3814910
339771


In [30]:
print(len(races_selected_streamline))

races_selected_streamline = races_selected_streamline[
    races_selected_streamline['rid'].isin(horses_selected_streamline['rid'])
]

print(len(races_selected_streamline))

36445
28269


In [31]:
assert set(horses_selected_streamline['rid']).symmetric_difference(set(races_selected_streamline['rid'])) == set()

---

## Only Look at Races w/ >= 3 and <= 14 Horses

In [32]:
# create dictionary from rid to runners

rid_to_runners = {}
for rid in tqdm(races_selected_streamline['rid']):
    rid_to_runners[rid] = len(horses_selected_streamline[horses_selected_streamline['rid'] == rid])

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28269/28269 [00:12<00:00, 2247.69it/s]


In [33]:
# create dataframe from dictionary

rename_cols = {
    'index': 'rid',
    0: 'runners',
}

df_runners = pd.DataFrame.from_dict(rid_to_runners, orient='index').reset_index().rename(columns=rename_cols)
df_runners.head()

Unnamed: 0,rid,runners
0,302858,6
1,291347,9
2,75447,8
3,358038,10
4,78982,4


In [34]:
# merge w/ races_clean_streamline

races_selected_streamline = races_selected_streamline.merge(df_runners, how='inner', on='rid')
races_selected_streamline.head()

Unnamed: 0,rid,course,time,date,title,rclass,band,ages,distance,condition,hurdles,prizes,winningTime,prize,metric,countryCode,ncond,class,runners
0,302858,Thurles (IRE),01:15,97/01/09,Liffey Maiden Hurdle (Div 1),,,5yo+,2m3f,Good,,[],277.2,,3821.0,IE,1,0,6
1,291347,Punchestown (IRE),03:40,97/02/16,Ericsson G.S.M. Grand National Trial Handicap ...,,,5yo+,3m2f,Soft,,[],447.2,,5229.0,IE,5,0,9
2,75447,Listowel (IRE),03:00,97/03/01,Ballybunion E.B.F. Beginners S'chase,,,4yo+,2m2f,Soft,,[],318.4,,3620.0,IE,5,0,8
3,358038,Punchestown (IRE),02:40,97/04/24,Quinns Of Baltinglass Chase (La Touche) (Cross...,,,5yo+,4m1f,Good,,[],533.9,,6637.0,IE,1,0,10
4,78982,Dundalk (IRE),05:15,97/05/02,Carlingford Handicap Chase,,0-109,4yo+,3m,Firm,,[],370.3,,4827.0,IE,8,0,4


In [35]:
print(len(races_selected_streamline))

races_selected_streamline = races_selected_streamline[
    (races_selected_streamline['runners'] >= 3) &
    (races_selected_streamline['runners'] <= 14)
]

print(len(races_selected_streamline))

races_selected_streamline.head()

28269
20574


Unnamed: 0,rid,course,time,date,title,rclass,band,ages,distance,condition,hurdles,prizes,winningTime,prize,metric,countryCode,ncond,class,runners
0,302858,Thurles (IRE),01:15,97/01/09,Liffey Maiden Hurdle (Div 1),,,5yo+,2m3f,Good,,[],277.2,,3821.0,IE,1,0,6
1,291347,Punchestown (IRE),03:40,97/02/16,Ericsson G.S.M. Grand National Trial Handicap ...,,,5yo+,3m2f,Soft,,[],447.2,,5229.0,IE,5,0,9
2,75447,Listowel (IRE),03:00,97/03/01,Ballybunion E.B.F. Beginners S'chase,,,4yo+,2m2f,Soft,,[],318.4,,3620.0,IE,5,0,8
3,358038,Punchestown (IRE),02:40,97/04/24,Quinns Of Baltinglass Chase (La Touche) (Cross...,,,5yo+,4m1f,Good,,[],533.9,,6637.0,IE,1,0,10
4,78982,Dundalk (IRE),05:15,97/05/02,Carlingford Handicap Chase,,0-109,4yo+,3m,Firm,,[],370.3,,4827.0,IE,8,0,4


---

## Symmetric Difference of Datasets

In [36]:
print(len(horses_selected_streamline))

horses_selected_streamline = horses_selected_streamline[
    horses_selected_streamline['rid'].isin(races_selected_streamline['rid'])
]

print(len(horses_selected_streamline))

339771
205138


In [37]:
print(len(races_selected_streamline))

races_selected_streamline = races_selected_streamline[
    races_selected_streamline['rid'].isin(horses_selected_streamline['rid'])
]

print(len(races_selected_streamline))

20574
20574


In [38]:
assert set(horses_selected_streamline['rid']).symmetric_difference(set(races_selected_streamline['rid'])) == set()

---

## Save Dataframes

In [39]:
horses_selected_streamline.to_csv(f"{BASE_DIR}/data/streamline/horses_selected.csv", index=False)

In [40]:
races_selected_streamline.to_csv(f"{BASE_DIR}/data/streamline/races_selected.csv", index=False)

---