# `remove_hurdles_horses_atic_and_races_atic.ipynb`

### Author: Anthony Hein

#### Last updated: 9/25/2021

# Overview:

Since we will still have enough entries even after removing those races where there are hurdles, we will proceed with removing them to ensure that the data is more uniform. This is not a problem because we acknowledge that this model will not longer hold for races with hurdles, though we will test the model against such races with hurdles anyways out of curiousity.

---

## Setup

In [1]:
import git
import os
from typing import List
from tqdm import tqdm
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
BASE_DIR = git.Repo(os.getcwd(), search_parent_directories=True).working_dir
BASE_DIR

'/Users/anthonyhein/Desktop/SML310/project'

---

## Load `horses_all_trim_intxn_clean.csv`

In [3]:
horses_all_trim_intxn_clean = pd.read_csv(f"{BASE_DIR}/data/csv/horses_all_trim_intxn_clean.csv", low_memory=False) 
horses_all_trim_intxn_clean.head()

Unnamed: 0,rid,horseName,age,saddle,decimalPrice,trainerName,jockeyName,position,positionL,dist,outHandicap,RPR,TR,OR,father,mother,gfather,weight
0,267255,Going For Broke,3.0,4.0,0.1,P C Haslam,Seb Sanders,1,,0.0,0.0,72.0,62.0,62.0,Simply Great,Empty Purse,Pennine Walk,58
1,267255,Pinchincha,3.0,3.0,0.266667,Dave Morris,Tony Clark,2,4.0,0.0,0.0,66.0,56.0,65.0,Priolo,Western Heights,Shirley Heights,60
2,267255,Skelton Sovereign,3.0,5.0,0.142857,Reg Hollinshead,D Griffiths,3,3.0,7.0,0.0,55.0,40.0,60.0,Contract Law,Mrs Lucky,Royal Match,55
3,267255,Fast Spin,3.0,6.0,0.380952,David Barron,Tony Culhane,4,7.0,14.0,0.0,38.0,30.0,59.0,Formidable I,Topwinder,Topsider,57
4,267255,As-Is,3.0,2.0,0.166667,Mark Johnston,J Weaver,5,7.0,21.0,0.0,29.0,21.0,65.0,Lomond,Capriati I,Diesis,60


In [4]:
horses_all_trim_intxn_clean.shape

(3199312, 18)

---

## Load `races_all_trim_intxn_clean.csv`

In [5]:
races_all_trim_intxn_clean = pd.read_csv(f"{BASE_DIR}/data/csv/races_all_trim_intxn_clean.csv", low_memory=False) 
races_all_trim_intxn_clean.head()

Unnamed: 0,rid,course,time,date,hurdles,prizes,winningTime,metric,countryCode,ncond,class
0,267255,Southwell (AW),03:40,97/01/01,,"[2752.25, 833.0, 406.5, 193.25]",106.9,1609.0,GB,0,5
1,297570,Southwell (AW),12:35,97/01/01,,"[1944.0, 544.0, 264.0]",91.0,1407.0,GB,0,6
2,334421,Southwell (AW),01:05,97/01/01,,"[2502.0, 702.0, 342.0]",150.7,2212.0,GB,0,6
3,366304,Southwell (AW),03:10,97/01/01,,"[2189.0, 614.0, 299.0]",108.6,1609.0,GB,0,6
4,13063,Southwell (AW),02:40,97/01/01,,"[2726.25, 825.0, 402.5, 191.25]",231.4,3318.5,GB,0,5


In [6]:
races_all_trim_intxn_clean.shape

(307896, 11)

---

## Remove Races w/ Hurdles

In [7]:
races_without_hurdles = races_all_trim_intxn_clean[races_all_trim_intxn_clean['hurdles'].isnull()]
len(races_without_hurdles)

208780

In [8]:
print(f"old shape {races_all_trim_intxn_clean.shape}")

races_all_trim_intxn_clean = races_without_hurdles

print(f"new shape {races_all_trim_intxn_clean.shape}")

assert races_all_trim_intxn_clean.shape[0] == len(races_without_hurdles)

old shape (307896, 11)
new shape (208780, 11)


In [9]:
print(f"old shape {horses_all_trim_intxn_clean.shape}")

horses_all_trim_intxn_clean = horses_all_trim_intxn_clean[
    horses_all_trim_intxn_clean['rid'].isin(races_all_trim_intxn_clean['rid'])
]

print(f"new shape {horses_all_trim_intxn_clean.shape}")

old shape (3199312, 18)
new shape (2200511, 18)


In [11]:
assert set(horses_all_trim_intxn_clean['rid']).symmetric_difference(set(races_all_trim_intxn_clean['rid'])) == set()
assert len(races_all_trim_intxn_clean[~races_all_trim_intxn_clean['hurdles'].isnull()]) == 0

## Save Dataframes

In [12]:
horses_all_trim_intxn_clean.to_csv(f"{BASE_DIR}/data/csv/horses_aticn.csv", index=False)

In [13]:
races_all_trim_intxn_clean.to_csv(f"{BASE_DIR}/data/csv/races_aticn.csv", index=False)

---