# `clean_horses_all_trim_intxn_and_races_all_trim_intxn.ipynb`

### Author: Anthony Hein

#### Last updated: 9/24/2021

# Overview:

It attempting to combine the `horses_all.csv` and `races_all.csv` it was found that this is computationally intractable due to the vast size of these files.

Now, we attempt to clean `horses_all(_trim(_intxn)).csv` and `races_all(_trim(_intxn)).csv` to remove rows with missing or incorrect data.

---

## Setup

In [1]:
import git
import os
import json
import math
from typing import List, Dict, Union
from tqdm import tqdm
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
BASE_DIR = git.Repo(os.getcwd(), search_parent_directories=True).working_dir
BASE_DIR

'/Users/anthonyhein/Desktop/SML310/project'

In [3]:
import sys
sys.path.append(f'{BASE_DIR}/utils/')

from rid_to_runners import RID_TO_RUNNERS
from length_abbrv_to_dist import LENGTH_ABBRV_TO_DIST

---

## Load `horses_all(_trim(_intxn)).csv`

In [4]:
horses_all_trim_intxn = pd.read_csv(f"{BASE_DIR}/data/csv/horses_all_trim_intxn.csv", low_memory=False) 
horses_all_trim_intxn.head()

Unnamed: 0,rid,horseName,age,saddle,decimalPrice,trainerName,jockeyName,position,positionL,dist,outHandicap,RPR,TR,OR,father,mother,gfather,weight
0,267255,Going For Broke,3.0,4.0,0.1,P C Haslam,Seb Sanders,1,,,,72.0,62.0,62.0,Simply Great,Empty Purse,Pennine Walk,58
1,267255,Pinchincha,3.0,3.0,0.266667,Dave Morris,Tony Clark,2,4.0,,,66.0,56.0,65.0,Priolo,Western Heights,Shirley Heights,60
2,267255,Skelton Sovereign,3.0,5.0,0.142857,Reg Hollinshead,D Griffiths,3,3.0,7.0,,55.0,40.0,60.0,Contract Law,Mrs Lucky,Royal Match,55
3,267255,Fast Spin,3.0,6.0,0.380952,David Barron,Tony Culhane,4,7.0,14.0,,38.0,30.0,59.0,Formidable I,Topwinder,Topsider,57
4,267255,As-Is,3.0,2.0,0.166667,Mark Johnston,J Weaver,5,7.0,21.0,,29.0,21.0,65.0,Lomond,Capriati I,Diesis,60


In [5]:
horses_all_trim_intxn.shape

(4107315, 18)

In [6]:
horses_all_trim_intxn_clean = horses_all_trim_intxn.copy(deep=True)

---

## Load `races_all(_trim(_intxn)).csv`

In [7]:
races_all_trim_intxn = pd.read_csv(f"{BASE_DIR}/data/csv/races_all_trim_intxn.csv", low_memory=False) 
races_all_trim_intxn.head()

Unnamed: 0,rid,course,time,date,hurdles,prizes,winningTime,metric,countryCode,ncond,class
0,267255,Southwell (AW),03:40,97/01/01,,"[2752.25, 833.0, 406.5, 193.25]",106.9,1609.0,GB,0,5
1,297570,Southwell (AW),12:35,97/01/01,,"[1944.0, 544.0, 264.0]",91.0,1407.0,GB,0,6
2,334421,Southwell (AW),01:05,97/01/01,,"[2502.0, 702.0, 342.0]",150.7,2212.0,GB,0,6
3,366304,Southwell (AW),03:10,97/01/01,,"[2189.0, 614.0, 299.0]",108.6,1609.0,GB,0,6
4,13063,Southwell (AW),02:40,97/01/01,,"[2726.25, 825.0, 402.5, 191.25]",231.4,3318.5,GB,0,5


In [8]:
races_all_trim_intxn.shape

(395186, 11)

In [9]:
races_all_trim_intxn_clean = races_all_trim_intxn.copy(deep=True)

---

## Clean Data in `races_all(_trim(_intxn)).csv`

In [10]:
print("The following columns have null values:")

for column in races_all_trim_intxn.columns:
    if sum(races_all_trim_intxn[column].isnull()) > 0:
        print(f"  - {column}")

The following columns have null values:
  - hurdles


Only `hurdles` has null values. We will punt on this until know whether we can safely remove hurdles and still have enough data.

## Clean Data in `horses_all(_trim(_intxn)).csv`

In [11]:
print("The following columns have null values:")

for column in horses_all_trim_intxn.columns:
    if sum(horses_all_trim_intxn[column].isnull()) > 0:
        print(f"  - {column}")

The following columns have null values:
  - age
  - saddle
  - trainerName
  - jockeyName
  - positionL
  - dist
  - outHandicap
  - RPR
  - TR
  - OR
  - father
  - mother
  - gfather


We now investigate these one at a time.

### `age`

As we suspect that age may be an important feature in horse racing, a horse with a missing age or negative age should be dropped from the dataset, unless this age can be inferred from other others.

In [12]:
# find all entries in df with bad age data
entries_w_bad_age = horses_all_trim_intxn_clean[
    horses_all_trim_intxn_clean['age'].isnull() |
    (horses_all_trim_intxn_clean['age'] < 0)
]

In [13]:
len(entries_w_bad_age)

512

In [14]:
# find all unique horses with bad age data
horses_w_bad_age = entries_w_bad_age['horseName'].unique()

In [15]:
len(horses_w_bad_age)

96

In [16]:
# for every unique horse with bad age data
# see if they raced in other races
# and if they did,
# then did they have a valid age in that race?

horses_w_ages_can_be_corrected = []
horses_corrective_age_data = {}

for horse_name in horses_w_bad_age:
    entries_same_horse = horses_all_trim_intxn_clean[horses_all_trim_intxn_clean['horseName'] == horse_name]
    entries_same_horse_valid_age = entries_same_horse[
                                       entries_same_horse['age'].notnull() &
                                       (entries_same_horse['age'] >= 0)
                                   ]
    races_and_ages = entries_same_horse_valid_age[['rid', 'age']]
    
    if len(races_and_ages) > 0:
        horses_w_ages_can_be_corrected.append(horse_name)
        horses_corrective_age_data[horse_name] = {
            'rid': races_and_ages['rid'].iloc[0],
            'age': races_and_ages['age'].iloc[0]
        }

In [17]:
horses_w_ages_can_be_corrected

['We Will See',
 'Polar Gale',
 "Ain't Misbehavin",
 'Father Pat',
 'Big Country',
 'Moon Tiger',
 'Combe Hay',
 'Long Island',
 'Western Frontier',
 'Going Ballistic',
 'Young At Heart',
 'Happy Forever',
 'Gateado',
 'Local Hero',
 'Don Gwarli']

In [18]:
# counts of rows that can be corrected for each name
entries_w_bad_age[entries_w_bad_age['horseName'].isin(horses_w_ages_can_be_corrected)]['horseName'].value_counts()

Father Pat          27
Going Ballistic     27
Western Frontier     7
Young At Heart       7
Ain't Misbehavin     5
Local Hero           5
Polar Gale           4
Moon Tiger           4
We Will See          3
Big Country          3
Gateado              3
Combe Hay            2
Long Island          2
Happy Forever        1
Don Gwarli           1
Name: horseName, dtype: int64

In [19]:
horses_corrective_age_data

{'We Will See': {'rid': 388337, 'age': 3.0},
 'Polar Gale': {'rid': 88503, 'age': 9.0},
 "Ain't Misbehavin": {'rid': 95672, 'age': 4.0},
 'Father Pat': {'rid': 1274, 'age': 4.0},
 'Big Country': {'rid': 48806, 'age': 5.0},
 'Moon Tiger': {'rid': 250504, 'age': 4.0},
 'Combe Hay': {'rid': 145460, 'age': 2.0},
 'Long Island': {'rid': 141347, 'age': 3.0},
 'Western Frontier': {'rid': 163614, 'age': 2.0},
 'Going Ballistic': {'rid': 394841, 'age': 0.0},
 'Young At Heart': {'rid': 241006, 'age': 5.0},
 'Happy Forever': {'rid': 376944, 'age': 3.0},
 'Gateado': {'rid': 16814, 'age': 4.0},
 'Local Hero': {'rid': 287144, 'age': 2.0},
 'Don Gwarli': {'rid': 234470, 'age': 3.0}}

In [20]:
def get_year_from_date_str(date_str: str) -> int:
    """
    Given a `date_str` of the form `YY/MM/DD` return
    `YYYY` as an `int.
    """
    year = int(date_str[:2])
    year = (year + 2000) if year < 50 else (year + 1900)
    return year

In [21]:
suspected_age_data = {}

for horse_name in horses_w_ages_can_be_corrected:
    
    # all rows with this horse
    entries_same_horse = horses_all_trim_intxn_clean[horses_all_trim_intxn_clean['horseName'] == horse_name]
    
    # only those with bad age data or with rid of known age
    entries_same_horse_bad_age = entries_same_horse[
        entries_same_horse['age'].isnull() | 
        (entries_same_horse['age'] < 0) |
        (entries_same_horse['rid'] == horses_corrective_age_data[horse_name]['rid'])
    ]
    
    races_same_horse_bad_age = races_all_trim_intxn_clean.loc[
        races_all_trim_intxn_clean['rid'].isin(entries_same_horse_bad_age['rid'])
    ]
    
    rid_of_known_age = horses_corrective_age_data[horse_name]['rid']
    race_of_known_age = races_same_horse_bad_age.loc[races_same_horse_bad_age['rid'] == rid_of_known_age]
    date_of_known_age = race_of_known_age['date'].iloc[0]
    year_of_known_age = get_year_from_date_str(date_of_known_age)
    age_at_year_of_known_age = horses_corrective_age_data[horse_name]['age']
    
    suspected_ages_at_races_same_horse_bad_age = {
        race['rid']: {
            'year': get_year_from_date_str(race['date']),
            'age': age_at_year_of_known_age + (get_year_from_date_str(race['date']) - year_of_known_age)
        }
        for _, race
        in races_same_horse_bad_age[['rid', 'date']].iterrows()
    }
    
    suspected_age_data[horse_name] = suspected_ages_at_races_same_horse_bad_age

suspected_age_data_slice = {k:v for i, (k,v) in enumerate(suspected_age_data.items()) if i < 3}

print(f"Preview suspected age data:\n{json.dumps(suspected_age_data_slice, indent=2)}")

Preview suspected age data:
{
  "We Will See": {
    "303071": {
      "year": 1994,
      "age": -16.0
    },
    "51656": {
      "year": 1994,
      "age": -16.0
    },
    "236425": {
      "year": 1994,
      "age": -16.0
    },
    "388337": {
      "year": 2013,
      "age": 3.0
    }
  },
  "Polar Gale": {
    "325801": {
      "year": 1994,
      "age": -6.0
    },
    "83954": {
      "year": 1995,
      "age": -5.0
    },
    "256349": {
      "year": 1995,
      "age": -5.0
    },
    "202663": {
      "year": 1995,
      "age": -5.0
    },
    "88503": {
      "year": 2009,
      "age": 9.0
    }
  },
  "Ain't Misbehavin": {
    "50453": {
      "year": 1995,
      "age": -17.0
    },
    "252762": {
      "year": 1995,
      "age": -17.0
    },
    "278651": {
      "year": 1995,
      "age": -17.0
    },
    "360471": {
      "year": 1995,
      "age": -17.0
    },
    "85527": {
      "year": 1995,
      "age": -17.0
    },
    "95672": {
      "year": 2016,
      "age"

Unfortunately, we still cannot correct many entries, since this algorithm produces negative age values for several of the entries we were interested in that would otherwise have `null` values. This must be due to different physical horses having the same name (and racing at different points in time). Let's remove these predictions.

In [22]:
def remove_negative_suspected_ages(
        d: Dict[str, Dict[str, Union[int, float]]]
    ) -> Dict[str, Dict[str, Union[int, float]]]:
    """
    For a given dictionary `d` which has several keys which are rids and each contain
    a dictionary with a `year` and `age` key, return the same dictionary with any rids
    containing negative `age` values removed.
    """
    return {k:v for (k,v) in d.items() if v['age'] >= 0}

In [23]:
suspected_age_data_only_positive = {
    k:remove_negative_suspected_ages(v)
    for i, (k,v)
    in enumerate(suspected_age_data.items())
}
print(f"Suspected age data:\n{json.dumps(suspected_age_data_only_positive, indent=2)}")

Suspected age data:
{
  "We Will See": {
    "388337": {
      "year": 2013,
      "age": 3.0
    }
  },
  "Polar Gale": {
    "88503": {
      "year": 2009,
      "age": 9.0
    }
  },
  "Ain't Misbehavin": {
    "95672": {
      "year": 2016,
      "age": 4.0
    }
  },
  "Father Pat": {
    "1274": {
      "year": 2007,
      "age": 4.0
    }
  },
  "Big Country": {
    "48806": {
      "year": 2018,
      "age": 5.0
    }
  },
  "Moon Tiger": {
    "250504": {
      "year": 1997,
      "age": 4.0
    }
  },
  "Combe Hay": {
    "145460": {
      "year": 2015,
      "age": 2.0
    }
  },
  "Long Island": {
    "141347": {
      "year": 2018,
      "age": 3.0
    }
  },
  "Western Frontier": {
    "163614": {
      "year": 2018,
      "age": 2.0
    }
  },
  "Going Ballistic": {
    "394841": {
      "year": 2011,
      "age": 0.0
    }
  },
  "Young At Heart": {
    "241006": {
      "year": 1996,
      "age": 5.0
    },
    "303011": {
      "year": 2012,
      "age": 21.0
    },
 

Now, recall that our suspected age data contains the race which we are basing all other ages on. Therefore, we will also remove those entries for which there is only one race listed, since these entries represent data already in the dataset.

In [24]:
suspected_age_data_only_positive_trunc = {
    k:v
    for i, (k,v)
    in enumerate(suspected_age_data_only_positive.items())
    if len(v) > 1
}
print(f"Suspected age data:\n{json.dumps(suspected_age_data_only_positive_trunc, indent=2)}")

Suspected age data:
{
  "Young At Heart": {
    "241006": {
      "year": 1996,
      "age": 5.0
    },
    "303011": {
      "year": 2012,
      "age": 21.0
    },
    "382835": {
      "year": 2012,
      "age": 21.0
    },
    "250387": {
      "year": 2011,
      "age": 20.0
    },
    "201817": {
      "year": 2011,
      "age": 20.0
    },
    "14908": {
      "year": 2011,
      "age": 20.0
    },
    "36674": {
      "year": 2011,
      "age": 20.0
    },
    "308265": {
      "year": 2011,
      "age": 20.0
    }
  },
  "Gateado": {
    "16814": {
      "year": 2004,
      "age": 4.0
    },
    "213732": {
      "year": 2002,
      "age": 2.0
    },
    "35588": {
      "year": 2002,
      "age": 2.0
    },
    "326168": {
      "year": 2002,
      "age": 2.0
    }
  },
  "Don Gwarli": {
    "234470": {
      "year": 2003,
      "age": 3.0
    },
    "244303": {
      "year": 2002,
      "age": 2.0
    }
  }
}


There are now few enough entries that we can inspect these individually.

Clearly, the suspected age data for _Young At Heart_ is not correct since the oldest horses that compete usually do not exceed 13 years old. A further look at all races with _Young At Heart_ reveal that there are two different horses with this name and systematic age data misinput for one of these horses. There is no corrective action we can take.

In [25]:
horses_all_trim_intxn_clean[horses_all_trim_intxn_clean['horseName'] == 'Young At Heart']

Unnamed: 0,rid,horseName,age,saddle,decimalPrice,trainerName,jockeyName,position,positionL,dist,outHandicap,RPR,TR,OR,father,mother,gfather,weight
184591,241006,Young At Heart,5.0,9.0,0.047619,M J Haynes,D Skyrme,40,,,4.0,,,95.0,Classic Secret,Yenillik,Rusticaro,63
203556,394305,Young At Heart,3.0,0.0,0.111111,M J Haynes,John Reid,6,1,10.25,,78.0,80.0,,Classic Secret,Yenillik,Rusticaro,57
214636,173518,Young At Heart,3.0,0.0,0.090909,M J Haynes,Frankie Dettori,8,2,9.75,,59.0,66.0,72.0,Classic Secret,Yenillik,Rusticaro,56
226155,12584,Young At Heart,3.0,0.0,0.111111,M J Haynes,Frankie Dettori,11,hd,10.5,,54.0,54.0,70.0,Classic Secret,Yenillik,Rusticaro,53
228546,389071,Young At Heart,3.0,0.0,0.111111,M J Haynes,Pat Eddery,9,1.5,14.5,,50.0,61.0,65.0,Classic Secret,Yenillik,Rusticaro,56
231625,25475,Young At Heart,3.0,0.0,0.111111,M J Haynes,John Reid,6,5,10.75,,52.0,24.0,,Classic Secret,Yenillik,Rusticaro,53
237678,6890,Young At Heart,3.0,0.0,0.058824,M J Haynes,Dominic Toole,6,20,29.5,,,5.0,57.0,Classic Secret,Yenillik,Rusticaro,57
242882,54353,Young At Heart,3.0,0.0,0.066667,M J Haynes,D Skyrme,1,,,,91.0,70.0,,Classic Secret,Yenillik,Rusticaro,68
683302,314548,Young At Heart,2.0,0.0,0.029412,M J Haynes,John Reid,8,4,17.5,,,44.0,,Classic Secret,Yenillik,Rusticaro,57
686903,254015,Young At Heart,2.0,0.0,0.038462,M J Haynes,Gary Carter,5,2,6.25,,70.0,72.0,,Classic Secret,Yenillik,Rusticaro,57


We can however, manually correct _Gateado_ and _Don Gwarli_, which we do below.

In [26]:
# correct Gateado
idx = horses_all_trim_intxn_clean.index[(horses_all_trim_intxn_clean['horseName'] == "Gateado") &
                                        (horses_all_trim_intxn_clean['rid'] == 213732)][0]
horses_all_trim_intxn_clean.at[idx, 'age'] = 2.0


idx = horses_all_trim_intxn_clean.index[(horses_all_trim_intxn_clean['horseName'] == "Gateado") &
                                        (horses_all_trim_intxn_clean['rid'] == 35588)][0]
horses_all_trim_intxn_clean.at[idx, 'age'] = 2.0


idx = horses_all_trim_intxn_clean.index[(horses_all_trim_intxn_clean['horseName'] == "Gateado") &
                                        (horses_all_trim_intxn_clean['rid'] == 326168)][0]
horses_all_trim_intxn_clean.at[idx, 'age'] = 2.0

In [27]:
# correct Don Gwarli
idx = horses_all_trim_intxn_clean.index[(horses_all_trim_intxn_clean['horseName'] == "Don Gwarli") &
                                        (horses_all_trim_intxn_clean['rid'] == 244303)][0]
horses_all_trim_intxn_clean.at[idx, 'age'] = 2.0

In [28]:
# check that Gateado and Don Gwarli are OK
assert len(horses_all_trim_intxn_clean[(horses_all_trim_intxn_clean['horseName'] == "Gateado") &
                                       (horses_all_trim_intxn_clean['age'].isnull())]) == 0
assert len(horses_all_trim_intxn_clean[(horses_all_trim_intxn_clean['horseName'] == "Don Gwarli") &
                                       (horses_all_trim_intxn_clean['age'].isnull())]) == 0

At this point in the code, we have correct all the missing and negative ages we could. So, we will remove whichever ones we could not correct.

In [29]:
# find all entries in df with bad age data
entries_w_bad_age = horses_all_trim_intxn_clean[
    horses_all_trim_intxn_clean['age'].isnull() |
    (horses_all_trim_intxn_clean['age'] < 0)
]

In [30]:
len(entries_w_bad_age)

508

In [31]:
# drop all races where there is a horse without an age
old_shape = races_all_trim_intxn_clean.shape
print(f"old shape {old_shape}")

races_all_trim_intxn_clean = races_all_trim_intxn_clean[
                                 ~races_all_trim_intxn_clean['rid'].isin(entries_w_bad_age['rid'])
                             ]

new_shape = races_all_trim_intxn_clean.shape
print(f"new shape {new_shape}")

assert old_shape[0] - len(entries_w_bad_age['rid'].unique()) == new_shape[0]

old shape (395186, 11)
new shape (394690, 11)


In [32]:
# drop all horse entries that are part of a race where there is a horse without an age
old_shape = horses_all_trim_intxn_clean.shape
print(f"old shape {old_shape}")

horses_all_trim_intxn_clean = horses_all_trim_intxn_clean[
                                 ~horses_all_trim_intxn_clean['rid'].isin(entries_w_bad_age['rid'])
                             ]

new_shape = horses_all_trim_intxn_clean.shape
print(f"new shape {new_shape}")

old shape (4107315, 18)
new shape (4100741, 18)


In [33]:
# sanity checks
assert set(horses_all_trim_intxn_clean['rid']).symmetric_difference(set(races_all_trim_intxn_clean['rid'])) == set()
assert len(horses_all_trim_intxn_clean['rid'][
    horses_all_trim_intxn_clean['age'].isnull() |
    (horses_all_trim_intxn_clean['age'] < 0)
]) == 0

### `saddle`

We similarly suspect `saddle` to be correlated with performance in horse races so we check to see if there is bad saddle data.

In [34]:
len(horses_all_trim_intxn_clean[horses_all_trim_intxn_clean['saddle'] == 0])

546345

In [35]:
len(horses_all_trim_intxn_clean[horses_all_trim_intxn_clean['saddle'] < 0])

0

In [36]:
len(horses_all_trim_intxn_clean[horses_all_trim_intxn_clean['saddle'].isnull()])

178606

Clearly, `saddle` data is missing from a substantial portion of the data. However, we start with a massive amount of data, which makes each truncation more permissible. For now, we will delete rows with bad saddle data. If we need to be less aggressive, we can always do so.

In [37]:
# find all entries in df with bad saddle data
entries_w_bad_saddle = horses_all_trim_intxn_clean[
    horses_all_trim_intxn_clean['saddle'].isnull() |
    (horses_all_trim_intxn_clean['saddle'] == 0)
]

In [38]:
len(entries_w_bad_saddle['rid'])

724951

We can try to correct races that have exactly one entry with bad `saddle` data.

In [39]:
rids_exactly_one_bad_saddle = []

for k, v in entries_w_bad_saddle['rid'].value_counts().items():
    if v == 1:
        rids_exactly_one_bad_saddle.append(k)

In [40]:
len(rids_exactly_one_bad_saddle)

44

In [41]:
# the existence of alternates who start at nonconsecutive saddle numbers
# makes this a touch difficult, still cannot correct _all_ races with
# exactly one bad saddle

rids_have_alternate_saddle = []

for rid in rids_exactly_one_bad_saddle:
    saddle_nums = horses_all_trim_intxn_clean[
        (horses_all_trim_intxn_clean['rid'] == rid) &
        (horses_all_trim_intxn_clean['saddle'] > 0) &
        (~horses_all_trim_intxn_clean['saddle'].isnull())
    ]['saddle']
    expected_sum = RID_TO_RUNNERS[rid] * (RID_TO_RUNNERS[rid] + 1) // 2
    actual_sum = sum(saddle_nums)
    missing_saddle = float(round(expected_sum - actual_sum))
    
    if missing_saddle > 0:
        
        idx = horses_all_trim_intxn_clean.index[(horses_all_trim_intxn_clean['rid'] == rid) &
                                                ((horses_all_trim_intxn_clean['saddle'] == 0) |
                                                (horses_all_trim_intxn_clean['saddle'].isnull()))][0]
        horses_all_trim_intxn_clean.at[idx, 'saddle'] = missing_saddle
        
    else:
        rids_have_alternate_saddle.append(rid)

In [42]:
len(rids_have_alternate_saddle)

7

In [43]:
entries_w_bad_saddle = horses_all_trim_intxn_clean[
    horses_all_trim_intxn_clean['saddle'].isnull() |
    (horses_all_trim_intxn_clean['saddle'] == 0)  
]

In [44]:
len(entries_w_bad_saddle)

724914

In [45]:
rids_exactly_one_bad_saddle = []

for k, v in entries_w_bad_saddle['rid'].value_counts().items():
    if v == 1:
        rids_exactly_one_bad_saddle.append(k)
        
assert len(rids_exactly_one_bad_saddle) == len(rids_have_alternate_saddle)

At this point, we have corrected all that we can hope to correct in terms of `saddle`. Accordingly, let's drop all those races which cannot be corrected.

In [46]:
entries_w_bad_saddle = horses_all_trim_intxn_clean[
    horses_all_trim_intxn_clean['saddle'].isnull() |
    (horses_all_trim_intxn_clean['saddle'] == 0)  
]

In [47]:
len(entries_w_bad_saddle)

724914

In [48]:
# drop all races where there is a horse without a saddle
old_shape = races_all_trim_intxn_clean.shape
print(f"old shape {old_shape}")

races_all_trim_intxn_clean = races_all_trim_intxn_clean[
                                 ~races_all_trim_intxn_clean['rid'].isin(entries_w_bad_saddle['rid'])
                             ]

new_shape = races_all_trim_intxn_clean.shape
print(f"new shape {new_shape}")

assert old_shape[0] - len(entries_w_bad_saddle['rid'].unique()) == new_shape[0]

old shape (394690, 11)
new shape (322776, 11)


In [49]:
# drop all horse entries that are part of a race where there is a horse without a saddle
old_shape = horses_all_trim_intxn_clean.shape
print(f"old shape {old_shape}")

horses_all_trim_intxn_clean = horses_all_trim_intxn_clean[
                                 ~horses_all_trim_intxn_clean['rid'].isin(entries_w_bad_saddle['rid'])
                             ]

new_shape = horses_all_trim_intxn_clean.shape
print(f"new shape {new_shape}")

old shape (4100741, 18)
new shape (3375555, 18)


In [50]:
# sanity checks
assert set(horses_all_trim_intxn_clean['rid']).symmetric_difference(set(races_all_trim_intxn_clean['rid'])) == set()
assert len(horses_all_trim_intxn_clean['rid'][
    horses_all_trim_intxn_clean['saddle'].isnull() |
    (horses_all_trim_intxn_clean['saddle'] == 0)
]) == 0

In case we need to be less aggressive. One theory we can have about the data is that when a `saddle` number is bad, _all `saddle` numbers within that race are bad_. If this were the case, then we can simply replace these with `saddle = 0` (or some other preset value). Let's test this theory:

In [51]:
len(entries_w_bad_saddle['rid'].unique())

71914

In [52]:
rids_some_saddle_some_not = []

for k, v in entries_w_bad_saddle['rid'].value_counts().items():
    try:
        assert v == RID_TO_RUNNERS[k]
    except AssertionError:
        rids_some_saddle_some_not.append(k)

In [53]:
len(rids_some_saddle_some_not)

58

### `position`

Through other exploratory data analysis, we have found that there are rows where `position = 0`. We will try to understand this better.

In [54]:
len(horses_all_trim_intxn_clean[
    horses_all_trim_intxn_clean['position'].isnull()
])

0

In [55]:
len(horses_all_trim_intxn_clean[
    horses_all_trim_intxn_clean['position'] < 0
])

0

In [56]:
horses_zero_position = horses_all_trim_intxn_clean[
    horses_all_trim_intxn_clean['position'] == 0
]
horses_zero_position.head()

Unnamed: 0,rid,horseName,age,saddle,decimalPrice,trainerName,jockeyName,position,positionL,dist,outHandicap,RPR,TR,OR,father,mother,gfather,weight
1521633,290934,Pont A Marcq,3.0,10.0,0.010309,J Heloury,Matthieu Autier,0,,,,,,,Russian Blue,Pausanias,Gold Away,54
1521634,290934,Askell Gwen,3.0,14.0,0.013514,G Collet,Nicolas Larenaudie,0,,,,,,,Sandwaki,Dolly Wice,Dolpour,50
1521635,290934,Pasco,3.0,4.0,0.03125,F Vermeulen,Cesar Passerat,0,,,,,,,Kavafi,Pekkouqui,Doyoun,55
1521636,290934,Light Opera,3.0,6.0,0.071429,Daniel Rabhi,Stephane Richardot,0,,,,,,,Dubai Destination,Polite Reply,With Approval,56
1521637,290934,Zatkova,3.0,16.0,0.038462,L Racco,Thierry Thulliez,0,,,,,,,Ekraar,Zabetta,Zamindar,54


In [57]:
for column in ['positionL', 'dist', 'outHandicap', 'RPR', 'TR', 'OR']:
    arr = horses_zero_position[column].unique()
    assert len(arr) == 1 and math.isnan(arr[0])

Entries where `position = 0` curiously have _several other fields with `NaN`_. This occurs without fail. Therefore, we conclude that these must be alternates who were on standby but never ended up running in the race. Before we remove these entries, let's make sure that for any race with alternates there is at least one runner.

In [58]:
horses_nonzero_position = horses_all_trim_intxn_clean[
    ~ (horses_all_trim_intxn_clean['position'] == 0)
]

assert len(np.intersect1d(horses_nonzero_position['rid'], horses_zero_position['rid'])) == \
       len(horses_zero_position['rid'].unique())

All races would still have at least one entry even after removing alternates, we may proceed.

In [59]:
# drop all horse entries that have `position = 0`
old_shape = horses_all_trim_intxn_clean.shape
print(f"old shape {old_shape}")

horses_all_trim_intxn_clean = horses_all_trim_intxn_clean[
                                 ~ (horses_all_trim_intxn_clean['position'] == 0)
                             ]

new_shape = horses_all_trim_intxn_clean.shape
print(f"new shape {new_shape}")

old shape (3375555, 18)
new shape (3374991, 18)


In [60]:
# sanity checks
assert set(horses_all_trim_intxn_clean['rid']).symmetric_difference(set(races_all_trim_intxn_clean['rid'])) == set()
assert len(horses_all_trim_intxn_clean['rid'][
    horses_all_trim_intxn_clean['position'].isnull() |
    (horses_all_trim_intxn_clean['position'] == 0)
]) == 0

### `positionL`

`positionL` will be important because we will use `positionL` and `dist` as proxies for the finishing time of a horse in place of actual finishing times (which are only provided for the winner of each race). Note, we expect that horses with `position = 1` or `position = 40` have `positionL = NaN` due to the nature of this variable. These can be corrected easily and so are not bad entries.

In [61]:
entries_w_nan_positionL = horses_all_trim_intxn_clean[
    horses_all_trim_intxn_clean['positionL'].isnull() &
    (~horses_all_trim_intxn_clean['position'].isin([1,40]))
]
len(entries_w_nan_positionL)

8286

In [62]:
len(entries_w_nan_positionL['rid'].unique())

2646

Since there is no way to correct `positionL = NaN` without inferring it to be `positionL = 0` (where this might introduce bias that makes it appear like horses are closer than they actually are), we are forced to drop these entries and their respective races.

In [63]:
# drop all races where there is a horse without a positionL
old_shape = races_all_trim_intxn_clean.shape
print(f"old shape {old_shape}")

races_all_trim_intxn_clean = races_all_trim_intxn_clean[
                                 ~races_all_trim_intxn_clean['rid'].isin(entries_w_nan_positionL['rid'])
                             ]

new_shape = races_all_trim_intxn_clean.shape
print(f"new shape {new_shape}")

assert old_shape[0] - len(entries_w_nan_positionL['rid'].unique()) == new_shape[0]

old shape (322776, 11)
new shape (320130, 11)


In [64]:
# drop all horse entries that are part of a race where there is a horse without a positionL
old_shape = horses_all_trim_intxn_clean.shape
print(f"old shape {old_shape}")

horses_all_trim_intxn_clean = horses_all_trim_intxn_clean[
                                 ~horses_all_trim_intxn_clean['rid'].isin(entries_w_nan_positionL['rid'])
                             ]

new_shape = horses_all_trim_intxn_clean.shape
print(f"new shape {new_shape}")

old shape (3374991, 18)
new shape (3339586, 18)


In [65]:
# sanity checks
assert set(horses_all_trim_intxn_clean['rid']).symmetric_difference(set(races_all_trim_intxn_clean['rid'])) == set()
assert len(horses_all_trim_intxn_clean['rid'][
    horses_all_trim_intxn_clean['positionL'].isnull() &
    (~horses_all_trim_intxn_clean['position'].isin([1,40]))
]) == 0

As for the entries where `positionL = NaN` and `position = 1` or `position = 40`, we will leave `positionL = NaN` in these entries; we will eventually be dropping the `positionL` column in favor of the `dist` column anyways, and will calculate the `dist` of a nonfinishing horse via a different means that handles the existence of `positionL = NaN`.

### `dist`

`dist` is important because it measures the distance to a winner. By inspection of the dataset, there is currently no value for `dist` for horses which place or do not finish the race. Let's confirm that these are the only places where `dist` is missing. If so, then we can easily infer meaningful values for these entries (e.g. a horse placing first has `dist = 0`).

Note that in the original dataset, the meaning of `dist` is distance to a winner (including a horse that placed). Instead, we will use `dist` to be the distance to a horse that finishes first in a race. This is much easier to reason about, both from our perspective and a model's perspective.

Before we can deal with null values, we first have to deal with a few weird cases where there is a string in this column, where this column is only supposed to contain numerical data. Looks like these are the remnants of failed conversion from length specifier to distance.

In [66]:
non_numerical_dist = []

for dist in horses_all_trim_intxn_clean['dist'].unique():
    try:
        float(dist)
    except:
        non_numerical_dist.append(dist)
        
non_numerical_dist

['hd', 'nk', 'snk', 'sht-hd', 'nse']

In [67]:
entries_non_numerical_dist = horses_all_trim_intxn_clean[
    horses_all_trim_intxn_clean['dist'].isin(non_numerical_dist)
]
entries_non_numerical_dist.head()

Unnamed: 0,rid,horseName,age,saddle,decimalPrice,trainerName,jockeyName,position,positionL,dist,outHandicap,RPR,TR,OR,father,mother,gfather,weight
14492,84848,Jendorcet,7.0,13.0,0.058824,Chris Fairhurst,J Callaghan,3,shd,hd,,76.0,,75.0,Grey Ghost I,Jendor,Condorcet,63
14726,109134,Castle Sweep,6.0,1.0,0.066667,D Nicholson,Richard Johnson,3,hd,nk,,157.0,147.0,155.0,Castle Keep,Fairy Shot,Random Shot,76
21261,82154,Silent Miracle,3.0,6.0,0.5,Michael Bell,Micky Fenton,3,shd,hd,,69.0,30.0,,Night Shift,Curie Point,Sharpen Up,55
21456,179365,Petoskin,5.0,4.0,0.066667,Jeff Pearce,C Teague,3,shd,nk,,52.0,40.0,50.0,Petoski,Farcical,Pharly,60
21640,233260,Mels Baby,4.0,1.0,0.125,Les Eyre,S Buckley,3,shd,hd,,78.0,44.0,70.0,Contract Law,Launch The Raft,Home Guard,62


In [68]:
len(entries_non_numerical_dist)

2714

**NOTE**: The below function is used extensively and handles bad `dist` data of all kinds including runners with `position = 2`, runners with non-numerical distance data, and runners in `position = 40`.

In [69]:
def get_dist_of_horse(df: pd.core.frame.DataFrame, idx: int) -> float:
    """
    Given a dataframe `df` that represents a race, calculate the `dist`
    of the horse with identifier `idx`.
    """
    
    df = df.sort_values(by=['position'])
    
    # cumulative distance to the runner in 1st place
    dist = 0
    
    # there may be ties for a position
    curr_position = 1
    
    # for all entries in this race
    for _, row in df.iterrows():
        
        # convert `positionL` to a numerical value,
        # it is safe to assume that the length is 0 where `positionL = NaN`, see prior discussion
        if row['positionL'] and (isinstance(row['positionL'], str) or not math.isnan(row['positionL'])):
            lengths = row['positionL']
        else:
            lengths = 0
        
        # some `positionL` values are strings encoding length information
        try:
            lengths = float(lengths)
        except:
            lengths = LENGTH_ABBRV_TO_DIST[lengths]
            
        # be careful about changing the current position bc of ties
        curr_position = row['position']
       
        # accumulate distance so long as we are not looking at nonfinishing horses
        if curr_position != 40:
            dist += lengths
        
        # found the desired horse
        if int(row.name) == idx:
            return dist + (LENGTH_ABBRV_TO_DIST['dist'] if curr_position == 40 else 0)

In [70]:
df = horses_all_trim_intxn_clean[horses_all_trim_intxn_clean['rid'] == 84848]
df

Unnamed: 0,rid,horseName,age,saddle,decimalPrice,trainerName,jockeyName,position,positionL,dist,outHandicap,RPR,TR,OR,father,mother,gfather,weight
14490,84848,Rothari,5.0,8.0,0.125,Brian Rothwell,R Supple,1,,,,88.0,,87.0,Nashwan,Royal Lorna,Val De L'Orne,68
14491,84848,Diamond Beach,4.0,12.0,0.066667,George Moore,N Bentley,2,shd,,,81.0,,88.0,Lugana Beach,Cannon Boy,Canonero,65
14492,84848,Jendorcet,7.0,13.0,0.058824,Chris Fairhurst,J Callaghan,3,shd,hd,,76.0,,75.0,Grey Ghost I,Jendor,Condorcet,63
14493,84848,Ferrers,6.0,7.0,0.222222,Pam Sly,Warren Marston,4,2,2.25,,89.0,,90.0,Homeboy,Gay Twenties,Lord Gayle,70
14494,84848,Pangeran,5.0,6.0,0.153846,Ann Duffield,J Supple,5,2,4.25,,87.0,,90.0,Forty Niner,Smart Heiress,Vaguely Noble,70
14495,84848,Beau Matelot,5.0,9.0,0.058824,Miss Kate Milligan,N Horrocks,6,3,7.25,,79.0,,85.0,Handsome Sailor,Bellanoora,Ahonoora,64
14496,84848,Adamatic,6.0,1.0,0.111111,Dick Allan,B Storey,7,1.75,9,,95.0,,103.0,Henbit,Arpal Magic,Master Owen,76
14497,84848,Falcon's Flame,4.0,10.0,0.066667,Victor Thompson,Mr M Thompson,8,5,14,,72.0,,93.0,Hawkster,Staunch Flame,Bold Forbes,68
14498,84848,Ten Past Six,5.0,14.0,0.133333,Martyn Wane,L O'Hara,9,3,17,3.0,59.0,,75.0,Kris,Tashinsky,Nijinsky,63
14499,84848,Rubislaw,5.0,16.0,0.029412,Mrs K M Lamb,Mrs Sarah Cutts,10,3.5,20.5,20.0,56.0,,75.0,Dunbeath,Larnem,Meldrum,60


In [71]:
assert get_dist_of_horse(df, 14492) == 0.5

In [72]:
df = horses_all_trim_intxn_clean[horses_all_trim_intxn_clean['rid'] == 82154]
df

Unnamed: 0,rid,horseName,age,saddle,decimalPrice,trainerName,jockeyName,position,positionL,dist,outHandicap,RPR,TR,OR,father,mother,gfather,weight
21259,82154,Mouche,3.0,4.0,0.125,Mrs J R Ramsden,Jimmy Fortune,1,,,,70.0,31.0,,Warning,Case For The Crown,Bates Motel,55
21260,82154,Lamarita,3.0,3.0,0.153846,James Eustace,Ray Cochrane,2,shd,,,70.0,30.0,,Emarati,Bentinck Hotel,Red God,55
21261,82154,Silent Miracle,3.0,6.0,0.5,Michael Bell,Micky Fenton,3,shd,hd,,69.0,30.0,,Night Shift,Curie Point,Sharpen Up,55
21262,82154,Tajrebah,3.0,7.0,0.166667,P T Walwyn,Richard Hills,4,2.5,2.75,,62.0,23.0,,Dayjur,Petrava,Imposing,55
21263,82154,Sang D'Antibes,3.0,5.0,0.066667,D J S Cosgrove,M Rimmer,5,2.5,5.25,,54.0,15.0,,Sanglamore,Baratoga,Bering,55
21264,82154,Aquatic Queen,3.0,1.0,0.019608,R J Weaver,M Wigham,6,1.75,7,,49.0,10.0,,Rudimentary,Aquarula,Dominion,55
21265,82154,Corinchili,3.0,2.0,0.125,George Margarson,Gary Carter,7,5,12,,34.0,,,Chilibang,Corinthia,Empery,55


In [73]:
assert get_dist_of_horse(df, 21261) == 0.5

In [74]:
for idx, entry in tqdm(entries_non_numerical_dist.iterrows()):
    df = horses_all_trim_intxn_clean[horses_all_trim_intxn_clean['rid'] == entry['rid']]
    dist = get_dist_of_horse(df, idx)
    horses_all_trim_intxn_clean.at[idx, 'dist'] = dist

2714it [00:12, 223.97it/s]


In [75]:
entries_non_numerical_dist = horses_all_trim_intxn_clean[
    horses_all_trim_intxn_clean['dist'].isin(non_numerical_dist)
]
assert len(entries_non_numerical_dist) == 0

Now that `dist` has only numerical values and `NaN`, we can proceed to convert this to a numerical column and check that all numerical values are valid and replace `dist = NaN` to `dist = 0`.

In [76]:
horses_all_trim_intxn_clean.dtypes

rid               int64
horseName        object
age             float64
saddle          float64
decimalPrice    float64
trainerName      object
jockeyName       object
position          int64
positionL        object
dist             object
outHandicap     float64
RPR             float64
TR              float64
OR              float64
father           object
mother           object
gfather          object
weight            int64
dtype: object

In [77]:
horses_all_trim_intxn_clean['dist'] = pd.to_numeric(horses_all_trim_intxn_clean['dist'], downcast='float')

In [78]:
horses_all_trim_intxn_clean.dtypes

rid               int64
horseName        object
age             float64
saddle          float64
decimalPrice    float64
trainerName      object
jockeyName       object
position          int64
positionL        object
dist            float32
outHandicap     float64
RPR             float64
TR              float64
OR              float64
father           object
mother           object
gfather          object
weight            int64
dtype: object

In [79]:
len(horses_all_trim_intxn_clean[horses_all_trim_intxn_clean['dist'] < 0])

0

In [80]:
len(horses_all_trim_intxn_clean[horses_all_trim_intxn_clean['dist'].isnull()])

864391

Great, no negative values for `dist` but there are still `dist = NaN`. Let's see if `dist = NaN` only occurs where `position = 1` or `position = 2` or `position = 40`, because `position = 1` or `position = 2` means that the horse technically has not "distance to a winner", which is how `dist` is explained (we will change this later) and `position = 40` means the horse did not finish so the `dist` can be potentially infinite.

In [81]:
len(horses_all_trim_intxn_clean[
    horses_all_trim_intxn_clean['dist'].isnull() &
    (~horses_all_trim_intxn_clean['position'].isin([1,2,40]))
])

236

Okay, so this is not true, though this number is a lot better than the first one. Let's understand where this is not true.

In [82]:
horses_all_trim_intxn_clean[
    horses_all_trim_intxn_clean['dist'].isnull() &
    (~horses_all_trim_intxn_clean['position'].isin([1,2,40]))
]['position'].unique()

array([ 3,  4,  5, 11,  6, 12,  7, 10,  9])

In [83]:
len(horses_all_trim_intxn_clean[
    horses_all_trim_intxn_clean['dist'].isnull() &
    (~horses_all_trim_intxn_clean['position'].isin([1,2,40]))
]['rid'].unique())

236

We can correct these `dist` using the same function as before.

In [84]:
entries_nan_dist = horses_all_trim_intxn_clean[
    horses_all_trim_intxn_clean['dist'].isnull() &
    (~horses_all_trim_intxn_clean['position'].isin([1,2,40]))
]
entries_nan_dist.head()

Unnamed: 0,rid,horseName,age,saddle,decimalPrice,trainerName,jockeyName,position,positionL,dist,outHandicap,RPR,TR,OR,father,mother,gfather,weight
22083,342240,Going Primitive,6.0,3.0,0.166667,J Hetherton,Mr S Swiers,3,nk,,,98.0,,,Primitive Rising,Good Going Girl,Import,69
22578,75990,Manolo I,4.0,4.0,0.222222,J Berry,Kevin Darley,4,.5,,,61.0,57.0,,Cricket Ball,Malouna,General Holme,58
736826,129604,Plenty Quick,3.0,4.0,0.119048,Donnie K Von Hemel,David Cabrera,3,1.5,,,,,,Alternation,Try To Catch Her,Broken Vow,54
753493,165034,Vin De Dance,3.0,2.0,0.066667,Murray Baker &amp; Andrew Forsman,Jason Waddell,4,1.25,,,107.0,,,Roc De Cambes,Explosive Dancer,San Luis,56
758283,87146,Sounds Delicious,4.0,8.0,0.298507,Linda Rice,Junior Alvarado,4,1,,,98.0,,,Yes It's True,Dulce Realidad,Sweetsouthernsaint,53


In [85]:
for idx, entry in tqdm(entries_nan_dist.iterrows()):
    df = horses_all_trim_intxn_clean[horses_all_trim_intxn_clean['rid'] == entry['rid']]
    dist = get_dist_of_horse(df, idx)
    horses_all_trim_intxn_clean.at[idx, 'dist'] = dist

236it [00:01, 208.30it/s]


In [86]:
entries_nan_dist = horses_all_trim_intxn_clean[
    horses_all_trim_intxn_clean['dist'].isnull() &
    (~horses_all_trim_intxn_clean['position'].isin([1,2,40]))
]
assert len(entries_nan_dist) == 0

Let's first figure out what values of `dist` are currently where `position = 40` so we can fix this.

In [87]:
entries_nan_dist_p40 = horses_all_trim_intxn_clean[
    horses_all_trim_intxn_clean['dist'].isnull() &
    (horses_all_trim_intxn_clean['position'] == 40)
]
entries_nan_dist_p40

Unnamed: 0,rid,horseName,age,saddle,decimalPrice,trainerName,jockeyName,position,positionL,dist,outHandicap,RPR,TR,OR,father,mother,gfather,weight
45,13063,Sharp Command,4.0,5.0,0.200000,P Eccles,J Weaver,40,,,,,,53.0,Sharpo,Bluish,Alleged,55
609,302858,Graignamanagh,6.0,3.0,0.307692,Harry De Bromhead,J R Barry,40,,,,,,,Tremblant,Feathermore,Crash Course,73
696,49027,Ask The Butler,6.0,1.0,0.444444,C Roche,Conor O'Dwyer,40,,,,,,,Carlingford Castle,Ask Breda,Ya Zaman,74
716,294935,Alicharger,7.0,4.0,0.019608,P Monteith,Tony Dobbin,40,,,,,,,Alias Smith,Amirati,Amber Rama,71
717,294935,Herbert Lodge,8.0,6.0,0.363636,Kim Bailey,Conor O'Dwyer,40,,,,,,,Montelimar,Mindyourbusiness,Run The Gantlet,71
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4107302,201116,Camair Crusader,5.0,8.0,0.058824,W McKeown,Alan Dempsey,40,,,,73.0,73.0,66.0,Jolly Jake,Sigrid's Dream,Triple Bend,64
4107303,201116,Classic Exhibit,10.0,4.0,0.066667,A Streeter,Russ Garritty,40,,,,,,77.0,Tate Gallery,See The Tops,Cure The Blues,69
4107304,201116,Faustnluce Lady,10.0,13.0,0.009901,W J Smith,P Griffiths,40,,,6.0,,,67.0,Faustus,Miss Friendly,Status Seeker,62
4107309,203745,Ealing Court,10.0,5.0,0.133333,Tony Carroll,Ollie McPhail,40,,,,,,90.0,Blazing Saddles,Fille De General,Brigadier Gerard,68


Of the horse who came in `position = 40` and `dist = NaN`, how many were in a race where there are actually 40 runners? These cases are harder to reason about than if there are fewer than 40 runners, in which case we would just say that this horse _truly did not finish_, versus that this horse arrived in 40th place with an unknown distance.

In [88]:
entries_nan_dist_and_40_runners = []
races_nan_dist_and_40_runners = set()

for _, row in entries_nan_dist_p40.iterrows():
    if RID_TO_RUNNERS[row['rid']] == 40:
        entries_nan_dist_and_40_runners.append(row['rid'])
        races_nan_dist_and_40_runners.add(row['rid'])
        
len(entries_nan_dist_and_40_runners), len(races_nan_dist_and_40_runners)

(399, 16)

In [89]:
np.unique(np.array(entries_nan_dist_and_40_runners), return_counts = True)

(array([  2778,  31320,  47950,  54291,  77273,  95922, 159269, 206187,
        211821, 239445, 241345, 299491, 320946, 328597, 337325, 394140]),
 array([22, 21, 21, 36, 31, 25, 21, 23, 19, 26, 26, 28, 23, 23, 29, 25]))

This is odd because we see that there are usually _multiple horses that arrive in 40th in a race with 40 horses_. While it seems possible to correct these programmatically, this probably represents bad manual input. In any case, races with a number of horses exceeding some threshold (< 40) will be dropped to make this tractable so we will preemptively drop these.

In [90]:
# drop all races where there is a horse finishing in 40th place with a race of 40 runners and has a null distance
old_shape = races_all_trim_intxn_clean.shape
print(f"old shape {old_shape}")

races_all_trim_intxn_clean = races_all_trim_intxn_clean[
                                 ~races_all_trim_intxn_clean['rid'].isin(races_nan_dist_and_40_runners)
                             ]

new_shape = races_all_trim_intxn_clean.shape
print(f"new shape {new_shape}")

assert old_shape[0] - len(races_nan_dist_and_40_runners) == new_shape[0]

old shape (320130, 11)
new shape (320114, 11)


In [91]:
# drop all horse entries that place in 40th with a race of 40 runners and have have null distance
old_shape = horses_all_trim_intxn_clean.shape
print(f"old shape {old_shape}")

horses_all_trim_intxn_clean = horses_all_trim_intxn_clean[
                                 ~horses_all_trim_intxn_clean['rid'].isin(races_nan_dist_and_40_runners)
                             ]

new_shape = horses_all_trim_intxn_clean.shape
print(f"new shape {new_shape}")

old shape (3339586, 18)
new shape (3338946, 18)


In [92]:
# sanity checks
assert set(horses_all_trim_intxn_clean['rid']).symmetric_difference(set(races_all_trim_intxn_clean['rid'])) == set()
assert len(horses_all_trim_intxn_clean['rid'][
    horses_all_trim_intxn_clean['rid'].isin(races_nan_dist_and_40_runners)
]) == 0

Now, we will infer data by approximating that a horse that does not finish a race finishes one "dist" (30 lengths) behind the preceding horse. Note that 30 lengths is a lot when it comes to horse racing so this should be a fair assumption of what would logically be an infinite distance to the winner. Dropping these may introduce bias into our data whereas it might be important to know that a horse frequently does not finish.

In [93]:
entries_nan_dist_p40 = horses_all_trim_intxn_clean[
    horses_all_trim_intxn_clean['dist'].isnull() &
    (horses_all_trim_intxn_clean['position'] == 40)
]
entries_nan_dist_p40

Unnamed: 0,rid,horseName,age,saddle,decimalPrice,trainerName,jockeyName,position,positionL,dist,outHandicap,RPR,TR,OR,father,mother,gfather,weight
45,13063,Sharp Command,4.0,5.0,0.200000,P Eccles,J Weaver,40,,,,,,53.0,Sharpo,Bluish,Alleged,55
609,302858,Graignamanagh,6.0,3.0,0.307692,Harry De Bromhead,J R Barry,40,,,,,,,Tremblant,Feathermore,Crash Course,73
696,49027,Ask The Butler,6.0,1.0,0.444444,C Roche,Conor O'Dwyer,40,,,,,,,Carlingford Castle,Ask Breda,Ya Zaman,74
716,294935,Alicharger,7.0,4.0,0.019608,P Monteith,Tony Dobbin,40,,,,,,,Alias Smith,Amirati,Amber Rama,71
717,294935,Herbert Lodge,8.0,6.0,0.363636,Kim Bailey,Conor O'Dwyer,40,,,,,,,Montelimar,Mindyourbusiness,Run The Gantlet,71
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4107302,201116,Camair Crusader,5.0,8.0,0.058824,W McKeown,Alan Dempsey,40,,,,73.0,73.0,66.0,Jolly Jake,Sigrid's Dream,Triple Bend,64
4107303,201116,Classic Exhibit,10.0,4.0,0.066667,A Streeter,Russ Garritty,40,,,,,,77.0,Tate Gallery,See The Tops,Cure The Blues,69
4107304,201116,Faustnluce Lady,10.0,13.0,0.009901,W J Smith,P Griffiths,40,,,6.0,,,67.0,Faustus,Miss Friendly,Status Seeker,62
4107309,203745,Ealing Court,10.0,5.0,0.133333,Tony Carroll,Ollie McPhail,40,,,,,,90.0,Blazing Saddles,Fille De General,Brigadier Gerard,68


In [94]:
df = horses_all_trim_intxn_clean[horses_all_trim_intxn_clean['rid'] == 201116]
df

Unnamed: 0,rid,horseName,age,saddle,decimalPrice,trainerName,jockeyName,position,positionL,dist,outHandicap,RPR,TR,OR,father,mother,gfather,weight
4107293,201116,Empire State,4.0,3.0,0.222222,Richard Fahey,Derek Byrne,1,,,,93.0,87.0,78.0,High Estate,Palm Dove,Storm Bird,70
4107294,201116,Earp,7.0,1.0,0.153846,George Moore,N Hannity,2,12.0,,,90.0,84.0,87.0,Anita's Prince,Ottavia Abu,Octavo,72
4107295,201116,Navan Project,5.0,7.0,0.117647,A R Dicken,R Supple,3,2.0,14.0,,65.0,62.0,67.0,Project Manager,Just Possible,Kalaglow,65
4107296,201116,Lucy Tufty,8.0,11.0,0.038462,George Prodromou,Lee Vickers,4,4.0,18.0,4.0,57.0,54.0,63.0,Vin St Benet,Manor Farm Toots,Royalty,60
4107297,201116,Fortune Hopper,5.0,5.0,0.142857,Martin Todhunter,Brian Harding,5,10.0,28.0,,70.0,58.0,76.0,Rock Hopper,Lots Of Luck,Neltino,69
4107298,201116,Brancepeth Belle,9.0,2.0,0.266667,N B Mason,G F Ryan,6,11.0,39.0,,53.0,50.0,80.0,Supreme Leader,Head Of The Gang,Pollerton,71
4107299,201116,Cee-N-K,5.0,10.0,0.038462,D McCain,G Lake,7,21.0,60.0,2.0,15.0,12.0,63.0,Thatching,Valois,Lyphard,60
4107300,201116,Count Keni,4.0,12.0,0.009901,I Park,N Old Smith,8,3.0,63.0,6.0,12.0,9.0,63.0,Formidable I,Flying Amy,Norwick,63
4107301,201116,Rose Flyer,9.0,9.0,0.038462,Michael Chapman,W Worthington,9,30.0,93.0,,-9.0,1.0,64.0,Nordico,String Of Straw,Thatching,63
4107302,201116,Camair Crusader,5.0,8.0,0.058824,W McKeown,Alan Dempsey,40,,,,73.0,73.0,66.0,Jolly Jake,Sigrid's Dream,Triple Bend,64


In [96]:
assert get_dist_of_horse(df, 4107302) == 123

In [97]:
assert get_dist_of_horse(df, 4107303) == 123

In [98]:
# takes 15 minutes to run

for idx, entry in tqdm(entries_nan_dist_p40.iterrows()):
    df = horses_all_trim_intxn_clean[horses_all_trim_intxn_clean['rid'] == entry['rid']]
    dist = get_dist_of_horse(df, idx)
    horses_all_trim_intxn_clean.at[idx, 'dist'] = dist

223768it [16:25, 227.03it/s]


In [101]:
entries_nan_dist_p40 = horses_all_trim_intxn_clean[
    horses_all_trim_intxn_clean['dist'].isnull() &
    (horses_all_trim_intxn_clean['position'] == 40)
]
assert len(entries_nan_dist_p40) == 0

We should just be left with `dist = NaN` where `position = 1` or `position = 2`.

In [102]:
entries_nan_dist = horses_all_trim_intxn_clean[
    horses_all_trim_intxn_clean['dist'].isnull() &
    (~horses_all_trim_intxn_clean['position'].isin([1,2]))
]
assert len(entries_nan_dist) == 0

Now, for horses that finish with `position = 1` or `position = 2` (which all horses with `dist = 0` currently satisfy, we can safely fill in `dist = 0`; this matches the definition of `dist` and _does not infer any data_.

In [103]:
horses_all_trim_intxn_clean['dist'].fillna(0, inplace=True)

In [104]:
entries_nan_dist = horses_all_trim_intxn_clean[
    horses_all_trim_intxn_clean['dist'].isnull()
]
assert len(entries_nan_dist) == 0

### `outHandicap`

Anywhere the `outHandicap = NaN` this can safely be replaced with `outHandicap = 0`. Since a handicap is an "opt-in" service and not the default.

In [105]:
len(horses_all_trim_intxn_clean[
    horses_all_trim_intxn_clean['outHandicap'] < 0
])

0

Good, no negative values for `outHandicap`.

In [106]:
entries_nan_outhandicap = horses_all_trim_intxn_clean[
    horses_all_trim_intxn_clean['outHandicap'].isnull()
]
len(entries_nan_outhandicap)

3254031

Most entries do not have an `outHandicap`, so this will probably be pretty inconsequential, nonetheless we can fill these values in.

In [107]:
horses_all_trim_intxn_clean['outHandicap'].fillna(0, inplace=True)

In [108]:
entries_nan_outhandicap = horses_all_trim_intxn_clean[
    horses_all_trim_intxn_clean['outHandicap'].isnull()
]
assert len(entries_nan_outhandicap) == 0

### `RPR`

Since this value is a rating, we will accept any real value here. That is, we will only correct where `RPR = NaN`.

In [109]:
entries_nan_rpr = horses_all_trim_intxn_clean[
    horses_all_trim_intxn_clean['RPR'].isnull()
]
len(entries_nan_rpr)

571970

Since this represents a reasonably proportion of the dataset, we do not want to simply drop everywhere where `NaN` occurs. Instead, we will replace `RPR = NaN` with `RPR = mean(RPR)`. Consider the properties this buys us: when we eventually standardize the data, this features will zero out everywhere that we originally had `RPR = NaN` and _provide no additional information to the model_. That is, if this feature zeros out, it will not affect the prediction; it will be like we never observed this at all.

In [110]:
mean_rpr = np.mean(horses_all_trim_intxn_clean[~horses_all_trim_intxn_clean['RPR'].isnull()]['RPR'])
mean_rpr

71.87665017694407

In [111]:
horses_all_trim_intxn_clean['RPR'].fillna(mean_rpr, inplace=True)

In [112]:
entries_nan_rpr = horses_all_trim_intxn_clean[
    horses_all_trim_intxn_clean['RPR'].isnull()
]
assert len(entries_nan_rpr) == 0

### `OR`

We will treat `OR` similarly to `RPR`.

In [113]:
entries_nan_or = horses_all_trim_intxn_clean[
    horses_all_trim_intxn_clean['OR'].isnull()
]
len(entries_nan_or)

1272628

In [114]:
mean_or = np.mean(horses_all_trim_intxn_clean[~horses_all_trim_intxn_clean['OR'].isnull()]['OR'])
mean_or

79.65460350246187

In [115]:
horses_all_trim_intxn_clean['OR'].fillna(mean_or, inplace=True)

In [116]:
entries_nan_or = horses_all_trim_intxn_clean[
    horses_all_trim_intxn_clean['OR'].isnull()
]
assert len(entries_nan_or) == 0

### `TR`

Finally, `TR` is treated like `OR` and `RPR`.

In [117]:
entries_nan_tr = horses_all_trim_intxn_clean[
    horses_all_trim_intxn_clean['TR'].isnull()
]
len(entries_nan_tr)

1304471

In [118]:
mean_tr = np.mean(horses_all_trim_intxn_clean[~horses_all_trim_intxn_clean['TR'].isnull()]['TR'])
mean_tr

51.591987121984786

In [119]:
horses_all_trim_intxn_clean['TR'].fillna(mean_tr, inplace=True)

In [120]:
entries_nan_tr = horses_all_trim_intxn_clean[
    horses_all_trim_intxn_clean['TR'].isnull()
]
assert len(entries_nan_tr) == 0

### `trainerName`

We speculate that `trainerName` may be helpful towards predicting the success of a horse, since trainer's have a good sense of the chances that a horse may win a race and will enter their horse in the race accordingly.

In [121]:
len(horses_all_trim_intxn_clean[horses_all_trim_intxn_clean['trainerName'].isnull()])

539

In [122]:
len(horses_all_trim_intxn_clean[horses_all_trim_intxn_clean['trainerName'] == ''])

0

Very few horses are lacking `trainerName`. We must drop these entries. One suggestion may be to use the trainer of this horse if this horse raced in a different race. However, this does not work because a horse may have several different trainers (especially as they are bought / sold).

In [123]:
entries_w_nan_trainer = horses_all_trim_intxn_clean[
    horses_all_trim_intxn_clean['trainerName'].isnull()
]

In [124]:
# drop all races where there is a horse without trainer information
old_shape = races_all_trim_intxn_clean.shape
print(f"old shape {old_shape}")

races_all_trim_intxn_clean = races_all_trim_intxn_clean[
                                 ~races_all_trim_intxn_clean['rid'].isin(entries_w_nan_trainer['rid'])
                             ]

new_shape = races_all_trim_intxn_clean.shape
print(f"new shape {new_shape}")

assert old_shape[0] - len(entries_w_nan_trainer['rid'].unique()) == new_shape[0]

old shape (320114, 11)
new shape (319778, 11)


In [125]:
# drop entries from all races where one or more horses do not have trainer information
old_shape = horses_all_trim_intxn_clean.shape
print(f"old shape {old_shape}")

horses_all_trim_intxn_clean = horses_all_trim_intxn_clean[
                                 ~horses_all_trim_intxn_clean['rid'].isin(entries_w_nan_trainer['rid'])
                             ]

new_shape = horses_all_trim_intxn_clean.shape
print(f"new shape {new_shape}")

old shape (3338946, 18)
new shape (3335293, 18)


In [126]:
# sanity checks
assert set(horses_all_trim_intxn_clean['rid']).symmetric_difference(set(races_all_trim_intxn_clean['rid'])) == set()
assert len(horses_all_trim_intxn_clean['rid'][
    horses_all_trim_intxn_clean['rid'].isin(entries_w_nan_trainer['rid'])
]) == 0

### `jockeyName`

We speculate that `jockeyName` may be helpful towards predicting the success of a horse. Since being a jockey is a profession, there are obviously better jockeys than others. Additionally, jockeys may ride different horses, and the same horse may have different jockeys.

In [127]:
len(horses_all_trim_intxn_clean[horses_all_trim_intxn_clean['jockeyName'].isnull()])

45

In [128]:
len(horses_all_trim_intxn_clean[horses_all_trim_intxn_clean['trainerName'] == ''])

0

Very few horses are lacking `jockeyName`. For the same reason as `trainerName`, we must drop these entries.

In [129]:
entries_w_nan_jockey = horses_all_trim_intxn_clean[
    horses_all_trim_intxn_clean['jockeyName'].isnull()
]

In [130]:
# drop all races where there is a horse without trainer information
old_shape = races_all_trim_intxn_clean.shape
print(f"old shape {old_shape}")

races_all_trim_intxn_clean = races_all_trim_intxn_clean[
                                 ~races_all_trim_intxn_clean['rid'].isin(entries_w_nan_jockey['rid'])
                             ]

new_shape = races_all_trim_intxn_clean.shape
print(f"new shape {new_shape}")

assert old_shape[0] - len(entries_w_nan_jockey['rid'].unique()) == new_shape[0]

old shape (319778, 11)
new shape (319760, 11)


In [131]:
# drop entries from all races where one or more horses do not have trainer information
old_shape = horses_all_trim_intxn_clean.shape
print(f"old shape {old_shape}")

horses_all_trim_intxn_clean = horses_all_trim_intxn_clean[
                                 ~horses_all_trim_intxn_clean['rid'].isin(entries_w_nan_jockey['rid'])
                             ]

new_shape = horses_all_trim_intxn_clean.shape
print(f"new shape {new_shape}")

old shape (3335293, 18)
new shape (3335069, 18)


In [132]:
# sanity checks
assert set(horses_all_trim_intxn_clean['rid']).symmetric_difference(set(races_all_trim_intxn_clean['rid'])) == set()
assert len(horses_all_trim_intxn_clean['rid'][
    horses_all_trim_intxn_clean['rid'].isin(entries_w_nan_jockey['rid'])
]) == 0

### `father`

We suspect this feature to be less important than the others and only really carries value for new horses, but let's try to understand what data is missing here.

In [133]:
len(horses_all_trim_intxn_clean[horses_all_trim_intxn_clean['father'].isnull()])

148

In [134]:
len(horses_all_trim_intxn_clean[horses_all_trim_intxn_clean['father'] == ''])

0

Let's try to correct this missing data by seeing if this horse is elsewhere in the dataset with a valid `father`.

In [135]:
entries_w_nan_father = horses_all_trim_intxn_clean[horses_all_trim_intxn_clean['father'].isnull()]
horses_w_nan_father = entries_w_nan_father['horseName'].unique()

In [136]:
horses_w_father_can_be_corrected = []
horses_corrective_father_data = {}

for horse_name in tqdm(horses_w_nan_father):
    entries_same_horse = horses_all_trim_intxn_clean[horses_all_trim_intxn_clean['horseName'] == horse_name]
    entries_same_horse_valid_father = entries_same_horse[entries_same_horse['father'].notnull()]
    
    if len(entries_same_horse_valid_father) > 0:
        horses_w_father_can_be_corrected.append(horse_name)
        horses_corrective_father_data[horse_name] = entries_same_horse_valid_father['father'].iloc[0]

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 101/101 [00:21<00:00,  4.60it/s]


In [137]:
len(horses_w_father_can_be_corrected)

17

In [138]:
horses_corrective_father_data

{'Angel Face': 'King Kamehameha',
 'Royal Touch': 'Dubawi',
 'Midnight Soprano': 'Celtic Swing',
 'Magical Dream': 'Dream Ahead',
 'Cubanita': 'Selkirk',
 'Silky': 'Montjeu',
 'Say': 'Galileo',
 "We'll Go Walking": 'Authorized',
 'Testosterone': 'Dansili',
 'Beach Of Falesa': 'Dylan Thomas',
 'Feel The Heat': 'Firebreak',
 'Trajectory': 'Dubai Destination',
 'Don Gwarli': 'Eton Lad',
 'Christian Soldier': 'Tickled Pink',
 'Fire King': 'Falbrav',
 'Statesmanship': 'Dubawi',
 'Benji': 'Elusive Fort'}

In [139]:
for idx, row in entries_w_nan_father.iterrows():   
    if row['horseName'] in horses_w_father_can_be_corrected:
        horses_all_trim_intxn_clean.at[idx, 'father'] = horses_corrective_father_data[row['horseName']]

Now, drop those entries that cannot be corrected.

In [140]:
entries_w_nan_father = horses_all_trim_intxn_clean[horses_all_trim_intxn_clean['father'].isnull()]
len(entries_w_nan_father)

120

In [141]:
# drop all races where there is a horse without father information
old_shape = races_all_trim_intxn_clean.shape
print(f"old shape {old_shape}")

races_all_trim_intxn_clean = races_all_trim_intxn_clean[
                                 ~races_all_trim_intxn_clean['rid'].isin(entries_w_nan_father['rid'])
                             ]

new_shape = races_all_trim_intxn_clean.shape
print(f"new shape {new_shape}")

assert old_shape[0] - len(entries_w_nan_father['rid'].unique()) == new_shape[0]

old shape (319760, 11)
new shape (319658, 11)


In [142]:
# drop entries from all races where one or more horses do not have father information
old_shape = horses_all_trim_intxn_clean.shape
print(f"old shape {old_shape}")

horses_all_trim_intxn_clean = horses_all_trim_intxn_clean[
                                 ~horses_all_trim_intxn_clean['rid'].isin(entries_w_nan_father['rid'])
                             ]

new_shape = horses_all_trim_intxn_clean.shape
print(f"new shape {new_shape}")

old shape (3335069, 18)
new shape (3334017, 18)


In [143]:
# sanity checks
assert set(horses_all_trim_intxn_clean['rid']).symmetric_difference(set(races_all_trim_intxn_clean['rid'])) == set()
assert len(horses_all_trim_intxn_clean['rid'][
    horses_all_trim_intxn_clean['rid'].isin(entries_w_nan_father['rid'])
]) == 0

### `mother`

Same as `father`.

In [144]:
len(horses_all_trim_intxn_clean[horses_all_trim_intxn_clean['mother'].isnull()])

477

In [145]:
len(horses_all_trim_intxn_clean[horses_all_trim_intxn_clean['mother'] == ''])

0

Let's try to correct this missing data by seeing if this horse is elsewhere in the dataset with a valid `mother`.

In [146]:
entries_w_nan_mother = horses_all_trim_intxn_clean[horses_all_trim_intxn_clean['mother'].isnull()]
horses_w_nan_mother = entries_w_nan_mother['horseName'].unique()

In [147]:
horses_w_mother_can_be_corrected = []
horses_corrective_mother_data = {}

for horse_name in tqdm(horses_w_nan_mother):
    entries_same_horse = horses_all_trim_intxn_clean[horses_all_trim_intxn_clean['horseName'] == horse_name]
    entries_same_horse_valid_mother = entries_same_horse[entries_same_horse['mother'].notnull()]
    
    if len(entries_same_horse_valid_mother) > 0:
        horses_w_mother_can_be_corrected.append(horse_name)
        horses_corrective_mother_data[horse_name] = entries_same_horse_valid_mother['mother'].iloc[0]

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 382/382 [01:34<00:00,  4.04it/s]


In [148]:
len(horses_w_mother_can_be_corrected)

57

In [149]:
horses_corrective_mother_data

{'Angel Face': 'One For Rose',
 'Overland': 'Miss Oversea',
 'Many Rivers': 'Ferry Boat Lady',
 'Shy Baby': 'Ballycurrane',
 'Indigo Girl': 'Montare',
 'Groomsman': 'Trois Heures Apres',
 'Giovanni': 'Golden Wave',
 'Felon': 'Farah',
 'No Reply': 'En Grisaille',
 'Deputy Indy': 'Rate Shock',
 'By Far': 'Perfidie',
 'San Lorenzo': 'Sanchez',
 'Noble Crusader': 'Suitably Discreet',
 'Flashing Star': 'Fair Dream',
 'Green Warrior': 'Starlit Sky',
 'Secret Ridge': 'Love Secret',
 'Miracle Man': 'Stormy Zaph',
 'Bright Gold': 'Miss Brightside',
 'Divide And Conquer': 'Madam Rocher',
 'Lucky Thirteen': 'Lingua Franca',
 'Chestnut Lady': 'Lady Dee',
 'Transmit': 'Apple Brandy',
 'Host': 'Colonna Traiana',
 'Royal Wave': 'Air Biscuit',
 'Rotary': 'Tarry',
 'Lucky Baby': 'Make Me Strong',
 'Everyman': 'Maid To Dance',
 'Sweet Street': 'Beg La Eile',
 'In The Spotlight': 'Radiate',
 'Solid Strike': 'Solid Land',
 'Elsie Jo': 'Joy St Clair',
 'Miss Van Gogh': 'Accede',
 'Appenzell': 'Autumn Fores

In [150]:
for idx, row in entries_w_nan_mother.iterrows():   
    if row['horseName'] in horses_w_mother_can_be_corrected:
        horses_all_trim_intxn_clean.at[idx, 'mother'] = horses_corrective_mother_data[row['horseName']]

Now, drop those entries that cannot be corrected.

In [151]:
entries_w_nan_mother = horses_all_trim_intxn_clean[horses_all_trim_intxn_clean['mother'].isnull()]
len(entries_w_nan_mother)

391

In [152]:
# drop all races where there is a horse without mother information
old_shape = races_all_trim_intxn_clean.shape
print(f"old shape {old_shape}")

races_all_trim_intxn_clean = races_all_trim_intxn_clean[
                                 ~races_all_trim_intxn_clean['rid'].isin(entries_w_nan_mother['rid'])
                             ]

new_shape = races_all_trim_intxn_clean.shape
print(f"new shape {new_shape}")

assert old_shape[0] - len(entries_w_nan_mother['rid'].unique()) == new_shape[0]

old shape (319658, 11)
new shape (319409, 11)


In [153]:
# drop entries from all races where one or more horses do not have mother information
old_shape = horses_all_trim_intxn_clean.shape
print(f"old shape {old_shape}")

horses_all_trim_intxn_clean = horses_all_trim_intxn_clean[
                                 ~horses_all_trim_intxn_clean['rid'].isin(entries_w_nan_mother['rid'])
                             ]

new_shape = horses_all_trim_intxn_clean.shape
print(f"new shape {new_shape}")

old shape (3334017, 18)
new shape (3331580, 18)


In [154]:
# sanity checks
assert set(horses_all_trim_intxn_clean['rid']).symmetric_difference(set(races_all_trim_intxn_clean['rid'])) == set()
assert len(horses_all_trim_intxn_clean['rid'][
    horses_all_trim_intxn_clean['rid'].isin(entries_w_nan_mother['rid'])
]) == 0

### `gfather`

Same as `father`, `mother`.

In [155]:
len(horses_all_trim_intxn_clean[horses_all_trim_intxn_clean['gfather'].isnull()])

107673

In [156]:
len(horses_all_trim_intxn_clean[horses_all_trim_intxn_clean['gfather'] == ''])

0

Let's try to correct this missing data by seeing if this horse is elsewhere in the dataset with a valid `gfather`.

In [157]:
entries_w_nan_gfather = horses_all_trim_intxn_clean[horses_all_trim_intxn_clean['gfather'].isnull()]
horses_w_nan_gfather = entries_w_nan_gfather['horseName'].unique()

In [158]:
# takes 40 minutes

horses_w_gfather_can_be_corrected = []
horses_corrective_gfather_data = {}

for horse_name in tqdm(horses_w_nan_gfather):
    entries_same_horse = horses_all_trim_intxn_clean[horses_all_trim_intxn_clean['horseName'] == horse_name]
    entries_same_horse_valid_gfather = entries_same_horse[entries_same_horse['gfather'].notnull()]
    
    if len(entries_same_horse_valid_gfather) > 0:
        horses_w_gfather_can_be_corrected.append(horse_name)
        horses_corrective_gfather_data[horse_name] = entries_same_horse_valid_gfather['gfather'].iloc[0]

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11499/11499 [41:21<00:00,  4.63it/s]


In [159]:
len(horses_w_gfather_can_be_corrected)

2885

In [160]:
for idx, row in tqdm(entries_w_nan_gfather.iterrows()):   
    if row['horseName'] in horses_w_gfather_can_be_corrected:
        horses_all_trim_intxn_clean.at[idx, 'gfather'] = horses_corrective_gfather_data[row['horseName']]

107673it [00:11, 9573.75it/s] 


Now, drop those entries that cannot be corrected.

In [161]:
entries_w_nan_gfather = horses_all_trim_intxn_clean[horses_all_trim_intxn_clean['gfather'].isnull()]
len(entries_w_nan_gfather)

79824

In [162]:
# drop all races where there is a horse without gfather information
old_shape = races_all_trim_intxn_clean.shape
print(f"old shape {old_shape}")

races_all_trim_intxn_clean = races_all_trim_intxn_clean[
                                 ~races_all_trim_intxn_clean['rid'].isin(entries_w_nan_gfather['rid'])
                             ]

new_shape = races_all_trim_intxn_clean.shape
print(f"new shape {new_shape}")

assert old_shape[0] - len(entries_w_nan_gfather['rid'].unique()) == new_shape[0]

old shape (319409, 11)
new shape (307896, 11)


In [163]:
# drop entries from all races where one or more horses do not have gfather information
old_shape = horses_all_trim_intxn_clean.shape
print(f"old shape {old_shape}")

horses_all_trim_intxn_clean = horses_all_trim_intxn_clean[
                                 ~horses_all_trim_intxn_clean['rid'].isin(entries_w_nan_gfather['rid'])
                             ]

new_shape = horses_all_trim_intxn_clean.shape
print(f"new shape {new_shape}")

old shape (3331580, 18)
new shape (3199312, 18)


In [164]:
# sanity checks
assert set(horses_all_trim_intxn_clean['rid']).symmetric_difference(set(races_all_trim_intxn_clean['rid'])) == set()
assert len(horses_all_trim_intxn_clean['rid'][
    horses_all_trim_intxn_clean['rid'].isin(entries_w_nan_gfather['rid'])
]) == 0

---

## Sanity Checks

In [165]:
print("The following columns have null values:")

for column in races_all_trim_intxn_clean.columns:
    if sum(races_all_trim_intxn_clean[column].isnull()) > 0:
        print(f"  - {column}")

The following columns have null values:
  - hurdles


In [166]:
print("The following columns have null values:")

for column in horses_all_trim_intxn_clean.columns:
    if sum(horses_all_trim_intxn_clean[column].isnull()) > 0:
        print(f"  - {column}")

The following columns have null values:
  - positionL


## Save Dataframes

In [167]:
races_all_trim_intxn_clean.to_csv(f"{BASE_DIR}/data/csv/races_all_trim_intxn_clean.csv", index=False)

In [168]:
horses_all_trim_intxn_clean.to_csv(f"{BASE_DIR}/data/csv/horses_all_trim_intxn_clean.csv", index=False)

---