# `featurize_horses.ipynb`

### Author: Anthony Hein

#### Last updated: 10/19/2021

# Overview:

We can finally featurize the horse dataset. This is primarily adding fields like average time, best time, average time of parent, average time of grandparent, etc. We also want to include fields like average time with weather.

---

## Setup

In [1]:
from datetime import datetime
import git
import os
import re
from typing import List
from tqdm import tqdm
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
BASE_DIR = git.Repo(os.getcwd(), search_parent_directories=True).working_dir
BASE_DIR

'/Users/anthonyhein/Desktop/SML310/project'

---

## Load `horses_augment_times.csv`

In [3]:
horses_augment_times = pd.read_csv(f"{BASE_DIR}/data/csv/horses_augment_times.csv", low_memory=False) 
horses_augment_times.head()

Unnamed: 0,rid,horseName,age,saddle,decimalPrice,trainerName,jockeyName,position,outHandicap,RPR,TR,OR,father,mother,gfather,weight,time
0,302858,Kings Return,6.0,4.0,0.6,W P Mullins,D J Casey,1,0.0,102.0,51.591987,79.654604,King's Ride,Browne's Return,Deep Run,73,277.2
1,302858,Majestic Red I,6.0,5.0,0.047619,John Hackett,Conor O'Dwyer,2,0.0,94.0,51.591987,79.654604,Long Pond,Courtlough Lady,Giolla Mear,73,278.679948
2,302858,Clearly Canadian,6.0,2.0,0.166667,D T Hughes,G Cotter,3,0.0,92.0,51.591987,79.654604,Nordico,Over The Seas,North Summit,71,278.957438
3,302858,Bernestic Wonder,8.0,1.0,0.058824,E McNamara,J Old Jones,4,0.0,71.87665,51.591987,79.654604,Roselier,Miss Reindeer,Reindeer,73,284.507242
4,302858,Beauty's Pride,5.0,6.0,0.038462,J J Lennon,T Martin,5,0.0,71.87665,51.591987,79.654604,Noalto,Elena's Beauty,Tarqogan,66,290.057045


In [4]:
horses_augment_times.shape

(194573, 17)

In [5]:
horses_featurized = horses_augment_times.copy()
horses_featurized.head()

Unnamed: 0,rid,horseName,age,saddle,decimalPrice,trainerName,jockeyName,position,outHandicap,RPR,TR,OR,father,mother,gfather,weight,time
0,302858,Kings Return,6.0,4.0,0.6,W P Mullins,D J Casey,1,0.0,102.0,51.591987,79.654604,King's Ride,Browne's Return,Deep Run,73,277.2
1,302858,Majestic Red I,6.0,5.0,0.047619,John Hackett,Conor O'Dwyer,2,0.0,94.0,51.591987,79.654604,Long Pond,Courtlough Lady,Giolla Mear,73,278.679948
2,302858,Clearly Canadian,6.0,2.0,0.166667,D T Hughes,G Cotter,3,0.0,92.0,51.591987,79.654604,Nordico,Over The Seas,North Summit,71,278.957438
3,302858,Bernestic Wonder,8.0,1.0,0.058824,E McNamara,J Old Jones,4,0.0,71.87665,51.591987,79.654604,Roselier,Miss Reindeer,Reindeer,73,284.507242
4,302858,Beauty's Pride,5.0,6.0,0.038462,J J Lennon,T Martin,5,0.0,71.87665,51.591987,79.654604,Noalto,Elena's Beauty,Tarqogan,66,290.057045


---

## Load `races_clean_augment_clean.csv`

In [None]:
races_clean_augment_clean = pd.read_csv(f"{BASE_DIR}/data/csv/races_clean_augment_clean.csv", low_memory=False) 
races_clean_augment_clean.head()

In [None]:
races_clean_augment_clean.shape

---

## Understand Population of Horses

Now that we have a clean dataset, we want to know how many times a horse will actually appear in this dataset, which tells us how successful it will be to get the average time of a horse.

In [None]:
horses_featurized['horseName'].value_counts()

In [None]:
len(horses_featurized) / len(horses_featurized['horseName'].unique())

In [None]:
plt.hist(horses_featurized['horseName'].value_counts())

In [None]:
horses_featurized['horseName'].value_counts()[:12960]

In [None]:
horses_featurized['horseName'].value_counts()[:31852]

There are clearly horses that race multiple times. Additionally, there are about 13000 horses that have run at least 5 times, which gives us more than enough surface to calculate an average or calculate the last $k$ race times for $k$ reasonable. However, there are about 16000 horses that have only run once, so we will have to do some inferring for these entries.

Let's also check how often the father appears as a racer.

In [None]:
horses_featurized[horses_featurized['father'].isin(horses_featurized['horseName'].unique())]

In [None]:
len(
    horses_featurized[
        horses_featurized['father'].isin(horses_featurized['horseName'].unique())
    ]['horseName'].unique()
) / len(horses_featurized['horseName'].unique())

In [None]:
horses_featurized[horses_featurized['horseName'] == 'Orchestra']

Better, than we expected, about 25% of horses have a father who is also recorded in this dataset.

We repeat this for the mother.

In [None]:
horses_featurized[horses_featurized['mother'].isin(horses_featurized['horseName'].unique())]

In [None]:
len(
    horses_featurized[
        horses_featurized['mother'].isin(horses_featurized['horseName'].unique())
    ]['horseName'].unique()
) / len(horses_featurized['horseName'].unique())

In [None]:
horses_featurized[horses_featurized['horseName'] == 'Gravieres']

Again, pretty suprisingly, about 17% of horses have a mother who is also recorded in this dataset.

We repeat this for the grandfather.

In [None]:
horses_featurized[horses_featurized['gfather'].isin(horses_featurized['horseName'].unique())]

In [None]:
len(
    horses_featurized[
        horses_featurized['gfather'].isin(horses_featurized['horseName'].unique())
    ]['horseName'].unique()
) / len(horses_featurized['horseName'].unique())

In [None]:
horses_featurized[horses_featurized['horseName'] == 'Raise You Ten']

Here, our luck seems to run dry, with only 8% of the population having a grandfather who also raced in this dataset. For this reason, we will probably drop the `gfather` column, as too many values related to this would be inferred. Also, we can claim that the `father` column already contains information about the `gfather`.

In [None]:
horses_featurized = horses_featurized.drop(columns=['gfather'])

---

## Map Horse Name to all Races

In [None]:
all_horse_names =  np.concatenate((horses_featurized['horseName'].unique(),
                                  horses_featurized['father'].unique(),
                                  horses_featurized['mother'].unique())
                                 )

all_horse_names = np.unique(all_horse_names)

In [None]:
horse_to_races = {}

for horse_name in tqdm(all_horse_names):
    horse_to_races[horse_name] = horses_featurized[horses_featurized['horseName'] == horse_name]

In [None]:
horse_to_races['Gravieres']

## Map Horse and Race to Prev Races

We cannot use the verbatim average race time of a horse at a feature, otherwise this would encode present and future information that would not actually be available at this race. We have to be careful about only using past information at any given race. To do so, we first construct a map from a given horse and race to all previous races with that horse.

In [None]:
def get_all_races(horse_name: str) -> pd.core.frame.DataFrame:
    df = horse_to_races[horse_name]
    if len(df) == 0:
        return pd.DataFrame()
    else:
        return df.merge(races_clean_augment_clean, how='inner', on='rid')

In [None]:
def get_prev_races(horse_name: str, rid: int) -> pd.core.frame.DataFrame:
    df = horse_to_races[horse_name]
    if len(df) <= 1:
        return pd.DataFrame()
    else:
        df = df.merge(races_clean_augment_clean, how='inner', on='rid')
        return df[df['datetime'] < df[df['rid'] == rid].iloc[0]['datetime']]

In [None]:
get_all_races('Gravieres')

In [None]:
get_prev_races('Gravieres', 27686)

Looks like its working, let's run it.

In [None]:
horse_idx_to_prev_races = {}

for idx, row in tqdm(horses_featurized.iterrows()):
    horse_idx_to_prev_races[idx] = get_prev_races(row['horseName'], row['rid'])

---

## Map Horse and Race to Average Speed

We obviously cannot use the average time since the distances of the race vary. Average speed will be slightly better, though it is unnecessarily harsh on horses that frequently run long distances (and thus must pace better).

In [None]:
horse_idx_to_prev_races[116083][['time', 'metric']]

In [None]:
def get_average_speed(df) -> float:
    if len(df) == 0:
        return float('nan')
    else:
        return np.mean(df['metric'] / df['time'])

In [None]:
horse_idx_to_avg_speed = {}

for idx, _ in tqdm(horses_featurized.iterrows()):
    horse_idx_to_avg_speed[idx] = get_average_speed(horse_idx_to_prev_races[idx])

In [None]:
horse_idx_to_avg_speed[116083]

---

## Map Horse and Race to Most Previous Speed

In [None]:
horse_idx_to_prev_races[116083][['time', 'metric', 'datetime']]

In [None]:
def get_prev_speed(df) -> float:
    if len(df) == 0:
        return float('nan')
    else:
        previous_datetime = df.iloc[0]['datetime']
        previous_speed = df.iloc[0]['metric'] / df.iloc[0]['time']
        for _, row in df.iterrows():
            if row['datetime'] > previous_datetime:
                previous_datetime = row['datetime']
                previous_speed = row['metric'] / row['time']
        return previous_speed

In [None]:
horse_idx_to_prev_speed = {}

for idx, _ in tqdm(horses_featurized.iterrows()):
    horse_idx_to_prev_speed[idx] = get_prev_speed(horse_idx_to_prev_races[idx])

In [None]:
horse_idx_to_prev_speed[116083]

---

## Map Horse to Father Average Speed

Here, we will make the assumption that a horse only races after its parent stops racing, to simplify the calculations a little.

In [None]:
horse_to_father_avg_speed = {}

for horse_name in tqdm(all_horse_names):
    df = horses_featurized[horses_featurized['horseName'] == horse_name]
    if len(df) > 0:
        father = df.iloc[0]['father']
        horse_to_father_avg_speed[horse_name] = get_average_speed(get_all_races(father))

In [None]:
get_all_races('Musical Waves')

In [None]:
get_all_races('Orchestra')

In [None]:
horse_to_father_avg_speed['Musical Waves']

In [None]:
horse_idx_to_father_avg_speed = {}

for idx, row in tqdm(horses_featurized.iterrows()):
    horse_idx_to_father_avg_speed[idx] = horse_to_father_avg_speed[row['horseName']]

---

## Map Horse to Mother Average Speed

In [None]:
horse_to_mother_avg_speed = {}

for horse_name in tqdm(all_horse_names):
    df = horses_featurized[horses_featurized['horseName'] == horse_name]
    if len(df) > 0:
        mother = df.iloc[0]['mother']
        horse_to_mother_avg_speed[horse_name] = get_average_speed(get_all_races(mother))

In [None]:
get_all_races('Homer I')

In [None]:
get_average_speed(get_all_races('Gravieres'))

In [None]:
horse_to_mother_avg_speed['Homer I']

In [None]:
horse_idx_to_mother_avg_speed = {}

for idx, row in tqdm(horses_featurized.iterrows()):
    horse_idx_to_mother_avg_speed[idx] = horse_to_mother_avg_speed[row['horseName']]

---

## Map Horse to Trainer Average Speed and Position

There is a similar problem here because we must be careful to make sure that we do not inadvertently include information from the race which we are attaching this information to. To ensure this, we will only use races that occur prior to the race in question.

Also, here we introduce the idea of average position, which captures the fact that trainers are involved in the decision of whether a horse will run on a given day.

In [None]:
horses_featurized['trainerName'].value_counts()[:1000]

In [None]:
trainer_to_races = {}

for trainer_name in tqdm(horses_featurized['trainerName'].unique()):
    trainer_to_races[trainer_name] = horses_featurized[horses_featurized['trainerName'] == trainer_name]

In [None]:
def get_prev_trainer_races(trainer_name: str, rid: int) -> pd.core.frame.DataFrame:
    df = trainer_to_races[trainer_name]
    if len(df) <= 1:
        return pd.DataFrame()
    else:
        df = df.merge(races_clean_augment_clean, how='inner', on='rid')
        return df[df['datetime'] < df[df['rid'] == rid].iloc[0]['datetime']]

In [None]:
horse_idx_to_prev_trainer_races = {}

for idx, row in tqdm(horses_featurized.iterrows()):
    horse_idx_to_prev_trainer_races[idx] = get_prev_trainer_races(row['trainerName'], row['rid'])

In [None]:
horse_idx_to_prev_trainer_races[30149]

In [None]:
def get_average_position(df) -> float:
    if len(df) == 0:
        return float('nan')
    else:
        return np.mean(df['position'])

In [None]:
horse_to_trainer_avg_speed = {}
horse_to_trainer_avg_position = {}

for idx, row in tqdm(horses_featurized.iterrows()):
    prev_trainer_races = horse_idx_to_prev_trainer_races[idx]
    horse_to_trainer_avg_speed[idx] = get_average_speed(prev_trainer_races)
    horse_to_trainer_avg_position[idx] = get_average_position(prev_trainer_races)

In [None]:
horses_featurized[horses_featurized['trainerName'] == 'M P Cash']

In [None]:
horse_to_trainer_avg_speed[85688]

In [None]:
horse_to_trainer_avg_speed[76966]

In [None]:
horse_to_trainer_avg_speed[113600]

In [None]:
horse_to_trainer_avg_position[85688]

In [None]:
horse_to_trainer_avg_position[76966]

In [None]:
horse_to_trainer_avg_position[113600]

---

## Checkpoint `horses_featurized`

Similar to `augment_races.ipynb`, first make a DF for each feature.

In [None]:
rename_cols = {
    0: 'avg_speed',
}

df_avg_speed = pd.DataFrame.from_dict(horse_idx_to_avg_speed, orient='index').rename(columns=rename_cols)
df_avg_speed.sample(5)

In [None]:
rename_cols = {
    0: 'prev_speed',
}

df_prev_speed = pd.DataFrame.from_dict(horse_idx_to_prev_speed, orient='index').rename(columns=rename_cols)
df_prev_speed.sample(5)

In [None]:
rename_cols = {
    0: 'father_avg_speed',
}

df_father_avg_speed = pd.DataFrame.from_dict(horse_idx_to_father_avg_speed, orient='index').rename(columns=rename_cols)
df_father_avg_speed.sample(5)

In [None]:
rename_cols = {
    0: 'mother_avg_speed',
}

df_mother_avg_speed = pd.DataFrame.from_dict(horse_idx_to_mother_avg_speed, orient='index').rename(columns=rename_cols)
df_mother_avg_speed.sample(5)

In [None]:
rename_cols = {
    0: 'trainer_avg_speed',
}

df_trainer_avg_speed = pd.DataFrame.from_dict(horse_to_trainer_avg_speed, orient='index').rename(columns=rename_cols)
df_trainer_avg_speed.sample(5)

In [None]:
rename_cols = {
    0: 'trainer_avg_position',
}

df_trainer_avg_position = pd.DataFrame.from_dict(horse_to_trainer_avg_position, orient='index').rename(columns=rename_cols)
df_trainer_avg_position.sample(5)

In [None]:
horses_featurized = horses_featurized.join(df_avg_speed) \
                                     .join(df_prev_speed) \
                                     .join(df_father_avg_speed) \
                                     .join(df_mother_avg_speed) \
                                     .join(df_trainer_avg_speed) \
                                     .join(df_trainer_avg_position)
horses_featurized.head()

In [None]:
horses_featurized.shape

In [None]:
horses_featurized.to_csv(f"{BASE_DIR}/data/csv/horses_featurized.csv", index=False)

---