# `streamline_horses_augmentation.ipynb`

### Author: Anthony Hein

#### Last updated: 11/3/2021

# Overview:

Augment the horses dataset so that each row includes an additional columns which is the _estimated_ finishing time of this horse. This is calculated based on the distance (in lengths) to the first place finisher, the distance of the race, and the winning time of the first place finisher.

Some of this borders on featurization. For us, the difference between augmenting the dataset and featurizing the dataset will be that featuriziation considers rows in our dataset with respect to other rows in the dataset whereas augmentation can exist for a row in isolation.

---

## Setup

In [1]:
from datetime import datetime
import git
import os
import re
from typing import List
from tqdm import tqdm
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
BASE_DIR = git.Repo(os.getcwd(), search_parent_directories=True).working_dir
BASE_DIR

'/Users/anthonyhein/Desktop/SML310/project'

---

## Load `horses_selected_trimmed_clean.csv`

In [3]:
horses_selected_trimmed_clean = pd.read_csv(f"{BASE_DIR}/data/streamline/horses_selected_trimmed_clean.csv",
                                            low_memory=False) 
horses_selected_trimmed_clean.head()

Unnamed: 0,rid,horseName,age,saddle,decimalPrice,isFav,trainerName,jockeyName,position,positionL,...,RPR,TR,OR,father,mother,gfather,weight,res_win,res_place,res_show
0,302858,Kings Return,6.0,4.0,0.6,1,W P Mullins,D J Casey,1,0,...,102.0,,,King's Ride,Browne's Return,Deep Run,73,1,1,0
1,302858,Majestic Red I,6.0,5.0,0.047619,0,John Hackett,Conor O'Dwyer,2,8,...,94.0,,,Long Pond,Courtlough Lady,Giolla Mear,73,0,1,0
2,302858,Clearly Canadian,6.0,2.0,0.166667,0,D T Hughes,G Cotter,3,1.5,...,92.0,,,Nordico,Over The Seas,North Summit,71,0,0,0
3,302858,Bernestic Wonder,8.0,1.0,0.058824,0,E McNamara,J Old Jones,4,dist,...,,,,Roselier,Miss Reindeer,Reindeer,73,0,0,0
4,302858,Beauty's Pride,5.0,6.0,0.038462,0,J J Lennon,T Martin,5,dist,...,,,,Noalto,Elena's Beauty,Tarqogan,66,0,0,0


In [4]:
horses_selected_trimmed_clean.shape

(202304, 22)

In [5]:
horses_selected_trimmed_clean_augmented = horses_selected_trimmed_clean.copy()
horses_selected_trimmed_clean_augmented.head()

Unnamed: 0,rid,horseName,age,saddle,decimalPrice,isFav,trainerName,jockeyName,position,positionL,...,RPR,TR,OR,father,mother,gfather,weight,res_win,res_place,res_show
0,302858,Kings Return,6.0,4.0,0.6,1,W P Mullins,D J Casey,1,0,...,102.0,,,King's Ride,Browne's Return,Deep Run,73,1,1,0
1,302858,Majestic Red I,6.0,5.0,0.047619,0,John Hackett,Conor O'Dwyer,2,8,...,94.0,,,Long Pond,Courtlough Lady,Giolla Mear,73,0,1,0
2,302858,Clearly Canadian,6.0,2.0,0.166667,0,D T Hughes,G Cotter,3,1.5,...,92.0,,,Nordico,Over The Seas,North Summit,71,0,0,0
3,302858,Bernestic Wonder,8.0,1.0,0.058824,0,E McNamara,J Old Jones,4,dist,...,,,,Roselier,Miss Reindeer,Reindeer,73,0,0,0
4,302858,Beauty's Pride,5.0,6.0,0.038462,0,J J Lennon,T Martin,5,dist,...,,,,Noalto,Elena's Beauty,Tarqogan,66,0,0,0


---

## Load `races_selected_trimmed_clean.csv`

In [6]:
races_selected_trimmed_clean = pd.read_csv(f"{BASE_DIR}/data/streamline/races_selected_trimmed_clean.csv",
                                           low_memory=False) 
races_selected_trimmed_clean.head()

Unnamed: 0,rid,course,title,winningTime,metric,ncond,class,runners,margin,1st_place_rank_in_odds,...,station name,station lat,station lng,dist to station,station reading date,temp,msl,rain,rhum,station reading timedelta
0,302858,Thurles,Liffey Maiden Hurdle (Div 1),277.2,3821.0,1,0,6,1.219263,1,...,BIRR,53.0525,-7.5325,45.288813,1/9/97 12:00,1.6,1012.4,0.0,87,15.0
1,291347,Punchestown,Ericsson G.S.M. Grand National Trial Handicap ...,447.2,5229.0,5,0,9,1.218049,4,...,CASEMENT,53.182,-6.262,24.477602,2/16/97 15:00,8.0,992.5,0.4,87,20.0
2,75447,Listowel,Ballybunion E.B.F. Beginners S'chase,318.4,3620.0,5,0,8,1.27732,3,...,SHANNON AIRPORT,52.4125,-8.5505,63.534139,3/1/97 14:00,12.0,1003.5,0.0,73,0.0
3,358038,Punchestown,Quinns Of Baltinglass Chase (La Touche) (Cross...,533.9,6637.0,1,0,10,1.286595,1,...,CASEMENT,53.182,-6.262,24.477602,4/24/97 14:00,12.6,1011.9,0.0,72,20.0
4,89211,Tipperary,Topaz Sprint Stakes (Listed),59.9,1005.0,4,0,5,1.217043,4,...,SHANNON AIRPORT,52.4125,-8.5505,25.222137,5/8/97 17:00,11.1,994.2,0.0,59,30.0


In [7]:
races_selected_trimmed_clean.shape

(20201, 34)

---

## Helper Dictionaries

In [8]:
rid_to_distance = {}
rid_to_winning_time = {}

for _, row in races_selected_trimmed_clean.iterrows():
    rid_to_distance[row['rid']] = row['metric']
    rid_to_winning_time[row['rid']] = row['winningTime']

---

## Augment w/ Finishing Time

The following is a function for the amount of time a "length" in a given horse race is, inspired by [https://edge.twinspires.com/racing/the-real-value-of-a-length/](https://edge.twinspires.com/racing/the-real-value-of-a-length/).

$$\text{time of length in seconds} = 1\ /\ [\ \text{distance}\ /\ \text{winning time}\ /\ \text{average horse length}\ ]$$

Note that the distance and the average horse length must have the same units to cancel out.

In [9]:
AVERAGE_HORSE_LENGTH = 2.55 # meters

In [10]:
def get_time_of_length(distance: float, winning_time: float) -> float:
    return 1 / (distance / winning_time / AVERAGE_HORSE_LENGTH)

In [11]:
def get_horse_finish_time(row) -> float:
    length_time = get_time_of_length(rid_to_distance[row['rid']], rid_to_winning_time[row['rid']])
    return rid_to_winning_time[row['rid']] + row['dist'] * length_time

In [12]:
horses_selected_trimmed_clean_augmented['finishing time'] = horses_selected_trimmed_clean_augmented.apply(
    get_horse_finish_time,
    axis=1
)
horses_selected_trimmed_clean_augmented

Unnamed: 0,rid,horseName,age,saddle,decimalPrice,isFav,trainerName,jockeyName,position,positionL,...,TR,OR,father,mother,gfather,weight,res_win,res_place,res_show,finishing time
0,302858,Kings Return,6.0,4.0,0.600000,1,W P Mullins,D J Casey,1,0,...,,,King's Ride,Browne's Return,Deep Run,73,1,1,0,277.200000
1,302858,Majestic Red I,6.0,5.0,0.047619,0,John Hackett,Conor O'Dwyer,2,8,...,,,Long Pond,Courtlough Lady,Giolla Mear,73,0,1,0,278.679948
2,302858,Clearly Canadian,6.0,2.0,0.166667,0,D T Hughes,G Cotter,3,1.5,...,,,Nordico,Over The Seas,North Summit,71,0,0,0,278.957438
3,302858,Bernestic Wonder,8.0,1.0,0.058824,0,E McNamara,J Old Jones,4,dist,...,,,Roselier,Miss Reindeer,Reindeer,73,0,0,0,284.507242
4,302858,Beauty's Pride,5.0,6.0,0.038462,0,J J Lennon,T Martin,5,dist,...,,,Noalto,Elena's Beauty,Tarqogan,66,0,0,0,290.057045
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
202299,227139,Old Tim,6.0,5.0,0.142857,0,Donal Hassett,Mr B Hassett,8,15,...,,,Poet's Dream I,Settled,Blue Cashmere,73,0,0,0,263.612842
202300,227139,Our Ling,6.0,12.0,0.111111,0,C P Donoghue,Philip Dempsey,9,13,...,,,Vaour,May-Ling,Mississippi,73,0,0,0,266.195410
202301,227139,Ballinarrid,6.0,2.0,0.111111,0,Seamus P Murphy,Mr P Fenton,10,0.75,...,,,John French,Cuckaloo,Master Buck,76,0,0,0,266.344405
202302,227139,Fountain Pen,9.0,3.0,0.047619,0,William J Fitzpatrick,Mr P Fahey,11,dist,...,,,Royal Fountain,Monday's Pet,Menelek,74,0,0,0,272.304178


---

## Augment w/ `(finishing time ratio) = (finishing time) / (winning time)`

This is a good summary statistic since it is agnostic to the nominal time and captures information about the distance to a finisher in one cell.

In [13]:
def get_horse_finishing_time_ratio(row) -> float:
    return row['finishing time'] / rid_to_winning_time[row['rid']]

In [14]:
horses_selected_trimmed_clean_augmented['finishing time ratio'] = horses_selected_trimmed_clean_augmented.apply(
    get_horse_finishing_time_ratio,
    axis=1
)

---

## Save Dataframe

In [15]:
horses_selected_trimmed_clean_augmented.to_csv(
    f"{BASE_DIR}/data/streamline/horses_selected_trimmed_clean_augmented.csv",
    index=False
)

---