# `get_horses_all_trim_races_all_trim_intersection.ipynb`

### Author: Anthony Hein

#### Last updated: 9/19/2021

# Overview:

It attempting to combine the `horses_all(_trim).csv` and `races_all(_trim).csv` it was found that this is computationally intractable due to the vast size of these files.

Now, we attempt to make these files smaller by removing those horse which participated in races not in `races_all(_trim).csv` and removing races for which we have no information on the participating horses in `horses_all(_trim).csv`.

---

## Setup

In [1]:
import git
import os
from typing import List
from tqdm import tqdm
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
BASE_DIR = git.Repo(os.getcwd(), search_parent_directories=True).working_dir
BASE_DIR

'/Users/anthonyhein/Desktop/SML310/project'

---

## Load `horses_all_trim.csv`

In [3]:
horses_all_trim = pd.read_csv(f"{BASE_DIR}/data/csv/horses_all_trim.csv", low_memory=False) 
horses_all_trim.head()

Unnamed: 0,rid,horseName,age,saddle,decimalPrice,trainerName,jockeyName,position,positionL,dist,outHandicap,RPR,TR,OR,father,mother,gfather,weight
0,267255,Going For Broke,3.0,4.0,0.1,P C Haslam,Seb Sanders,1,,,,72.0,62.0,62.0,Simply Great,Empty Purse,Pennine Walk,58
1,267255,Pinchincha,3.0,3.0,0.266667,Dave Morris,Tony Clark,2,4.0,,,66.0,56.0,65.0,Priolo,Western Heights,Shirley Heights,60
2,267255,Skelton Sovereign,3.0,5.0,0.142857,Reg Hollinshead,D Griffiths,3,3.0,7.0,,55.0,40.0,60.0,Contract Law,Mrs Lucky,Royal Match,55
3,267255,Fast Spin,3.0,6.0,0.380952,David Barron,Tony Culhane,4,7.0,14.0,,38.0,30.0,59.0,Formidable I,Topwinder,Topsider,57
4,267255,As-Is,3.0,2.0,0.166667,Mark Johnston,J Weaver,5,7.0,21.0,,29.0,21.0,65.0,Lomond,Capriati I,Diesis,60


In [4]:
horses_all_trim.shape

(4107315, 18)

---

## Load `races_all_trim.csv`

In [5]:
races_all_trim = pd.read_csv(f"{BASE_DIR}/data/csv/races_all_trim.csv", low_memory=False) 
races_all_trim.head()

Unnamed: 0,rid,course,time,date,hurdles,prizes,winningTime,metric,countryCode,ncond,class
0,267255,Southwell (AW),03:40,97/01/01,,"[2752.25, 833.0, 406.5, 193.25]",106.9,1609.0,GB,0,5
1,297570,Southwell (AW),12:35,97/01/01,,"[1944.0, 544.0, 264.0]",91.0,1407.0,GB,0,6
2,334421,Southwell (AW),01:05,97/01/01,,"[2502.0, 702.0, 342.0]",150.7,2212.0,GB,0,6
3,366304,Southwell (AW),03:10,97/01/01,,"[2189.0, 614.0, 299.0]",108.6,1609.0,GB,0,6
4,13063,Southwell (AW),02:40,97/01/01,,"[2726.25, 825.0, 402.5, 191.25]",231.4,3318.5,GB,0,5


In [6]:
races_all_trim.shape

(396572, 11)

---

## Delete Races w/o Horse Information 

In [7]:
bad_rids = set(races_all_trim['rid']) - set(horses_all_trim['rid'])
len(bad_rids)

1386

In [8]:
races_all_trim_intxn = races_all_trim[~ races_all_trim['rid'].isin(bad_rids)]

assert len(races_all_trim) - len(bad_rids) == len(races_all_trim_intxn)

len(races_all_trim_intxn)

395186

---

## Delete Horses w/o Race Information

In [9]:
bad_rids = set(horses_all_trim['rid']) - (set(races_all_trim['rid']))
len(bad_rids)

0

Luckily, nothing to do here.

In [10]:
horses_all_trim_intxn = horses_all_trim

---

## Sanity Check

In [11]:
assert set(horses_all_trim_intxn['rid']).symmetric_difference(set(races_all_trim_intxn['rid'])) == set()

---

## Save Dataframes

In [12]:
races_all_trim_intxn.to_csv(f"{BASE_DIR}/data/csv/races_all_trim_intxn.csv", index=False)

In [13]:
horses_all_trim_intxn.to_csv(f"{BASE_DIR}/data/csv/horses_all_trim_intxn.csv", index=False)

---