# `select_countries_horses_aticnm_and_races_aticnm_ie.ipynb`

### Author: Anthony Hein

#### Last updated: 10/18/2021

# Overview:

We are still weaning down the dataset into something more manageable and more targeted. Towards this purpose, we now drop all races which do not belong to a subset of countries of our choosing (which will be determined by the most prolific countries in the dataset).

**Note**: This is nearly identical to another notebook bearing a similar name except _it chooses Ireland instead of Great Britain due to the availability of data_.

---

## Setup

In [1]:
import git
import os
from typing import List
from tqdm import tqdm
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
BASE_DIR = git.Repo(os.getcwd(), search_parent_directories=True).working_dir
BASE_DIR

'/Users/anthonyhein/Desktop/SML310/project'

---

## Load `horses_aticnm.csv`

In [3]:
horses_aticnm = pd.read_csv(f"{BASE_DIR}/data/csv/horses_aticnm.csv", low_memory=False) 
horses_aticnm.head()

Unnamed: 0,rid,horseName,age,saddle,decimalPrice,trainerName,jockeyName,position,positionL,dist,outHandicap,RPR,TR,OR,father,mother,gfather,weight
0,267255,Going For Broke,3.0,4.0,0.1,P C Haslam,Seb Sanders,1,,0.0,0.0,72.0,62.0,62.0,Simply Great,Empty Purse,Pennine Walk,58
1,267255,Pinchincha,3.0,3.0,0.266667,Dave Morris,Tony Clark,2,4.0,0.0,0.0,66.0,56.0,65.0,Priolo,Western Heights,Shirley Heights,60
2,267255,Skelton Sovereign,3.0,5.0,0.142857,Reg Hollinshead,D Griffiths,3,3.0,7.0,0.0,55.0,40.0,60.0,Contract Law,Mrs Lucky,Royal Match,55
3,267255,Fast Spin,3.0,6.0,0.380952,David Barron,Tony Culhane,4,7.0,14.0,0.0,38.0,30.0,59.0,Formidable I,Topwinder,Topsider,57
4,267255,As-Is,3.0,2.0,0.166667,Mark Johnston,J Weaver,5,7.0,21.0,0.0,29.0,21.0,65.0,Lomond,Capriati I,Diesis,60


In [4]:
horses_aticnm.shape

(1674531, 18)

---

## Load `races_aticnm.csv`

In [5]:
races_aticnm = pd.read_csv(f"{BASE_DIR}/data/csv/races_aticnm.csv", low_memory=False) 
races_aticnm.head()

Unnamed: 0,rid,course,time,date,hurdles,prizes,winningTime,metric,countryCode,ncond,class
0,267255,Southwell (AW),03:40,97/01/01,,"[2752.25, 833.0, 406.5, 193.25]",106.9,1609.0,GB,0,5
1,297570,Southwell (AW),12:35,97/01/01,,"[1944.0, 544.0, 264.0]",91.0,1407.0,GB,0,6
2,334421,Southwell (AW),01:05,97/01/01,,"[2502.0, 702.0, 342.0]",150.7,2212.0,GB,0,6
3,366304,Southwell (AW),03:10,97/01/01,,"[2189.0, 614.0, 299.0]",108.6,1609.0,GB,0,6
4,13063,Southwell (AW),02:40,97/01/01,,"[2726.25, 825.0, 402.5, 191.25]",231.4,3318.5,GB,0,5


In [6]:
races_aticnm.shape

(178067, 11)

---

## Find Most Prolific Countries in the Dataset

In [7]:
races_aticnm['countryCode'].value_counts()

GB      123972
IE       19510
FR       15056
US        7111
HK        3513
AU        2277
AE        2019
GER        986
ARG        674
CAN        494
IT         488
BRZ        483
JP         297
CHI        292
SAF        275
NZ         153
JER        100
SWE         92
URU         71
TUR         56
NOR         39
SWI         35
DEN         21
SIN         20
SPA         10
KSA          7
KOR          4
BEL          4
ARAB         2
CZE          2
BHR          1
GUE          1
MAC          1
PER          1
Name: countryCode, dtype: int64

Although Great Britain has more races, the weather data for Great Britain is markedly more messy and frustrating to work with, so we will choose the country with the second most races, Ireland. Note that this would still constitute the largest analysis of horse racing (the previous largest being about 13K horses).

In [8]:
print(f"old shape {races_aticnm.shape}")

races_aticnmg = races_aticnm[races_aticnm['countryCode'] == 'IE']

print(f"new shape {races_aticnmg.shape}")

old shape (178067, 11)
new shape (19510, 11)


In [9]:
print(f"old shape {horses_aticnm.shape}")

horses_aticnmg = horses_aticnm[horses_aticnm['rid'].isin(races_aticnmg['rid'])]

print(f"new shape {horses_aticnmg.shape}")

old shape (1674531, 18)
new shape (197491, 18)


## Save Dataframes

In [10]:
horses_aticnmg.to_csv(f"{BASE_DIR}/data/csv/horses_aticnmi.csv", index=False)

In [11]:
races_aticnmg.to_csv(f"{BASE_DIR}/data/csv/races_aticnmi.csv", index=False)

---