# `make_rid_to_station_dict_ie.ipynb`

### Author: Anthony Hein

#### Last updated: 10/18/2021

# Overview:

At this point we have a dictionary which matches courses to locations (using latitude and longitude) and metadata on all weather stations in Ireland that have published hourly data (where this metadata includes open and close dates of the station as well as the latitue and longitude of the station). Therefore, the next step is to match each race to a weather station, where we will select the weather station that is active during the race and nearest to the course that the race is taking place at.

---

## Setup

In [2]:
from datetime import datetime
import git
import os
from typing import List, Union
from tqdm import tqdm
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [3]:
BASE_DIR = git.Repo(os.getcwd(), search_parent_directories=True).working_dir
BASE_DIR

'/Users/anthonyhein/Desktop/SML310/project'

In [4]:
import sys

sys.path.append(f'{BASE_DIR}/utils/')

from course_and_country_to_location import COURSE_AND_COUNTRY_TO_LOCATION
from rid_to_course_and_country import RID_TO_COURSE_AND_COUNTRY

---

## Load `horses_aticnmi.csv`

In [5]:
horses_aticnmi = pd.read_csv(f"{BASE_DIR}/data/csv/horses_aticnmi.csv", low_memory=False) 
horses_aticnmi.head()

Unnamed: 0,rid,horseName,age,saddle,decimalPrice,trainerName,jockeyName,position,positionL,dist,outHandicap,RPR,TR,OR,father,mother,gfather,weight
0,302858,Kings Return,6.0,4.0,0.6,W P Mullins,D J Casey,1,,0.0,0.0,102.0,51.591987,79.654604,King's Ride,Browne's Return,Deep Run,73
1,302858,Majestic Red I,6.0,5.0,0.047619,John Hackett,Conor O'Dwyer,2,8,0.0,0.0,94.0,51.591987,79.654604,Long Pond,Courtlough Lady,Giolla Mear,73
2,302858,Clearly Canadian,6.0,2.0,0.166667,D T Hughes,G Cotter,3,1.5,9.5,0.0,92.0,51.591987,79.654604,Nordico,Over The Seas,North Summit,71
3,302858,Bernestic Wonder,8.0,1.0,0.058824,E McNamara,J Old Jones,4,dist,39.5,0.0,71.87665,51.591987,79.654604,Roselier,Miss Reindeer,Reindeer,73
4,302858,Beauty's Pride,5.0,6.0,0.038462,J J Lennon,T Martin,5,dist,69.5,0.0,71.87665,51.591987,79.654604,Noalto,Elena's Beauty,Tarqogan,66


In [6]:
horses_aticnmi.shape

(197491, 18)

---

## Load `races_aticnmi.csv`

In [7]:
races_aticnmi = pd.read_csv(f"{BASE_DIR}/data/csv/races_aticnmi.csv", low_memory=False) 
races_aticnmi.head()

Unnamed: 0,rid,course,time,date,hurdles,prizes,winningTime,metric,countryCode,ncond,class
0,302858,Thurles (IRE),01:15,97/01/09,,[],277.2,3821.0,IE,1,0
1,291347,Punchestown (IRE),03:40,97/02/16,,[],447.2,5229.0,IE,5,0
2,377929,Leopardstown (IRE),03:00,97/05/11,,[],106.4,1609.0,IE,4,0
3,275117,Curragh (IRE),03:35,97/05/25,,[],125.9,2011.0,IE,4,0
4,66511,Leopardstown (IRE),04:30,97/06/02,,[],116.3,1810.0,IE,1,0


In [8]:
races_aticnmi.shape

(19510, 11)

---

## Load `ireland_stations_metadata.csv`

In [9]:
ireland_stations_metadata = pd.read_csv(f"{BASE_DIR}/data/csv/ireland_stations_metadata.csv", low_memory=False) 
ireland_stations_metadata.head()

Unnamed: 0,County,Station Number,name,Height(m),Easting,Northing,Latitude,Longitude,Open Year,Close Year
0,Westmeath,2222,MULLINGAR S.W.S.,111,242700,252700,53.312,-7.212,1943,1974
1,Monaghan,2437,CLONES,89,250000,326300,54.11,-7.14,1950,2008
2,Galway,2021,GALWAY S.W.S.,20,132700,225600,53.1634,-9.0034,1978,1990
3,Offaly,4919,BIRR,72,207400,204400,53.0525,-7.5325,1954,2009
4,Kilkenny,3613,KILKENNY,65,249400,157400,52.3955,-7.161,1957,2008


In [10]:
ireland_stations_metadata.shape

(33, 10)

---

## Date Helper Functions

In [11]:
def get_date_from_race_data(date: str) -> datetime:
    # the strip here is a hack until we can fix elsewhere, similarly the prepend with 0
    if date.find(' 00:00') >= 0:
        date = date[:date.find(' 00:00')]
    date = '0' + date if date[1] == '/' else date
    return datetime.strptime(date, '%y/%m/%d')

In [12]:
def get_date_from_stations_metadata(date: str) -> datetime:
    return datetime.strptime(date, '%Y')

In [13]:
def get_open_stations(df: pd.core.frame.DataFrame, race_date: datetime) -> List[bool]:
    return [
        (get_date_from_stations_metadata(str(row['Open Year'])) < race_date) and \
        (get_date_from_stations_metadata(str(row['Close Year'])) > race_date)
        for _, row
        in df.iterrows()
    ]

In [14]:
def station_is_open(row: pd.core.frame.DataFrame, race_date: datetime) -> bool:
    return (get_date_from_stations_metadata(str(row['Open Year'])) < race_date) and \
           (get_date_from_stations_metadata(str(row['Close Year'])) > race_date)

---

## Distance Helper Functions

**Note**: The precise distance between two points specified by `(latitude, longitude)` coordinates cannot be computed by the Euclidean distance formula, as this would instead calculate the distance if you drilled a wire into Earth (and ignored the curvature of Earth). However, for points which are near to each other (and not across the world), this is a fair approximation. Additionally, the precise formula, give here [https://stackoverflow.com/questions/28994289/calculate-euclidean-distance-with-google-maps-coordinates#:~:text=You%20can%2C%20but%20not%20by,from%20a%20degree%20of%20latitude.](https://stackoverflow.com/questions/28994289/calculate-euclidean-distance-with-google-maps-coordinates#:~:text=You%20can%2C%20but%20not%20by,from%20a%20degree%20of%20latitude.) (among other sources) involves a calculation with cosine, which is more computationally expensive. Furthermore, we find it unlikely that there are several stations approximately equidistant from a given track. For all these reasons, we will just use the Euclidean distance for this decision.

In [15]:
def get_distance_to_station(df: pd.core.frame.DataFrame, track_lat: float, track_lng: float) -> float:
    champion_station_name = ''
    champion_area = ''
    champion_distance = np.inf
    
    for _, row in df.iterrows():
        dist = (row['Latitude'] - track_lat) ** 2 + (row['Longitude'] - track_lng) ** 2
        
        if dist < champion_distance:
            champion_station_name = row['Station Number']
            champion_area = row['name']
            champion_distance = dist
            
    return (champion_station_name, champion_area)

---

## Make `rid` to Station Dict

A second attempt instead precomputes lengths to different stations for easy lookup. That is, we will make a dictionary from each `(course, countryCode)` pair to an ordered list of `(Station Number, name)` pairs and associated distances, where distances are in increasing order.

In [16]:
COURSE_AND_COUNTRY_TO_LOCATION_IE = {}

for key, val in COURSE_AND_COUNTRY_TO_LOCATION.items():
    if key[1] == 'IE':
        COURSE_AND_COUNTRY_TO_LOCATION_IE[key] = val

In [17]:
COURSE_AND_COUNTRY_TO_STATION = {}

for key, val in tqdm(COURSE_AND_COUNTRY_TO_LOCATION_IE.items()):
    distances = []
    for _, row in ireland_stations_metadata.iterrows():
        distance = (row['Latitude'] - val['lat']) ** 2 + (row['Longitude'] - val['lng']) ** 2
        distances.append(((row['Station Number'], row['name']), distance))
    distances = sorted(distances, key=lambda x: x[1])
    COURSE_AND_COUNTRY_TO_STATION[key] = distances

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 29/29 [00:00<00:00, 445.41it/s]


Now, for each race, we can find its corresponding sorted list of distances and find the first in the list that is open and collecting data over an interval which contains the race date.

In [18]:
d = {}

for idx, row in tqdm(races_aticnmi.iterrows()):
    
    race_date = get_date_from_race_data(row['date'])
    
    # get sorted list of stations
    lst = COURSE_AND_COUNTRY_TO_STATION[(row['course'], row['countryCode'])]
    
    # elt is ((station name, area), dist)
    for elt in lst:
        station_row = ireland_stations_metadata[ireland_stations_metadata['Station Number'] == elt[0][0]].iloc[0]
        if station_is_open(station_row, race_date):
            d[row['rid']] = elt[0]
            break

19510it [00:12, 1531.67it/s]


In [19]:
len(d)

19510

In [20]:
smple = races_aticnmi.sample(5)

In [21]:
smple

Unnamed: 0,rid,course,time,date,hurdles,prizes,winningTime,metric,countryCode,ncond,class
6537,393164,Curragh (IRE),02:30,06/07/15,,"[13020.0, 3820.0, 1820.0, 620.0]",86.5,1407.0,IE,2,0
15043,372664,Down Royal (IRE),03:45,01/09/22,,"[10912.02, 2359.35, 1032.22]",154.5,2413.0,IE,8,0
3850,410163,Limerick (IRE),01:15,20/09/11 00:00,,"[5310.0, 1710.0, 810.0, 360.0, 180.0, 90.0]",90.7,1306.5,IE,5,0
6117,231235,Wexford (RH) (IRE),04:25,12/10/28,,"[4830.0, 1120.0, 490.0, 280.0]",230.6,3218.0,IE,1,0
9736,27621,Galway (IRE),06:10,11/07/25,,"[11730.0, 2720.0, 1190.0, 680.0]",90.68,1407.0,IE,1,0


In [22]:
[(rid, d[rid]) for rid in smple['rid']]

[(393164, (875, 'MULLINGAR')),
 (372664, (2437, 'CLONES')),
 (410163, (518, 'SHANNON AIRPORT')),
 (231235, (375, 'OAK PARK')),
 (27621, (2175, 'CLAREMORRIS'))]

In [23]:
ireland_stations_metadata[ireland_stations_metadata['Station Number'].isin([d[rid][0] for rid in smple['rid']])]

Unnamed: 0,County,Station Number,name,Height(m),Easting,Northing,Latitude,Longitude,Open Year,Close Year
1,Monaghan,2437,CLONES,89,250000,326300,54.11,-7.14,1950,2008
16,Mayo,2175,CLAREMORRIS,68,134523,273883,53.4239,-8.5933,2010,2022
17,Westmeath,875,MULLINGAR,101,243000,254300,53.3214,-7.2144,2002,2022
21,Clare,518,SHANNON AIRPORT,15,137900,160300,52.4125,-8.5505,1937,2022
22,Carlow,375,OAK PARK,62,273000,179500,52.514,-6.5455,2003,2022


These are correct by inspection.

---

## Write to File in `utils`

In [24]:
s = f"RID_TO_STATION_IE = {d}"
s[:100]

"RID_TO_STATION_IE = {302858: (4919, 'BIRR'), 291347: (3723, 'CASEMENT'), 377929: (532, 'DUBLIN AIRPO"

In [25]:
with open(f"{BASE_DIR}/utils/rid_to_station_ie.py", 'w', encoding='utf-8') as f:
    f.write(s)

---