## Split Margin Importance by Track, Distance

Here we investigate the SplitMargin importance by (Track, Distance). We define 'importance' as the win-rate of greyhounds who reach the split marker first (higher win rate means the SplitMargin is more important).

Import libraries, packages, and greyhound data

In [7]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats
import os
import decouple
import sys
config = decouple.AutoConfig(' ')
os.chdir(config('ROOT_DIRECTORY'))
sys.path.insert(0, '')

from scipy.stats import zscore
from multielo import MultiElo, Player, Tracker
from multielo.multielo import defaults

# Read in data
df_raw = pd.read_csv('./data/clean/dog_results.csv')

display(df_raw)

Unnamed: 0,FasttrackDogId,Place,DogName,Box,Rug,Weight,StartPrice,Margin1,Margin2,PIR,...,FasttrackRaceId,TrainerId,TrainerName,Distance,RaceGrade,Track,RaceNum,TrackDist,RaceDate,FieldSize
0,157500927,1,RAINE ALLEN,1,1,27.4,2.4,2.30,,Q/111,...,335811282,7683,C GRENFELL,500.0,Restricted Win,Bendigo,1.0,Bendigo500,2018-07-01,6
1,1820620018,2,SURF A LOT,2,2,32.8,6.3,2.30,2.30,M/332,...,335811282,137227,C TYLEY,500.0,Restricted Win,Bendigo,1.0,Bendigo500,2018-07-01,6
2,1950680026,3,PINGIN' BEE,6,6,25.5,9.3,3.84,1.54,S/443,...,335811282,132763,P DAPIRAN,500.0,Restricted Win,Bendigo,1.0,Bendigo500,2018-07-01,6
3,1524380048,4,LUCAS THE GREAT,7,7,32.2,9.1,5.27,1.43,M/655,...,335811282,116605,E HAMILTON,500.0,Restricted Win,Bendigo,1.0,Bendigo500,2018-07-01,6
4,124225458,5,QUAVO,4,4,28.9,3.4,5.56,0.29,M/766,...,335811282,132763,P DAPIRAN,500.0,Restricted Win,Bendigo,1.0,Bendigo500,2018-07-01,6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
782997,491585906,3,GLORIOUS GUNN,8,8,27.1,3.8,3.75,2.43,6644,...,745616339,87891,G HORNE,520.0,Grade 5,Cannington,12.0,Cannington520,2021-12-31,7
782998,485659451,4,WOOD FIRE,3,3,32.1,4.1,3.75,0.14,3233,...,745616339,68549,C HALSE,520.0,Grade 5,Cannington,12.0,Cannington520,2021-12-31,7
782999,528381655,5,TRENDING QUARTER,6,6,31.8,16.2,5.25,1.43,4566,...,745616339,83581,J DAILLY,520.0,Grade 5,Cannington,12.0,Cannington520,2021-12-31,7
783000,537992387,6,ELITE WEAPON,1,1,26.7,2.9,5.25,0.00,1455,...,745616339,293372,S WILLIAMS,520.0,Grade 5,Cannington,12.0,Cannington520,2021-12-31,7


Determine SplitMargin importance by each (Track, Distance), then merge back to original dataframe

In [11]:
'''
Create a temporary dataframe that we will merge back to the original dataframe.
'''

df = df_raw.copy()

# Create a SplitMarkerPlace column
df = df.sort_values(by=['FasttrackRaceId', 'SplitMargin'], ascending=True)
df['SplitMarginPlace'] = df.groupby('FasttrackRaceId').cumcount()+1

# Create a SplitMarginWin column (who reach the split marker first)
df['SplitMarginWin'] = (df['SplitMarginPlace'] == 1).astype(int)

# Take only Greyhounds who reached split marker first, and calculate the win rate by track and distance
df = df[df['SplitMarginWin'] == 1]
df['Win'] = (df['Place'] == 1).astype(int)
df = df.groupby('TrackDist', as_index=False).agg(NumberOfWins = ('Win', 'sum'),
                                                 SampleSize = ('Win', 'count'))
df['WinRate'] = round(100*df['NumberOfWins']/df['SampleSize'], 2)
df = df.sort_values(by='WinRate')

# Take only tracks with at least 1000 sample size
df = df[df.SampleSize >= 1000]

# Break into Quantiles and Merge to original dataframe
df['TrackSplitMarginQuantile'] = pd.qcut(df['WinRate'], 10, labels=False)+1
df = df[['TrackDist', 'TrackSplitMarginQuantile']]

display(df)

Unnamed: 0,TrackDist,TrackSplitMarginQuantile
46,Healesville300,1
2,Albion Park520,1
7,Angle Park515,1
86,Sale440,1
99,The Meadows525,1
89,Sandown Park515,2
55,Ipswich520,2
26,Cannington520,2
18,Bendigo500,2
42,Geelong460,3


Save to ./data/features as a .csv

In [12]:
df.to_csv('./data/features/split-margin-importance-by-trackdist.csv', index=False)