**Description of Parameters**

Standard Columns (known before race)


* AgeRestriction: Age of horse allowed to participate in the race
* Barrier: Position in the starting stall, 0 being the inner most lane
* ClassRestriction:
* CourseIndicator:
* Dam ID: ID of the mother of the horse
* Distance: Length of the race in metres
* Foaling Country: The country the horse was born in
* Foaling Date: The day the horse was born
* FrontShoes:
* Gender:
* GoingAbbrey
* GoingID
* HandicapDistance
* HandicapType
* HindShoes:
* HorseAge
* HorseID
* JockeyID
* RaceGroup
* RaceID
* RacePrizemoney
* RaceStartTime
* RacingSubType
* Saddlecloth
* SexRestriction: M if male only race, F if female only race
* SireID: ID of the father of the horse
* StartType
* StartingLine
* Surface
* TrackID
* TrainerID
* WeightCarried
* WetnessScale


Performace (result of the race)


*   BeatenMargin: number of horse lengths the horse has been beaten by?
*   Disqualified: True if horse was disqualified, False if it was not. Some examples include:
  * Gallop
  * Trap
  * Aubin
  * Pace
* FinishPosition: Place finished in the race or reason for not placing
  * BS = Break Stride
  * PU = Pulled Up
  * FL = Fell
  * NP = Took no Part

* PIRPosition
* Prizemoney
* RaceOverallTime
* PriceSP
* NoFrontCover
* PositionInRunning
* WideOffRail







In [151]:
# Import Packages
import numpy as np
import pandas as pd
import pyarrow.parquet as pq

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression

In [152]:
# read data file
df = pd.read_parquet("trots_2013-2022.parquet", engine = 'pyarrow')

**Data Cleaning and Preprocessing**


In [153]:
# drop class restriction
df = df.drop(columns=["ClassRestriction"])

In [154]:
# break age restriction into lower age limit and upper age limit
df["LowerAge"] = df["AgeRestriction"].astype(str).str[0]
df["LowerAge"] = df["LowerAge"].replace("P", 9)
df["LowerAge"] = pd.to_numeric(df["LowerAge"])
df["LowerAge"] = df["LowerAge"].fillna(0)
df["UpperAge"] = df["AgeRestriction"].astype(str).str.replace("yo", "").astype(str).str[-1]
df["UpperAge"] = df["UpperAge"].replace('0', 10)
df["UpperAge"] = df["UpperAge"].replace("+", 15)
df["UpperAge"] = pd.to_numeric(df["UpperAge"])
df["UpperAge"] = df["UpperAge"].fillna(20)
df = df.drop(columns = ["AgeRestriction"])

In [155]:
# break race start time into year, month, day, hour, and minute
df["RaceStartYear"] = df["RaceStartTime"].dt.year
df["RaceStartMonth"] = df["RaceStartTime"].dt.month
df["RaceStartDay"] = df["RaceStartTime"].dt.day
df["RaceStartHour"] = df["RaceStartTime"].dt.hour
df["RaceStartMinute"] = df["RaceStartTime"].dt.minute

In [156]:
# break foaling date into year, month, and day
df["FoalingDateYear"] = df["FoalingDate"].dt.year
df["FoalingDateMonth"] = df["FoalingDate"].dt.month
df["FoalingDateDay"] = df["FoalingDate"].dt.day

In [157]:
# drop foaling date column
df = df.drop(columns = ["FoalingDate"])

In [158]:
# break finish position into place and DNF reason
df["DNFReason"] = df["FinishPosition"].astype(str).str.replace("1","").astype(str)
for i in ["2","3","4","5","6","7","8","9","0", " "]:
  df["DNFReason"] = df["DNFReason"].astype(str).str.replace(i,"").astype(str)
for i in ["BS","UN","PU","DQ","FL","NP","UR","WC"]:
  df["FinishPosition"] = df["FinishPosition"].astype(str).str.replace(i,'-1').astype(str)
df["FinishPosition"] = df["FinishPosition"].astype(int)

In [159]:
# add win column
df['Win'] = df["FinishPosition"]
for i in ["2","3","4","5","6","7","8","9","0", "10","11","-1"]:
  df["Win"] = df["Win"].astype(str).str.replace(i,"0").astype(str)
df["Win"] = df["Win"].astype(int)
pd.value_counts(df.Win)

0    1111506
1      88906
Name: Win, dtype: int64

In [160]:
# add dummy variables for categorical variables
df = pd.get_dummies(df,columns=["CourseIndicator", "DNFReason","Disqualified", "FoalingCountry", "Gender", "GoingAbbrev", "HandicapType", "RaceGroup", "RacingSubType", "SexRestriction", "StartType", "Surface"], drop_first = True)

In [161]:
# split data into train and test set
train = df[df["RaceStartTime"] < "2021-11-01 00:00:00"]
test = df[df["RaceStartTime"] >= "2021-11-01 00:00:00"]

In [162]:
# remove RaceStartTime and columns
train = train.drop(columns = ["RaceStartTime"])
test = test.drop(columns = ["RaceStartTime"])

**Model 1 - drops all performance variables**

In [163]:
# gets values for X and y
y_train = train.Win.values
X_train = train.drop(columns = ["Win", "BeatenMargin", 'Disqualified_True', "FinishPosition", 'DNFReason_BS',
       'DNFReason_DQ', 'DNFReason_FL', 'DNFReason_NP', 'DNFReason_PU',
       'DNFReason_UN', 'DNFReason_UR', 'DNFReason_WC', "PIRPosition", "Prizemoney", "RaceOverallTime",
                             "PriceSP" , "NoFrontCover", "PositionInRunning" , "WideOffRail" ]).values
y_test = test.Win.values
X_test = test.drop(columns = ["Win", "BeatenMargin", 'Disqualified_True', "FinishPosition", 'DNFReason_BS',
       'DNFReason_DQ', 'DNFReason_FL', 'DNFReason_NP', 'DNFReason_PU',
       'DNFReason_UN', 'DNFReason_UR', 'DNFReason_WC', "PIRPosition", "Prizemoney", "RaceOverallTime",
                             "PriceSP" , "NoFrontCover", "PositionInRunning" , "WideOffRail" ]).values

In [164]:
# define the logistic regression model
logreg = LogisticRegression(solver='lbfgs', max_iter=10000)
lr = logreg.fit(X_train, y_train)

In [165]:
# predict probabilities for win probability column for train data
train['WinProbability'] = lr.predict_proba(X_train)[:,1]
train['WinProbability'] = train['WinProbability'] / train.groupby('RaceID')['WinProbability'].transform('sum')

In [166]:
# predict probabilities for win probability column for test data
test['WinProbability'] = lr.predict_proba(X_test)[:,1]
test['WinProbability'] = test['WinProbability'] / test.groupby('RaceID')['WinProbability'].transform('sum')

In [167]:
# predict the winner for each race by using the highest win probability
train["PredictedWin"] = 0
train.loc[train.groupby('RaceID')['WinProbability'].transform(max) == train['WinProbability'],"PredictedWin"] = 1
test["PredictedWin"] = 0
test.loc[test.groupby('RaceID')['WinProbability'].transform(max) == test['WinProbability'],"PredictedWin"] = 1

In [168]:
# First, get tp, tn, fp, fn
tp = sum(np.logical_and(test['PredictedWin'] == 1, test['Win'] == 1))
tn = sum(np.logical_and(test['PredictedWin'] == 0, test['Win'] == 0))
fp = sum(np.logical_and(test['PredictedWin'] == 1, test['Win'] == 0))
fn = sum(np.logical_and(test['PredictedWin'] == 0, test['Win'] == 1))

print(f"tp: {tp} tn: {tn} fp: {fp} fn: {fn}")

# Accuracy
acc = (tp + tn) / (tp + tn + fp + fn)

# Precision
precision = tp / (tp + fp)

# Recall
recall = tp / (tp + fn)

# Sensitivity
sensitivity = recall

# Specificity
specificity = tn / (fp + tn)

# Print results
print("Accuracy:",round(acc,3),"Recall:",round(recall,3),"Precision:",round(precision,3),
          "Sensitivity:",round(sensitivity,3),"Specificity:",round(specificity,3))

tp: 261 tn: 24098 fp: 1879 fn: 1882
Accuracy: 0.866 Recall: 0.122 Precision: 0.122 Sensitivity: 0.122 Specificity: 0.928


**Model 2 - put all performance variable to 0 in test set**

In [169]:
# get values for X and y
columns_to_zero = ["BeatenMargin", 'Disqualified_True', "FinishPosition", 'DNFReason_BS',
       'DNFReason_DQ', 'DNFReason_FL', 'DNFReason_NP', 'DNFReason_PU',
       'DNFReason_UN', 'DNFReason_UR', 'DNFReason_WC', "PIRPosition", "Prizemoney", "RaceOverallTime",
                             "PriceSP" , "NoFrontCover", "PositionInRunning" , "WideOffRail"]
train2 = train
test2 = test

test2.loc[:, columns_to_zero] = 0

y_train = train.Win.values
X_train = train2.drop(columns = ["Win"]).values
y_test = test.Win.values
X_test = test2.drop(columns = ["Win"]).values

  test2.loc[:, columns_to_zero] = 0


In [170]:
# define the logistic regression model
logreg2 = LogisticRegression(solver='lbfgs', max_iter=10000)
lr2 = logreg2.fit(X_train, y_train)

In [171]:
# predict probabilities for win probability column for train data
train2['WinProbability'] = lr2.predict_proba(X_train)[:,1]
train2['WinProbability'] = train2['WinProbability'] / train2.groupby('RaceID')['WinProbability'].transform('sum')

In [172]:
# predict probabilities for win probability column for test data
test2['WinProbability'] = lr2.predict_proba(X_test)[:,1]
test2['WinProbability'] = test2['WinProbability'] / test2.groupby('RaceID')['WinProbability'].transform('sum')

In [173]:
# predict the winner for each race by using the highest win probability
train2["PredictedWin"] = 0
train2.loc[train2.groupby('RaceID')['WinProbability'].transform(max) == train2['WinProbability'],"PredictedWin"] = 1
test2["PredictedWin"] = 0
test2.loc[test2.groupby('RaceID')['WinProbability'].transform(max) == test2['WinProbability'],"PredictedWin"] = 1

In [174]:
# First, get tp, tn, fp, fn
tp = sum(np.logical_and(test2['PredictedWin'] == 1, test2['Win'] == 1))
tn = sum(np.logical_and(test2['PredictedWin'] == 0, test2['Win'] == 0))
fp = sum(np.logical_and(test2['PredictedWin'] == 1, test2['Win'] == 0))
fn = sum(np.logical_and(test2['PredictedWin'] == 0, test2['Win'] == 1))

print(f"tp: {tp} tn: {tn} fp: {fp} fn: {fn}")

# Accuracy
acc = (tp + tn) / (tp + tn + fp + fn)

# Precision
precision = tp / (tp + fp)

# Recall
recall = tp / (tp + fn)

# Sensitivity
sensitivity = recall

# Specificity
specificity = tn / (fp + tn)

# Print results
print("Accuracy:",round(acc,3),"Recall:",round(recall,3),"Precision:",round(precision,3),
          "Sensitivity:",round(sensitivity,3),"Specificity:",round(specificity,3))

tp: 238 tn: 24075 fp: 1902 fn: 1905
Accuracy: 0.865 Recall: 0.111 Precision: 0.111 Sensitivity: 0.111 Specificity: 0.927


**Model 1 ran better**

In [175]:
# combine train and test (used model 1)
final_data = pd.concat([train, test])
final_data

Unnamed: 0,Barrier,BeatenMargin,DamID,Distance,FinishPosition,FrontShoes,GoingID,HandicapDistance,HindShoes,HorseAge,...,RaceGroup_G3,RacingSubType_TM,SexRestriction_C&G,SexRestriction_F,SexRestriction_M,StartType_V,Surface_S,Surface_T,WinProbability,PredictedWin
0,5,1.55,1491946,2150.0,2,0,4,0.0,0,6,...,0,0,0,0,1,0,1,0,3.201728e-04,0
1,6,3.55,1509392,2150.0,4,0,4,0.0,0,6,...,0,0,0,0,1,0,1,0,2.441597e-11,0
2,7,5.55,1507967,2150.0,6,0,4,0.0,0,6,...,0,0,0,0,1,0,1,0,7.196308e-14,0
3,8,999.00,1508536,2150.0,-1,0,4,0.0,0,6,...,0,0,0,0,1,0,1,0,1.020950e-14,0
4,9,999.00,1514055,2150.0,-1,0,4,0.0,0,6,...,0,0,0,0,1,0,1,0,1.084446e-14,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1192908,0,0.00,1562349,2850.0,0,3,4,0.0,2,3,...,0,0,0,1,0,1,0,0,7.070688e-02,0
1192909,0,0.00,1555764,2850.0,0,2,4,0.0,2,3,...,0,0,0,1,0,1,0,0,7.131926e-02,0
1192910,0,0.00,1503535,2850.0,0,2,4,0.0,2,3,...,0,0,0,1,0,1,0,0,7.009350e-02,0
1192911,0,0.00,1510529,2850.0,0,2,4,0.0,2,3,...,0,0,0,1,0,1,0,0,7.355970e-02,0


In [176]:
# write the final data that contains the win probability column to a parquet file
final_data.to_parquet('final_data.parquet')