# Machine Learning for Horse Racing

----

Charles Spencer GA DSI8 Singapore

#### Read Me: https://docs.google.com/document/d/15_TQGryrslBMF6tk0QzfTw_YMzcEhzGqaSD0I440_CI/edit

## Problem Statement; Steps & Conclusions:

- Our goal is to develop a model which can be used to predict the probability of a horse winning a race. To do this we will follow 3 steps:
 1. create a model to predict the order of finish.
 2. use the past race data on lbw and implied probability of winning (1/odds) to calculate our own estimated odds.
 3. compare actual odds versus our odds for winning bets to make.

- Horse racing uses a [Pari-mutuel](https://en.wikipedia.org/wiki/Parimutuel_betting) wagering system.  Simply, you are not betting against ‘the house’; you are betting against the other players.  However, [Singapore Turf Club](http://www.turfclub.com.sg/Pages/Homepage.aspx) 'the house' takes roughly 20% on win bets, aka vigorish, but also offers a 10% rebate on losing bets in excess of SGD2,000 [Rebate Program](http://www.singaporepools.com.sg/en/HorseRacing/BetGuide/Pages/RebateProgram.aspx).
- Therefore, our goal is not just to find the most likely winner in each race, but rather which horse or horses are offering odds that exceed their actual chances of winning. This point underscores the importance of expected value, as a central concept in any probabilistic exercise. 
- As such, our model will have to outperform the collective wisdom of the betting crowd (the baseline) for it to be considered successful.  To set the baseline we will engineer a new feature called 'public_prob_win' which is calcualted from the 'Win Div' for each horse using the following calcualtion: (('Win Div'/$5)*.80)  

**Step One: Market Efficiency**
In 1988 Nobel prize winning economist Richard Thaler wrote [Anomalies: Parimutuel Betting Markets: Racetracks and Lotteries](https://www.jstor.org/stable/1942856?seq=1#page_scan_tab_contents) which identified a bias in horse race betting markets. Thaler's study found that on average, bets on longshots lose much more frequently than their odds suggest, whilst bets on favorites lose modestly less. This implies that longshots are overbet and favorites are underbet. By using a large data set of parimutuel harness horse races, we show that the favorite-longshot bias also exists in Singapore as well. This bias may also create opportunities for success in our horse racing modeling exercise.  

***Conclusion - Clear & Useful Wagering Bias: There is clear evidence of a strong wagering bias with favorites (paying SGD20 or less) winning more frequently than their odds might suggest, relative to long-shots (paying SGD21 and above). We should be able to use this to our advantage in Step Four - the implementation of our predictive model by focusing on wagers below SGD20 ('Win_Div_3'), and assigning a larger hurdle to wagering opportunities above SGD21.***

**Step Two: SLR on 

**Step Two: Improve Performance with MLR**
We plan on using a **Multiple Linear Regression model**, trained on ~10,000 rows of past race data. ~20 Features, including a number of Engineered Features is required to be included into our model to give it a high degree of predictive power.  Our model will have to outperform the collective wisdom of the betting crowd for it to be considered successful.  

**Step Four: Interesting Stats**
- average lost lengths in a race, versus the LBW 
- avg jockey lost lengths & performance if he saved ground?


**Step Three: Data Visualization**


**Step Four: Implementation**


### Conclusion - Clear & Useful Wagering Bias:

There is clear evidence of a **strong wagering bias** with favorites winning more frequently than their odds might suggest, relative to long-shots.

Horses **paying 20 or less**, implying a **public_probability estimate of winning at greater than 25%**, yet these horse show a **materially smaller PL_Percent Loss** relative to long-shots.

We can use this in the implementation of our predictive model by focusing on wagers below 20 ('Win_Div_3'), and assigning a larger hurdle to wagering opportunities above 20.

### Import Libraries:

In [1]:
import pandas as pd
import numpy as np

%matplotlib inline
from matplotlib import pyplot as plt

import matplotlib.pyplot as plt
import seaborn as sns

import scipy.stats

from sklearn import linear_model
from sklearn.linear_model import LinearRegression, Ridge, LassoCV, RidgeCV
from sklearn.preprocessing import StandardScaler, PolynomialFeatures, PowerTransformer
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict, KFold
from sklearn.metrics import r2_score, mean_squared_error
from sklearn import metrics
from sklearn.dummy import DummyRegressor

import statsmodels.api as sm

%config InlineBackend.figure_format = 'retina'

plt.style.use('fivethirtyeight')



### Loading Data; Exploratory Data Analysis & Cleaning:

Our dataset totals nearly 10,000 Rows & 27 unique Columns of past race data for our MLR analysis.  

Some modest cleaning was performed to prepare us for **Step One** analysis.


In [2]:
# Import data: 
#df = pd.read_csv('myfile.csv', parse_dates=['Date'], dayfirst=True)
df = pd.read_csv('./datasets/stc_data.csv', parse_dates=['date'], dayfirst=True)

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
df.tail()

Unnamed: 0,date,race,class,distance,surface,horse_number,horse_name,gear,horse_rating,horse_weight,hcp_weight,c_wt,bar,jockey,trainer,running_position,pl,time,lbw,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,l100m_time_2,l100m_distance_2,off_rail_2,l100m_km_hr_2,Unnamed: 28,time_2,peak_km_hr_2,avg_km_hr_2,distance_traveled_2,margin_2,finish_3,no_3,horse_name_3,win_div_3,rating_3,c_weight_3,jockey_3,trainer_3,draw_3,running_pos_3,finish_time_3,lbw_3,h_weight_3
10228,2019-07-21,10,C4,1400,T,12,MINGS MAN,"WK, TT",52,465,52.5,49.5,7,APP N ZYRUL,RB MARSH,5-6-10,10,1:24.97,6.9,,,MINGS MAN,,,6.57,101.0,10.2,55.3,,1:24.97,66.7,60.2,1423.0,3/4,10,12,MINGS MAN,$356,52,49.5,N ZYRUL,RB MARSH,7,5-6-10,1:24.97,6.9L,465kg
10229,2019-07-21,10,C4,1400,T,3,BILLY BRITAIN,,64,516,58.5,57.5,10,APP I AMIRUL,S GRAY,9-10-11,11,1:24.99,7.1,,,BILLY BRITAIN,,,6.55,101.0,8.5,55.5,,1:24.99,67.1,59.8,1414.0,head,11,3,BILLY BRITAIN,$227,64,57.5,I AMIRUL,S GRAY,10,9-10-11,1:24.99,7.1L,516kg
10230,2019-07-21,10,C4,1400,T,10,OXBOW SUN,B,57,520,55.0,55.0,12,CC WONG,D KOH,2-2-12,12,1:25.45,9.9,,,OXBOW SUN,,,7.01,101.0,7.0,51.8,,1:25.45,67.7,59.7,1420.0,2 3/4,12,10,OXBOW SUN,$210,57,55.0,CC WONG,D KOH,12,2-2-12,1:25.45,9.85L,520kg
10231,2019-07-21,10,C4,1400,T,13,MY FRIENDS,,52,505,52.5,51.5,1,APP J SEE,L KHOO,1-3-13,13,1:26.33,15.4,,,MY FRIENDS,,,7.21,101.0,4.4,50.4,,1:26.33,67.7,58.9,1414.0,5 1/2,13,13,MY FRIENDS,$259,52,51.5,J SEE,L KHOO,1,1-3-13,1:26.33,15.35L,505kg
10232,2019-07-21,10,C4,1400,T,6,DRONE,,62,504,57.5,57.5,8,J AZZOPARDI,M CLEMENTS,8-8-14,14,1:26.49,16.4,,,DRONE,,,7.18,101.0,13.0,50.6,,1:26.49,65.3,59.1,1422.0,1,14,6,DRONE,$193,62,57.5,J AZZOPARDI,M CLEMENTS,8,8-8-14,1:26.49,16.35L,504kg


In [3]:
# dtypes: float64(11), int64(28), object(42)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10233 entries, 0 to 10232
Data columns (total 47 columns):
date                   10233 non-null datetime64[ns]
race                   10233 non-null int64
class                  10233 non-null object
distance               10233 non-null int64
surface                10233 non-null object
horse_number           10233 non-null int64
horse_name             10233 non-null object
gear                   7477 non-null object
horse_rating           10233 non-null int64
horse_weight           10233 non-null int64
hcp_weight             10233 non-null float64
c_wt                   10233 non-null float64
bar                    10233 non-null int64
jockey                 10233 non-null object
trainer                10233 non-null object
running_position       10233 non-null object
pl                     10233 non-null int64
time                   10216 non-null object
lbw                    10233 non-null float64
Unnamed: 19            0 non-null

### Data wrangling:

In [4]:

df['date'].dt.strftime('%d/%m/%Y')

# Indexing:
df['indexing'] = np.arange(len(df)) #new indexing column for re-ranking in original order (date & race)
df['surf_numb'] = df.apply(lambda x: 1 if x['surface'] == 'T' else .1, axis=1) #conv 'surface' P to 0.1 & T to 1
df['indexing_surf_dist'] = df.surf_numb * df.distance # new column for surface & distance
df['indexing_surf_dist_bar'] = df.indexing_surf_dist * df.bar # new column for suf; dist & bar
df['indexing_date_race'] = df["date"].map(str) + df["race"].map(str) # new categorical column for date & race
df['indexing_date_horse'] = df["date"].map(str) + df["horse_name"].map(str) # new categorical column for date & horse


df['indexing_surf_dist_10'] = df['indexing_surf_dist'].astype(str)
df['indexing_surf_dist_horse'] = df["indexing_surf_dist_10"].map(str) + df["horse_name"].map(str) # new column
#df['indexing_surf_dist_class'] = df["indexing_surf_dist_10"].map(str) + df["class"].map(str)

# Cleaning:
df['win_div_3'] = df['win_div_3'].str.replace('$', '') # 'Win_Div_3': Remove '$' and convert to 'int'
df['win_div_3'] = df['win_div_3'].astype(int)
df = df.assign(lbw = 0 - df['lbw']) # convert LBW to negative numbers.

# Additional Calculations for our Engineered Features & Analysis: 
# Favorite/Longshot Bias Bins:
bin_ranges = [0, 20.5, 5000.5]
bin_names = ['$6-20; Prob. >=25%', '$21+; Prob. <25%']
# implied prob   >25%;      25-0%   

df['binned_win_div_3'] = pd.cut((df['win_div_3']), bins=bin_ranges,labels=bin_names)
df = df.assign(public_prob = 5 / df['win_div_3'] * 100)
df['total_count'] = df.apply(lambda x: 1 if x['lbw'] == 0 else 1, axis=1)
df['win_count'] = df.apply(lambda x: 1 if x['lbw'] == 0 else 0, axis=1)
df['total_wager'] = df.apply(lambda x: 5 if x['lbw'] == 0 else 5, axis=1)
df = df.assign(return_wager = df['win_div_3'] * df['win_count'])
df['loss_rebate'] = df.apply(lambda x: .5 if x['win_count'] == 0 else 0, axis=1)
df = df.assign(profit_loss = df['return_wager'] - df['total_wager'] + df['loss_rebate'])

#drop unused columns:
df = df.drop(['Unnamed: 19', 'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23', 
               'Unnamed: 28', 'time_2', 'margin_2', 'finish_3', 'no_3', 'horse_name_3', 'rating_3',
               'c_weight_3', 'jockey_3', 'trainer_3', 'draw_3', 'running_pos_3', 'finish_time_3',
             'lbw_3', 'h_weight_3'], axis=1)

#drop other currently unused columns:
df = df.drop(['horse_number', 'gear', 'running_position', 'pl', 'time', 'off_rail_2'], axis=1)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10233 entries, 0 to 10232
Data columns (total 37 columns):
date                        10233 non-null datetime64[ns]
race                        10233 non-null int64
class                       10233 non-null object
distance                    10233 non-null int64
surface                     10233 non-null object
horse_name                  10233 non-null object
horse_rating                10233 non-null int64
horse_weight                10233 non-null int64
hcp_weight                  10233 non-null float64
c_wt                        10233 non-null float64
bar                         10233 non-null int64
jockey                      10233 non-null object
trainer                     10233 non-null object
lbw                         10233 non-null float64
l100m_time_2                10161 non-null float64
l100m_distance_2            10161 non-null float64
l100m_km_hr_2               10233 non-null float64
peak_km_hr_2                1020

In [11]:
# check for accuracy:

# visualize the relationship between the features we plan on using and the response using scatterplots
plt.style.use('dark_background')
sns.boxplot(df, x=['indexing_surf_dist'],
             y='avg_km_hr_2', height=7, aspect=0.7, kind='reg')

TypeError: boxplot() got multiple values for argument 'x'

In [6]:
# df.head().T

df[['win_div_3','binned_win_div_3', 'public_prob', 'total_count', 'indexing_surf_dist_horse',
   'win_count', 'total_wager', 'return_wager', 'loss_rebate', 'profit_loss']].head()

Unnamed: 0,win_div_3,binned_win_div_3,public_prob,total_count,indexing_surf_dist_horse,win_count,total_wager,return_wager,loss_rebate,profit_loss
0,37,$21+; Prob. <25%,13.513514,1,1200.0ROCKET FIGHTER,1,5,37,0.0,32.0
1,19,$6-20; Prob. >=25%,26.315789,1,1200.0DOMINY,0,5,0,0.5,-4.5
2,11,$6-20; Prob. >=25%,45.454545,1,1200.0GREATBALLS OF FIRE,0,5,0,0.5,-4.5
3,138,$21+; Prob. <25%,3.623188,1,1200.0EDEN GARDEN,0,5,0,0.5,-4.5
4,147,$21+; Prob. <25%,3.401361,1,1200.0GOOD WARRIOR,0,5,0,0.5,-4.5


In [7]:
# return df to original order:
df = df.sort_values(by=['indexing'], ascending =True)
df = df.reset_index()
df = df.drop('index', axis=1)

In [8]:
# save cleaned dataset back to folder
df.to_csv('./datasets/stc_data_cleaned.csv')

In [9]:
df.tail()

Unnamed: 0,date,race,class,distance,surface,horse_name,horse_rating,horse_weight,hcp_weight,c_wt,bar,jockey,trainer,lbw,l100m_time_2,l100m_distance_2,l100m_km_hr_2,peak_km_hr_2,avg_km_hr_2,distance_traveled_2,win_div_3,indexing,surf_numb,indexing_surf_dist,indexing_surf_dist_bar,indexing_date_race,indexing_date_horse,indexing_surf_dist_10,indexing_surf_dist_horse,binned_win_div_3,public_prob,total_count,win_count,total_wager,return_wager,loss_rebate,profit_loss
10228,2019-07-21,10,C4,1400,T,MINGS MAN,52,465,52.5,49.5,7,APP N ZYRUL,RB MARSH,-6.9,6.57,101.0,55.3,66.7,60.2,1423.0,356,10228,1.0,1400.0,9800.0,2019-07-21 00:00:0010,2019-07-21 00:00:00MINGS MAN,1400.0,1400.0MINGS MAN,$21+; Prob. <25%,1.404494,1,0,5,0,0.5,-4.5
10229,2019-07-21,10,C4,1400,T,BILLY BRITAIN,64,516,58.5,57.5,10,APP I AMIRUL,S GRAY,-7.1,6.55,101.0,55.5,67.1,59.8,1414.0,227,10229,1.0,1400.0,14000.0,2019-07-21 00:00:0010,2019-07-21 00:00:00BILLY BRITAIN,1400.0,1400.0BILLY BRITAIN,$21+; Prob. <25%,2.202643,1,0,5,0,0.5,-4.5
10230,2019-07-21,10,C4,1400,T,OXBOW SUN,57,520,55.0,55.0,12,CC WONG,D KOH,-9.9,7.01,101.0,51.8,67.7,59.7,1420.0,210,10230,1.0,1400.0,16800.0,2019-07-21 00:00:0010,2019-07-21 00:00:00OXBOW SUN,1400.0,1400.0OXBOW SUN,$21+; Prob. <25%,2.380952,1,0,5,0,0.5,-4.5
10231,2019-07-21,10,C4,1400,T,MY FRIENDS,52,505,52.5,51.5,1,APP J SEE,L KHOO,-15.4,7.21,101.0,50.4,67.7,58.9,1414.0,259,10231,1.0,1400.0,1400.0,2019-07-21 00:00:0010,2019-07-21 00:00:00MY FRIENDS,1400.0,1400.0MY FRIENDS,$21+; Prob. <25%,1.930502,1,0,5,0,0.5,-4.5
10232,2019-07-21,10,C4,1400,T,DRONE,62,504,57.5,57.5,8,J AZZOPARDI,M CLEMENTS,-16.4,7.18,101.0,50.6,65.3,59.1,1422.0,193,10232,1.0,1400.0,11200.0,2019-07-21 00:00:0010,2019-07-21 00:00:00DRONE,1400.0,1400.0DRONE,$21+; Prob. <25%,2.590674,1,0,5,0,0.5,-4.5
