# Predicting Greyhound Finishing Position, 1st to 6th.
## 1. Description.
### There are 2,000 races in the dataset. Crayford 380 metre races only. The dataset can be used to predict the race winner or the finishing position of each greyhound. This notebook will be used to create a classification model of greyhound finish postion.
### Research Question:
### Can the model outperform the market in predicting the finish position of greyhounds in competitive six-runner races, 1st to 6th?

## 1.1 Loading Packages and Data

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing
from sklearn.ensemble import  GradientBoostingClassifier # classifier

In [2]:
df = pd.read_csv("../input/greyhound-racing-uk-predict-finish-position/data_final.csv")

## 1.2 Exploring the dataset

In [3]:
df.head()

Unnamed: 0,Race_ID,Trap,Odds,BSP,Public_Estimate,Last_Run,Distance_All,Finish_All,Distance_Places_All,Races_All,...,Early_380,Grade_380,Time_380,Early_Time_380,Stay_380,Favourite,Finished,Wide_380,Dist_By,Winner
0,0,6,2.75,4.0,1,12,456.47,4.09,402.86,17,...,2.0,4.0,17.84,3.63,0.5,6.0,4,0.0,-10.5,0
1,0,3,5.0,7.6,4,5,410.48,3.53,414.0,21,...,3.43,3.29,24.18,3.7,0.28,6.0,1,0.14,-4.71,1
2,0,5,5.0,9.4,6,9,386.45,3.39,380.0,31,...,3.43,3.71,24.06,3.67,-0.43,6.0,3,0.0,-2.86,0
3,0,4,7.0,7.8,5,9,380.0,3.03,380.0,21,...,2.43,4.43,24.14,3.65,0.28,6.0,2,0.0,-2.71,0
4,0,2,5.0,5.1,2,13,385.0,2.59,388.33,40,...,3.14,2.71,24.05,3.64,-0.43,6.0,6,0.0,-2.32,0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12006 entries, 0 to 12005
Data columns (total 28 columns):
Race_ID                12006 non-null int64
Trap                   12006 non-null int64
Odds                   12006 non-null float64
BSP                    12006 non-null float64
Public_Estimate        12006 non-null int64
Last_Run               12006 non-null int64
Distance_All           12006 non-null float64
Finish_All             12006 non-null float64
Distance_Places_All    12006 non-null float64
Races_All              12006 non-null int64
Distance_Recent        12006 non-null float64
Finish_Recent          12006 non-null float64
Odds_Recent            12006 non-null float64
Early_Recent           12006 non-null float64
Races_380              12006 non-null int64
Wins_380               12006 non-null float64
Finish_380             12006 non-null float64
Odds_380               12006 non-null float64
Early_380              12006 non-null float64
Grade_380              12006 

## 1.3 Building the classification model, predicting finish position 1st to 6th.
>  I will be creating a multiclass classifier, using sklearn's Gradient Boosting Classifier. The target variable is 'Finished', 1 to 6 (1st to 6th).  
>  The variable 'Race_ID' (this is for Identification only) will not be used. 'Winner', which is the binary classification target variable, 1 to 0 (Win/lose), will also be removed as this is clearly cheating.
>  


## 1.3.1 The features / predictor variables

In [5]:
# Features
features = ['Trap', 'BSP', 'Time_380', 'Finish_Recent', 'Finish_All', 'Stay_380',\
            'Races_All','Odds_Recent','Odds_380', 'Distance_Places_All', 'Dist_By',\
            'Races_380', 'Odds','Last_Run','Early_Time_380', 'Early_Recent' ,\
            'Distance_All', 'Wins_380', 'Grade_380','Finish_380','Early_380',\
            'Distance_Recent', 'Public_Estimate','Wide_380', 'Favourite']
# Target
target = ['Finished']

In [6]:
df[features].corr()

Unnamed: 0,Trap,BSP,Time_380,Finish_Recent,Finish_All,Stay_380,Races_All,Odds_Recent,Odds_380,Distance_Places_All,...,Early_Recent,Distance_All,Wins_380,Grade_380,Finish_380,Early_380,Distance_Recent,Public_Estimate,Wide_380,Favourite
Trap,1.0,-0.010928,0.006039,0.00496,0.008526,0.277346,-0.001243,-0.073423,-0.076825,-0.072892,...,-0.290623,-0.07082,-0.009368,0.002056,0.015788,-0.287677,-0.034304,8.6e-05,0.44007,0.0
BSP,-0.010928,1.0,0.043502,0.104114,0.090383,0.043106,0.156367,0.219201,0.2321,0.059333,...,0.046036,0.059474,-0.054812,0.055517,0.142594,0.060096,0.071785,0.804775,-0.008039,0.007842
Time_380,0.006039,0.043502,1.0,-0.107252,-0.063403,0.008318,0.060019,0.024338,0.027301,-0.001194,...,-0.108189,-0.006385,-0.051596,0.160579,-0.042169,-0.040519,-0.027158,0.031984,0.057427,0.024255
Finish_Recent,0.00496,0.104114,-0.107252,1.0,0.594869,0.30822,-0.082271,0.273774,0.244595,0.065376,...,0.39936,0.069811,-0.304652,0.087636,0.762498,0.237432,0.049758,0.090531,0.036313,-0.001296
Finish_All,0.008526,0.090383,-0.063403,0.594869,1.0,0.152343,-0.216865,0.126639,0.154431,-0.019067,...,0.317192,0.007917,-0.422467,0.343719,0.548077,0.24539,0.027377,0.079254,0.016793,0.011693
Stay_380,0.277346,0.043106,0.008318,0.30822,0.152343,1.0,0.115,0.089389,0.102632,-0.114033,...,-0.644926,-0.12016,-0.087712,0.030113,0.44908,-0.743963,-0.09011,0.054093,0.138498,-0.013295
Races_All,-0.001243,0.156367,0.060019,-0.082271,-0.216865,0.115,1.0,0.248957,0.276539,0.113777,...,-0.164811,0.086732,-0.026721,0.01864,0.011291,-0.115735,-0.005865,0.158367,-0.004473,-0.009261
Odds_Recent,-0.073423,0.219201,0.024338,0.273774,0.126639,0.089389,0.248957,1.0,0.834641,0.124121,...,0.107965,0.131159,-0.120799,-0.033214,0.295658,0.124591,0.079469,0.213441,-0.053853,-0.004395
Odds_380,-0.076825,0.2321,0.027301,0.244595,0.154431,0.102632,0.276539,0.834641,1.0,0.117071,...,0.113046,0.110731,-0.158212,0.03722,0.335256,0.139906,0.060118,0.225681,-0.050937,-0.007944
Distance_Places_All,-0.072892,0.059333,-0.001194,0.065376,-0.019067,-0.114033,0.113777,0.124121,0.117071,1.0,...,0.118519,0.935596,-0.105647,0.001758,0.064551,0.171412,0.575218,0.061142,-0.029622,-0.012474


### There are three odds related features, 'BSP','Odds' & 'Public_Estimate'.
> These are highly correlated, see correlation matrix below.
> Betfair Starting Price (BSP) is most highly correlated with the target 'Finished', this feature will be kept.
> I will remove the other two odds features, 'Odds' and 'Public_Estimate' from the list of features to use in my prediction model.

In [7]:
df[['BSP','Odds','Public_Estimate','Finished']].corr()
features.remove('Odds')
features.remove('Public_Estimate')
print(features)
print("\nThere are now",len(features),"features remaining.")

['Trap', 'BSP', 'Time_380', 'Finish_Recent', 'Finish_All', 'Stay_380', 'Races_All', 'Odds_Recent', 'Odds_380', 'Distance_Places_All', 'Dist_By', 'Races_380', 'Last_Run', 'Early_Time_380', 'Early_Recent', 'Distance_All', 'Wins_380', 'Grade_380', 'Finish_380', 'Early_380', 'Distance_Recent', 'Wide_380', 'Favourite']

There are now 23 features remaining.


### 1.3.2 Splitting the data, train & test.

In [8]:
train=df.sample(frac=0.80,random_state=10) #random state is a seed value
test=df.drop(train.index)

In [9]:
# train_X, train_y
train_X = train[features]
train_y = train[target]

# test_X, test_y
test_X = test[features]
test_y = test[target]

### 1.3.3 Train the model

In [10]:
# Create model
model = GradientBoostingClassifier(n_estimators = 10, max_features = None, min_samples_split = 2)
model.fit(train_X, train_y.values.ravel())

GradientBoostingClassifier(criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='deviance', max_depth=3,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=10,
                           n_iter_no_change=None, presort='auto',
                           random_state=None, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)

In [11]:
# evaluate the model on TRAINING DATA
accuracy = model.score(train_X, train_y)
print('    Training Model Accuracy:    ' + str(round(accuracy*100,2)) + '%')

    Training Model Accuracy:    27.36%


### 1.3.5 Test the Model

In [12]:
# evaluate the model on Test data
accuracy = model.score(test_X, test_y)
print('    Test Model Accuracy:  ' + str(round(accuracy*100,2)) + '%')

    Test Model Accuracy:  22.12%


### 1.3.6 Calculate Market Predictions of Greyhound Finishing Position
> Does the model beat the market in predicting finish positon?
> Yes, sligthly.

In [17]:
# evaluate the market on Test data
# the feature 'Public_Estimate' gives the market prediction of finish position for each greyhound.
market_data = list(zip(test['Public_Estimate'], test['Finished']))
total = len(list(market_data))
count=0
for val in market_data:
    if val[0] == val[1]:
        count+=1
print('    Test Market Accuracy:      ' + str(round(count/total,3)*100) + '%')  # - - test   

    Test Market Accuracy:      20.9%


### 2. Conclusion

#### The model outperformed the market in predicting greyhound finishing position, slightly. 
#### Model Accuracy = 22.12%
#### Market Accuracy = 20.9%