# Horse Racing Prediction by Deep Learning

This AI Horse racing prediction is using a 3-layer Neural Network to predict the finishing position of each horse in a race. The data used to train this model is obtained from the Hong Kong Jockey Club (HKJC). The model is built and trained by Tensorflow. Supervised learning is used in this project to classify the expected finishing position of the horses.

<H2> Part 1: Data Input and Preprocessing </H2>

In this part of the program, we will import the data obtained from the HKJC. First of all, the following features were selected based on my past horse picking experience, namely:

- position: The starting position of the horse. If the position is "1", it indicates the closest position to the hurdle and should be benficial in non-straight race courses.

- load: This is the loading of the horse in pounds. Maximum is 133.

- ON odds: This is the overnight odds of the horse provided by the HKJC.

- odds: This is the odds of the horse 15 min before the race.

- class: This is the class of the case. It is common to all horses in a race except special races.

- num horses: This is the number of horses participated in the race.


In [52]:
#Loading the data and preprocessing it.

import pandas as pd
import tensorflow as tf
tf.__version__

print("tensorflow version = " + tf.__version__)

PATH_TRAINING_DATA = 'training_data/horse_data_train_test.csv'

dataset = pd.read_csv(PATH_TRAINING_DATA)
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0, shuffle=False)


tensorflow version = 2.5.0


In [53]:
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train) #use "fit_transform" for training data
X_test = sc.transform(X_test)       #use "transform" for testing data

y_train = tf.keras.utils.to_categorical((y_train-1), 14)
y_test = tf.keras.utils.to_categorical((y_test-1), 14)

print(y_train)

[[1. 0. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 [0. 0. 1. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 1. ... 0. 0. 0.]]


<H2> Part 2: Building the Model </H2>

We will construct a basic 3-layer model: one input layer, one output layer and one hidden layer.
I have tried relu and sigmoid as the activation, and found that sigmoid seems to be a better choice for this project.


In [54]:
#create a new model
def build_model():

    # Initializing the ANN
    model = tf.keras.models.Sequential()

    # Adding the input layer and the first layer
    # We have six features now, and one bias term, so the input layer has a size of 6+1
    model.add(tf.keras.layers.Dense(units=7, activation='sigmoid'))
    #model.add(tf.keras.layers.Dense(units=7, activation='relu'))

    # Adding the input layer and the hidden layer
    # We set the size of the hidden layer to be the mean of input and output later, i.e. 10
    #model.add(tf.keras.layers.Dense(units=5, activation='sigmoid')) 
    #model.add(tf.keras.layers.Dense(units=6, activation='sigmoid')) 
    #model.add(tf.keras.layers.Dense(units=10, activation='sigmoid')) 
    #model.add(tf.keras.layers.Dense(units=10, activation='relu')) 
    #model.add(tf.keras.layers.Dense(units=6, activation='sigmoid')) 
    #model.add(tf.keras.layers.Dense(units=3, activation='sigmoid')) 
    model.add(tf.keras.layers.Dense(units=12, activation='sigmoid')) 
    model.add(tf.keras.layers.Dense(units=12, activation='sigmoid')) 
    model.add(tf.keras.layers.Dense(units=12, activation='sigmoid')) 
    

    # Adding a drop out layer
    #model.add(tf.keras.layers.Dropout(0.2))
    #model.add(tf.keras.layers.Dropout(0.3))
    #model.add(tf.keras.layers.Dropout(0.35277))
    #model.add(tf.keras.layers.Dropout(0.17036843627853782))
    model.add(tf.keras.layers.Dropout(0.10513604980689638))

    #model.add(tf.keras.layers.Dense(units=6, activation='sigmoid')) 
    #model.add(tf.keras.layers.Dropout(0.2))

    # Adding the output layer
    # We have 14 outputs, so the output later has a size of 14
    model.add(tf.keras.layers.Dense(units=14, activation='softmax'))

    # Compiling the model
    #opt = tf.keras.optimizers.Adam(learning_rate=0.03) 
    opt = tf.keras.optimizers.Adam(learning_rate=0.01)
    model.compile(loss = 'categorical_crossentropy', optimizer = opt, metrics = ['accuracy'])
    
    return model

model = build_model() 

<H2> Part 3a: [Optional] Training and saving a new Model </H2>

In this section, we would either traing the model from scratch or just load a pre-trained model that is shipped with this Jupyter notebook.

- If the training data has been changed or the model has been renewed, go to train a new model in code cell 3a)
- If you want to save some time, just skip cell 3a) and load a pre-trained model in code cell 3b)


In [44]:
#This is code cell 3a) that trains a model from scratch and then saves it. Please note that it takes hours to train!

import tensorflow as tf
import datetime, os

%load_ext tensorboard
#%reload_ext tensorboard

logdir = os.path.join("logs", datetime.datetime.now().strftime("%Y%m%d-%H%M%S"))
tensorboard_callback = tf.keras.callbacks.TensorBoard(logdir, histogram_freq=1)

print("rows in X_train = " + str(X_train.shape[0]) )
print("rows in y_train = " + str(y_train.shape[0]) )

print("Training model...")
model.fit(x=X_train, 
          y=y_train, 
          batch_size = 14, 
          epochs = 5000, #10, #10000, #30000 #50000 #20000
          validation_data=(X_test, y_test), 
          callbacks=[tensorboard_callback])

# save the model for later use
print("Saving model...")
model.save('saved_model/my_model')

print("Model saved!")

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard
rows in X_train = 693
rows in y_train = 693
Training model...
Epoch 1/5000


ValueError: in user code:

    C:\Users\longi\anaconda3\envs\Tensorflow\lib\site-packages\tensorflow\python\keras\engine\training.py:855 train_function  *
        return step_function(self, iterator)
    C:\Users\longi\anaconda3\envs\Tensorflow\lib\site-packages\tensorflow\python\keras\engine\training.py:845 step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    C:\Users\longi\anaconda3\envs\Tensorflow\lib\site-packages\tensorflow\python\distribute\distribute_lib.py:1285 run
        return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
    C:\Users\longi\anaconda3\envs\Tensorflow\lib\site-packages\tensorflow\python\distribute\distribute_lib.py:2833 call_for_each_replica
        return self._call_for_each_replica(fn, args, kwargs)
    C:\Users\longi\anaconda3\envs\Tensorflow\lib\site-packages\tensorflow\python\distribute\distribute_lib.py:3608 _call_for_each_replica
        return fn(*args, **kwargs)
    C:\Users\longi\anaconda3\envs\Tensorflow\lib\site-packages\tensorflow\python\keras\engine\training.py:838 run_step  **
        outputs = model.train_step(data)
    C:\Users\longi\anaconda3\envs\Tensorflow\lib\site-packages\tensorflow\python\keras\engine\training.py:797 train_step
        y, y_pred, sample_weight, regularization_losses=self.losses)
    C:\Users\longi\anaconda3\envs\Tensorflow\lib\site-packages\tensorflow\python\keras\engine\compile_utils.py:204 __call__
        loss_value = loss_obj(y_t, y_p, sample_weight=sw)
    C:\Users\longi\anaconda3\envs\Tensorflow\lib\site-packages\tensorflow\python\keras\losses.py:155 __call__
        losses = call_fn(y_true, y_pred)
    C:\Users\longi\anaconda3\envs\Tensorflow\lib\site-packages\tensorflow\python\keras\losses.py:259 call  **
        return ag_fn(y_true, y_pred, **self._fn_kwargs)
    C:\Users\longi\anaconda3\envs\Tensorflow\lib\site-packages\tensorflow\python\util\dispatch.py:206 wrapper
        return target(*args, **kwargs)
    C:\Users\longi\anaconda3\envs\Tensorflow\lib\site-packages\tensorflow\python\keras\losses.py:1644 categorical_crossentropy
        y_true, y_pred, from_logits=from_logits)
    C:\Users\longi\anaconda3\envs\Tensorflow\lib\site-packages\tensorflow\python\util\dispatch.py:206 wrapper
        return target(*args, **kwargs)
    C:\Users\longi\anaconda3\envs\Tensorflow\lib\site-packages\tensorflow\python\keras\backend.py:4862 categorical_crossentropy
        target.shape.assert_is_compatible_with(output.shape)
    C:\Users\longi\anaconda3\envs\Tensorflow\lib\site-packages\tensorflow\python\framework\tensor_shape.py:1161 assert_is_compatible_with
        raise ValueError("Shapes %s and %s are incompatible" % (self, other))

    ValueError: Shapes (None, 1) and (None, 14) are incompatible


<H2> Part 3b: Loading a pre-trained Model </H2>

If you have ever run cell 3a), you will load the model that you have trained. Otherwise, you will load a pre-trained model that is shipped with this Jupyter notebook.

In [55]:
#This is code cell 3b) that loads a pre-trained model. 

model = tf.keras.models.load_model('saved_model/my_model')

print("model is loaded")

model is loaded


<H2>Part 3c: Model Summary</H2>
    
Let's take a look at the model:

In [56]:
model.summary()

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_12 (Dense)             (None, 7)                 49        
_________________________________________________________________
dense_13 (Dense)             (None, 12)                96        
_________________________________________________________________
dense_14 (Dense)             (None, 12)                156       
_________________________________________________________________
dense_15 (Dense)             (None, 12)                156       
_________________________________________________________________
dropout_4 (Dropout)          (None, 12)                0         
_________________________________________________________________
dense_16 (Dense)             (None, 14)                182       
Total params: 639
Trainable params: 639
Non-trainable params: 0
________________________________________________________

<H2> Part 4: Evaluation </H2>

This part calculates the confusion matrix and accuracy_score of the model.

In [57]:
# Restore the weights
import numpy as np
import pandas as pd

dataset = pd.read_csv('training_data/horse_data_train_test.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)


import tensorflow as tf
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# Predicting the Test set results
y_pred = model.predict( X_test )  #y_pred = new_model.predict( sc.fit_transform(X_test) ) 
y_pred = np.argmax(y_pred, axis = 1)

In [58]:
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, (y_pred+1))
print("cm=", cm)
print("accuracy_score=", accuracy_score(y_test, (y_pred+1)))

cm= [[ 6  0  0  2  1  2  1  1  0  1  0  0  0  0]
 [ 2  6  0  3  1  2  0  0  0  0  0  0  0  0]
 [ 0  0  2  1  1  2  0  1  1  0  1  1  0  0]
 [ 3  1  0  3  1  2  0  0  1  1  0  0  1  0]
 [ 1  0  0  3  3  0  0  0  2  0  0  0  0  0]
 [ 0  0  1  2  2  3  0  0  1  0  0  0  0  0]
 [ 0  0  0  1  3  2  4  2  1  0  0  0  0  0]
 [ 0  0  1  4  0  2  1  4  2  0  0  0  0  0]
 [ 1  0  0  0  0  1  0  2  5  1  1  1  1  0]
 [ 0  1  0  2  4  0  0  1  0  5  2  0  1  0]
 [ 0  0  1  2  1  1  0  0  4  1  2  0  1  0]
 [ 0  1  1  3  1  0  0  0  2  0  3  2  1  0]
 [ 0  0  0  0  0  0  0  0  0  1  0  0 10  0]
 [ 0  0  0  0  1  0  1  0  0  0  0  0  0  9]]
accuracy_score= 0.367816091954023


<H2> Part 5: [Optional] Visualization of the model by Tensorboard </H2>

***Note:*** Run this section only if you have run Part3a.

By making use of the powerful visual tools given by Tensorboard, we can tune the hyper parameters of the model and visualize the results easily. If the epoch_accuracy is increasing steadily with epoch for the validation set, we could be confident that the model is on track. 

If we are not satisfy with the performaces shown in the Tensorboard, we can go back to refine the data processing, model building and hyper parameters and check on Tensorboard again iteratively.

In [49]:
print("Calling tensorboard...")

#import datetime, os
#%load_ext tensorboard
#logdir = os.path.join("logs", datetime.datetime.now().strftime("%Y%m%d-%H%M%S"))

%tensorboard --logdir logs --port=6008 #use a different port if tensorboard fails to load!


Calling tensorboard...


Reusing TensorBoard on port 6008 (pid 13820), started 6 days, 6:25:33 ago. (Use '!kill 13820' to kill it.)

<H2> Part 6: Prediction Checking </H2>

In this section, we will try to predict the results of a race on 2021/09/22, and see how accurate the model is by comparing the rediction witht the real results.
 

In [59]:
#This is the csv containing the data of a new race:

#CURRENT_RACE_DATA = 'new_data/horse_data_20210926_race8.csv'

#CURRENT_RACE_DATA = 'new_data/horse_data_20220101_race5.csv'
#CURRENT_RACE_DATA = 'new_data/horse_data_20220105_race3.csv'
#CURRENT_RACE_DATA = 'new_data/horse_data_20211205_race6.csv'
#CURRENT_RACE_DATA = 'new_data/horse_data_20220109_race4.csv'
CURRENT_RACE_DATA = 'new_data/horse_data_20220116_race3.csv'

#CURRENT_RACE_DATA = 'new_data/horse_data_20210922_race3.csv'
#REAL_FINISHING_POSITIONS = '[ 1  6  4  2 12 10  8  3  7  5 11  9 13 14]'

In [61]:
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

PATH_TRAINING_DATA = 'training_data/horse_data_train_test.csv'

#training data
dataset = pd.read_csv(PATH_TRAINING_DATA)
X = dataset.iloc[:, :-1].values
Y = dataset.iloc[:, -1].values

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 0, shuffle=False)

sc = StandardScaler()
X_train = sc.fit_transform(X_train)

#live data
dataset = pd.read_csv(CURRENT_RACE_DATA)
X_live = dataset.iloc[:, :-1].values

#use "transform" for live data
X_live = sc.transform(X_live)

#loading pretrained model
model = tf.keras.models.load_model('saved_model/my_model')

# Predicting the Test set results
y_pred = model.predict(X_live)  

# Predicting the finishing position
y_pred_finishing = np.argmax(y_pred, axis = 1) 
print("Expected finishing positions=", y_pred_finishing+1)

#print("The real finishing positions= " + REAL_FINISHING_POSITIONS)


Expected finishing positions= [ 4  2 12  2 10  5  9  1  5  5  9]


For prediction on your own, you need to create a new csv file that contains the data of a new race in the folder new_data, then run the preditction.