# Horse Racing Prediction by Deep Learning

This AI Horse racing prediction is using a 3-layer Neural Network to predict the finishing position of each horse in a race. The data used to train this model is obtained from the Hong Kong Jockey Club (HKJC). The model is built and trained by Tensorflow. Supervised learning is used in this project to classify the expected finishing position of the horses.

<H2> Part 1: Data Input and Preprocessing </H2>

In this part of the program, we will import the data obtained from the HKJC. First of all, the following features were selected based on my past horse picking experience, namely:

- position: The starting position of the horse. If the position is "1", it indicates the closest position to the hurdle and should be benficial in non-straight race courses.

- load: This is the loading of the horse in pounds. Maximum is 133.

- ON odds: This is the overnight odds of the horse provided by the HKJC.

- odds: This is the odds of the horse 15 min before the race.

- class: This is the class of the case. It is common to all horses in a race except special races.

- num horses: This is the number of horses participated in the race.


In [1]:
#Loading the data and preprocessing it.

import pandas as pd
import tensorflow as tf


print ("tensorflow version = " + str(tf.__version__))

PATH_TRAINING_DATA = 'training_data/horse_data_train_test.csv'

dataset = pd.read_csv(PATH_TRAINING_DATA)
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0, shuffle=False)


tensorflow version = 2.5.0


In [2]:
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train) #use "fit_transform" for training data
X_test = sc.transform(X_test)       #use "transform" for testing data

y_train = tf.keras.utils.to_categorical((y_train-1), 14)
y_test = tf.keras.utils.to_categorical((y_test-1), 14)

print(X_train)

[[-0.11692486  0.80348406 -0.43178501 -0.66488144 -0.61242914 -1.14322466]
 [-0.61373421 -0.62764251 -0.75689373 -0.87496493 -0.61242914 -1.14322466]
 [ 0.62828917 -1.01794976 -0.55682683 -0.69289257 -0.61242914 -1.14322466]
 ...
 [-1.60735291 -0.4975401  -0.27548275 -0.27272558 -0.61242914 -2.02474729]
 [ 0.13147982 -0.75774493 -0.55682683 -0.55283691 -0.61242914 -2.02474729]
 [ 0.62828917 -1.14805217 -0.58808728 -0.66488144 -0.61242914 -2.02474729]]


<H2> Part 2: Building the Model </H2>

We will construct a basic 3-layer model: one input layer, one output layer and one hidden layer.

We will introduce Keras Tuner to find the best number of nuerons in the hidden layer.


In [3]:
from keras_tuner.tuners import RandomSearch

#create a new model
def build_model(hp):

    num_hidden_layers = hp.Choice('num_hidden_layers', values=[1, 2, 3])
    num_units = hp.Choice('num_units', values=[3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14])
    dropout_rate = hp.Float('dropout_rate', min_value=0.1, max_value=0.5)
    
    # Initializing the ANN
    model = tf.keras.models.Sequential()

    # Adding the input layer and the first layer
    # We have six features now, and one bias term, so the input layer has a size of 6+1
    #model.add(tf.keras.layers.Dense(units=6, activation='sigmoid'))
    model.add(tf.keras.layers.Dense(units=7, activation='sigmoid'))
    #model.add(tf.keras.layers.Dense(units=7, activation='relu'))

    # Adding the input layer and the hidden layer
    # We set the size of the hidden layer to be the mean of input and output later, i.e. 10
    #model.add(tf.keras.layers.Dense(units=hp.Int('units',
    #                                             min_value=3,
    #                                             max_value=25, 
    #                                             step=1), 
    #                                activation='sigmoid')) 
    
    for _ in range(0, num_hidden_layers):
        model.add(tf.keras.layers.Dense(num_units, activation='sigmoid'))
        model.add(tf.keras.layers.Dropout(dropout_rate))
    
    #model.add(tf.keras.layers.Dense(
    #  hp.Choice('units', [8, 16, 32]),
    #  activation='sigmoid'))

    # Adding a drop out layer
    model.add(tf.keras.layers.Dropout(0.2))

    # Adding the output layer
    # We have 14 outputs, so the output later has a size of 14
    model.add(tf.keras.layers.Dense(units=14, activation='softmax'))

    # Compiling the model
    #opt = tf.keras.optimizers.Adam(learning_rate=0.03) 
    opt = tf.keras.optimizers.Adam(learning_rate=0.01)
    model.compile(loss = 'categorical_crossentropy', optimizer = opt, metrics = ['accuracy'])
    
    return model


In [4]:
tuner = RandomSearch(
    build_model,
    objective = 'val_accuracy',
    max_trials = 20 #100
)

INFO:tensorflow:Reloading Oracle from existing project .\untitled_project\oracle.json
INFO:tensorflow:Reloading Tuner from .\untitled_project\tuner0.json


<H2> Part 3a: Using Keras Tuner to find the best number of layers and neurons </H2>

In this section, we would invetigate the best number of neurons in our model.


In [5]:
#This is code cell 3a) that trains a model from scratch and then saves it. Please note that it takes hours to train!

import tensorflow as tf
import datetime, os

print("rows in X_train = " + str(X_train.shape[0]) )
print("rows in y_train = " + str(y_train.shape[0]) )

print("Training model...")
tuner.search(X_train, y_train, epochs=5000, validation_data=(X_test, y_test))
best_model = tuner.get_best_models()[0]

tuner.results_summary()


Trial 1 Complete [00h 06m 29s]
val_accuracy: 0.1953125

Best val_accuracy So Far: 0.2520325183868408
Total elapsed time: 00h 06m 29s
INFO:tensorflow:Oracle triggered exit
Results summary
Results in .\untitled_project
Showing 10 best trials
Objective(name='val_accuracy', direction='max')
Trial summary
Hyperparameters:
num_hidden_layers: 1
num_units: 3
dropout_rate: 0.17036843627853782
Score: 0.2520325183868408
Trial summary
Hyperparameters:
num_hidden_layers: 3
num_units: 12
dropout_rate: 0.10513604980689638
Score: 0.2520325183868408
Trial summary
Hyperparameters:
num_hidden_layers: 1
num_units: 6
dropout_rate: 0.35276679031830505
Score: 0.2520325183868408
Trial summary
Hyperparameters:
num_hidden_layers: 2
num_units: 14
dropout_rate: 0.3278931705437662
Score: 0.24390244483947754
Trial summary
Hyperparameters:
num_hidden_layers: 1
num_units: 14
dropout_rate: 0.2666663831817352
Score: 0.24390244483947754
Trial summary
Hyperparameters:
num_hidden_layers: 1
num_units: 14
dropout_rate: 0.24

In [6]:
best_model().summary()

ValueError: The first argument to `Layer.call` must always be passed.