# Predicting how points end in tennis

## Abstract

This is part of the code that I used in my solution to the CrowdAnalytix competition. It uses a three-layer neural network to predict the outcome of a tennis points among three classes (Winner, Forced error, Unforced error). The described solution is very raw and I think many improvements could still be made to improve the accuracy of the model (better feature engineering/model ensembling). My final model achieved an accuracy around 90%.

## Motivation
Tennis, one of the most popular professional sports around the world, still uses manual coding of point outcomes.  This is not only labor-intensive but it also raises concerns that outcome categories may not always be consistent from one coder to the next. The purpose of this contest is to find a better approach. 

## Point Endings
Every tennis match is made up of a sequence of points. A point begins with a serve and players exchange shots until a player makes an error or is unable to return a shot in play. 

Traditionally, the shot ending a point in tennis has been had been described in one of three mutually exclusive ways: a winner, an unforced error, or a forced error. A winner is a shot that was in play, not touched by the opponent, and ends with the point going to the player who made the shot. The other two categories are two distinct types of errors where both end with the point going to the player who did not make the shot. The distinction between an unforced and forced error is based on the nature of the incoming shot and a judgment about whether the shot was playable or not. As you can imagine, this distinction is not a perfect science.  

## Outcome Coding
Point endings give us insight into player performance. For this reason, accurate statistics about point outcomes are essential to the sport. At professional tennis tournaments, human coders are trained to label and document outcomes during matches. This is the primary way that the sport gathers information about winners and errors. 

## Tracking Data
The adoption of the player challenge system in the mid-2000s has lead to the use of multi-camera tracking systems for the majority of top professional matches. These tracking systems monitor the 3D coordinates of the ball position and 2D coordinates of the player position throughout a match. The richness of these data hold considerable promise for addressing many challenging questions in the sport.

## Objective

The objective of this contest is as follows:

* Predict how a point ends in tennis using modern tracking data.

In [1]:
import warnings; warnings.simplefilter('ignore')

from time import time
import numpy as np
import pandas as pd

from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score

import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.optimizers import SGD, Adam
from keras.callbacks import TensorBoard
from keras.wrappers.scikit_learn import KerasClassifier

Using TensorFlow backend.


## Data

In [57]:
# Train data.
df_mens = pd.read_csv('data/mens_train_file.csv', sep=',',header=0)
df_womens = pd.read_csv('data/womens_train_file.csv', sep=',',header=0)
frames = [df_mens, df_womens]
df = pd.concat(frames)

# Submission data.
df_mens_test = pd.read_csv('data/mens_test_file.csv', sep=',',header=0)
df_womens_test = pd.read_csv('data/womens_test_file.csv', sep=',',header=0)
frames = [df_mens_test, df_womens_test]
df_test = pd.concat(frames)
df_test['submission_id'] = df_test['id'].map(str) + '_' + df_test['gender'].map(str)
df_submission = pd.read_csv('data/AUS_SubmissionFormat.csv', sep=',',header=0)
df_test = pd.merge(df_submission, df_test, how='outer', on=['submission_id', 'submission_id'])
df_test.drop(['submission_id', 'train_x', 'UE', 'FE', 'W'], axis=1, inplace=True)

In [58]:
print(df.head())

   rally  serve hitpoint      speed  net.clearance  distance.from.sideline  \
0      4      1        B  35.515042      -0.021725                3.474766   
1      4      2        B  33.382640       1.114202                2.540801   
2     23      1        B  22.316690      -0.254046                3.533166   
3      9      1        F  36.837309       0.766694                0.586885   
4      4      1        B  35.544208       0.116162                0.918725   

      depth  outside.sideline  outside.baseline  player.distance.travelled  \
0  6.797621             False             False                   1.467570   
1  2.608708             False              True                   2.311931   
2  9.435749             False             False                   3.903728   
3  3.342180              True             False                   0.583745   
4  5.499119             False             False                   2.333456   

    ...    opponent.depth  opponent.distance.from.center  same

In [59]:
X = df.iloc[:, 1:24].values
Y = df.iloc[:, 26].values
X_pred = df_test.iloc[:, 1:24].values
print(X)
print(Y)
print(X.shape)
print(Y.shape)

[[1 'B' 35.51504197 ..., 'F' 0.445317963 False]
 [2 'B' 33.38264003 ..., 'B' 0.43243397299999997 False]
 [1 'B' 22.3166902 ..., 'F' 0.397537762 True]
 ..., 
 [2 'F' 16.90628902 ..., 'B' 0.966185615 False]
 [2 'F' 15.19971253 ..., 'B' 0.887608207 False]
 [1 'F' 30.67953985 ..., 'B' 0.562388497 True]]
['UE' 'FE' 'FE' ..., 'W' 'W' 'UE']
(10000, 23)
(10000,)


### Pre-processing

In [61]:
# Encoding categorical data.
labelEncoder = LabelEncoder()
for col in [1,6,7,19,20,22]:
    X[:, col] = labelEncoder.fit_transform(X[:, col])
    X_pred[:, col] = labelEncoder.fit_transform(X_pred[:, col])

# Categorical representation: ['FE', 'UE', 'W']
Y = keras.utils.to_categorical(labelEncoder.fit_transform(Y), num_classes=3)

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, shuffle=True)

# Feature Scaling.
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
X_pred = sc.transform(X_pred)


ValueError: bad input shape (10000, 3)

In [47]:
# Check shapes.
print("X_train: ", X_train.shape)
print("Y_train: ", Y_train.shape)
print("X_test: ", X_test.shape)
print("Y_test: ", Y_test.shape)


('X_train: ', (8000, 24))
('Y_train: ', (8000, 3))
('X_test: ', (2000, 24))
('Y_test: ', (2000, 3))


### Model

In [38]:
def classifier():

    model = Sequential()

    model.add(Dense(64, activation='relu', input_dim=X_train.shape[1]))
    model.add(Dropout(0.3))
    model.add(Dense(64, activation='relu'))
    model.add(Dropout(0.3))
    model.add(Dense(64, activation='relu'))
    model.add(Dropout(0.3))
    model.add(Dense(3, activation='softmax'))

    model.compile(loss='categorical_crossentropy',
                optimizer='adam',
                metrics=['accuracy'])

    return model

![title](img/graph.png)

### Train

In [39]:

model = classifier()

tensorboard = TensorBoard(log_dir="logs/{}".format(time()))

model.fit(X_train, Y_train,
          epochs=120,
          batch_size=25,
          callbacks=[tensorboard])


Epoch 1/120
Epoch 2/120
Epoch 3/120
Epoch 4/120
Epoch 5/120
Epoch 6/120
Epoch 7/120
Epoch 8/120
Epoch 9/120
Epoch 10/120
Epoch 11/120
Epoch 12/120
Epoch 13/120
Epoch 14/120
Epoch 15/120
Epoch 16/120
Epoch 17/120
Epoch 18/120
Epoch 19/120
Epoch 20/120
Epoch 21/120
Epoch 22/120
Epoch 23/120
Epoch 24/120
Epoch 25/120
Epoch 26/120
Epoch 27/120
Epoch 28/120
Epoch 29/120
Epoch 30/120
Epoch 31/120
Epoch 32/120
Epoch 33/120
Epoch 34/120
Epoch 35/120
Epoch 36/120
Epoch 37/120
Epoch 38/120
Epoch 39/120
Epoch 40/120
Epoch 41/120
Epoch 42/120
Epoch 43/120
Epoch 44/120
Epoch 45/120
Epoch 46/120
Epoch 47/120
Epoch 48/120
Epoch 49/120
Epoch 50/120
Epoch 51/120
Epoch 52/120
Epoch 53/120
Epoch 54/120
Epoch 55/120
Epoch 56/120
Epoch 57/120
Epoch 58/120
Epoch 59/120
Epoch 60/120
Epoch 61/120
Epoch 62/120
Epoch 63/120
Epoch 64/120
Epoch 65/120
Epoch 66/120
Epoch 67/120
Epoch 68/120
Epoch 69/120
Epoch 70/120
Epoch 71/120
Epoch 72/120
Epoch 73/120
Epoch 74/120
Epoch 75/120
Epoch 76/120
Epoch 77/120
Epoch 78

Epoch 83/120
Epoch 84/120
Epoch 85/120
Epoch 86/120
Epoch 87/120
Epoch 88/120
Epoch 89/120
Epoch 90/120
Epoch 91/120
Epoch 92/120
Epoch 93/120
Epoch 94/120
Epoch 95/120
Epoch 96/120
Epoch 97/120
Epoch 98/120
Epoch 99/120
Epoch 100/120
Epoch 101/120
Epoch 102/120
Epoch 103/120
Epoch 104/120
Epoch 105/120
Epoch 106/120
Epoch 107/120
Epoch 108/120
Epoch 109/120
Epoch 110/120
Epoch 111/120
Epoch 112/120
Epoch 113/120
Epoch 114/120
Epoch 115/120
Epoch 116/120
Epoch 117/120
Epoch 118/120
Epoch 119/120
Epoch 120/120


<keras.callbacks.History at 0x7f2c1e117b50>

![acc](img/acc.png)

### Evaluation

In [40]:
print('Testing:')
score = model.evaluate(X_test, Y_test)
print model.metrics_names[0], ': ', score[0], '\n', model.metrics_names[1], ': ',score[1]


Testing:
loss :  0.398554241061 
acc :  0.864
