# Tech how-to: Build your own expected-goals model

This how-to guides you through the process of building your own expected-goals model using popular data science and machine learning tools like Pandas, XGBoost, and scikit-learn. In this how-to, we discuss the following steps:
1. Loading the data
2. Preparing the data
3. Constructing examples and datasets
4. Learning a model
5. Evaluating the model

As part of this how-to, we release an artificial but realistic shots dataset containing information on 127,643 shots. To represent the shots, we adopt the SPADL representation, which we introduce in more detail in the following paper:

**Actions Speak Louder Than Goals: Valuing Player Actions in Soccer**  
Tom Decroos, Lotte Bransen, Jan Van Haaren, and Jesse Davis  
[Read the full paper on arXiv](https://arxiv.org/abs/1802.07127)

In [None]:
# Install missing packages
!pip install pandas pyarrow xgboost sklearn scikit-plot scipy

In [4]:
%load_ext autoreload
%autoreload 2

# Import standard modules
import os
import sys

# Import Pandas library
import pandas as pd

# Import XGBoost classifier
from xgboost import XGBClassifier

# Import scikit-learn functions
from sklearn.metrics import average_precision_score
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split

# Import scikit-plot functions
from scikitplot.metrics import plot_roc_curve
from scikitplot.metrics import plot_precision_recall_curve
from scikitplot.metrics import plot_calibration_curve

# Import SciPy function
from scipy.spatial import distance

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Download the dataset

In [3]:
!wget "https://github.com/JanVanHaaren/how-to-expected-goals/raw/master/shots.parquet" -O "shots.parquet"

--2019-01-25 11:14:04--  https://github.com/JanVanHaaren/how-to-expected-goals/raw/master/shots.parquet
Resolving github.com (github.com)... 140.82.118.3, 140.82.118.4
Connecting to github.com (github.com)|140.82.118.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/JanVanHaaren/how-to-expected-goals/master/shots.parquet [following]
--2019-01-25 11:14:05--  https://raw.githubusercontent.com/JanVanHaaren/how-to-expected-goals/master/shots.parquet
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.192.133, 151.101.128.133, 151.101.64.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.192.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7793234 (7,4M) [application/octet-stream]
Saving to: ‘shots.parquet’


2019-01-25 11:14:06 (14,7 MB/s) - ‘shots.parquet’ saved [7793234/7793234]



## Load the dataset

For the purpose of this how-to, we constructed an artificial but realistic shots dataset containing information on 127,643 shots. For each shot, the dataset contains the following information for the shot as well as the two actions immediately preceding the shot:
* `game_id`: a unique identifier of the game;
* `team_id`: a unique identifier of the team who performed the action;
* `player_id`: a unique identifier of the player who performed the action;
* `period`: 1 for the first half and 2 for the second half;
* `seconds`: the time elapsed in seconds since the start of the half;
* `type_id`: the identifier for the type of action;
* `type_name`: the name for the type of action;
* `body_part_id`: 0 for foot, 1 for head, 2 for other body part;
* `result`: the result of the action: 0 for failure, 1 for success;
* `start_x`: the x coordinate for the location where the action started, ranges from 0 to 105;
* `start_y`: the y coordinate for the location where the action started, ranges from 0 to 68;
* `end_x`: the x coordinate for the location where the action ended, ranges from 0 to 105;
* `end_y`: the y coordinate for the location where the action ended, ranges from 0 to 68.

The prefix `action` refers to the shot, whereas the prefixes `action1` and `action2` refer to the last and one-but-last action prior to the shot.

The mapping between the `type_id` and `type_name` values is as follows:
* 0: pass
* 1: cross
* 2: throw in
* 3: freekick crossed
* 4: freekick short
* 5: corner crossed
* 6: corner short
* 7: take on
* 8: foul
* 9: tackle
* 10: interception
* 11: shot
* 12: shot penalty
* 13: shot freekick
* 14: keeper save
* 18: clearance
* 21: dribble
* 22: goalkick

In [5]:
df_dataset = pd.read_parquet('shots.parquet')

In [6]:
number_of_shots = len(df_dataset)

print('Our dataset contains {} shots.'.format(number_of_shots))

Our dataset contains 127643 shots.


In [9]:
pd.Series(df_dataset.columns)

0           action_game_id
1           action_team_id
2         action_player_id
3            action_period
4           action_seconds
5           action_type_id
6         action_type_name
7      action_body_part_id
8            action_result
9           action_start_x
10          action_start_y
11            action_end_x
12            action_end_y
13         action1_game_id
14         action1_team_id
15       action1_player_id
16          action1_period
17         action1_seconds
18         action1_type_id
19       action1_type_name
20    action1_body_part_id
21          action1_result
22         action1_start_x
23         action1_start_y
24           action1_end_x
25           action1_end_y
26         action2_game_id
27         action2_team_id
28       action2_player_id
29          action2_period
30         action2_seconds
31         action2_type_id
32       action2_type_name
33    action2_body_part_id
34          action2_result
35         action2_start_x
36         action2_start_y
3

## Normalize the location features

In order to help the learning algorithm, we rescale the location features from their original scales to a normalized scale ranging from 0 to 1. More specifically, we divide the x coordinates by 105 and the y coordinates by 68.

In [None]:
for action in ['action', 'action1', 'action2']:
    for side in ['start', 'end']:
        
        # Normalize the X location
        key_x = '{}_{}_x'.format(action, side)
        df_dataset[key_x] = df_dataset[key_x] / 105
               
        # Normalize the Y location
        key_y = '{}_{}_y'.format(action, side)
        df_dataset[key_y] = df_dataset[key_y] / 68

## Construct the examples

In order to predict the outcome of each shot, we need to transform our shots database into a dataset that we can fed into our machine learning algorithm. To this end, we perform the following three steps:

1. We compute the Eucledian distances between the start locations of each of the three actions and the center of the opposing goal. We add these three distances as features to our dataset as we expect them to help our machine learning algorithm to learn a more accurate model.

2. We construct our dataset by selecting a subset of the available features.

3. We split the dataset into a train set for training the model and a hold-out test set for evaluating the model. This is an important step as we aim to learn a predictive model that generalizes well to unseen examples. By evaluating our model on a hold-out test set, we can investigate whether we are overfitting on the train data.

### Compute additional features
We compute the Eucledian distances between the start location of each of the three actions and the center of the opposing goal, which is located at coordinates (1, 0.5) in our normalized coordinate representation.

In [None]:
# Normalized location for the center of the opposing goal
goal = (1, 0.5)

In [None]:
# Compute distance to goal for each action's start location
for action in ['action', 'action1', 'action2']:
    key_start_x = '{action}_start_x'.format(action=action)
    key_start_y = '{action}_start_y'.format(action=action)
    key_start_distance = '{action}_start_distance'.format(action=action)

    df_dataset[key_start_distance] = df_dataset.apply(lambda s: distance.euclidean((s[key_start_x], s[key_start_y]), goal), axis=1)

In [None]:
# Determine body part used for each action
for action in ['action', 'action1', 'action2']:
    key_body_part_id = '{action}_body_part_id'.format(action=action)
    
    key_is_foot = '{action}_is_foot'.format(action=action)
    key_is_head = '{action}_is_head'.format(action=action)
    key_is_other = '{action}_is_other'.format(action=action)

    df_dataset[key_is_foot] = df_dataset[key_body_part_id] == 0
    df_dataset[key_is_head] = df_dataset[key_body_part_id] == 1
    df_dataset[key_is_other] = df_dataset[key_body_part_id] == 2

In [None]:
df_dataset.head(3).T

### Construct the dataset
We construct our dataset by selecting a subset of the available features. In this how-to, we use a limited number of features such as the location of the shot (`action_start_x` and `action_start_y`), the body part used by the shot taker (`action_body_part_id`), and the distances between the locations of the three actions and the center of the opposing goal (`action_start_distance`, `action1_start_distance`, and `action2_start_distance`).

We encourage you to try other features as well and to investigate what effect they have on the performance of your expected-goals model. For example, you could try to include the angle between the shot location and the center of the goal or the angle between the shot location and the goal posts as a feature too. 

In [None]:
# Features
columns_features = [
    'action_start_x',
    'action_start_y',
    'action_is_foot',
    'action_is_head',
    'action_start_distance',
    'action1_start_distance',
    'action2_start_distance'
]

# Label: 1 if a goal, 0 otherwise
column_target = 'action_result'

In [None]:
X = df_dataset[columns_features]
y = df_dataset[column_target]

### Split the dataset into a train set and a test set
We train our expected-goals model on 90% of the data and evaluate the model on the remaining 10% of the data.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10)

## Learn the model
We learn our expected-goals model using the XGBoost algorithm, which is a popular algorithm in machine learning competitions like Kaggle. The algorithm is particularly appealing as it requires minimal parameter tuning to provide decent performance on many standard machine learning tasks.

[Visit the XGBoost website for more information](http://xgboost.readthedocs.io/en/latest/model.html)

We train an XGBoost classifier on our train set. We train 500 trees and set their maximum depth to 5.

In [None]:
classifier = XGBClassifier(objective='binary:logistic', max_depth=4, n_estimators=100)
classifier.fit(X_train, y_train)

In [None]:
classifier.best_params_

## Evaluate the model
We evaluate the accuracy of our expected-goals model by making predictions for the shots in our test set.

### Predict the test examples

In [None]:
# For each shot, predict the probability of the shot resulting in a goal
y_pred = classifier.predict_proba(X_test)

### Compute area under the curve: receiver operating characteristic (AUC-ROC)
To measure the accuracy of our expected-goals model, we compute the AUC-ROC obtained on the test set. The values for the AUC-ROC metric range from 0 to 1. The higher the AUC-ROC value is, the better the classifier is, where an AUC-ROC value of 0.50 corresponds to random guessing. That is, if we randomly predicted whether a shot results in a goal or not, we would obtain an AUC-ROC of 0.50.

In [None]:
y_total = y_train.count()
y_positive = y_train.sum()

print('The training set contains {} examples of which {} are positives.'.format(y_total, y_positive))

In [None]:
auc_roc = roc_auc_score(y_test, y_pred[:, 1])

print('Our classifier obtains an AUC-ROC of {}.'.format(auc_roc))

### Compute area under the curve: precision-recall (AUC-PR)
Since the AUC-ROC metric is susceptible to class imbalance (i.e., the number of positive examples is much lower or higher than the number of negative examples), we also compute the AUC-PR obtained on the test set. The values for the AUC-PR metric range from 0 to 1 too. The higher the AUC-PR value is, the better the classifier is. Unlike AUC-ROC, however, the value for random guessing does not necessarily correspond to 0.50 for imbalanced classes, but corresponds to the ratio of positive examples in the train set.

In [None]:
auc_pr_baseline = y_positive / y_total

print('The baseline performance for AUC-PR is {}.'.format(auc_pr_baseline))

In [None]:
auc_pr = average_precision_score(y_test, y_pred[:, 1])

print('Our classifier obtains an AUC-PR of {}.'.format(auc_pr))

### Plot AUC-ROC curve

In [None]:
plot_roc_curve(y_test, y_pred, curves='each_class')

### Plot AUC-PR curve

In [None]:
plot_precision_recall_curve(y_test, y_pred, curves='each_class')

### Plot calibration curve
We plot a calibration curve to investigate how well our expected-goals model is calibrated. The plot shows the mean predicted value on the horizontal axis and the fraction of covered positive examples on the vertical axis.

In [None]:
plot_calibration_curve(y_test, [y_pred])

## Optional: Perform grid search to find optimal parameters

In [None]:
from sklearn.model_selection import GridSearchCV

parameters = {
    'objective': ['binary:logistic'],
    'max_depth': [4, 5, 6, 7, 8],
    'n_estimators': [100, 250, 500, 1000, 1500, 2000]
}

classifier = XGBClassifier()
classifier = GridSearchCV(classifier, parameters, scoring='roc_auc', verbose=2)
classifier.fit(X_train, y_train)