# Build your own expected-assists model

This notebook guides you through the process of building your own expected-assists model using popular data science and machine learning tools like Pandas, XGBoost, CatBoost and scikit-learn. In this guide, we discuss the following steps:
1. Loading the data
2. Preparing the data
3. Constructing examples and datasets
4. Learning a model
5. Evaluating the model

In this notebook, we adopt the SPADL representation, which we introduce in more detail in the following paper:

**Actions Speak Louder Than Goals: Valuing Player Actions in Soccer**  
Tom Decroos, Lotte Bransen, Jan Van Haaren, and Jesse Davis  
[Read the full paper on arXiv](https://arxiv.org/abs/1802.07127)

In [0]:
# Install Python packages
!pip install matplotsoccer pandas pyarrow xgboost sklearn scikit-plot scipy numpy==1.15.4 catboost==0.14

In [0]:
%load_ext autoreload
%autoreload 2

# Import standard modules
import os
import sys

# Import Pandas library
import pandas as pd

# Import XGBoost classifier
from xgboost import XGBClassifier

# Import CatBoost classifier
from catboost import CatBoostClassifier
from catboost import FeaturesData
from catboost import Pool

# Import scikit-learn functions
from sklearn.metrics import average_precision_score
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split

# Import scikit-plot functions
from scikitplot.metrics import plot_roc
from scikitplot.metrics import plot_precision_recall
from scikitplot.metrics import plot_calibration_curve

# Import SciPy function
from scipy.spatial import distance

# Import plotting libraries
import matplotsoccer
import seaborn

# Import Google Colab function
from google.colab import drive

## Load the data

For each action, the dataset contains the following information:
* `action_id`: a unique identifier of the action;
* `game_id`: a unique identifier of the game;
* `period_id`: 1 for the first half and 2 for the second half;
* `time_seconds`: the time elapsed in seconds since the start of the half;
* `team_id`: a unique identifier of the team who performed the action;
* `player_id`: a unique identifier of the player who performed the action;
* `start_x`: the x coordinate for the location where the action started, ranges from 0 to 105;
* `start_y`: the y coordinate for the location where the action started, ranges from 0 to 68;
* `end_x`: the x coordinate for the location where the action ended, ranges from 0 to 105;
* `end_y`: the y coordinate for the location where the action ended, ranges from 0 to 68;
* `body_part_id`: 0 for foot, 1 for head, 2 for other body part;
* `type_name`: the name for the type of action;
* `type_id`: the identifier for the type of action;
* `result`: the result of the action: 0 for failure, 1 for success, 2 for special cases.

The mapping between the `type_id` and `type_name` values is as follows:
* 0: pass
* 1: cross
* 2: throw in
* 3: freekick crossed
* 4: freekick short
* 5: corner crossed
* 6: corner short
* 7: take on
* 8: foul
* 9: tackle
* 10: interception
* 11: shot
* 12: shot penalty
* 13: shot freekick
* 14: keeper save
* 18: clearance
* 21: dribble
* 22: goalkick

In [0]:
# Access your Google Drive account
mount_point = '/content/gdrive'
drive.mount(mount_point)

In [0]:
drive_folder = 'My Drive'
directory = 'Wyscout'
path = os.path.join(mount_point, drive_folder, directory)

In [0]:
# Specify the relevant seasons
# season_ids = [10992, 181334, 185611]  # Eredivisie 2016/2017 - 2018/2019
# season_ids = [10883, 181150, 185618]  # Premier League 2016/2017 - 2018/2019
season_ids = [10883, 10992, 181150, 181334, 185611, 185618]  # Eredivisie 2016/2017 - 2018/2019 + Premier League 2016/2017 - 2018/2019

In [0]:
# Load the data from Google Drive
matches = []
players = []
teams = []
actions = []

for season_id in season_ids:

  # Matches
  df_matches = pd.read_hdf(os.path.join(path, f'season-{season_id}', 'matches.h5'), key='matches')
  matches.append(df_matches)

  # Players
  df_players = pd.read_hdf(os.path.join(path, f'season-{season_id}', 'players.h5'), key='players')
  players.append(df_players)

  # Teams
  df_teams = pd.read_hdf(os.path.join(path, f'season-{season_id}', 'teams.h5'), key='teams')
  teams.append(df_teams)

  # Actions
  store_actions = pd.HDFStore(os.path.join(path, f'season-{season_id}', 'actions.h5'))
  for match in store_actions.keys():
    actions.append(store_actions.get(match))
  store_actions.close()

### Load the matches

In [0]:
df_matches = pd.concat(matches).drop_duplicates(subset='match_id', keep='last').sort_values('match_id', ascending=True).reset_index(drop=True)

In [0]:
df_matches

### Load the players

In [0]:
df_players = pd.concat(players).drop_duplicates(subset='player_id', keep='last').sort_values('player_id', ascending=True).reset_index(drop=True)

In [0]:
df_players

In [0]:
mapping_players = pd.Series(data=df_players['short_name'].values, index=df_players['player_id'].values)

### Load the teams

In [0]:
df_teams = pd.concat(teams).drop_duplicates(subset='team_id', keep='last').sort_values('team_id', ascending=True).reset_index(drop=True)

In [0]:
df_teams

In [0]:
mapping_teams = pd.Series(data=df_teams['short_name'].values, index=df_teams['team_id'].values)

### Load the actions

In [0]:
df_actions = pd.concat(actions).sort_values(['game_id', 'action_id'], ascending=True).reset_index(drop=True)

In [0]:
df_actions

In [0]:
df_actions['player_name'] = df_actions['player_id'].map(mapping_players)

In [0]:
df_actions['team_name'] = df_actions['team_id'].map(mapping_teams)

In [0]:
df_actions

In [0]:
df_actions.columns

In [0]:
# Mark passes that are assists
df_actions_next = df_actions.shift(-1)
df_actions['assist'] = ((df_actions['type_name'] == 'pass') & (df_actions_next['type_name'] == 'shot') & 
                        (df_actions_next['team_id'] == df_actions['team_id']) & (df_actions['result'] == 1)).fillna(False).astype(int)
df_actions

In [0]:
number_of_passes = len(df_actions[df_actions['type_name'] == 'pass'])

print(f'Our dataset contains {number_of_passes} passes.')

In [0]:
number_of_assists = len(df_actions[df_actions['assist'] == 1])

print(f'Our dataset contains {number_of_assists} assists.')

In [0]:
# Select all passes
df_passes = df_actions[(df_actions['type_name'] == 'pass') & (df_actions['player_id'] != df_actions_next['player_id'])].copy()

## Normalize the location features

In order to help the learning algorithm, we rescale the location features from their original scales to a normalized scale ranging from 0 to 1. More specifically, we divide the x coordinates by 105 and the y coordinates by 68.

In [0]:
df_passes[['start_x', 'start_y', 'end_x', 'end_y']]

In [0]:
for side in ['start', 'end']:
    
  # Normalize the X location
  df_passes[f'{side}_x'] = df_passes[f'{side}_x'] / 105
          
  # Normalize the Y location
  df_passes[f'{side}_y'] = df_passes[f'{side}_y'] / 68

In [0]:
df_passes[['start_x', 'start_y', 'end_x', 'end_y']]

## Construct the examples

In order to predict the outcome of each pass, we need to transform our passes database into a dataset that we can fed into our machine learning algorithm. To this end, we perform the following two steps:

1. We construct our dataset by selecting a subset of the available features.

2. We split the dataset into a train set for training the model and a hold-out test set for evaluating the model. This is an important step as we aim to learn a predictive model that generalizes well to unseen examples. By evaluating our model on a hold-out test set, we can investigate whether we are overfitting on the train data.

### Compute additional features

In [0]:
# Determine body part used for each shot
df_passes['is_foot'] = df_passes['body_part_id'] == 0
df_passes['is_head'] = df_passes['body_part_id'] == 1
df_passes['is_body'] = df_passes['body_part_id'] == 2

In [0]:
df_passes.head(10).T

### Construct the dataset
We construct our dataset by selecting a subset of the available features. In this notebook, we use a limited number of features such as the location of the shot (`start_x` and `start_y`), the body part used by the shot taker (`body_part_id`), and the distances between the location of the shot and the center of the opposing goal (`start_distance`).

We encourage you to try other features as well and to investigate what effect they have on the performance of your expected-assists model. For example, you could try to include the angle between the shot location and the center of the goal or the angle between the shot location and the goal posts as a feature too. 

In [0]:
# Features
columns_features = [
    'start_x',
    'start_y',
    'is_foot',
    'is_head',
    'is_body'
]

# Label: 1 if an assist, 0 otherwise
column_target = 'assist'

### Split the dataset into a train set and a test set
We train our expected-assists model on 90% of the data and evaluate the model on the remaining 10% of the data.

In [0]:
X = df_passes[columns_features]
y = df_passes[column_target]

In [0]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10)

In [0]:
len(X_train), len(X_test), len(y_train), len(y_test)

Alternatively, we train the model on X seasons of the data and evaluate the model on the remaining seasons.

In [0]:
# season_ids = [10992, 181334, 185611]  # Eredivisie 2016/2017 - 2018/2019
# season_ids = [10883, 181150, 185618]  # Premier League 2016/2017 - 2018/2019

seasons_train = [10992, 181334]
seasons_test = [185611]

In [0]:
matches_train = df_matches[df_matches['season_id'].isin(seasons_train)]['match_id'].unique()
matches_test = df_matches[df_matches['season_id'].isin(seasons_test)]['match_id'].unique()

In [0]:
X_train = df_passes[df_passes['game_id'].isin(matches_train)][columns_features]
X_test = df_passes[df_passes['game_id'].isin(matches_test)][columns_features]
y_train = df_passes[df_passes['game_id'].isin(matches_train)][column_target]
y_test = df_passes[df_passes['game_id'].isin(matches_test)][column_target]

In [0]:
len(X_train), len(X_test), len(y_train), len(y_test)

## Learn the model with XGBoost
We learn our expected-assists model using the XGBoost algorithm, which is a popular algorithm in machine learning competitions like Kaggle. The algorithm is particularly appealing as it requires minimal parameter tuning to provide decent performance on many standard machine learning tasks.

[Visit the XGBoost website for more information](http://xgboost.readthedocs.io/en/latest/model.html)

We train an XGBoost classifier on our train set. We train 100 trees and set their maximum depth to 4.

In [0]:
classifier = XGBClassifier(objective='binary:logistic', max_depth=4, n_estimators=100)
classifier.fit(X_train, y_train)

## Learn the model with CatBoost
We learn our expected-assists model using the CatBoost algorithm, which is a machine learning algorithm that is quickly gaining popularity. Like XGBoost, the algorithm is particularly appealing as it requires minimal parameter tuning to provide decent performance on many standard machine learning tasks.

[Visit the CatBoost website for more information](https://catboost.ai/docs/)

In [0]:
classifier = CatBoostClassifier(objective='Logloss', max_depth=4, n_estimators=100)
classifier.fit(X_train, y_train)

## Evaluate the model
We evaluate the accuracy of our expected-assists model by making predictions for the shots in our test set.

### Predict the test examples

In [0]:
# For each shot, predict the probability of the shot resulting in a goal
y_pred = classifier.predict_proba(X_test)

In [0]:
y_pred

### Compute area under the curve: receiver operating characteristic (AUC-ROC)
To measure the accuracy of our expected-assists model, we compute the AUC-ROC obtained on the test set. The values for the AUC-ROC metric range from 0 to 1. The higher the AUC-ROC value is, the better the classifier is, where an AUC-ROC value of 0.50 corresponds to random guessing. That is, if we randomly predicted whether a pass results in an assist or not, we would obtain an AUC-ROC of 0.50.

In [0]:
y_total = y_train.count()
y_positive = y_train.sum()

print(f'The training set contains {y_total} examples of which {y_positive} are positives.')

In [0]:
auc_roc = roc_auc_score(y_test, y_pred[:, 1])

print(f'Our classifier obtains an AUC-ROC of {auc_roc}.')

### Compute area under the curve: precision-recall (AUC-PR)
Since the AUC-ROC metric is susceptible to class imbalance (i.e., the number of positive examples is much lower or higher than the number of negative examples), we also compute the AUC-PR obtained on the test set. The values for the AUC-PR metric range from 0 to 1 too. The higher the AUC-PR value is, the better the classifier is. Unlike AUC-ROC, however, the value for random guessing does not necessarily correspond to 0.50 for imbalanced classes, but corresponds to the ratio of positive examples in the train set.

In [0]:
auc_pr_baseline = y_positive / y_total

print(f'The baseline performance for AUC-PR is {auc_pr_baseline}.')

In [0]:
auc_pr = average_precision_score(y_test, y_pred[:, 1])

print(f'Our classifier obtains an AUC-PR of {auc_pr}.')

### Plot AUC-ROC curve

In [0]:
plot_roc(y_test, y_pred)

### Plot AUC-PR curve

In [0]:
plot_precision_recall(y_test, y_pred)

### Plot calibration curve
We plot a calibration curve to investigate how well our expected-assists model is calibrated. The plot shows the mean predicted value on the horizontal axis and the fraction of covered positive examples on the vertical axis.

In [0]:
plot_calibration_curve(y_test, [y_pred])

# Use your own expected-assists model

In [0]:
# Select a season
season_id = 185611

In [0]:
# Retrieve matches in selected season
df_matches

In [0]:
match_ids = df_matches[df_matches['season_id'] == season_id]['match_id'].unique()

In [0]:
# Retrieve passes in selected season
df_sample = df_passes[df_passes['game_id'].isin(match_ids)].reset_index(drop=True)

In [0]:
df_sample

In [0]:
# Estimate expected-assists value for each pass
predictions = classifier.predict_proba(df_sample[columns_features])

In [0]:
df_predictions = pd.DataFrame(data=predictions, columns=['xna', 'xa'])

In [0]:
# Combine expected-assists values with pass information
df_results = df_sample.merge(df_predictions, left_index=True, right_index=True)

# Inspect passes with high or low expected-assist values

In [0]:
# Get pass with highest xA value
highest_xg = df_results.sort_values('xa', ascending=False).reset_index(drop=True).loc[0]

In [0]:
highest_xg

In [0]:
df_actions_sample = df_actions[(df_actions['game_id'] == highest_xg['game_id']) &
                               (df_actions['action_id'] <= highest_xg['action_id'] + 1) &
                               (df_actions['action_id'] > highest_xg['action_id'] - 5)]

In [0]:
matplotsoccer.actions(
    location=df_actions_sample[['start_x', 'start_y', 'end_x', 'end_y']],
    action_type=df_actions_sample['type_name'],
    team=df_actions_sample['team_name'],
    result=df_actions_sample['result'],
    label=df_actions_sample[['time_seconds', 'type_name', 'player_name', 'team_name']],
    labeltitle=['time', 'type', 'player', 'team'],
    zoom=False
)

In [0]:
player_name = 'H. Ziyech'
ax = matplotsoccer.field(show=False)
passes_player = df_results[df_results['player_name'] == player_name]
ax.set_title(f'Passes by {player_name}')
ax = seaborn.kdeplot(passes_player['start_x'] * 105, passes_player['start_y'] * 68, shade='True', n_levels=5, ax=ax)

In [0]:
threshold = 0.2
player_name = 'H. Ziyech'
ax = matplotsoccer.field(show=False)
passes_player = df_results[(df_results['player_name'] == player_name) & (df_results['xa'] > threshold)]
ax.set_title(f'Passes by {player_name}')
ax = seaborn.kdeplot(passes_player['start_x'] * 105, passes_player['start_y'] * 68, shade='True', n_levels=5, ax=ax)

# Produce overview

In [0]:
# Produce overview
df_overview = df_results.groupby(['player_name', 'team_name'])[['assist', 'xa']].sum()

In [0]:
# Sort overview and rename columns
df_overview = df_overview.sort_values('assist', ascending=False).reset_index(drop=False)
df_overview.columns = ['player_name', 'team_name', 'assists', 'expected_assists']

In [0]:
df_overview.head(10)

In [0]:
df_overview.to_csv(os.path.join(path, 'expected-assists.csv'))

In [0]:
drive.mount(mount_point, force_remount=True)