# Modelling possession value using neural network

In soccer/football, estimate value of each players action on the field is a critical part of analytics, since we can thus understand the risk and reward of each pass or tackle, and players could have a better decision making in the future. It's also sometimes called possession value.

There are quite a few works that estimate possession values inclduing

[Karun Singh's expected threat](https://karun.in/blog/expected-threat.html), 

[@thecomeonman xPo](https://thecomeonman.github.io/xPo/),

[Tom Decroos et al. VAEP](https://arxiv.org/abs/1802.07127),

but due to limitation of data access, all of these works are base on event data only and hence dependent of data provider and can't capture everything on the field. With [Metrica sports sample tracking data](https://github.com/metrica-sports/sample-data) we can now try to estimate using player position and speed instead.


In NFL big data bowl 2020 which provide tracking data to model NFL competition, and the [winning solution](https://www.kaggle.com/c/nfl-big-data-bowl-2020/discussion/119400) ultilize a convolutional neural network and the following code is attempt of depolying similar model to estimate shot attempt chance with tracking data.

In [1]:
import pandas as pd
import numpy as np
import torch
from torch import nn
from torch.utils.data import DataLoader, TensorDataset
import random
import os
from sklearn.metrics import confusion_matrix
from sklearn.metrics import mean_squared_error

pd.set_option('mode.chained_assignment', None)

## Preprocess

In [2]:
##Preprocess code by @EightyFivePoint https://github.com/Friends-of-Tracking-Data-FoTD/LaurieOnTracking

import Metrica_IO as mio
import Metrica_Viz as mviz
import Metrica_Velocities as mvel

DATADIR = 'data/'

game_id = 2 # let's look at sample match 2

# read in the event data
events = mio.read_event_data(DATADIR,game_id)

# count the number of each event type in the data
print( events['Type'].value_counts() )

# Bit of housekeeping: unit conversion from metric data units to meters
events = mio.to_metric_coordinates(events)

# Get events by team
home_events = events[events['Team']=='Home']
away_events = events[events['Team']=='Away']

# Frequency of each event type by team
home_events['Type'].value_counts()
away_events['Type'].value_counts()

# Get all shots
shots = events[events['Type']=='SHOT']
home_shots = home_events[home_events.Type=='SHOT']
away_shots = away_events[away_events.Type=='SHOT']

# Look at frequency of each shot Subtype
home_shots['Subtype'].value_counts()
away_shots['Subtype'].value_counts()


# Get the shots that led to a goal
home_goals = home_shots[home_shots['Subtype'].str.contains('-GOAL')].copy()
away_goals = away_shots[away_shots['Subtype'].str.contains('-GOAL')].copy()

# Add a column event 'Minute' to the data frame
home_goals['Minute'] = home_goals['Start Time [s]']/60.


#### TRACKING DATA ####

# READING IN TRACKING DATA
tracking_home = mio.tracking_data(DATADIR,game_id,'Home')
tracking_away = mio.tracking_data(DATADIR,game_id,'Away')



# Convert positions from metrica units to meters 
tracking_home = mio.to_metric_coordinates(tracking_home)
tracking_away = mio.to_metric_coordinates(tracking_away)

PASS              964
CHALLENGE         311
RECOVERY          248
BALL LOST         233
SET PIECE          80
BALL OUT           49
SHOT               24
FAULT RECEIVED     20
CARD                6
Name: Type, dtype: int64
Reading team: home
Reading team: away


In [3]:
# Calculate player velocities
tracking_home = mvel.calc_player_velocities(tracking_home,  filter_='moving average')
tracking_away = mvel.calc_player_velocities(tracking_away,  filter_='moving average')

In [4]:
#Consider ball is inbound only (no missing ball coordinate)

tracking_home_inbound = tracking_home[~pd.isna(tracking_home.ball_x)]
tracking_away_inbound = tracking_away[~pd.isna(tracking_away.ball_x)]

## Model input

To convert the data into input, we need to extract the player position and speed and feed into the model. Particularly after inspired by NFL competition solution here are the input:

1. Difference of x,y position between each of home and away team player
2. Difference of x,y velocity between each of home and away team player
3. Difference in x,y position between each of home team players and the ball
4. Difference in x,y position between each of away team players and the ball
5. x,y velocty of home team player
6. x,y velocty of away team player

And hence there are 6x2 = 12 layers of input, or typically in neural network it is a 12 channels data.

In [5]:
def create_faetures(tracking_home,tracking_away):

    ball_loc = tracking_home_inbound[['ball_x','ball_y']].values

    player_home_inbound = tracking_home_inbound[tracking_home_inbound.columns[~tracking_home_inbound.columns.str.contains('ball')]]
    player_away_inbound = tracking_away_inbound[tracking_away_inbound.columns[~tracking_away_inbound.columns.str.contains('ball')]]


    
    home_x = player_home_inbound[player_home_inbound.columns[player_home_inbound.columns.str.endswith('_x')]].values
    home_y = player_home_inbound[player_home_inbound.columns[player_home_inbound.columns.str.endswith('_y')]].values
    home_vx = player_home_inbound[player_home_inbound.columns[player_home_inbound.columns.str.endswith('_vx')]].values
    home_vy = player_home_inbound[player_home_inbound.columns[player_home_inbound.columns.str.endswith('_vy')]].values

    away_x = player_away_inbound[player_away_inbound.columns[player_away_inbound.columns.str.endswith('_x')]].values
    away_y = player_away_inbound[player_away_inbound.columns[player_away_inbound.columns.str.endswith('_y')]].values
    away_vx = player_away_inbound[player_away_inbound.columns[player_away_inbound.columns.str.endswith('_vx')]].values
    away_vy = player_away_inbound[player_away_inbound.columns[player_away_inbound.columns.str.endswith('_vy')]].values
    
    del player_home_inbound
    del player_away_inbound
    
    player_vector = []

    for frame in range(len(home_x)):
        home_x_input = home_x[frame][~np.isnan(home_x[frame])]
        home_y_input = home_y[frame][~np.isnan(home_y[frame])]
        home_vx_input = home_vx[frame][~np.isnan(home_vx[frame])]
        home_vy_input = home_vy[frame][~np.isnan(home_vy[frame])]

        away_x_input = away_x[frame][~np.isnan(away_x[frame])]
        away_y_input = away_y[frame][~np.isnan(away_y[frame])]
        away_vx_input = away_vx[frame][~np.isnan(away_vx[frame])]
        away_vy_input = away_vy[frame][~np.isnan(away_vy[frame])]

        ball_loc_input = ball_loc[frame]

        player_vector.append(player_feature(home_x_input,away_x_input,home_y_input,away_y_input,home_vx_input,away_vx_input,
                                           home_vy_input,away_vy_input,ball_loc_input))
    
    return np.array(player_vector)

    
def player_feature(home_x,away_x,home_y,away_y,home_sx,away_sx,home_sy,away_sy,ball_loc):
    if(len(home_x<11)):
        home_x = np.pad(home_x,(11-len(home_x),0), 'constant', constant_values=-999)
        home_y = np.pad(home_y,(11-len(home_y),0), 'constant', constant_values=-999)
        home_sx = np.pad(home_sx,(11-len(home_sx),0), 'constant', constant_values=-999)
        home_sy = np.pad(home_sy,(11-len(home_sy),0), 'constant', constant_values=-999)
    if(len(away_x<11)):
        away_x = np.pad(away_x,(11-len(away_x),0), 'constant', constant_values=-999)
        away_y = np.pad(away_y,(11-len(away_y),0), 'constant', constant_values=-999)
        away_sx = np.pad(away_sx,(11-len(away_sx),0), 'constant', constant_values=-999)
        away_sy = np.pad(away_sy,(11-len(away_sy),0), 'constant', constant_values=-999)

    dist_away_home_x = away_x.reshape(-1,1)-home_x.reshape(1,-1)
    dist_away_home_sx = away_sx.reshape(-1,1)-home_sx.reshape(1,-1)
    dist_away_home_y = away_y.reshape(-1,1)-home_y.reshape(1,-1)
    dist_away_home_sy = away_sy.reshape(-1,1)-home_sy.reshape(1,-1)
    dist_home_ball_x = home_x.reshape(-1,1)-np.repeat(ball_loc[0],11).reshape(1,-1)
    dist_home_ball_y = home_y.reshape(-1,1)-np.repeat(ball_loc[1],11).reshape(1,-1)    
    dist_away_ball_x = away_x.reshape(-1,1)-np.repeat(ball_loc[0],11).reshape(1,-1)
    dist_away_ball_y = away_y.reshape(-1,1)-np.repeat(ball_loc[1],11).reshape(1,-1)
    home_sx = np.repeat(home_sx,11).reshape(11,-1)
    home_sy = np.repeat(home_sy,11).reshape(11,-1)
    away_sx = np.repeat(away_sx,11).reshape(11,-1)
    away_sy = np.repeat(away_sy,11).reshape(11,-1)
    feats = [dist_away_home_x, dist_away_home_sx, dist_away_home_y,dist_away_home_sy, dist_home_ball_x,dist_home_ball_y,dist_away_ball_x,dist_away_ball_y,
            home_sx,home_sy,away_sx,away_sy]
    
    return np.stack(feats)


In [6]:
x = create_faetures(tracking_home,tracking_away )

In [7]:
def scaling(feats, sctype="standard"):
    v1 = []
    v2 = []
    for i in range(feats.shape[1]):
        feats_ = feats[:, i, :]
        if sctype == "standard":
            mean_ = np.mean(feats_)
            std_ = np.std(feats_)
            feats[:, i, :] -= mean_
            feats[:, i, :] /= std_
            v1.append(mean_)
            v2.append(std_)
        elif sctype == "minmax":
            max_ = np.max(feats_)
            min_ = np.min(feats_)
            feats[:, i, :] = (feats_ - min_) / (max_ - min_)
            v1.append(max_)
            v2.append(min_)

    return feats, v1, v2

x, sc_mean, sc_std = scaling(x) #Standardization of data

## Process event data as output

For the target variable (or the output of model), normally people will use the chance of scroing goal in next n event (e.g. 5 events in xT and 10 events). However, sicne here is using only 1 game of training data, number of goal will be pretty low, and hence shot attempt chance in next 10 events is used for output.

In order to incorperate the context of tracking data, "shot attempt chance in next 10 events" becomes "tracking frame which span withnin 10 events prior to a shot attempt oppotunity".

In [8]:
events['next_event_start'] = events['Start Frame'].shift(-1).fillna(200000) 

events_slice = events[events.next_event_start > events['Start Frame']] #eliminate events with no duration (e.g. yellow card)



In [9]:
#take the rolling shot attempt on next 10 events

events_slice['rolling_home_shot'] = ((events_slice.Type == 'SHOT') & (events_slice.Team == 'Home'))[::-1].rolling(10,1).max()[::-1].fillna(0)
events_slice['rolling_away_shot'] = ((events_slice.Type == 'SHOT') & (events_slice.Team == 'Away'))[::-1].rolling(10,1).max()[::-1].fillna(0)

s_home = pd.Series(events_slice['rolling_home_shot'].values, pd.IntervalIndex.from_arrays(events_slice['Start Frame'], events_slice['next_event_start']-1))
s_away = pd.Series(events_slice['rolling_away_shot'].values, pd.IntervalIndex.from_arrays(events_slice['Start Frame'], events_slice['next_event_start']-1))

tracking_home_inbound = tracking_home_inbound.reset_index()
tracking_home_inbound['event_rolling_home_shot'] = tracking_home_inbound['Frame'].map(s_home).fillna(0)
tracking_home_inbound['event_rolling_away_shot'] = tracking_home_inbound['Frame'].map(s_away).fillna(0)

And finally checking the data input

In [10]:
tracking_home_inbound.head()

Unnamed: 0,Frame,Period,Time [s],Home_11_x,Home_11_y,Home_1_x,Home_1_y,Home_2_x,Home_2_y,Home_3_x,...,Home_7_vy,Home_7_speed,Home_8_vx,Home_8_vy,Home_8_speed,Home_9_vx,Home_9_vy,Home_9_speed,event_rolling_home_shot,event_rolling_away_shot
0,51,1,2.04,47.47846,0.68952,15.67422,15.61892,18.82878,5.0116,19.23158,...,0.454143,0.464687,0.1855,-0.055857,0.193727,-0.2915,0.522143,0.598001,0.0,0.0
1,52,1,2.08,47.46574,0.6766,15.68482,15.6366,18.8309,5.01228,19.18706,...,0.449286,0.460766,0.181714,-0.043714,0.186898,-0.314214,0.522143,0.609396,0.0,0.0
2,53,1,2.12,47.45196,0.663,15.6933,15.65496,18.83302,5.00684,19.1436,...,0.442,0.45543,0.170357,-0.041286,0.175289,-0.329357,0.534286,0.627644,0.0,0.0
3,54,1,2.16,47.44136,0.65348,15.7039,15.67876,18.83514,5.00888,19.1012,...,0.420143,0.437259,0.159,-0.029143,0.161649,-0.355857,0.512429,0.623873,0.0,0.0
4,55,1,2.2,47.43076,0.64668,15.71556,15.70256,18.8362,5.0116,19.05668,...,0.395857,0.417444,0.151429,-0.021857,0.152998,-0.378571,0.488143,0.617738,0.0,0.0


In [11]:
x.shape

(83272, 12, 11, 11)

In [12]:
y = tracking_home_inbound[['event_rolling_home_shot','event_rolling_away_shot']].values

## Create model and training loop

Here the model is created using PyTorch though it would be a pretty similar code for Keras instead

In [13]:
##Training script by @takuoko1  https://www.kaggle.com/takuok/1st-place-reproduction-10feats-dev


N_CLASSES = 2 #Home and away team shot attempt chance in next 10 events
N_CHANNELS = 12 #number of input layer
class Flatten(nn.Module):
    def forward(self, input):
        return input.view(input.size(0), -1)


class CnnModel(nn.Module):
    def __init__(self, num_classes):
        super().__init__()
        self.conv1 = nn.Sequential(
            nn.Conv2d(N_CHANNELS, 128, kernel_size=1, stride=1, bias=False),
            nn.SELU(inplace=True), #Since data has both positive and negative input using ReLU will kill half of data and decrease accuracy
            nn.Conv2d(128, 160, kernel_size=1, stride=1, bias=False),
            nn.SELU(inplace=True),
            nn.Conv2d(160, 128, kernel_size=1, stride=1, bias=False),
            nn.SELU(inplace=True)
        )
        self.pool1 = nn.AdaptiveAvgPool2d((1, 11))

        self.conv2 = nn.Sequential(
            nn.BatchNorm2d(128),
            nn.Conv2d(128, 160, kernel_size=(1, 1), stride=1, bias=False),
            nn.SELU(inplace=True),
            nn.BatchNorm2d(160),
            nn.Conv2d(160, 96, kernel_size=(1, 1), stride=1, bias=False),
            nn.SELU(inplace=True),
            nn.BatchNorm2d(96),
            nn.Conv2d(96, 96, kernel_size=(1, 1), stride=1, bias=False),
            nn.SELU(inplace=True),
            nn.BatchNorm2d(96),
        )
        self.pool2 = nn.AdaptiveAvgPool2d((1, 1))

        self.last_linear = nn.Sequential(
            Flatten(),
            nn.Linear(96, 256),
            nn.LayerNorm(256),
            nn.SELU(inplace=True),
            nn.Dropout(0.3),
            nn.Linear(256, num_classes)
        )

    def forward(self, x):
        x = self.conv1(x)
        x = self.pool1(x)
        x = self.conv2(x)
        x = self.pool2(x)
        x = torch.sigmoid(self.last_linear(x))


        return x

In [14]:

def train_one_epoch(model, train_loader, criterion, optimizer, device, 
                    steps_upd_logging=500, accumulation_steps=1, scheduler=None):
    model.train()

    total_loss = 0.0
    for step, (x, targets) in enumerate(train_loader):
        #x= x.to(device)
        #targets = targets.to(device)
        optimizer.zero_grad()

        logits = model(x)


        loss = criterion(logits, targets)
        loss.backward()

        if (step + 1) % accumulation_steps == 0:  # Wait for several backward steps
            optimizer.step()  # Now we can do an optimizer step

        total_loss += loss.item()
        
        if scheduler is not None:
            scheduler.step()

    return total_loss / (step + 1)

def validate(model, val_loader, criterion, device):
    model.eval()

    val_loss = 0.0
    true_ans_list = []
    preds_cat = []
    for step, (x, targets) in enumerate(val_loader):
        #x= x.to(device)
        #targets = targets.to(device)

        logits = model(x)
#         _, targets = targets.max(dim=1)
        loss = criterion(logits, targets)
        true_ans_list.append(targets.float().detach().numpy())
        preds_cat.append(logits.float().detach().numpy())
        val_loss += loss.item()
        
    all_true_ans = np.concatenate(true_ans_list, axis=0)
    all_preds = np.concatenate(preds_cat, axis=0)
    return all_preds, all_true_ans, val_loss / (step + 1)

To decrease the computation time, 5000 frame of tracking data is used to train the model and another 5000 is used for validation

In [15]:
sample_row = np.random.choice(len(x),10000, replace=False)
train_row = sample_row[:len(sample_row)//2]
val_row = sample_row[len(sample_row)//2:]

x_train = x[train_row]
y_train = y[train_row]

x_val = x[val_row]
y_val = y[val_row]

In [16]:
BATCH_SIZE = 64

train_dataset = TensorDataset(torch.tensor(x_train, dtype=torch.float32), torch.tensor(y_train, dtype=torch.float32))
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, num_workers=0, pin_memory=True)

val_dataset = TensorDataset(torch.tensor(x_val, dtype=torch.float32), torch.tensor(y_val, dtype=torch.float32))
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False, num_workers=0, pin_memory=True)
del train_dataset, val_dataset

In [17]:
device = "cpu"
epochs = 50
SEED = 71

model = CnnModel(num_classes=N_CLASSES)
#model.to(device)

num_steps = len(x_train) // BATCH_SIZE
criterion = torch.nn.BCELoss() #Binary Cross Entropy loss
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
scheduler  = torch.optim.lr_scheduler.OneCycleLR(optimizer,max_lr=3e-3,epochs=epochs+1,steps_per_epoch=num_steps)


In [18]:
#Set fix seed

def seed_torch(seed=71):
    random.seed(seed)
    os.environ["PYTHONHASHSEED"] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True

In [19]:

for epoch in range(1, epochs + 1):
    seed_torch(SEED + epoch)

    print("Starting {} epoch...".format(epoch))
    tr_loss = train_one_epoch(model, train_loader, criterion, optimizer, device, scheduler=scheduler)
    print('Mean train loss: {}'.format(round(tr_loss, 5)))

    val_pred, y_true, val_loss = validate(model, val_loader, criterion, device)
    print('Mean val loss: {}'.format(round(val_loss, 5)))

Starting 1 epoch...
Mean train loss: 0.51975
Mean val loss: 0.36683
Starting 2 epoch...
Mean train loss: 0.30027
Mean val loss: 0.2386
Starting 3 epoch...
Mean train loss: 0.22934
Mean val loss: 0.20986
Starting 4 epoch...
Mean train loss: 0.22356
Mean val loss: 0.21607
Starting 5 epoch...
Mean train loss: 0.21943
Mean val loss: 0.20476
Starting 6 epoch...
Mean train loss: 0.21503
Mean val loss: 0.1987
Starting 7 epoch...
Mean train loss: 0.20454
Mean val loss: 0.20071
Starting 8 epoch...
Mean train loss: 0.20596
Mean val loss: 0.21782
Starting 9 epoch...
Mean train loss: 0.19535
Mean val loss: 0.18813
Starting 10 epoch...
Mean train loss: 0.18835
Mean val loss: 0.19178
Starting 11 epoch...
Mean train loss: 0.18317
Mean val loss: 0.17597
Starting 12 epoch...
Mean train loss: 0.18161
Mean val loss: 0.17033
Starting 13 epoch...
Mean train loss: 0.17903
Mean val loss: 0.17797
Starting 14 epoch...
Mean train loss: 0.1819
Mean val loss: 0.17416
Starting 15 epoch...
Mean train loss: 0.16717


## Result Evaluation

We can now check the error and confution matrix

In [20]:
print("Home team error = %1.4f" % mean_squared_error(y_true[:,0], val_pred[:,0]))

Home team error = 0.0073


In [21]:
print("Away team error = %1.4f" % mean_squared_error(y_val[:,1], val_pred[:,1]))

Away team error = 0.0036


In [22]:
#Confusion matrix for home team
confusion_matrix(y_true[:,0], np.round(val_pred[:,0]))

array([[4660,    6],
       [  39,  295]], dtype=int64)

In [23]:
#Confusion matrix for away team
confusion_matrix(y_true[:,1], np.round(val_pred[:,1]))

array([[4753,   17],
       [   4,  226]], dtype=int64)

We can see the model can predict shot attempt chance well despite the heavy class imbalance, though still it's not that meaningful given limited data only

## Possible future improvement

It's just a simple demo of capability of neural network. Possible future improvement include:

1. There are only two games of data and obviously more tracking data would help
2. Neural network structure here is not fine tuned for soccer task and could have rooms for improvement
3. Only a snapshot of tracking data frame is used as model input: accuracy can improve if consider past frames as well

and possible many more.

Also similar input structure can be used for different task e.g. pass completion probability, scenario autoencoder... etc

Karun Singh has mentioned a few in Opta forum presentation: https://vimeo.com/398489039/80d8dcfb58