## 🌀 It's all starting to Unravel!

First run `pip install unravelsports` if you haven't already!


-----


In [None]:
%pip install unravelsports

### 1. Introduction

This notebook shows how to use this package to convert [Kloppy](https://github.com/PySport/kloppy) tracking data format into Graphs. These Graphs can subsequently be used to train a Graph Neural Network with the [Spektral](https://graphneural.network/) library as discussed in [A Graph Neural Network Deep-dive into Successful Counterattacks {A. Sahasrabudhe & J. Bekkers}](https://github.com/USSoccerFederation/ussf_ssac_23_soccer_gnn/tree/main).

This example follows these steps:
- [2. Imports](#2-imports)
- [3. Open SkillCorner Data](#3-open-skillcorner-data)
- [4. Graph Neural Network Converter](#4-graph-neural-network-converter)
- [5. Load Kloppy Data, Convert & Store](#5-load-kloppy-data-convert-and-store)
- [6. Creating a Custom Graph Dataset](#6-creating-a-custom-graph-dataset)
- [7. Prepare for Training](#7-prepare-for-training)
- [8. GNN Training](#8-training-gnn)


### 2. Imports

We import `GraphConverter` to help us convert from Kloppy tracking data frames to graphs.

In [13]:
from unravel.soccer import GraphConverter

from kloppy import skillcorner

-------
### 3. Open SkillCorner Data

The `GraphConverter` class supports the conversion of every tracking data provider supported by [PySports Kloppy](https://github.com/PySport/kloppy), namely:
- Sportec
- Tracab
- SecondSpectrum
- SkillCorner
- StatsPerform
- Metrica

In this example we're going to use tracking data frames from 4 matches of [Open SkillCorner Data](https://github.com/SkillCorner/opendata). 

All we need to know for now is that this data is from the following matches:

|    | date_time            |   id | home_team.short_name   | away_team.short_name   |
|---:|:---------------------|-----:|:-----------------------|:-----------------------|
|  0 | 2020-07-02T19:15:00Z | 4039 | Manchester City        | Liverpool              |
|  1 | 2020-05-26T16:30:00Z | 3749 | Dortmund               | Bayern Munchen         |
|  2 | 2020-03-08T19:45:00Z | 3518 | Juventus               | Inter                  |
|  3 | 2020-03-01T20:00:00Z | 3442 | Real Madrid            | FC Barcelona           |

-------
### 4. Graph Neural Network Converter

To get started with the `GraphConverter` we need to pass two required parameters:
- `dataset` (of type `TrackingDataset` (Kloppy)) 
- `labels` (a dict with `frame_id` as keys and a value of `{True, False, 1 or 0}`). For example {frame_id: 1, frame_id: 0}. You will need to create your own labels! In this example we'll use `dummy_labels(dataset)` to generate a fake label for each frame.


#### Graph Identifier(s):
When training a GNN it's highly recommended to split data into test/train(/validation) on match level, or sequence/possession level such that all values from one match/sequence/possession all end up in the same test, train or validation set. This should be done to avoid leaking information between test, train and validation sets.

To make this simple we have two options we can pass to `GraphConverter`, namely:
- `graph_id`. This is a single identifier (str or int) for a whole match, for example the unique match id.
- `graph_ids`. This is a dictionary with the same keys as `labels`, but the values are now the unique identifiers. This option can be used if we want to split by sequence or possession_id. For example: {frame_id: 'matchId-sequenceId', frame_id: 'match_Id-sequenceId2'} etc. You will need to create your own possession/sequence ids. Note, if `labels` and `graph_ids` don't have the exact same keys it will throw an error. In this example we'll use the `graph_id=match_id` as the unique identifier, but feel free to change that for `graph_ids=dummy_graph_ids(dataset)` to test out that behavior.

Correctly splitting the final dataset in train, test and validiation sets is incorporated into `CustomGraphDataset` (see section 7 for more information).


#### Graph Converter Settings:
| Parameter | Type | Description | Default |
|-----------|------|-------------|---------|
| `ball_carrier_threshold` | float | The distance threshold to determine the ball carrier in meters. If no ball carrier within ball_carrier_threshold, we skip the frame. | 25.0 |
| `max_player_speed` | float | The maximum speed of a player in meters per second. Used for normalizing node features. | 12.0 |
| `max_ball_speed` | float | The maximum speed of the ball in meters per second. Used for normalizing node features. | 28.0 |
| `boundary_correction` | float | A correction factor for boundary calculations, used to correct out of bounds as a percentage (Used as 1+boundary_correction, i.e., 0.05). Not setting this might lead to players outside the pitch markings to have values that fall slightly outside of our normalization range. When we set boundary_correction, any players outside the pitch will be moved to be on the closest line. | None |
| `self_loop_ball` | bool | Flag to indicate if the ball node should have a self-loop, aka be connected with itself and not only player(s) | True |
| `adjacency_matrix_connect_type` | str | The type of connection used in the adjacency matrix, typically related to the ball. Choose from 'ball', 'ball_carrier' or 'no_connection' | 'ball' |
| `adjacency_matrix_type` | str | The type of adjacency matrix, indicating how connections are structured, such as split by team. Choose from 'delaunay', 'split_by_team', 'dense', 'dense_ap' or 'dense_dp' | 'split_by_team' |
| `infer_ball_ownership` | bool | Infers 'attacking_team' if no 'ball_owning_team' exist (in Kloppy TrackingDataset) by finding the player closest to the ball using ball xyz, uses 'ball_carrier_threshold' as a cut-off. | True |
| `infer_goalkeepers` | bool | Set True if no GK label is provided, set False for incomplete (broadcast tracking) data that might not have a GK in every frame. | True |
| `defending_team_node_value` | float | Value for the node feature when player is on defending team. Should be between 0 and 1 including. | 0.1 |
| `non_potential_receiver_node_value` | float | Value for the node feature when player is NOT a potential receiver of a pass (when on opposing team or in possession of the ball). Should be between 0 and 1 including. | 0.1 |
| `label_type` | str | The type of prediction label used. Currently only supports 'binary' | 'binary' |
| `random_seed` | int, bool | When a random_seed is given, it will randomly shuffle an individual Graph without changing the underlying structure. When set to True, it will shuffle every frame differently; False won't shuffle. Advised to set True when creating an actual dataset to support Permutation Invariance. | False |
| `pad` | bool | True pads to a total amount of 22 players and ball (so 23x23 adjacency matrix). It dynamically changes the edge feature padding size based on the combination of AdjacencyMatrixConnectType and AdjacencyMatrixType, and self_loop_ball. No need to set padding because smaller and larger graphs can all be used in the same dataset. | False |
| `verbose` | bool | The converter logs warnings / error messages when specific frames have no coordinates, or other missing information. False mutes all of these warnings. | False |

-------
### 5. Load Kloppy Data, Convert and Store

Here we loop over 4 SkillCorner matches and convert the first 500 frames.

Important things to note:
- We import `dummy_labels` to randomly generate binary labels.
- We import `dummy_graph_ids` to renerage fake graph labels.
- Our `GraphConverter` uses the Kloppy `DatasetTransformer` under the hood, which will take care of setting up playing orientation and coordinate system correctly.
- Technically setting the coordinate system does not matter, because the `DatasetTransformer` transforms everything to `coordinates="secondspectrum"` as said, but setting it already will speed up parsing a bit.
- In this example we don't have any _actual_ labels for our tracking data frames, you are going to have to create your own. In this example we use `dummy_labels(dataset)` to randomly generate `True` or `False` labels for each frame. This also means training with these random labels will not create a good model.
- We store the data as individual pickle files, one for each match. The data that gets stored in the pickle is a list of dicts. One dict per frame. Each dict has keys:
    - 'a' (adjacency matrix) [np.array of shape (players+ball, players+ball)
    - 'x' (node features) [np.array of shape (n_nodes, n_node_features)]. The currently implemented node features (in order) are:
        - normalized x-coordinate
        - normalized y-coordinate
        - x component of the velocity unit vector
        - y component of the velocity unit vector
        - normalized speed
        - normalized angle of velocity vector
        - normalized distance to goal
        - normalized angle to goal
        - normalized distance to ball
        - normalized angle to ball
        - attacking (1) or defending team (`defending_team_node_value`) 
        - potential receiver (1) else `non_potential_receiver_node_value` 
    - 'e' (edge features) [np.array of shape (np.non_zero(a), n_edge_features)]. The currently implemented edge features (in order) are:
        - normalized inter-player distance
        - normalized inter-player speed difference
        - inter-player angle cosine
        - inter-player angle sine
        - inter-player velocity vector cosine
        - inter-player velocity vector sine 
        - optional: 1 if two players are connected else 0 according to delaunay adjacency matrix. Only if adjacency_matrix_type is NOT 'delauney'
    - 'y' (label) [np.array] 
- We will end up with fewer than 2,000 eventhough we set `limit=500` frames because we set `include_empty_frames=False`
- When using different providers always set `include_empty_frames=False` or `only_alive=True`

In [14]:
from os.path import exists

from unravel.utils import dummy_labels, dummy_graph_ids

match_ids = [4039, 3749, 3518, 3442]
pickle_file_path = "pickle_files/{match_id}.pickle"

for match_id in match_ids:
    match_pickle_file_path = pickle_file_path.format(match_id=match_id)
    # if the output file already exists, skip this whole step
    if not exists(match_pickle_file_path):

        # Load Kloppy dataset
        dataset = skillcorner.load_open_data(
            match_id=match_id,
            coordinates="secondspectrum",
            include_empty_frames=False,
            limit=500,  # limit to 500 frames in this example
        )

        # Initialize the GNN Converter, with dataset, labels and settings
        converter = GraphConverter(
            dataset=dataset,
            # create fake labels
            labels=dummy_labels(dataset),
            graph_id=match_id,
            # graph_ids=dummy_graph_ids(dataset),
            # settings
            ball_carrier_treshold=25.0,
            max_player_speed=12.0,
            max_ball_speed=28.0,
            boundary_correction=None,
            self_loop_ball=False,
            adjacency_matrix_connect_type="ball",
            adjacency_matrix_type="split_by_team",
            label_type="binary",
            infer_ball_ownership=True,
            infer_goalkeepers=True,
            defending_team_node_value=0.1,
            non_potential_receiver_node_value=0.1,
            random_seed=False,
            pad=True,
            verbose=False,
        )
        # Compute the graphs and directly store them as a pickle file
        converter.to_pickle(file_path=match_pickle_file_path)

-------
### 6. Creating a Custom Graph Dataset

- `CustomGraphDataset` (or `CounterDataset` as it's named in [U.S. Soccer Federation GNN Repository](https://github.com/USSoccerFederation/ussf_ssac_23_soccer_gnn/blob/main/counterattack.ipynb)) is a [`spektral.data.Dataset`](https://graphneural.network/creating-dataset/). 
This type of dataset is required to properly load and train a Spektral GNN.
- The `CustomGraphDataset` has a custom method `add()` that allows us to update to add more Graphs. This is useful because we can load an individual match pickle file and add/update the graphs directly to `dataset`. (Note this `dataset` is different than the previoulsy loaded Kloppy dataset!)

In [15]:
import pickle


def load_pickle(file_path):
    with open(file_path, "rb") as file:
        # Deserialize the object from the file
        graph_data = pickle.load(file)
    return graph_data

In [16]:
from unravel.soccer import CustomGraphDataset

dataset: CustomGraphDataset = None

for match_id in match_ids:
    graph_data = load_pickle(file_path=pickle_file_path.format(match_id=match_id))

    if not dataset:
        dataset = CustomGraphDataset(data=graph_data)
    else:
        dataset.add(graph_data, verbose=True)

print("Complete:", dataset)

Loading 477 graphs into CustomGraphDataset...
Loading 477 graphs into CustomGraphDataset...


Adding 380 graphs to CustomGraphDataset...
Adding 336 graphs to CustomGraphDataset...
Adding 411 graphs to CustomGraphDataset...
Complete: CustomGraphDataset(n_graphs=1604)


---------
### 7. Prepare for Training

Now that we have all the data converted as Graphs inside our `CustomGraphDataset` object, we can prepare to train the GNN model.

We first get all necessary information from our dataset that we need to train our model, namely:
- N = Max amount of nodes in a single graph
- F = Number of of Node Features
- S = Number of of Edge Features
- n_out = Dimesion of the target
- n = Number of samples in dataset

Please ignore the 'weird' naming convention, this simply copies the Spektral documentation.


In [17]:
N, F, S, n_out, n = dataset.dimensions()

#### 7.1 Split Dataset

Our `dataset` object has two custom methods to help split the data into train, test and validation sets.
Either use `dataset.split_test_train()` if we don't need a validation set, or `dataset.split_test_train_validation()` if we do also require a validation set.

We can split our data amongst these subsets 'by_graph_id' if we have provided Graph Ids in our `GraphConverter` using the 'graph_id' or 'graph_ids' parameter.
The 'split_train', 'split_test' and 'split_validation' parameters can either be ratios, percentages or relative size compared to total. 

Note: We can see that, because we are splitting by only 4 different graph_ids here (the 4 match_ids) the ratio's aren't perfectly 4 to 1 to 1. If you change the `graph_id=match_id` parameter in the `GraphConverter` to `graph_ids=dummy_graph_ids(dataset)` you'll see that it's easier to get close to the correct ratios, simply because we have a lot more graph_ids to split a cross. 

In [18]:
train, test, val = dataset.split_test_train_validation(
    split_train=4, split_test=1, split_validation=1, by_graph_id=True, random_seed=42
)
print("Train:", train)
print("Test:", test)
print("Validation:", val)

Train: CustomGraphDataset(n_graphs=791)
Test: CustomGraphDataset(n_graphs=477)
Validation: CustomGraphDataset(n_graphs=336)


#### 7.2 Model Configurations

In [19]:
learning_rate = 1e-3
epochs = 5  # Increase for actual training
batch_size = 32
channels = 128
n_layers = 3  # Number of CrystalConv layers

#### 7.3 Create DataLoaders

Create a Spektral [`DisjointLoader`](https://graphneural.network/loaders/#disjointloader). This DisjointLoader will help us to load batches of Graphs for training purposes.

We'll skip creating the validation loader for now.

In [20]:
from spektral.data import DisjointLoader

loader_tr = DisjointLoader(train, batch_size=batch_size, epochs=epochs)
loader_te = DisjointLoader(test, batch_size=batch_size, epochs=1, shuffle=False)

#### 7.4 Build GNN Model

This GNN Model has the same architecture as described in [A Graph Neural Network Deep-dive into Successful Counterattacks {A. Sahasrabudhe & J. Bekkers}](https://github.com/USSoccerFederation/ussf_ssac_23_soccer_gnn/tree/main)

In [21]:
from spektral.layers import GlobalAvgPool, CrystalConv
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.losses import BinaryCrossentropy
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam


class GNN(Model):
    """
    Building the Graph Neural Network configuration with Model as the parent class
    from spektral library.
    """

    def __init__(self, n_layers, channels, n_out):
        """
        Constructor code for setting up the layers needed for training the model.
        """
        super().__init__()
        self.conv1 = CrystalConv()
        self.convs = []
        for _ in range(1, n_layers):
            self.convs.append(CrystalConv())
        self.pool = GlobalAvgPool()
        self.dense1 = Dense(channels, activation="relu")
        self.dropout = Dropout(0.5)
        self.dense2 = Dense(channels, activation="relu")
        self.dense3 = Dense(n_out, activation="sigmoid")

    def call(self, inputs):
        """
        Build the neural network.
        """
        x, a, e, i = inputs
        x = self.conv1([x, a, e])
        for conv in self.convs:
            x = conv([x, a, e])
        x = self.pool([x, i])
        x = self.dense1(x)
        x = self.dropout(x)
        x = self.dense2(x)
        x = self.dropout(x)
        return self.dense3(x)


# Build model
model = GNN(n_layers=n_layers, channels=channels, n_out=n_out)
# Setup the optimizer
optimizer = Adam(learning_rate)
# Set up the logloss function
loss_fn = BinaryCrossentropy()

--------
### 8. Fit Model

In [22]:
import tensorflow as tf


# Ensure eager execution is enabled
@tf.function(input_signature=loader_tr.tf_signature())
def train_step(inputs, target):
    with tf.GradientTape() as tape:
        predictions = model(inputs, training=True)
        loss = loss_fn(target, predictions) + sum(model.losses)
    gradients = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))
    return loss


# Print loss at each step of training.
step = loss = 0
for batch in loader_tr:
    step += 1
    loss += train_step(*batch)
    if step == loader_tr.steps_per_epoch:
        step = 0
        print("Loss: {}".format(loss / loader_tr.steps_per_epoch))
        loss = 0

Loss: 5.872689723968506
Loss: 1.6423248052597046
Loss: 0.9415183067321777
Loss: 0.7331467270851135
Loss: 0.7081347107887268
