## 🌀 unravel kloppy into graph neural network!

First run `pip install unravelsports` if you haven't already!


-----


In [None]:
%pip install unravelsports --quiet

In this in-depth walkthrough we'll discuss everything the `unravelsports` package has to offer for converting a [Kloppy](https://github.com/PySport/kloppy) dataset of soccer tracking data into graphs for training binary classification graph neural networks using the [Spektral](https://graphneural.network/) library.

This walkthrough will touch on a lot of the concepts from [A Graph Neural Network Deep-dive into Successful Counterattacks {A. Sahasrabudhe & J. Bekkers}](https://github.com/USSoccerFederation/ussf_ssac_23_soccer_gnn). It is strongly advised to first read the [research paper (pdf)](https://ussf-ssac-23-soccer-gnn.s3.us-east-2.amazonaws.com/public/Sahasrabudhe_Bekkers_SSAC23.pdf). Some concepts are also explained in the [Graphs FAQ](graphs_faq.md).

Step by step we'll show how this package can be used to load soccer positional (tracking) data with `kloppy`, how to convert this data into "graphs", train a Graph Neural Network with `spektral`, evaluate it's performance, save and load the model and finally apply the model to unseen data to make predictions.

The powerful Kloppy package allows us to load and standardize data from many providers: Metrica, Sportec, Tracab, SecondSpectrum, StatsPerform and SkillCorner. In this guide we'll use some matches from the [Public SkillCorner Dataset](https://github.com/SkillCorner/opendata).

<br>
<i>Before we get started it is important to note that the <b>unravelsports</b> library does not have built in functionality to create binary labels, these will need to be supplied by the reader. In this example we use the <b>dummy_labels()</b> functionality that comes with the package. This function creates a single binary label for each frame by randomly assigning it a 0 or 1 value.
</i>
<br>

##### **Contents**

- [**1. Imports**](#1-imports).
- [**2. Public SkillCorner Data**](#2-public-skillcorner-data).
- [**3. Graph Converter**](#2-open-skillcorner-data).
- [**4. Load Kloppy Data, Convert & Store**](#4-load-kloppy-data-convert-and-store).
- [**5. Creating a Custom Graph Dataset**](#5-creating-a-custom-graph-dataset).
- [**6. Prepare for Training**](#6-prepare-for-training).
    - [6.1 Split Dataset](#61-split-dataset)
    - [6.2 Model Configurations](#62-model-configurations)
    - [6.3 Build GNN Model](#63-build-gnn-model)
    - [6.4 Create DataLoaders](#64-create-dataloaders)
- [**7. GNN Training + Prediction**](#7-training-and-prediction).
    - [7.1 Compile Model](#71-compile-model)
    - [7.2 Fit Model](#72-fit-model)
    - [7.3 Save & Load Model](#73-save--load-model)
    - [7.4 Evaluate Model](#74-evaluate-model)
    - [7.5 Predict on New Data](#75-predict-on-new-data)

ℹ️ [**Graphs FAQ**](graphs_faq.md)

-----

### 1. Imports

We import `SoccerGraphConverter` to help us convert from Kloppy positional tracking frames to graphs.

With the power of **Kloppy** we can also load data from many providers by importing `metrica`, `sportec`, `tracab`, `secondspectrum`, or `statsperform` from `kloppy`.

In [4]:
from unravel.soccer import SoccerGraphConverter

from kloppy import skillcorner

-----

### 2. Public SkillCorner Data

The `SoccerGraphConverter` class allows processing data from every tracking data provider supported by [PySports Kloppy](https://github.com/PySport/kloppy), namely:
- Sportec
- Tracab
- SecondSpectrum
- SkillCorner
- StatsPerform
- Metrica

In this example we're going to use a sample of tracking data from 4 matches of [publicly available SkillCorner data](https://github.com/SkillCorner/opendata). 

All we need to know for now is that this data is from the following matches:

|  id | date_time           | home_team   | away_team   |
|---:|:---------------------:|:-----------------------|:-----------------------|
|  4039 | 2020-07-02T19:15:00Z | Manchester City        | Liverpool              |
|  3749 | 2020-05-26T16:30:00Z | Dortmund               | Bayern Munchen         |
|  3518 | 2020-03-08T19:45:00Z | Juventus               | Inter                  |
|  3442 | 2020-03-01T20:00:00Z | Real Madrid            | FC Barcelona           |

-----

### 3. Graph Converter

ℹ️ For more information on:
- What a Graph is, check out [Graph FAQ Section A](graphs_faq.ipynb)
- What parameters we can pass to the `SoccerGraphConverter`, check out [Graph FAQ Section B](graphs_faq.ipynb)
- What features each Graph has, check out [Graph FAQ Section C](graphs_faq.ipynb)

---

To get started with the `SoccerGraphConverter` we need to pass one _required_ parameter:
- `dataset` (of type `TrackingDataset` (Kloppy)) 

And one parameter that's required when we're converting for training purposes (more on this later):
- `labels` (a dictionary with `frame_id`s as keys and a value of `{True, False, 1 or 0}`).
```python
{83340: True, 83341: False, etc..} = {83340: 1, 83341: 0, etc..} =  {83340: 1, 83341: False, etc..}
```
⚠️ As mentioned before you will need to create your own labels! In this example we'll use `dummy_labels(dataset)` to generate a fake label for each frame.

#### Graph Identifier(s):
When training a model on tracking data it's highly recommended to split data into test/train(/validation) sets by match or period such that all data end up in the same test, train or validation set. This should be done to avoid leaking information between test, train and validation sets. To make this simple, there are two _optional_ parameters we can pass to `SoccerGraphConverter`, namely:
- `graph_id`. This is a single identifier (str or int) for a whole match, for example the unique match id.
- `graph_ids`. This is a dictionary with the same keys as `labels`, but the values are now the unique identifiers. This option can be used if we want to split by sequence or possession_id. For example: {frame_id: 'matchId-sequenceId', frame_id: 'match_Id-sequenceId2'} etc. You will need to create your own ids. Note, if `labels` and `graph_ids` don't have the exact same keys it will throw an error.

In this example we'll use the `graph_id=match_id` as the unique identifier, but feel free to change that for `graph_ids=dummy_graph_ids(dataset)` to test out that behavior.

Correctly splitting the final dataset in train, test and validiation sets using these Graph Identifiers is incorporated into `GraphDataset` (see [Section 6.1](#61-split-dataset) for more information).

------


### 4. Load Kloppy Data, Convert and Store

As mentioned in [Section 2](#2-public-skillcorner-data) we will use 4 matches of SkillCorner data. In the below example we will load the first 500 frames of data from each of these 4 games (we set `limit=500`) to create a dataset of 2,000 samples (Note: We're going to actually have less than 2,000 samples because setting `include_empty_frames=False` means we'll skip some frames in our conversion step).

Important things to note:
- We import `dummy_labels` to randomly generate binary labels. Training with these random labels will not create a good model.
- We import `dummy_graph_ids` to generate fake graph labels.
- The `SoccerGraphConverter` handles all necessary steps (like setting the correct coordinate system, and left-right normalization).
- We will end up with fewer than 2,000 eventhough we set `limit=500` frames because we set `include_empty_frames=False` and all frames without ball coordinates are automatically ommited.
- When using other providers always set `include_empty_frames=False` or `only_alive=True`.
- We store the data as individual compressed pickle files, one file for per match. The data that gets stored in the pickle is a list of dictionaries, one dictionary per frame. Each dictionary has keys for the adjacency matrix, node features, edge features, label and graph id.

In [5]:
from os.path import exists

from unravel.utils import dummy_labels, dummy_graph_ids

match_ids = [4039, 3749, 3518, 3442]
pickle_folder = "pickles"
compressed_pickle_file_path = "{pickle_folder}/{match_id}.pickle.gz"

for match_id in match_ids:
    match_pickle_file_path = compressed_pickle_file_path.format(
        pickle_folder=pickle_folder, match_id=match_id
    )
    # if the output file already exists, skip this whole step
    if not exists(match_pickle_file_path):

        # Load Kloppy dataset
        dataset = skillcorner.load_open_data(
            match_id=match_id,
            coordinates="secondspectrum",
            include_empty_frames=False,
            limit=500,  # limit to 500 frames in this example
        )

        # Initialize the Graph Converter, with dataset, labels and settings
        converter = SoccerGraphConverter(
            dataset=dataset,
            # create fake labels
            labels=dummy_labels(dataset),
            graph_id=match_id,
            # graph_ids=dummy_graph_ids(dataset),
            # Settings
            ball_carrier_treshold=25.0,
            max_player_speed=12.0,
            max_ball_speed=28.0,
            boundary_correction=None,
            self_loop_ball=True,
            adjacency_matrix_connect_type="ball",
            adjacency_matrix_type="split_by_team",
            label_type="binary",
            infer_ball_ownership=True,
            infer_goalkeepers=True,
            defending_team_node_value=0.1,
            non_potential_receiver_node_value=0.1,
            random_seed=False,
            pad=True,
            verbose=False,
        )
        # Compute the graphs and directly store them as a pickle file
        converter.to_pickle(file_path=match_pickle_file_path)

Processing frames: 100%|██████████| 500/500 [00:02<00:00, 244.81it/s]
Processing frames: 100%|██████████| 500/500 [00:01<00:00, 285.65it/s]
Processing frames: 100%|██████████| 500/500 [00:01<00:00, 343.58it/s] 
Processing frames: 100%|██████████| 500/500 [00:01<00:00, 285.17it/s]


ℹ️ For a full table of parameters we can pass to the `SoccerGraphConverter` check out [Graph FAQ Section B](graphs_faq.ipynb)

-----

### 5. Creating a Custom Graph Dataset

To easily train our model with the Spektral library we need to use a Spektral dataset object. The `GraphDataset` class helps us create such an object really easily.

- `GraphDataset` is a [`spektral.data.Dataset`](https://graphneural.network/creating-dataset/). 
This type of dataset makes it very easy to properly load, train and predict with a Spektral GNN.
- The `GraphDataset` has an option to load from a folder of compressed pickle files, all we have to do is pass the pickle_folder location.

ℹ️ For more information on the `GraphDataset` please check the [Graphs FAQ Section D](graphs_faq.ipynb)

In [None]:
from unravel.utils import GraphDataset

dataset = GraphDataset(pickle_folder=pickle_folder)

### 6. Prepare for Training

Now that we have all the data converted into Graphs inside our `GraphDataset` object, we can prepare to train the GNN model.


#### 6.1 Split Dataset

Our `dataset` object has two custom methods to help split the data into train, test and validation sets.
Either use `dataset.split_test_train()` if we don't need a validation set, or `dataset.split_test_train_validation()` if we do also require a validation set.

We can split our data 'by_graph_id' if we have provided Graph Ids in our `SoccerGraphConverter` using the 'graph_id' or 'graph_ids' parameter.

The 'split_train', 'split_test' and 'split_validation' parameters can either be ratios, percentages or relative size compared to total. 

We opt to create a test, train _and_ validation set to use in our example.

In [7]:
train, test, val = dataset.split_test_train_validation(
    split_train=4, split_test=1, split_validation=1, by_graph_id=True, random_seed=42
)
print("Train:", train)
print("Test:", test)
print("Validation:", val)

Train: CustomSpektralDataset(n_graphs=791)
Test: CustomSpektralDataset(n_graphs=477)
Validation: CustomSpektralDataset(n_graphs=336)


🗒️ We can see that, because we are splitting by only 4 different graph_ids here (the 4 match_ids) the ratio's aren't perfectly 4 to 1 to 1. If you change the `graph_id=match_id` parameter in the `SoccerGraphConverter` to `graph_ids=dummy_graph_ids(dataset)` you'll see that it's easier to get close to the correct ratios, simply because we have a lot more graph_ids to split a cross. 

#### 6.2 Model Configurations

In [8]:
learning_rate = 1e-3
epochs = 5  # Increase for actual training
batch_size = 32
channels = 128
n_layers = 3  # Number of CrystalConv layers

#### 6.3 Build GNN Model

This GNN Model has the same architecture as described in [A Graph Neural Network Deep-dive into Successful Counterattacks {A. Sahasrabudhe & J. Bekkers}](https://github.com/USSoccerFederation/ussf_ssac_23_soccer_gnn/tree/main)

This exact model can also simply be loaded as:

`from unravel.classifiers import CrystalGraphClassifier` as shown in [Quick Start Guide](0_quick_start_guide.ipynb)

Below we show the exact same code to make it easier to adjust.

In [9]:
from spektral.layers import GlobalAvgPool, CrystalConv
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.models import Model


class CrystalGraphClassifier(Model):
    def __init__(
        self,
        n_layers: int = 3,
        channels: int = 128,
        drop_out: float = 0.5,
        n_out: int = 1,
        **kwargs
    ):
        super().__init__(**kwargs)

        self.n_layers = n_layers
        self.channels = channels
        self.drop_out = drop_out
        self.n_out = n_out

        self.conv1 = CrystalConv()
        self.convs = [CrystalConv() for _ in range(1, self.n_layers)]
        self.pool = GlobalAvgPool()
        self.dense1 = Dense(self.channels, activation="relu")
        self.dropout = Dropout(self.drop_out)
        self.dense2 = Dense(self.channels, activation="relu")
        self.dense3 = Dense(self.n_out, activation="sigmoid")

    def call(self, inputs):
        x, a, e, i = inputs
        x = self.conv1([x, a, e])
        for conv in self.convs:
            x = conv([x, a, e])
        x = self.pool([x, i])
        x = self.dense1(x)
        x = self.dropout(x)
        x = self.dense2(x)
        x = self.dropout(x)
        return self.dense3(x)

#### 6.4 Create DataLoaders

Create a Spektral [`DisjointLoader`](https://graphneural.network/loaders/#disjointloader). This DisjointLoader will help us to load batches of Disjoint Graphs for training purposes.

Note that these Spektral `Loaders` return a generator, so if we want to retrain the model, we need to reload these loaders.

In [10]:
from spektral.data import DisjointLoader

loader_tr = DisjointLoader(train, batch_size=batch_size, epochs=epochs)
loader_va = DisjointLoader(val, epochs=1, shuffle=False, batch_size=batch_size)

--------

### 7. Training and Prediction

Below we outline how to train the model, make predictions and add the predicted values back to the Kloppy dataframe.

#### 7.1 Compile Model

1. Initialize the `CrystalGraphClassifier` (or create your own Graph Classifier).
2. Compile the model with a loss function, optimizer and your preferred metrics.

In [None]:
from tensorflow.keras.metrics import AUC, BinaryAccuracy
from tensorflow.keras.losses import BinaryCrossentropy
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping

model = CrystalGraphClassifier()

model.compile(
    loss=BinaryCrossentropy(), optimizer=Adam(), metrics=[AUC(), BinaryAccuracy()]
)

#### 7.2 Fit Model

1. We have a a [`DisjointLoader`](https://graphneural.network/loaders/#disjointloader) for training and validation sets.
2. Fit the model. 
3. We add `EarlyStopping` and a `validation_data` dataset to monitor performance, and set `use_multiprocessing=True` to improve training speed.

⚠️ When trying to fit the model _again_ make sure to reload Data Loaders in [Section 6.4](#64-create-dataloaders), because they are generators.

In [None]:
model.fit(
    loader_tr.load(),
    steps_per_epoch=loader_tr.steps_per_epoch,
    epochs=5,
    use_multiprocessing=True,
    validation_data=loader_va.load(),
    callbacks=[EarlyStopping(monitor="loss", patience=5, restore_best_weights=True)],
)

#### 7.3 Save & Load Model

This step is solely included to show how to restore a model.

In [None]:
from tensorflow.keras.models import load_model

model_path = "models/my-first-graph-classifier"
model.save(model_path)
loaded_model = load_model(model_path)

#### 7.4 Evaluate Model

1. Create another `DisjointLoader`, this time for the test set.
2. Evaluate model performance on the test set. This evaluation function uses the `metrics` passed to `model.compile`

🗒️ Our performance is really bad because we're using random labels, very few epochs and a small dataset.

📖 For more information on evaluation in sports analytics see: [Methodology and evaluation in sports analytics: challenges, approaches, and lessons learned {J. Davis et. al. (2024)}](https://link.springer.com/article/10.1007/s10994-024-06585-0)


In [14]:
loader_te = DisjointLoader(test, epochs=1, shuffle=False, batch_size=batch_size)
results = model.evaluate(loader_te.load())



#### 7.5 Predict on New Data

1. Load new, unseen data from the SkillCorner dataset.
2. Convert this data, making sure we use the exact same settings as in step 1.
3. If we set `prediction=True` we do not have to supply labels to the `SoccerGraphConverter`.

In [15]:
kloppy_dataset = skillcorner.load_open_data(
    match_id=2068,  # A game we have not yet used in section 4
    include_empty_frames=False,
    limit=500,
)

preds_converter = SoccerGraphConverter(
    dataset=kloppy_dataset,
    prediction=True,
    ball_carrier_treshold=25.0,
    max_player_speed=12.0,
    max_ball_speed=28.0,
    boundary_correction=None,
    self_loop_ball=True,
    adjacency_matrix_connect_type="ball",
    adjacency_matrix_type="split_by_team",
    label_type="binary",
    infer_ball_ownership=True,
    infer_goalkeepers=True,
    defending_team_node_value=0.1,
    non_potential_receiver_node_value=0.1,
    random_seed=False,
    pad=True,
    verbose=False,
)

4. Make a prediction on all the frames of this dataset using `model.predict`

In [None]:
# Compute the graphs and add them to the GraphDataset
pred_dataset = GraphDataset(graphs=preds_converter.to_spektral_graphs())

loader_pred = DisjointLoader(
    pred_dataset, batch_size=batch_size, epochs=1, shuffle=False
)
preds = model.predict(loader_pred.load(), use_multiprocessing=True)

Processing frames: 100%|██████████| 500/500 [00:01<00:00, 326.02it/s]




5. Convert Klopy dataset to a dataframe and merge back the pedictions using the frame_ids.

In [17]:
import pandas as pd

kloppy_df = kloppy_dataset.to_df()

preds_df = pd.DataFrame(
    {"frame_id": [x.id for x in pred_dataset], "y": preds.flatten()}
)

kloppy_df = pd.merge(kloppy_df, preds_df, on="frame_id", how="left")

kloppy_df[300:305][["frame_id", "period_id", "timestamp", "y"]]

Unnamed: 0,frame_id,period_id,timestamp,y
300,2166,1,0 days 00:00:33.300000,0.259016
301,2167,1,0 days 00:00:33.400000,0.251124
302,2168,1,0 days 00:00:33.500000,0.258305
303,2169,1,0 days 00:00:33.600000,0.256378
304,2170,1,0 days 00:00:33.700000,0.305434


🗒️ Not all frames have a prediction because of missing (ball) data, so we look at the 300th-305th frame.