## 🌀 unravel kloppy into graph neural network using the _new_ Polars back-end!

First run `pip install unravelsports` if you haven't already!


-----


In [1]:
%pip install unravelsports --quiet


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In this in-depth walkthrough we'll discuss everything the `unravelsports` package has to offer for converting a [Kloppy](https://github.com/PySport/kloppy) dataset of soccer tracking data into graphs for training binary classification graph neural networks using the [Spektral](https://graphneural.network/) library, and a newly added (version==0.3.0+) [Polars](https://pola.rs/) back-end.

This walkthrough will touch on a lot of the concepts from [A Graph Neural Network Deep-dive into Successful Counterattacks {A. Sahasrabudhe & J. Bekkers}](https://github.com/USSoccerFederation/ussf_ssac_23_soccer_gnn). It is strongly advised to first read the [research paper (pdf)](https://ussf-ssac-23-soccer-gnn.s3.us-east-2.amazonaws.com/public/Sahasrabudhe_Bekkers_SSAC23.pdf). Some concepts are also explained in the [Graphs FAQ](graphs_faq.md).

Step by step we'll show how this package can be used to load soccer positional (tracking) data with `kloppy`, how to convert this data into a `KloppyPolarsDataset`, convert it into "graphs", train a Graph Neural Network with `spektral`, evaluate it's performance, save and load the model and finally apply the model to unseen data to make predictions.

The powerful Kloppy package allows us to load and standardize data from many providers: Metrica, Sportec, Tracab, SecondSpectrum, StatsPerform and SkillCorner. In this guide we'll use some matches from the [Public Sportec (DFL) Dataset (Bassek et al. 2025)](https://www.nature.com/articles/s41597-025-04505-y).

<br>
<i>Before we get started it is important to note that the <b>unravelsports</b> library does not have built in functionality to create binary labels, these will need to be supplied by the reader. In this example we use dummy labels instead. 
</i>
<br>

##### **Contents**

- [**1. Imports**](#1-imports).
- [**2. Public Sportec (DFL) Data**](#2-public-sportec-data).
- [**3. ⭐ _KloppyPolarsDataset_ and _SoccerGraphConverterPolars_**](#2-open-skillcorner-data).
- [**4. Load Kloppy Data, Convert & Store**](#4-load-kloppy-data-convert-and-store).
- [**5. Creating a Custom Graph Dataset**](#5-creating-a-custom-graph-dataset).
- [**6. Prepare for Training**](#6-prepare-for-training).
    - [6.1 Split Dataset](#61-split-dataset)
    - [6.2 Model Configurations](#62-model-configurations)
    - [6.3 Build GNN Model](#63-build-gnn-model)
    - [6.4 Create DataLoaders](#64-create-dataloaders)
- [**7. GNN Training + Prediction**](#7-training-and-prediction).
    - [7.1 Compile Model](#71-compile-model)
    - [7.2 Fit Model](#72-fit-model)
    - [7.3 Save & Load Model](#73-save--load-model)
    - [7.4 Evaluate Model](#74-evaluate-model)
    - [7.5 Predict on New Data](#75-predict-on-new-data)

ℹ️ [**Graphs FAQ**](graphs_faq.md)

-----

### 1. Imports

We import `SoccerGraphConverterPolars` to help us convert from Kloppy positional tracking frames to graphs.

With the power of **Kloppy** we can also load data from many providers by importing `metrica`, `sportec`, `tracab`, `secondspectrum`, or `statsperform` from `kloppy`.

In [2]:
from unravel.soccer import SoccerGraphConverterPolars, KloppyPolarsDataset

from kloppy import sportec

-----

### 2. Public Sportec Data

The `SoccerGraphConverterPolars` class allows processing data from every tracking data provider supported by [PySports Kloppy](https://github.com/PySport/kloppy), namely:
- Sportec
- Tracab
- SecondSpectrum
- SkillCorner
- StatsPerform
- Metrica
- PFF (beta)
- HawkEye (alpha)
- Signality (alpha)

You can choose any of the following games from the open Sportec dataset:

```python
matches = {
    'J03WMX': "1. FC Köln vs. FC Bayern München",
    'J03WN1': "VfL Bochum 1848 vs. Bayer 04 Leverkusen",
    'J03WPY': "Fortuna Düsseldorf vs. 1. FC Nürnberg",
    'J03WOH': "Fortuna Düsseldorf vs. SSV Jahn Regensburg",
    'J03WQQ': "Fortuna Düsseldorf vs. FC St. Pauli",
    'J03WOY': "Fortuna Düsseldorf vs. F.C. Hansa Rostock",
    'J03WR9': "Fortuna Düsseldorf vs. 1. FC Kaiserslautern"
}
```

-----

### 3. ⭐ _KloppyPolarsDataset_ and _SoccerGraphConverterPolars_

ℹ️ For more information on:
- What a Graph is, check out [Graph FAQ Section A](graphs_faq.ipynb)
- What parameters we can pass to the `SoccerGraphConverterPolars`, check out [Graph FAQ Section B](graphs_faq.ipynb)
- What features each Graph has, check out [Graph FAQ Section C](graphs_faq.ipynb)

------

To get started we need to load our tracking data using Kloppy, and subsequently pass this to the `KloppyPolarsDataset`. 

This `KloppyPolarsDataset` also takes the following optional parameters:
- ball_carrier_threshold: float = 25.0
- max_player_speed: float = 12.0
- max_ball_speed: float = 28.0
- max_player_acceleration: float = 6.0
- max_ball_acceleration: float = 13.5
- orient_ball_owning: bool = True

🗒️ KloppyPolarsDataset sets the orientation to `Orientation.BALL_OWNING_TEAM` (ball owning team plays left to right) when `orient_ball_owning=True`. This is preferred behaviour in this use-case.

If our dataset does not have the ball owning team we infer the ball owning team automatically using the `ball_carrier_threshold` and subsequently change the orientation automatically to be left to right for the ball owning team too. Additionally, we automatically identify the ball carrying player as the player on the ball owning team closest to the ball.

🗒️ In `SoccerGraphConverter` [deprecated] if the ball owning team was not available we set the orientation to STATIC_HOME_AWAY meaning attacking could happen in two directions. 

<div style="border: 2px solid #ddd; border-radius: 5px; padding: 10px; background-color: ##282C34;">
<pre>
kloppy_dataset = sportec.load_open_tracking_data(
    match_id=match_id,
    coordinates="secondspectrum",
    alive_only=True,
    limit=500,  
)
kloppy_polars_dataset = KloppyPolarsDataset(
    kloppy_dataset=kloppy_dataset,
    ball_carrier_threshold=25.0
)
</pre>
</div>

#### Graph Identifier(s):
After loading the `kloppy_polars_dataset` we now add graph identifiers. We can do this by passing a list of column names on which we want to split our data.

🗒️ When training a model on tracking data it's highly recommended to split data into test/train(/validation) sets by match or period such that all data end up in the same test, train or validation set. This should be done to avoid leaking information between test, train and validation sets. Correctly splitting the final dataset in train, test and validiation sets using these Graph Identifiers is incorporated into `CustomSpektralDataset` (see [Section 6.1](#61-split-dataset) for more information).

<div style="border: 2px solid #ddd; border-radius: 5px; padding: 10px; background-color: ##282C34;">
<pre>
kloppy_polars_dataset.add_graph_ids(by=["game_id", "period_id"])
</pre>
</div>

#### Graph Labels

Now, we can add our (binary) labels to the dataset. In all examples we do this using `kloppy_polars_dataset.add_dummy_labels()`, but these are random labels and will not help with training.

To add useful labels for your task you need to "join" a Polars dataframe that contains a column with the required labels to the `kloppy_polars_dataset.data` Polars dataframe. Please note that in this dataframe each row is a single player (or ball) object, and thus each `frame_id` has 23 rows (if all players and ball are observed). All these rows (for a single frame_id) need to have _the same_ label. If your label column is not named `"label"` you need to pass the `label_col` (str) parameter to `SoccerGraphConverterPolars`.

We add _real_ labels as follows:

<div style="border: 2px solid #ddd; border-radius: 5px; padding: 10px; background-color: ##282C34;">
<pre>
kloppy_polars_dataset.data = (
    kloppy_polars_dataset.data
    .join(
        some_label_dataframe.select(["game_id", "period_id", "frame_id", "label"]), 
        on=["game_id", "period_id", "frame_id"],
        how="left"
    )
</pre>
</div>

For a full list of other parameters we can pass to the `SoccerGraphConverterPolars`, check out [Graph FAQ Section B](graphs_faq.ipynb)

------


### 4. Load Kloppy Data, Convert and Store

As mentioned in [Section 2](#2-public-sportec-data) we will use matches of Sportec data. In the below example we will load the first 500 frames of data from 4 games (we set `limit=500`) to create a dataset of 2,000 samples.

Important things to note:
- The `SoccerGraphConverterPolars` handles all necessary steps.
- We will end up with fewer than 2,000 eventhough we set `limit=500` frames because we set `only_alive=True` and all frames without ball coordinates are automatically ommited.
- When using other providers always set `include_empty_frames=False` or `only_alive=True`.
- We store the data as individual compressed pickle files, one file for per match. The data that gets stored in the pickle is a list of dictionaries, one dictionary per frame. Each dictionary has keys for the adjacency matrix, node features, edge features, label and graph id.

In [3]:
from os.path import exists

matches = {
    "J03WMX": "1. FC Köln vs. FC Bayern München",
    "J03WN1": "VfL Bochum 1848 vs. Bayer 04 Leverkusen",
    "J03WPY": "Fortuna Düsseldorf vs. 1. FC Nürnberg",
    "J03WOH": "Fortuna Düsseldorf vs. SSV Jahn Regensburg",
    # 'J03WQQ': "Fortuna Düsseldorf vs. FC St. Pauli",
    # 'J03WOY': "Fortuna Düsseldorf vs. F.C. Hansa Rostock",
    # 'J03WR9': "Fortuna Düsseldorf vs. 1. FC Kaiserslautern"
}
pickle_folder = "pickles"
compressed_pickle_file_path = "{pickle_folder}/{match_id}.pickle.gz"

for match_id in matches.keys():
    match_pickle_file_path = compressed_pickle_file_path.format(
        pickle_folder=pickle_folder, match_id=match_id
    )
    # if the output file already exists, skip this whole step
    if not exists(match_pickle_file_path):
        # Load Kloppy dataset
        kloppy_dataset = sportec.load_open_tracking_data(
            match_id=match_id,
            coordinates="secondspectrum",
            only_alive=True,
            limit=500,
        )
        dataset = KloppyPolarsDataset(
            kloppy_dataset=kloppy_dataset,
            ball_carrier_threshold=25.0,
            max_ball_speed=28.0,
            max_player_speed=12.0,
            max_player_acceleration=6.0,
            max_ball_acceleration=13.5,
            orient_ball_owning=True,
        )
        dataset.add_graph_ids()
        dataset.add_dummy_labels()

        # Initialize the Graph Converter, with dataset, labels and settings
        converter = SoccerGraphConverterPolars(
            dataset=dataset,
            # Settings
            self_loop_ball=True,
            adjacency_matrix_connect_type="ball",
            adjacency_matrix_type="split_by_team",
            label_type="binary",
            defending_team_node_value=0.1,
            non_potential_receiver_node_value=0.1,
            random_seed=False,
            pad=False,
            verbose=False,
        )
        # Compute the graphs and directly store them as a pickle file
        converter.to_pickle(file_path=match_pickle_file_path)


-----

### 5. Creating a Custom Graph Dataset

To easily train our model with the Spektral library we need to use a Spektral dataset object. The `CustomSpektralDataset` class helps us create such an object really easily.

- `CustomSpektralDataset` is a [`spektral.data.Dataset`](https://graphneural.network/creating-dataset/). 
This type of dataset makes it very easy to properly load, train and predict with a Spektral GNN.
- The `CustomSpektralDataset` has an option to load from a folder of compressed pickle files, all we have to do is pass the pickle_folder location.

ℹ️ For more information on the `CustomSpektralDataset` please check the [Graphs FAQ Section D](graphs_faq.ipynb)

In [4]:
from unravel.utils import CustomSpektralDataset

dataset = CustomSpektralDataset(pickle_folder=pickle_folder)
dataset

CustomSpektralDataset(n_graphs=2004)

### 6. Prepare for Training

Now that we have all the data converted into Graphs inside our `CustomSpektralDataset` object, we can prepare to train the GNN model.


#### 6.1 Split Dataset

Our `dataset` object has two custom methods to help split the data into train, test and validation sets.
Either use `dataset.split_test_train()` if we don't need a validation set, or `dataset.split_test_train_validation()` if we do also require a validation set.

We can split our data 'by_graph_id' if we have provided Graph Ids in our `SoccerGraphConverterPolars` using the 'graph_id' or 'graph_ids' parameter.

The 'split_train', 'split_test' and 'split_validation' parameters can either be ratios, percentages or relative size compared to total. 

We opt to create a test, train _and_ validation set to use in our example.

In [5]:
train, test, val = dataset.split_test_train_validation(
    split_train=4, split_test=1, split_validation=1, by_graph_id=True, random_seed=42
)
print("Train:", train)
print("Test:", test)
print("Validation:", val)

Train: CustomSpektralDataset(n_graphs=1002)
Test: CustomSpektralDataset(n_graphs=501)
Validation: CustomSpektralDataset(n_graphs=501)


🗒️ We can see that, because we are splitting by only 4 different graph_ids here (the 4 match_ids) the ratio's aren't perfectly 4 to 1 to 1. If you change the `graph_id=match_id` parameter in the `SoccerGraphConverter` to `graph_ids=dummy_graph_ids(dataset)` you'll see that it's easier to get close to the correct ratios, simply because we have a lot more graph_ids to split a cross. 

#### 6.2 Model Configurations

In [6]:
learning_rate = 1e-3
epochs = 5  # Increase for actual training
batch_size = 32
channels = 128
n_layers = 3  # Number of CrystalConv layers

#### 6.3 Build GNN Model

This GNN Model has the same architecture as described in [A Graph Neural Network Deep-dive into Successful Counterattacks {A. Sahasrabudhe & J. Bekkers}](https://github.com/USSoccerFederation/ussf_ssac_23_soccer_gnn/tree/main)

This exact model can also simply be loaded as:

`from unravel.classifiers import CrystalGraphClassifier` as shown in [Quick Start Guide](0_quick_start_guide.ipynb)

Below we show the exact same code to make it easier to adjust.

In [7]:
from spektral.layers import GlobalAvgPool, CrystalConv
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.models import Model


class CrystalGraphClassifier(Model):
    def __init__(
        self,
        n_layers: int = 3,
        channels: int = 128,
        drop_out: float = 0.5,
        n_out: int = 1,
        **kwargs
    ):
        super().__init__(**kwargs)

        self.n_layers = n_layers
        self.channels = channels
        self.drop_out = drop_out
        self.n_out = n_out

        self.conv1 = CrystalConv()
        self.convs = [CrystalConv() for _ in range(1, self.n_layers)]
        self.pool = GlobalAvgPool()
        self.dense1 = Dense(self.channels, activation="relu")
        self.dropout = Dropout(self.drop_out)
        self.dense2 = Dense(self.channels, activation="relu")
        self.dense3 = Dense(self.n_out, activation="sigmoid")

    def call(self, inputs):
        x, a, e, i = inputs
        x = self.conv1([x, a, e])
        for conv in self.convs:
            x = conv([x, a, e])
        x = self.pool([x, i])
        x = self.dense1(x)
        x = self.dropout(x)
        x = self.dense2(x)
        x = self.dropout(x)
        return self.dense3(x)

#### 6.4 Create DataLoaders

Create a Spektral [`DisjointLoader`](https://graphneural.network/loaders/#disjointloader). This DisjointLoader will help us to load batches of Disjoint Graphs for training purposes.

Note that these Spektral `Loaders` return a generator, so if we want to retrain the model, we need to reload these loaders.

In [8]:
from spektral.data import DisjointLoader

loader_tr = DisjointLoader(train, batch_size=batch_size, epochs=epochs)
loader_va = DisjointLoader(val, epochs=1, shuffle=False, batch_size=batch_size)

--------

### 7. Training and Prediction

Below we outline how to train the model, make predictions and add the predicted values back to the Kloppy dataframe.

#### 7.1 Compile Model

1. Initialize the `CrystalGraphClassifier` (or create your own Graph Classifier).
2. Compile the model with a loss function, optimizer and your preferred metrics.

In [9]:
from tensorflow.keras.metrics import AUC, BinaryAccuracy
from tensorflow.keras.losses import BinaryCrossentropy
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping

model = CrystalGraphClassifier()

model.compile(
    loss=BinaryCrossentropy(), optimizer=Adam(), metrics=[AUC(), BinaryAccuracy()]
)



#### 7.2 Fit Model

1. We have a a [`DisjointLoader`](https://graphneural.network/loaders/#disjointloader) for training and validation sets.
2. Fit the model. 
3. We add `EarlyStopping` and a `validation_data` dataset to monitor performance, and set `use_multiprocessing=True` to improve training speed.

⚠️ When trying to fit the model _again_ make sure to reload Data Loaders in [Section 6.4](#64-create-dataloaders), because they are generators.

In [10]:
model.fit(
    loader_tr.load(),
    steps_per_epoch=loader_tr.steps_per_epoch,
    epochs=5,
    use_multiprocessing=True,
    validation_data=loader_va.load(),
    callbacks=[EarlyStopping(monitor="loss", patience=5, restore_best_weights=True)],
)

Epoch 1/5




Epoch 2/5




Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.src.callbacks.History at 0x359511d50>

#### 7.3 Save & Load Model

This step is solely included to show how to restore a model.

In [11]:
from tensorflow.keras.models import load_model

model_path = "models/my-first-graph-classifier"
model.save(model_path)
loaded_model = load_model(model_path)

INFO:tensorflow:Assets written to: models/my-first-graph-classifier/assets


INFO:tensorflow:Assets written to: models/my-first-graph-classifier/assets


#### 7.4 Evaluate Model

1. Create another `DisjointLoader`, this time for the test set.
2. Evaluate model performance on the test set. This evaluation function uses the `metrics` passed to `model.compile`

🗒️ Our performance is really bad because we're using random labels, very few epochs and a small dataset.

📖 For more information on evaluation in sports analytics see: [Methodology and evaluation in sports analytics: challenges, approaches, and lessons learned {J. Davis et. al. (2024)}](https://link.springer.com/article/10.1007/s10994-024-06585-0)


In [12]:
loader_te = DisjointLoader(test, epochs=1, shuffle=False, batch_size=batch_size)
results = model.evaluate(loader_te.load())



#### 7.5 Predict on New Data

1. Load new, unseen data from the SkillCorner dataset.
2. Convert this data, making sure we use the exact same settings as in step 1.
3. If we set `prediction=True` we do not have to supply labels to the `SoccerGraphConverterPolars`.
4. We do still need to add graph_ids. It is advised to do this by "frame_id" for the prediction, such that we can more easily merge the predictions back to the correct frame

In [13]:
kloppy_dataset = sportec.load_open_tracking_data(
    match_id="J03WR9",  # A game we have not yet used in section 4
    only_alive=True,
    limit=500,
)
dataset = KloppyPolarsDataset(
    kloppy_dataset=kloppy_dataset, ball_carrier_threshold=25.0
)
dataset.add_graph_ids(by=["frame_id"])

preds_converter = SoccerGraphConverterPolars(
    dataset=dataset,
    # Settings
    prediction=True,
    self_loop_ball=True,
    adjacency_matrix_connect_type="ball",
    adjacency_matrix_type="split_by_team",
    label_type="binary",
    defending_team_node_value=0.1,
    non_potential_receiver_node_value=0.1,
    random_seed=False,
    pad=False,
    verbose=False,
)

4. Make a prediction on all the frames of this dataset using `model.predict`

In [14]:
# Compute the graphs and add them to the CustomSpektralDataset
pred_dataset = CustomSpektralDataset(graphs=preds_converter.to_spektral_graphs())

loader_pred = DisjointLoader(
    pred_dataset, batch_size=batch_size, epochs=1, shuffle=False
)
preds = model.predict(loader_pred.load(), use_multiprocessing=True)



5. Create a predictions Polars DataFrame

We merge the `preds_df` to the `KloppyPolarsDataset` named `dataset` that we just applied the predictions to.

🗒️ We use `"frame_id"` here because the prediction dataset "x.id" is the Graph Ids we added above. We did this on the `"frame_id"` too.

In [15]:
import polars as pl

preds_df = pl.DataFrame(
    {"frame_id": [int(x.id) for x in pred_dataset], "y_hat": preds.flatten()}
)
preds_df.sort("y_hat")

dataset.data = dataset.data.join(preds_df, on="frame_id", how="left")

dataset.data[300:305][["frame_id", "period_id", "timestamp", "y_hat"]]

frame_id,period_id,timestamp,y_hat
i64,i64,duration[μs],f32
10300,1,12s,0.991924
10301,1,12s 40ms,0.991924
10302,1,12s 80ms,0.991924
10303,1,12s 120ms,0.991924
10304,1,12s 160ms,0.991924
