# Creating a training dataset

This notebook creates a dataset of passes and generates features and labels.

In [None]:
from pathlib import Path

import pandas as pd
pd.set_option('display.max_columns', None)

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
from unxpass.databases import SQLiteDatabase
from unxpass.datasets import PassesDataset, CompletedPassesDataset, FailedPassesDataset

## Configure folder names

First, we define were the processed data should be stored.

In [None]:
DATA_DIR = Path("../stores/")

## Create database connection

We need a database with StatsBomb 360 data to extract passes from.

In [None]:
DB_PATH = DATA_DIR / "database.sql"
db = SQLiteDatabase(DB_PATH)

In [None]:
from socceraction.spadl.utils import add_names

game_id = 3795107

# load SPADL actions
df_actions = add_names(db.actions(game_id))
df_actions.head()

## Select passes

We only use passes that are 
- performed by foot
- part of open play
- for which the start and end location are included in the 360 snapshot

In [None]:
passes_idx = PassesDataset.actionfilter(df_actions)
df_actions.loc[passes_idx].head()

## Compute features and labels

The `unxpass.features` and `unxpass.labels` modules implement various feature generation and labeling functions, respectively.

In [None]:
from unxpass import features as fs
from unxpass import labels as ls

# List of available features
print("Features:", [f.__name__ for f in fs.all_features])

# List of available labels
print("Labels:", [f.__name__ for f in ls.all_labels])

As some of these functions require data of the entire game (e.g., to determine the current scoreline) they should always be applied on the game state representation of the full game. Relevant actions can be selected afterwards.

In [None]:
from socceraction.vaep.features import gamestates as to_gamestates
from unxpass.utils import play_left_to_right

# convert actions to gamestates
home_team_id, _ = db.get_home_away_team_id(game_id)
gamestates = play_left_to_right(to_gamestates(df_actions, nb_prev_actions=3), home_team_id)

In [31]:
dataset = PassesDataset(
    path=DATA_DIR / "datasets" / "euro2020" / "train",
    xfns=["actiontype"],
    yfns=["success"]
)
#dataset.create(db)

## The "PassesDataset" interface

To make things easier, we provide an interface that does all of the above. Additionally, it can store all computed features and labels locally. This is recommended when experimenting with multiple model configurations. It also functions as a PyTorch dataset.

In [28]:
dataset = PassesDataset(
    path=DATA_DIR / "datasets" / "euro2020",
    xfns=["pass_options"],
    yfns=["receiver"]
)
dataset.create(db, [{"competition_id": 55, "season_id": 43, "game_id": 3795506}])
dataset = PassesDataset(
    path=DATA_DIR / "datasets_pass" / "euro2020" / "train",
    xfns=["pass_options"],
    yfns=["receiver"]
)
dataset.create(db)

KeyError: "['pass_option_id'] not in index"

In [24]:
dataset = CompletedPassesDataset(
    path=DATA_DIR / "datasets" / "completed",
    xfns=[f.__name__ for f in fs.all_features],
    yfns=[f.__name__ for f in ls.all_labels]
)
dataset.create(db)
# dataset = FailedPassesDataset(
#     path=DATA_DIR / "datasets" / "failed",
#     xfns=[f.__name__ for f in fs.all_features],
#     yfns=[f.__name__ for f in ls.all_labels]
# )
# dataset.create(db)

Output()

KeyboardInterrupt: 

You can now retrieve the computed features and labels as a Pandas DataFrame.

In [None]:
dataset.features

In [None]:
dataset.labels

Or you can iterate over all examples, returning dictionary with the features and labels.

In [None]:
dataset[0]

In [None]:
db.close()