# Data Preparation

This notebook loads the 2018 World Cup dataset provided by StatsBomb and converts it to the [SPADL format](https://github.com/ML-KULeuven/socceraction).

**Disclaimer**: this notebook is compatible with the following package versions:

- tqdm 4.42.1
- pandas 1.0
- socceraction 0.1.1

In [1]:
import os; import sys
from tqdm.notebook import tqdm

import math
import pandas as pd

import socceraction.spadl as spadl
import socceraction.spadl.statsbomb as statsbomb

## Configure leagues and seasons to download and convert
The two dictionaries below map my internal season and league IDs to Statsbomb's IDs. Using an internal ID makes it easier to work with data from multiple providers.

In [2]:
seasons = {
    3: '2018',
}
leagues = {
    'FIFA World Cup': 'WC',
}

## Configure folder names and download URLs

The two cells below define the URLs from where the data are downloaded and were data is stored.

In [3]:
free_open_data_remote = "https://raw.githubusercontent.com/statsbomb/open-data/master/data/"

In [4]:
spadl_datafolder = "../data/statsbomb_opensource"
raw_datafolder = f"../data/statsbomb_opensource/raw"

# Create data folder if it doesn't exist
for d in [raw_datafolder, spadl_datafolder]:
    if not os.path.exists(d):
        os.makedirs(d, exist_ok=True)
        print(f"Directory {d} created ")

Directory ../data/statsbomb_opensource/raw created 


## Set up the statsbombloader

In [5]:
SBL = statsbomb.StatsBombLoader(root=free_open_data_remote, getter="remote")

## Select competitions to load and convert

In [6]:
# View all available competitions
df_competitions = SBL.competitions()
set(df_competitions.competition_name)

{'Champions League',
 "FA Women's Super League",
 'FIFA World Cup',
 'La Liga',
 'NWSL',
 'Premier League',
 "Women's World Cup"}

In [7]:
df_selected_competitions = df_competitions[df_competitions.competition_name.isin(
    leagues.keys()
)]

df_selected_competitions

Unnamed: 0,competition_id,season_id,country_name,competition_name,competition_gender,season_name,match_updated,match_available
17,43,3,International,FIFA World Cup,male,2018,2019-12-16T23:09:16.168756,2019-12-16T23:09:16.168756


## Convert to the SPADL format

In [8]:
for competition in df_selected_competitions.itertuples():
    # Get matches from all selected competition
    matches = SBL.matches(competition.competition_id, competition.season_id)

    matches_verbose = tqdm(list(matches.itertuples()), desc="Loading match data")
    teams, players, player_games = [], [], []
    
    competition_id = leagues[competition.competition_name]
    season_id = seasons[competition.season_id]
    spadl_h5 = os.path.join(spadl_datafolder, f"spadl-statsbomb_opensource-{competition_id}-{season_id}.h5")
    with pd.HDFStore(spadl_h5) as spadlstore:
        
        spadlstore["actiontypes"] = spadl.actiontypes_df()
        spadlstore["results"] = spadl.results_df()
        spadlstore["bodyparts"] = spadl.bodyparts_df()
        
        for match in matches_verbose:
            # load data
            teams.append(SBL.teams(match.match_id))
            players.append(SBL.players(match.match_id))
            events = SBL.events(match.match_id)

            # convert data
            player_games.append(statsbomb.extract_player_games(events))
            spadlstore[f"actions/game_{match.match_id}"] = statsbomb.convert_to_actions(events,match.home_team_id)

        games = matches.rename(columns={"match_id": "game_id", "match_date": "game_date"})
        games.season_id = season_id
        games.competition_id = competition_id
        spadlstore["games"] = games
        spadlstore["teams"] = pd.concat(teams).drop_duplicates("team_id").reset_index(drop=True)
        spadlstore["players"] = pd.concat(players).drop_duplicates("player_id").reset_index(drop=True)
        spadlstore["player_games"] = pd.concat(player_games).reset_index(drop=True)

HBox(children=(FloatProgress(value=0.0, description='Loading match data', max=64.0, style=ProgressStyle(descri…




your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block1_values] [items->Index(['game_date', 'kick_off', 'competition_id', 'country_name',
       'competition_name', 'season_id', 'season_name', 'home_team_name',
       'home_team_gender', 'home_team_group', 'name', 'managers',
       'away_team_name', 'away_team_gender', 'away_team_group', 'match_status',
       'last_updated', 'data_version'],
      dtype='object')]

  exec(code_obj, self.user_global_ns, self.user_ns)
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block1_values] [items->Index(['player_name', 'player_nickname', 'country_name', 'extra'], dtype='object')]

  exec(code_obj, self.user_global_ns, self.user_ns)
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed-integer,key->block2_values] 