# Data Preparation

This notebook downloads the opensource [Wyscoutmatch event dataset](https://figshare.com/collections/Soccer_match_event_dataset/4415000/5) and converts it to the [SPADL format](https://github.com/ML-KULeuven/socceraction). This dataset contains all spatio-temporal events (passes, shots, fouls, etc.) that occured during all matches of the 2017/18 season of the top-5 European leagues (La Liga, Serie A, Bundesliga, Premier League, Ligue 1) as well as the FIFA World Cup 2018 and UEFA Euro Cup 2016.

**Disclaimer**: following package versions:

- tqdm 4.42.1
- pandas 1.0
- socceraction 0.1.1

In [1]:
import os
import sys

from tqdm.notebook import tqdm

import math

import pandas as pd
pd.set_option('display.max_columns', None)

from io import BytesIO
from pathlib import Path

from urllib.parse import urlparse
from urllib.request import urlopen, urlretrieve
# optional: if you get a SSL CERTIFICATE_VERIFY_FAILED exception
import ssl; ssl._create_default_https_context = ssl._create_unverified_context

from zipfile import ZipFile, is_zipfile

import socceraction.spadl as spadl
import socceraction.spadl.wyscout as wyscout

## Configure leagues and seasons to download and convert
The two dictionaries below map my internal season and league IDs to Wyscout's IDs. Using an internal ID makes it easier to work with data from multiple providers.

In [2]:
seasons = {
    181248: '1718',
    181150: '1718',
    181144: '1718',
    181189: '1718',
    181137: '1718'
}
leagues = {
    'England':'ENG',
    'France':'FRA',
    'Germany':'GER',
    'Italy':'ITA',
    'Spain':'ESP'
}

## Configure folder names and download URLs

The two cells below define the URLs from where the data are downloaded and were data is stored.

In [3]:
# https://figshare.com/collections/Soccer_match_event_dataset/4415000/5
dataset_urls = dict(
    competitions = "https://ndownloader.figshare.com/files/15073685",
    teams = "https://ndownloader.figshare.com/files/15073697",
    players = "https://ndownloader.figshare.com/files/15073721",
    matches = "https://ndownloader.figshare.com/files/14464622",
    events = "https://ndownloader.figshare.com/files/14464685"
)

In [4]:
raw_datafolder = "../data/wyscout_opensource/raw"
spadl_datafolder = "../data/wyscout_opensource"

# Create data folder if it doesn't exist
for d in [raw_datafolder, spadl_datafolder]:
    if not os.path.exists(d):
        os.makedirs(d, exist_ok=True)
        print(f"Directory {d} created ")

Directory ../data/wyscout_opensource/raw created 


## Download WyScout data 

The following cell loops through the dataset_urls dict and stores each downloaded data file to the `raw_datafolder` in the local file system.

If the downloaded data file is a ZIP archive, the included JSON files are extracted from the ZIP archive.

In [5]:
for url in tqdm(dataset_urls.values()):
    url_obj = urlopen(url).geturl()
    path = Path(urlparse(url_obj).path)
    file_name = os.path.join(raw_datafolder, path.name)
    file_local, _ = urlretrieve(url_obj, file_name)
    if is_zipfile(file_local):
        with ZipFile(file_local) as zip_file:
            zip_file.extractall(raw_datafolder)

print("Downloaded files:")
os.listdir(raw_datafolder)

  0%|          | 0/5 [00:00<?, ?it/s]

Downloaded files:


['events_France.json',
 'events_Spain.json',
 'matches_World_Cup.json',
 'events_Germany.json',
 'matches_Italy.json',
 'matches.zip',
 'teams.json',
 'matches_Germany.json',
 'events_European_Championship.json',
 'events_World_Cup.json',
 'competitions.json',
 'matches_England.json',
 'events.zip',
 'events_Italy.json',
 'matches_France.json',
 'matches_Spain.json',
 'players.json',
 'events_England.json',
 'matches_European_Championship.json']

## Preprocess Wyscout data

The read_json_file function reads and returns the content of a given JSON file. The function handles the encoding of special characters (e.g., accents in names of players and teams) that the pd.read_json function cannot handle properly.

In [6]:
def read_json_file(filename):
    with open(filename, 'rb') as json_file:
        return BytesIO(json_file.read()).getvalue().decode('unicode_escape')

Wyscout does not distinguish between headers and other body
parts on shots. The socceraction convertor simply labels all
shots as performed by foot. I think it is better to label 
them as headers.

In [7]:
def determine_bodypart_id(event):
    """
    This function determines the body part used for an event
    Args:
    event (pd.Series): Wyscout event Series
    Returns:
    int: id of the body part used for the action
    """
    if event["subtype_id"] in [81, 36, 21, 90, 91]:
        body_part = "other"
    elif event["subtype_id"] == 82 or event['head/body']:
        body_part = "head"
    else:  # all other cases
        body_part = "foot"
    return spadl.config.bodyparts.index(body_part)
wyscout.determine_bodypart_id = determine_bodypart_id

### Select competitions to load and convert

In [8]:
json_competitions = read_json_file(f"{raw_datafolder}/competitions.json")
df_competitions = pd.read_json(json_competitions)
# Rename competitions to the names used in the file names
df_competitions['name'] = df_competitions.apply(lambda x: x.area['name'] if x.area['name'] != "" else x['name'], axis=1)
df_competitions['id'] = df_competitions.apply(lambda x: leagues.get(x.area['name'], 'NULL'), axis=1)
# View all available competitions
set(df_competitions.name)

{'England',
 'European Championship',
 'France',
 'Germany',
 'Italy',
 'Spain',
 'World Cup'}

In [9]:
df_selected_competitions = df_competitions[df_competitions.name.isin(leagues.keys())]
df_selected_competitions

Unnamed: 0,name,wyId,format,area,type,id
0,Italy,524,Domestic league,"{'name': 'Italy', 'id': '380', 'alpha3code': '...",club,ITA
1,England,364,Domestic league,"{'name': 'England', 'id': '0', 'alpha3code': '...",club,ENG
2,Spain,795,Domestic league,"{'name': 'Spain', 'id': '724', 'alpha3code': '...",club,ESP
3,France,412,Domestic league,"{'name': 'France', 'id': '250', 'alpha3code': ...",club,FRA
4,Germany,426,Domestic league,"{'name': 'Germany', 'id': '276', 'alpha3code':...",club,GER


## Convert to the SPADL format

In [10]:
json_teams = read_json_file(f"{raw_datafolder}/teams.json")
df_teams = wyscout.convert_teams(pd.read_json(json_teams))

json_players = read_json_file(f"{raw_datafolder}/players.json")
df_players = wyscout.convert_players(pd.read_json(json_players))


for competition in df_selected_competitions.itertuples():
    json_matches = read_json_file(f"{raw_datafolder}/matches_{competition.name}.json")
    df_matches = pd.read_json(json_matches)
    season_id = seasons[df_matches.seasonId.unique()[0]]
    df_games =  wyscout.convert_games(df_matches)
    df_games['competition_id'] = competition.id
    df_games['season_id'] = season_id
    
    json_events = read_json_file(f"{raw_datafolder}/events_{competition.name}.json")
    df_events = pd.read_json(json_events).groupby('matchId', as_index=False)
    
    player_games = []
    
    spadl_h5 = os.path.join(spadl_datafolder, f"spadl-wyscout_opensource-{competition.id}-{season_id}.h5")

    # Store all spadl data in h5-file
    print(f"Converting {competition.id} {season_id}")
    with pd.HDFStore(spadl_h5) as spadlstore:
        
        spadlstore["actiontypes"] = spadl.actiontypes_df()
        spadlstore["results"] = spadl.results_df()
        spadlstore["bodyparts"] = spadl.bodyparts_df()
        spadlstore["games"] = df_games

        for game in tqdm(list(df_games.itertuples())):
            game_id = game.game_id
            game_events = df_events.get_group(game_id)

            # filter the players that were lined up in this season
            player_games.append(wyscout.get_player_games(df_matches[df_matches.wyId == game_id].iloc[0], game_events))

            # convert events to SPADL actions
            home_team = game.home_team_id
            df_actions = wyscout.convert_actions(game_events, home_team)
            df_actions["action_id"] = range(len(df_actions))
            spadlstore[f"actions/game_{game_id}"] = df_actions

        player_games = pd.concat(player_games).reset_index(drop=True)  
        spadlstore["player_games"] = player_games
        spadlstore["players"] = df_players[df_players.player_id.isin(player_games.player_id)]
        spadlstore["teams"] = df_teams[df_teams.team_id.isin(df_games.home_team_id) | df_teams.team_id.isin(df_games.away_team_id)]

Converting ITA 1718


  0%|          | 0/380 [00:00<?, ?it/s]

Converting ENG 1718


  0%|          | 0/380 [00:00<?, ?it/s]

Converting ESP 1718


  0%|          | 0/380 [00:00<?, ?it/s]

Converting FRA 1718


  0%|          | 0/380 [00:00<?, ?it/s]

Converting GER 1718


  0%|          | 0/306 [00:00<?, ?it/s]