# Wyscout Data Preparation

This notebook downloads the opensource [Wyscout match event dataset](https://figshare.com/collections/Soccer_match_event_dataset/4415000/2) and converts it to the [SPADL format](https://github.com/ML-KULeuven/socceraction). This dataset contains all spatiotemporal events (passes, shots, fouls, etc.) that occured during all matches of the 2017/18 season of the top-5 European leagues (La Liga, Serie A, Bundesliga, Premier League, Ligue 1) as well as the FIFA World Cup 2018 and UEFA Euro Cup 2016.

In [1]:
from pathlib import Path

from socceraction.data.wyscout import PublicWyscoutLoader
from socceraction.spadl.wyscout import convert_to_actions

In [2]:
%load_ext autoreload
%autoreload 2

from soccer_xg.data import HDFDataset

  from .autonotebook import tqdm as notebook_tqdm


## Configuration
We will load all matches of the 2017/18 season of the top-5 European leagues.

In [3]:
comps = [
    { "league": { "name": "ITA", "wy_id": 524 }, "season": { "name": "1718", "wy_id": 181248 } },
    { "league": { "name": "ENG", "wy_id": 364 }, "season": { "name": "1718", "wy_id": 181150 } },
    { "league": { "name": "ESP", "wy_id": 795 }, "season": { "name": "1718", "wy_id": 181144 } },
    { "league": { "name": "FRA", "wy_id": 412 }, "season": { "name": "1718", "wy_id": 181189 } },
    { "league": { "name": "GER", "wy_id": 426 }, "season": { "name": "1718", "wy_id": 181137 } }
]

The cell below defines where the data will be stored.

In [4]:
raw_datafolder = Path("../data/wyscout/raw")
spadl_datafolder = Path("../data")

# Create data folder if it doesn't exist
raw_datafolder.mkdir(parents=True, exist_ok=True)
spadl_datafolder.mkdir(parents=True, exist_ok=True)

## Set up a data loader

We use the [API clients included in the socceraction library](https://socceraction.readthedocs.io/en/latest/documentation/data/index.html) to fetch data. These clients enable fetching event streams and their corresponding metadata as Pandas DataFrames using a unified data model. Below we setup a data loader to fetch data from the public Wyscout dataset.

In [5]:
WYL = PublicWyscoutLoader(root=raw_datafolder)

Let's fetch all available competitions and check whether we've set the correct IDs above.

In [6]:
# View all available competitions
df_competitions = WYL.competitions()
set(df_competitions.competition_name)

{'English first division',
 'European Championship',
 'French first division',
 'German first division',
 'Italian first division',
 'Spanish first division',
 'World Cup'}

In [7]:
df_competitions \
 .set_index(["competition_id", "season_id"]) \
 .loc[[(c['league']['wy_id'], c['season']['wy_id']) for c in comps]]

Unnamed: 0_level_0,Unnamed: 1_level_0,country_name,competition_name,competition_gender,season_name
competition_id,season_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
524,181248,Italy,Italian first division,male,2017/2018
364,181150,England,English first division,male,2017/2018
795,181144,Spain,Spanish first division,male,2017/2018
412,181189,France,French first division,male,2017/2018
426,181137,Germany,German first division,male,2017/2018


## Download and store data

In [8]:
# create a HDF dataset
dataset = HDFDataset(
    path=spadl_datafolder / "spadl-wyscout-bigfive-1718.h5", 
    mode="w"
)
for comp in comps:
    # get name and id of competition
    competition_name, competition_id = comp['league']['name'], comp['league']['wy_id']
    season_name, season_id = comp['season']['name'], comp['season']['wy_id']
    print(f"Importing {competition_name} {season_name} ...")
    # import data
    dataset.import_data(
        WYL, 
        convert_to_actions, 
        competition_id, 
        season_id
    )

Importing ITA 1718 ...


Loading game data...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 380/380 [04:29<00:00,  1.41it/s]


Importing ENG 1718 ...


Loading game data...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 380/380 [04:22<00:00,  1.44it/s]


Importing ESP 1718 ...


Loading game data...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 380/380 [04:30<00:00,  1.41it/s]


Importing FRA 1718 ...


Loading game data...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 380/380 [04:21<00:00,  1.45it/s]


Importing GER 1718 ...


Loading game data...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 306/306 [03:26<00:00,  1.48it/s]


The HDF database now contains all games, teams, players and actions performed during each game.

In [9]:
dataset.games().head()

Unnamed: 0_level_0,competition_id,season_id,game_date,game_day,home_team_id,away_team_id
game_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2576335,524,181248,2018-05-20 18:45:00,38,3162,3161
2576336,524,181248,2018-05-20 18:45:00,38,3315,3158
2576329,524,181248,2018-05-20 16:00:00,38,3173,3172
2576330,524,181248,2018-05-20 16:00:00,38,3165,3219
2576331,524,181248,2018-05-20 16:00:00,38,3163,3166


In [10]:
dataset.teams().head()

Unnamed: 0_level_0,team_name_short,team_name
team_id,Unnamed: 1_level_1,Unnamed: 2_level_1
3166,Bologna,Bologna FC 1909
3185,Torino,Torino FC
3197,Crotone,FC Crotone
3157,Milan,AC Milan
3161,Internazionale,FC Internazionale Milano


In [11]:
dataset.players().head()

Unnamed: 0_level_0,team_id,player_name,nickname
player_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
21384,3162,Ciro Immobile,C. Immobile
20550,3162,Ştefan Daniel Radu,Ş. Radu
130,3162,Stefan de Vrij,S. de Vrij
346908,3162,Alessandro Murgia,A. Murgia
376362,3162,Luiz Felipe Ramos Marchi,Luiz Felipe


In [12]:
dataset.events(game_id=2576335).head()

Unnamed: 0_level_0,game_id,period_id,milliseconds,team_id,player_id,type_id,type_name,subtype_id,subtype_name,positions,tags
event_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
253668302,2576335,1,2417.59,3161,3344,8,Pass,85,Simple pass,"[{'y': 50, 'x': 49}, {'y': 58, 'x': 38}]",[{'id': 1801}]
253668303,2576335,1,3904.412,3161,116349,8,Pass,85,Simple pass,"[{'y': 58, 'x': 38}, {'y': 91, 'x': 37}]",[{'id': 1801}]
253668304,2576335,1,6484.211,3161,135903,8,Pass,85,Simple pass,"[{'y': 91, 'x': 37}, {'y': 72, 'x': 34}]",[{'id': 1801}]
253668306,2576335,1,10043.835,3161,138408,8,Pass,85,Simple pass,"[{'y': 72, 'x': 34}, {'y': 14, 'x': 36}]",[{'id': 1801}]
253668308,2576335,1,14032.07,3161,21094,8,Pass,85,Simple pass,"[{'y': 14, 'x': 36}, {'y': 39, 'x': 30}]",[{'id': 1801}]


In [13]:
dataset.actions(game_id=2576335).head()

Unnamed: 0_level_0,game_id,period_id,time_seconds,team_id,player_id,start_x,start_y,end_x,end_y,original_event_id,bodypart_id,type_id,result_id
action_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,2576335,1,2.41759,3161,3344,53.55,34.0,65.1,39.44,253668302,0,0,1
1,2576335,1,3.904412,3161,116349,65.1,39.44,66.15,61.88,253668303,0,0,1
2,2576335,1,6.484211,3161,135903,66.15,61.88,69.3,48.96,253668304,0,0,1
3,2576335,1,10.043835,3161,138408,69.3,48.96,67.2,9.52,253668306,0,0,1
4,2576335,1,14.03207,3161,21094,67.2,9.52,73.5,26.52,253668308,0,0,1


In [14]:
dataset.close()