This notebook takes the Wyscout data and turns them into parquet files. These are extremely fast to load so good for this prototyping kind of analysis.

**References:**

Pappalardo, Luca; Massucco, Emanuele (2019): Soccer match event dataset. figshare. Collection. https://doi.org/10.6084/m9.figshare.c.4415000

Pappalardo, L., Cintia, P., Rossi, A. et al. A public data set of spatio-temporal match events in soccer competitions. Sci Data 6, 236 (2019). https://doi.org/10.1038/s41597-019-0247-7

Data link: https://figshare.com/collections/Soccer_match_event_dataset/4415000/2

The dataframes have the following number of entries:

* df_coach: 208 entries
* df_player: 3603 entries
* df_team: 142 entries
* df_competition: 7 entries
* df_match: 1941 entries
* df_formation: 74098 entries
* df_substitution: 11097 entries
* df_event: 3251294 entries

In [1]:
import requests
import zipfile
import os
import pandas as pd
import numpy as np
import glob
from mplsoccer.statsbomb import _split_location_cols, _split_dict_col, _list_dictionary_to_df

# Path
Chosen path where the Wyscout open-data is located, only processing the new files.

In [2]:
cwd = os.getcwd()
# save files in folder in current directory. To change if want to save elsewhere
DATA_FOLDER = os.path.join(cwd, 'data', 'wyscout')

# Links to the data

In [3]:
# files that are jsons
JSON_LINKS = ['https://ndownloader.figshare.com/files/15073868',  # coaches
              'https://ndownloader.figshare.com/files/15073721',  # players
              'https://ndownloader.figshare.com/files/15073697',  # teams
              'https://ndownloader.figshare.com/files/15073685',  # competitions
              'https://raw.githubusercontent.com/andrewRowlinson/mplsoccer-assets/main/wyscout_event_tags.json',# my decode tags
              ]  # competitions
JSON_FILES = ['coach.json', 
              #'referees.json',  # <- not downloaded as corrupt
              'player.json', 'team.json', 'competition.json',
              'event_tag.json']

In [4]:
# Files that are zipped
ZIP_LINKS = ['https://ndownloader.figshare.com/files/14464685',  # events
             'https://ndownloader.figshare.com/files/14464622']  # matches
ZIP_FILES = ['events.zip', 'matches.zip']

# Make the directory structure

In [5]:
# make the directory structure
for folder in ['json', 'event_raw', 'match_raw', 'formation_raw', 'substitution_raw']:
    path = os.path.join(DATA_FOLDER, folder)
    if not os.path.exists(path):
        os.mkdir(path)

# Download files

In [6]:
def download_url(url, save_path, chunk_size=128, json=False):
    '''Souce: https://stackoverflow.com/questions/9419162/download-returned-zip-file-from-url '''
    r = requests.get(url, stream=True)
    if json:
        r.encoding = 'unicode-escape'
    with open(save_path, 'wb') as fd:
        for chunk in r.iter_content(chunk_size=chunk_size):
            fd.write(chunk)

In [7]:
# download json files
for i, link in enumerate(JSON_LINKS):
    download_url(link, os.path.join(DATA_FOLDER, 'json', JSON_FILES[i]), json=True)

In [8]:
# download zip files, extract jsons, and remove original zip files
for i, link in enumerate(ZIP_LINKS):
    save_path = os.path.join(DATA_FOLDER, 'json', ZIP_FILES[i])
    download_url(link, save_path)
    with zipfile.ZipFile(save_path, 'r') as zip_ref:
        zip_ref.extractall(os.path.join(DATA_FOLDER, 'json'))
    os.remove(save_path)

# Rename dictionary for consistency with StatsBomb

In [9]:
team_rename = {'Real Club Celta de Vigo': 'Celta Vigo',
               'Valencia Club de Fútbol': 'Valencia',
               'FC Barcelona': 'Barcelona',
               'Real Betis Balompié': 'Real Betis',
               'Girona FC': 'Girona',
               'CD Leganés': 'Leganés',
               'Real Sociedad de Fútbol': 'Real Sociedad',
               'Real Club Deportivo de La Coruña': 'Deportivo La Coruna',
               'Sevilla FC': 'Sevilla',
               'Getafe Club de Fútbol': 'Getafe',
               'Athletic Club Bilbao': 'Athletic Bilbao',
               'Real Madrid Club de Fútbol': 'Real Madrid',
               'Málaga Club de Fútbol': 'Málaga',
               'Levante UD': 'Levante',
               'Reial Club Deportiu Espanyol': 'Espanyol',
               'UD Las Palmas': 'Las Palmas',
               'SD Eibar': 'Eibar',
               'Villarreal Club de Fútbol': 'Villarreal',
               'Manchester United FC': 'Manchester United',
               'Manchester City FC': 'Manchester City',
               'Tottenham Hotspur FC': 'Tottenham Hotspur',
               'AS Monaco FC': 'AS Monaco',
               'Newcastle United FC': 'Newcastle United',
               'Leicester City FC': 'Leicester City',
               'Juventus FC': 'Juventus',
               'BV Borussia 09 Dortmund': 'Borussia Dortmund',
               'Everton FC': 'Everton',
               'Arsenal FC': 'Arsenal',
               'Southampton FC': 'Southampton',
               'Liverpool FC': 'Liverpool',
               'Chelsea FC': 'Chelsea',
               'Club Atlético de Madrid': 'Atlético Madrid',
               'Korea Republic': 'South Korea'}

# Coach

In [10]:
df_coach = pd.read_json(os.path.join(DATA_FOLDER, 'json', 'coach.json'), encoding='unicode-escape')
for col in ['passportArea', 'birthArea']:
    df_coach = _split_dict_col(df_coach, col)
df_coach.to_parquet(os.path.join(DATA_FOLDER, 'coach.parquet'))
df_coach.rename({'wyId': 'coach_id'}, axis=1, inplace=True)
df_coach.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 208 entries, 0 to 207
Data columns (total 15 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   coach_id                 208 non-null    int64 
 1   shortName                208 non-null    object
 2   firstName                208 non-null    object
 3   middleName               208 non-null    object
 4   lastName                 208 non-null    object
 5   birthDate                206 non-null    object
 6   currentTeamId            208 non-null    int64 
 7   passportArea_id          208 non-null    int64 
 8   passportArea_alpha2code  208 non-null    object
 9   passportArea_alpha3code  208 non-null    object
 10  passportArea_name        208 non-null    object
 11  birthArea_id             208 non-null    int64 
 12  birthArea_alpha2code     208 non-null    object
 13  birthArea_alpha3code     208 non-null    object
 14  birthArea_name           208 non-null    o

# Players

In [11]:
df_player = pd.read_json(os.path.join(DATA_FOLDER, 'json', 'player.json'), encoding='unicode-escape')
for col in ['passportArea', 'role', 'birthArea']:
    df_player = _split_dict_col(df_player, col)
# some of the ids are null some are 'null' as text :)
for col in ['currentTeamId', 'currentNationalTeamId', 'passportArea_id', 'birthArea_id']:
    mask_null = (df_player[col].isnull())|(df_player[col] == 'null')
    df_player.loc[mask_null, col] = np.nan
    df_player[col] = df_player[col].astype(np.float32)
df_player.rename({'wyId': 'player_id'}, axis=1, inplace=True)
df_player.to_parquet(os.path.join(DATA_FOLDER, 'player.parquet'))
df_player.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3603 entries, 0 to 3602
Data columns (total 22 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   weight                   3603 non-null   int64  
 1   firstName                3603 non-null   object 
 2   middleName               3603 non-null   object 
 3   lastName                 3603 non-null   object 
 4   currentTeamId            3468 non-null   float32
 5   birthDate                3603 non-null   object 
 6   height                   3603 non-null   int64  
 7   player_id                3603 non-null   int64  
 8   foot                     3603 non-null   object 
 9   shortName                3603 non-null   object 
 10  currentNationalTeamId    1357 non-null   float32
 11  passportArea_name        3603 non-null   object 
 12  passportArea_id          3603 non-null   float32
 13  passportArea_alpha3code  3603 non-null   object 
 14  passportArea_alpha2code 

# Teams

In [12]:
df_team = pd.read_json(os.path.join(DATA_FOLDER, 'json', 'team.json'), encoding='unicode-escape')
df_team = _split_dict_col(df_team, 'area')
df_team['area_id'] = df_team.area_id.astype(np.int32)
df_team.rename({'wyId': 'team_id'}, axis=1, inplace=True)
df_team.officialName.replace(team_rename, inplace=True)
df_team.to_parquet(os.path.join(DATA_FOLDER, 'team.parquet'))
df_team.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 142 entries, 0 to 141
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   city             142 non-null    object
 1   name             142 non-null    object
 2   team_id          142 non-null    int64 
 3   officialName     142 non-null    object
 4   type             142 non-null    object
 5   area_name        142 non-null    object
 6   area_id          142 non-null    int32 
 7   area_alpha3code  142 non-null    object
 8   area_alpha2code  142 non-null    object
dtypes: int32(1), int64(1), object(7)
memory usage: 9.6+ KB


# Competitions

In [13]:
df_competition = pd.read_json(os.path.join(DATA_FOLDER, 'json', 'competition.json'), encoding='unicode-escape')
df_competition = _split_dict_col(df_competition, 'area')
# if the area id is '0' as text for internationals set to missing
df_competition.loc[df_competition.format=='International cup', 'area_id'] = np.nan
df_competition['area_id'] = df_competition.area_id.astype(np.float32)
# make same format as StatsBomb: competition_country_name
mask = df_competition.type=='club'
df_competition.loc[mask, 'competition_country_name'] = df_competition.loc[mask, 'area_name']
mask = df_competition.type=='international'
df_competition.loc[mask, 'competition_country_name'] = 'International'
# add gender
df_competition['competition_gender'] = 'male'
# replace with competition real names
df_competition.name.replace({'Spanish first division': 'La Liga',
                             'World Cup': 'FIFA World Cup',
                             'Italian first division': 'Serie A',
                             'English first division': 'Premier League',
                             'French first division': 'Ligue 1',
                             'German first division': 'Bundesliga',
                             'European Championship': 'UEFA Euro'}, inplace=True)
# rename competition name
df_competition.rename({'name': 'competition_name', 'wyId': 'competition_id'}, axis=1, inplace=True)
# add season name
df_competition.loc[df_competition.type == 'club', 'season_name'] = '2017/2018'
df_competition.loc[df_competition.competition_name == 'UEFA Euro', 'season_name'] = '2016'
df_competition.loc[df_competition.competition_name == 'FIFA World Cup', 'season_name'] = '2018'
df_competition.to_parquet(os.path.join(DATA_FOLDER, 'competition.parquet'))
df_competition.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 11 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   competition_name          7 non-null      object 
 1   competition_id            7 non-null      int64  
 2   format                    7 non-null      object 
 3   type                      7 non-null      object 
 4   area_name                 7 non-null      object 
 5   area_id                   5 non-null      float32
 6   area_alpha3code           7 non-null      object 
 7   area_alpha2code           7 non-null      object 
 8   competition_country_name  7 non-null      object 
 9   competition_gender        7 non-null      object 
 10  season_name               7 non-null      object 
dtypes: float32(1), int64(1), object(9)
memory usage: 716.0+ bytes


# Matches

In [14]:
# list of match files
match_list = glob.glob(os.path.join(DATA_FOLDER, 'json', 'matches*.json'))

# loop through match files as save as seperate parquet files
for file in match_list:
    
    # match dataframe
    df_match = pd.read_json(file, encoding='unicode-escape')
    
    # split the team information from the teamsData column into two seperate columns
    col = 'teamsData'
    df_match[col] = df_match[col].apply(lambda x: {} if pd.isna(x) else x)
    df_match['team1'] = df_match.teamsData.apply(lambda x: x[list(x.keys())[0]])
    df_match['team2'] = df_match.teamsData.apply(lambda x: x[list(x.keys())[1]])
    
    # split team information stored as a dictionary into seperate columns
    df_match = _split_dict_col(df_match, 'team1')
    df_match = _split_dict_col(df_match, 'team2')
    
    # add home and away teams and scores up to extra time
    mask = df_match.team1_side == 'home'
    mask_et = (df_match.team1_scoreET > 0) | (df_match.team2_scoreET > 0)
    df_match.loc[mask,'home_score'] = df_match.loc[mask,'team1_score']
    df_match.loc[mask,'away_score'] = df_match.loc[mask,'team2_score']
    df_match.loc[~mask,'home_score'] = df_match.loc[~mask,'team2_score']
    df_match.loc[~mask,'away_score'] = df_match.loc[~mask,'team1_score']
    df_match.loc[mask_et & mask,'home_score'] = df_match.loc[mask_et & mask,'team1_scoreET']
    df_match.loc[mask_et & mask,'away_score'] = df_match.loc[mask_et & mask,'team2_scoreET']
    df_match.loc[mask_et & ~mask,'home_score'] = df_match.loc[mask_et & ~mask,'team2_scoreET']
    df_match.loc[mask_et & ~mask,'away_score'] = df_match.loc[mask_et & ~mask,'team1_scoreET']    
    
    # add away/ home team info
    df_match.loc[mask, 'home_team_id'] = df_match.loc[mask, 'team1_teamId']
    df_match.loc[~mask, 'home_team_id'] = df_match.loc[~mask, 'team2_teamId']
    df_match.loc[mask, 'away_team_id'] = df_match.loc[mask, 'team2_teamId']
    df_match.loc[~mask, 'away_team_id'] = df_match.loc[~mask, 'team1_teamId']
    
    # add away/home coach info
    df_match.loc[mask, 'home_team_coach_id'] = df_match.loc[mask, 'team1_coachId']
    df_match.loc[~mask, 'home_team_coach_id'] = df_match.loc[~mask, 'team2_coachId']
    df_match.loc[mask, 'away_team_coach_id'] = df_match.loc[mask, 'team2_coachId']
    df_match.loc[~mask, 'away_team_coach_id'] = df_match.loc[~mask, 'team1_coachId']

    # format date columns
    df_match['dateutc'] = pd.to_datetime(df_match.dateutc)
    df_match['kick_off'] = pd.to_datetime(df_match.date.astype(str).str[:-6])
    
    # rename columns
    df_match.rename({'wyId': 'match_id',
                     'gameweek': 'match_week',
                     'seasonId': 'season_id',
                     'competitionId': 'competition_id',
                     'venue': 'stadium_name'}, axis=1, inplace=True)
    
    # add competition info
    df_match = df_match.merge(df_competition[['competition_id',
                                              'competition_country_name',
                                              'competition_name',
                                              'season_name',
                                              'competition_gender']], on='competition_id', how='left')
    
    # add team info
    df_match = df_match.merge(df_team[['team_id', 'officialName']],
                              left_on='home_team_id', right_on='team_id', how='left')
    df_match = df_match.merge(df_team[['team_id', 'officialName']],
                              left_on='away_team_id', right_on='team_id', how='left', suffixes=['_home', '_away'])
    
    df_match.rename({'officialName_home': 'home_team_name',
                     'officialName_away': 'away_team_name'}, axis=1, inplace=True)
    
    # replace some team names to be the same as StatsBomb
    df_match.home_team_name.replace(team_rename, inplace=True)
    df_match.away_team_name.replace(team_rename, inplace=True)

    # dataframes with the team id for adding to the formation/ substitutions
    df_team1 = df_match[['match_id', 'team1_teamId']].rename({'team1_teamId': 'team_id'}, axis=1)
    df_team2 = df_match[['match_id', 'team2_teamId']].rename({'team2_teamId': 'team_id'}, axis=1)
    
    # formation lineup dataframe
    df_team1_formation_lineup = _list_dictionary_to_df(df_match, 'team1_formation_lineup', 'lineup', 'lineup_id', 'match_id')
    df_team2_formation_lineup = _list_dictionary_to_df(df_match, 'team2_formation_lineup', 'lineup', 'lineup_id', 'match_id')
    df_team1_formation_lineup = df_team1_formation_lineup.merge(df_team1, on='match_id', how='left')
    df_team2_formation_lineup = df_team2_formation_lineup.merge(df_team2, on='match_id', how='left')
    df_formation_lineup = pd.concat([df_team1_formation_lineup, df_team2_formation_lineup])
    df_formation_lineup = _split_dict_col(df_formation_lineup, 'lineup')
    df_formation_lineup['bench'] = False
    
    # formation bench lineup
    df_team1_formation_bench = _list_dictionary_to_df(df_match, 'team1_formation_bench', 'lineup', 'lineup_id', 'match_id')
    df_team2_formation_bench = _list_dictionary_to_df(df_match, 'team2_formation_bench', 'lineup', 'lineup_id', 'match_id')
    df_team1_formation_bench = df_team1_formation_bench.merge(df_team1, on='match_id', how='left')
    df_team2_formation_bench = df_team2_formation_bench.merge(df_team2, on='match_id', how='left')
    df_formation_bench = pd.concat([df_team1_formation_bench, df_team2_formation_bench])
    df_formation_bench = _split_dict_col(df_formation_bench, 'lineup')
    df_formation_bench['bench'] = True
    
    # combine lineup from bench/ not from bench
    df_formation = pd.concat([df_formation_lineup, df_formation_bench])
        
    df_formation.rename({'lineup_playerId': 'player_id', 'lineup_ownGoals': 'ownGoals',
                         'lineup_redCards': 'redCards', 'lineup_goals': 'goals', 'lineup_yellowCards': 'yellowCards'},
                        axis=1, inplace=True)
    
    # fix an error where the goalkeeper (Hitz) isn't starting (Jakob is in error): Borussia Dortmund vs Augsburg 2018-02-26
    df_formation.loc[(df_formation.match_id == 2516947) & (df_formation.player_id == 14914), 'bench'] = False
    df_formation.loc[(df_formation.match_id == 2516947) & (df_formation.player_id == 391449), 'bench'] = True
    
    # get a subsitutions dataframe
    df_team1_formation_substitutions = _list_dictionary_to_df(df_match, 'team1_formation_substitutions',
                                                              'lineup', 'sub_id', 'match_id')
    df_team2_formation_substitutions = _list_dictionary_to_df(df_match, 'team2_formation_substitutions', 
                                                              'lineup', 'sub_id', 'match_id')
    df_team1_formation_substitutions = df_team1_formation_substitutions.merge(df_team1, on='match_id', how='left')
    df_team2_formation_substitutions = df_team2_formation_substitutions.merge(df_team2, on='match_id', how='left')
    df_formation_substitutions = pd.concat([df_team1_formation_substitutions, df_team2_formation_substitutions])
    df_formation_substitutions = df_formation_substitutions[df_formation_substitutions.lineup != 'null'].copy()
    df_formation_substitutions = _split_dict_col(df_formation_substitutions, 'lineup')
    df_formation_substitutions.rename({'id': 'match_id', 'lineup_playerIn': 'player_id_in',
                                       'lineup_playerOut': 'player_id_out', 'lineup_minute': 'minute'},
                                      axis=1, inplace=True)
    
    # drop columns
    df_match.drop(['date', 'status', 'winner', 'referees', 'team_id_away', 'team_id_home',
                   'team1_formation_bench', 'team1_formation_lineup', 'team1_formation_substitutions',
                   'team2_formation_bench', 'team2_formation_lineup', 'team2_formation_substitutions',
                   'team1_hasFormation', 'team2_hasFormation',
                   'team1_score', 'team1_scoreP', 'team1_scoreHT', 'team1_scoreET',
                   'team2_score', 'team2_scoreP', 'team2_scoreHT', 'team2_scoreET',
                   'teamsData', 'team1_teamId', 'team2_teamId', 'team2_side',
                   'team1_side', 'team1_coachId', 'team2_coachId'], axis=1, inplace=True)
    
    save_path = os.path.join(DATA_FOLDER, 'match_raw', f'{os.path.basename(file)[:-4]}parquet')
    df_match.to_parquet(save_path)
    
    save_path = os.path.join(DATA_FOLDER, 'formation_raw', f'{os.path.basename(file)[:-4]}parquet')
    df_formation.to_parquet(save_path)
    
    save_path = os.path.join(DATA_FOLDER, 'substitution_raw', f'{os.path.basename(file)[:-4]}parquet')
    df_formation_substitutions.to_parquet(save_path)

Get matches as a single dataframe

In [15]:
match_files = glob.glob(os.path.join(DATA_FOLDER, 'match_raw', '*.parquet'))
df_match = pd.concat([pd.read_parquet(file) for file in match_files])
df_match.to_parquet(os.path.join(DATA_FOLDER, 'match.parquet'))
df_match.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1941 entries, 0 to 63
Data columns (total 23 columns):
 #   Column                    Non-Null Count  Dtype         
---  ------                    --------------  -----         
 0   roundId                   1941 non-null   int64         
 1   match_week                1941 non-null   int64         
 2   season_id                 1941 non-null   int64         
 3   dateutc                   1941 non-null   datetime64[ns]
 4   stadium_name              1941 non-null   object        
 5   match_id                  1941 non-null   int64         
 6   label                     1941 non-null   object        
 7   duration                  1941 non-null   object        
 8   competition_id            1941 non-null   int64         
 9   home_score                1941 non-null   float64       
 10  away_score                1941 non-null   float64       
 11  home_team_id              1941 non-null   float64       
 12  away_team_id          

Get the formation as a single dataframe

In [16]:
files = glob.glob(os.path.join(DATA_FOLDER, 'formation_raw', '*.parquet'))
df_formation = pd.concat([pd.read_parquet(file) for file in files])
df_formation.to_parquet(os.path.join(DATA_FOLDER, 'formation.parquet'))
df_formation.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 74098 entries, 0 to 738
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   match_id        74098 non-null  int64 
 1   lineup_id       74098 non-null  int64 
 2   team_id         74098 non-null  int64 
 3   player_id       74098 non-null  int64 
 4   ownGoals        74098 non-null  object
 5   redCards        74098 non-null  object
 6   goals           74098 non-null  object
 7   yellowCards     74098 non-null  object
 8   bench           74098 non-null  bool  
 9   lineup_assists  5211 non-null   object
dtypes: bool(1), int64(4), object(5)
memory usage: 5.7+ MB


Get the substitution as a single dataframe

In [17]:
files = glob.glob(os.path.join(DATA_FOLDER, 'substitution_raw', '*.parquet'))
df_substitution = pd.concat([pd.read_parquet(file) for file in files])
df_substitution.to_parquet(os.path.join(DATA_FOLDER, 'substitution.parquet'))
df_substitution.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11097 entries, 0 to 188
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   match_id        11097 non-null  int64 
 1   sub_id          11097 non-null  int64 
 2   team_id         11097 non-null  int64 
 3   player_id_in    11097 non-null  int64 
 4   player_id_out   11097 non-null  int64 
 5   minute          11097 non-null  int64 
 6   lineup_assists  674 non-null    object
dtypes: int64(6), object(1)
memory usage: 693.6+ KB


# Events

In [18]:
# list of event files
events_list = glob.glob(os.path.join(DATA_FOLDER, 'json', 'events*.json'))

# loop through event files as save as seperate parquet files
for file in events_list:
    
    print(os.path.basename(file))
    
    # load as dataframe
    df_event = pd.read_json(file, encoding='unicode-escape')
    
    # split start and end positions
    _split_location_cols(df_event, 'positions', ['start', 'end'])
    
    # create seperate columns for the x/y coordinates
    for col in ['start', 'end']:
        df_event = _split_dict_col(df_event, col)
        
    # set dodgy end coordinates to null
    mask = df_event.eventName.isin(['Shot', 'Interruption', 'Offside'])
    mask2 = df_event.subEventName.isin(['Free kick shot', 'Hand foul', 'Late card foul', 'Out of game foul', 'Protest',
                                        'Simulation', 'Time lost foul', 'Violent Foul'])
    df_event.loc[mask | mask2, 'end_x'] = np.nan
    df_event.loc[mask | mask2, 'end_y'] = np.nan
    
    # wyscout has some dodgy end_y/ end_x near the corners. Convert to np.nan
    mask_dodgy_end = (((df_event.end_y == 100) & (df_event.end_x == 100)) | 
                      ((df_event.end_x == 0) & (df_event.end_y == 0)))
    df_event.loc[mask_dodgy_end, 'end_y'] = np.nan
    df_event.loc[mask_dodgy_end, 'end_x'] = np.nan
    
    # set dodgy start coordinates to null
    df_event.loc[df_event.eventName.isin(['Save attempt', 'Goalkeeper leaving line']), 'start_x'] = np.nan
    df_event.loc[df_event.eventName.isin(['Save attempt', 'Goalkeeper leaving line']), 'start_y'] = np.nan
    
    # fix start coordinates for goal kicks
    df_event.loc[df_event.subEventName == 'Goal kick', 'start_x'] = 6.
    df_event.loc[df_event.subEventName == 'Goal kick', 'start_y'] = 50.
    
    # create a seperate column for each tag in the dictionary
    df_new = pd.DataFrame(df_event['tags'].tolist(), index=df_event.index)
    for tag in df_new.columns:
        df_new.loc[df_new[tag].notnull(), tag] = df_new.loc[df_new[tag].notnull(), tag].apply(lambda x: x['id'])
        
    # summarise tag id columns into boolean columns for each tag and a string column for position 
    cols_to_drop = df_new.columns
    df_tag = pd.read_json(os.path.join(DATA_FOLDER, 'json', 'event_tag.json'))
    position_tags = df_tag.loc[df_tag.tag_name.str[:8] == 'position', 'tag_id'].values
    for i, row in df_tag.iterrows():
        if row['tag_id'] not in position_tags:
            df_new.loc[(df_new==row['tag_id']).any(axis=1), row['tag_name']] = True
        else:
            df_new.loc[(df_new==row['tag_id']).any(axis=1), 'position'] = row['tag_name']
            
    # remove 'position' and '_' from text in the position column
    df_new['position'] = df_new.position.str[9:].str.replace('_', ' ')
    df_new.loc[df_new['position'].isnull(), 'position'] = None
    
    # replace missing with False for boolean columns
    other_tags = df_tag.loc[df_tag.tag_name.str[:8] != 'position', 'tag_name'].values
    df_new[other_tags] = df_new[other_tags].replace({np.nan: False})
    
    # drop tag id columns
    df_new.drop(cols_to_drop, axis=1, inplace=True)                                               
                                        
    # add tags to the dataset
    df_event = pd.concat([df_event, df_new], axis=1)
    
    # drop tag column
    df_event.drop('tags', axis=1, inplace=True)
    
    # deal with blank subEventId
    df_event.loc[df_event.subEventId=='', 'subEventId'] = None
    df_event['subEventId'] = df_event['subEventId'].astype(np.float32)
    
    # rename columns for consistency with other datasets
    df_event.rename({'playerId': 'player_id',
                     'start_y': 'y',
                     'start_x': 'x',
                     'matchId': 'match_id',
                     'teamId': 'team_id',}, axis=1, inplace=True)
    
    # save to parquet
    save_path = os.path.join(DATA_FOLDER, 'event_raw', f'{os.path.basename(file)[:-4]}parquet')
    df_event.to_parquet(save_path)

events_England.json
events_European_Championship.json
events_France.json
events_Germany.json
events_Italy.json
events_Spain.json
events_World_Cup.json


Get events as a single dataframe

In [19]:
event_files = glob.glob(os.path.join(DATA_FOLDER, 'event_raw', '*.parquet'))
df_event = pd.concat([pd.read_parquet(file) for file in event_files])
df_event.sort_values(['match_id', 'matchPeriod', 'eventSec'], inplace=True)
df_event.to_parquet(os.path.join(DATA_FOLDER, 'event.parquet'))
df_event.info(verbose=True, show_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3251294 entries, 0 to 647371
Data columns (total 51 columns):
 #   Column               Non-Null Count    Dtype  
---  ------               --------------    -----  
 0   eventId              3251294 non-null  int64  
 1   subEventName         3251294 non-null  object 
 2   player_id            3251294 non-null  int64  
 3   match_id             3251294 non-null  int64  
 4   eventName            3251294 non-null  object 
 5   team_id              3251294 non-null  int64  
 6   matchPeriod          3251294 non-null  object 
 7   eventSec             3251294 non-null  float64
 8   subEventId           3243112 non-null  float32
 9   id                   3251294 non-null  int64  
 10  y                    3227510 non-null  float64
 11  x                    3227510 non-null  float64
 12  end_y                3024965 non-null  float64
 13  end_x                3024965 non-null  float64
 14  goal                 3251294 non-null  bool   
 15 