# StatsBomb Data Preparation

This notebook loads StatsBomb's 2015/16 Big 5 Leagues Free Data Release, converts it to the [SPADL format](https://socceraction.readthedocs.io/en/latest/documentation/spadl/index.html) and stores it in a HDF5 database.

In [1]:
from pathlib import Path

from socceraction.data.statsbomb import StatsBombLoader
from socceraction.spadl.statsbomb import convert_to_actions

In [2]:
%load_ext autoreload
%autoreload 2

from soccer_xg.data import HDFDataset

  from .autonotebook import tqdm as notebook_tqdm


## Configuration
We will load the StatsBomb data for the Big 5 leagues in 2015/16. 

In [3]:
comps = [
    { "league": { "name": "GER", "sb_id":  9 }, "season": { "name": "2015/16", "sb_id": 27 } },
    { "league": { "name": "ENG", "sb_id":  2 }, "season": { "name": "2015/16", "sb_id": 27 } },
    { "league": { "name": "ESP", "sb_id": 11 }, "season": { "name": "2015/16", "sb_id": 27 } },
    { "league": { "name": "FRA", "sb_id":  7 }, "season": { "name": "2015/16", "sb_id": 27 } },
    { "league": { "name": "ITA", "sb_id": 12 }, "season": { "name": "2015/16", "sb_id": 27 } }
]

The cell below defines where the data will be stored.

In [4]:
spadl_datafolder = Path("../data")

# Create data folder if it doesn't exist
spadl_datafolder.mkdir(parents=True, exist_ok=True)

## Set up a data loader

We use the [API clients included in the socceraction library](https://socceraction.readthedocs.io/en/latest/documentation/data/index.html) to fetch data. These clients enable fetching event streams and their corresponding metadata as Pandas DataFrames using a unified data model. Below we setup a data loader to fetch data from [StatsBomb's open data repository](https://github.com/statsbomb/open-data).

In [5]:
SBL = StatsBombLoader(getter="remote")

In [6]:
import warnings
# suppress warning about missing authentication while downloading public StatsBomb data
from statsbombpy.api_client import NoAuthWarning
warnings.simplefilter('ignore', NoAuthWarning)
# surpress warnings regarding data version
warnings.filterwarnings("ignore", message=".*fidelity.*")

Let's fetch all available competitions and check whether we've set the correct IDs above.

In [7]:
# View all available competitions
df_competitions = SBL.competitions()
set(df_competitions.competition_name)

{'1. Bundesliga',
 'Champions League',
 'Copa del Rey',
 "FA Women's Super League",
 'FIFA U20 World Cup',
 'FIFA World Cup',
 'Indian Super league',
 'La Liga',
 'Liga Profesional',
 'Ligue 1',
 'Major League Soccer',
 'NWSL',
 'North American League',
 'Premier League',
 'Serie A',
 'UEFA Euro',
 'UEFA Europa League',
 "UEFA Women's Euro",
 "Women's World Cup"}

In [8]:
df_competitions \
 .set_index(["competition_id", "season_id"]) \
 .loc[[(c['league']['sb_id'], c['season']['sb_id']) for c in comps]]

Unnamed: 0_level_0,Unnamed: 1_level_0,competition_name,country_name,competition_gender,season_name
competition_id,season_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
9,27,1. Bundesliga,Germany,male,2015/2016
2,27,Premier League,England,male,2015/2016
11,27,La Liga,Spain,male,2015/2016
7,27,Ligue 1,France,male,2015/2016
12,27,Serie A,Italy,male,2015/2016


## Download and store data

Next, we download the data, convert it to the SPADL format and store it in a HDF file.

In [9]:
# create a HDF dataset
dataset = HDFDataset(
    path=spadl_datafolder / "spadl-statsbomb-bigfive-1516.h5", 
    mode="w"
)
for comp in comps:
    # get name and id of competition
    competition_name, competition_id = comp['league']['name'], comp['league']['sb_id']
    season_name, season_id = comp['season']['name'], comp['season']['sb_id']
    print(f"Importing {competition_name} {season_name} ...")
    # import data
    dataset.import_data(
        SBL, 
        convert_to_actions, 
        competition_id, 
        season_id
    )

Importing GER 2015/16 ...


Loading game data...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 306/306 [10:19<00:00,  2.02s/it]


Importing ENG 2015/16 ...


Loading game data...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 380/380 [12:13<00:00,  1.93s/it]


Importing ESP 2015/16 ...


Loading game data...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 380/380 [12:04<00:00,  1.91s/it]


Importing FRA 2015/16 ...


Loading game data...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 377/377 [13:02<00:00,  2.08s/it]


Importing ITA 2015/16 ...


Loading game data...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 380/380 [14:30<00:00,  2.29s/it]


The HDF database now contains all games, teams, players and actions performed during each game.

In [10]:
dataset.games().head()

Unnamed: 0_level_0,season_id,competition_id,competition_stage,game_day,game_date,home_team_id,away_team_id,home_score,away_score,venue,referee
game_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
3890561,27,9,Regular Season,34,2016-05-14 15:30:00,175,181,1,4,PreZero Arena,Felix Brych
3890505,27,9,Regular Season,28,2016-04-02 15:30:00,169,184,1,0,Allianz Arena,Florian Meyer
3890511,27,9,Regular Season,29,2016-04-08 20:30:00,173,178,2,2,Olympiastadion Berlin,Benjamin Brand
3890515,27,9,Regular Season,29,2016-04-09 15:30:00,171,872,1,2,Volksparkstadion,Peter Sippel
3890411,27,9,Regular Season,17,2015-12-20 16:30:00,173,177,2,0,Olympiastadion Berlin,Peter Sippel


In [11]:
dataset.teams().head()

Unnamed: 0_level_0,team_name
team_id,Unnamed: 1_level_1
179,Wolfsburg
184,Eintracht Frankfurt
174,VfB Stuttgart
186,FC Köln
172,Augsburg


In [12]:
dataset.players().head()

Unnamed: 0_level_0,team_id,player_name,nickname
player_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
3053,181,Leroy Sané,
3499,181,Jean-Eric Maxim Choupo-Moting,Eric Maxim Choupo-Moting
3502,181,Joël Andre Job Matip,Joël Matip
3510,181,Sead Kolašinac,
5242,181,Younès Belhanda,


In [13]:
dataset.events(game_id=3890561).head()

Unnamed: 0_level_0,game_id,period_id,team_id,player_id,type_id,type_name,index,timestamp,minute,second,...,team_name,duration,extra,related_events,player_name,position_id,position_name,location,under_pressure,counterpress
event_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
41bd60ac-9b2c-4cb8-85aa-23ae71825c1e,3890561,1,175,,35,Starting XI,1,0 days 00:00:00,0,0,...,Hoffenheim,0.0,"{'tactics': {'formation': 3421, 'lineup': [{'p...",[],,,,,False,False
fbca533d-f3f4-4a86-b4a3-4fcae63592cf,3890561,1,181,,35,Starting XI,2,0 days 00:00:00,0,0,...,Schalke 04,0.0,"{'tactics': {'formation': 4141, 'lineup': [{'p...",[],,,,,False,False
b15ba6b1-61ac-4d9c-b2a3-096ce31bcf01,3890561,1,175,,18,Half Start,3,0 days 00:00:00,0,0,...,Hoffenheim,0.0,{},[442128f8-2e38-491c-bf1e-b336e91757fa],,,,,False,False
442128f8-2e38-491c-bf1e-b336e91757fa,3890561,1,181,,18,Half Start,4,0 days 00:00:00,0,0,...,Schalke 04,0.0,{},[b15ba6b1-61ac-4d9c-b2a3-096ce31bcf01],,,,,False,False
644e16d7-10ca-45f0-8128-fc0055d6f753,3890561,1,175,8387.0,30,Pass,5,0 days 00:00:00.482000,0,0,...,Hoffenheim,0.453238,"{'pass': {'recipient': {'id': 5460, 'name': 'A...",[7602c8d9-d988-4eae-bb9f-309fbad4c7c5],Mark Uth,18.0,Right Attacking Midfield,"[61.0, 40.1]",False,False


In [14]:
dataset.actions(game_id=3890561).head()

Unnamed: 0_level_0,game_id,original_event_id,period_id,time_seconds,team_id,player_id,start_x,start_y,end_x,end_y,type_id,result_id,bodypart_id
action_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,3890561,644e16d7-10ca-45f0-8128-fc0055d6f753,1,0.482,175,8387.0,53.33125,33.9575,52.63125,35.8275,0,1,4
1,3890561,329a1879-2521-4614-8c68-b4798b0e5d23,1,0.935,175,5460.0,52.63125,35.8275,51.93125,35.4875,21,1,0
2,3890561,77e2ddaf-6de3-49e7-a318-7d765799b543,1,1.015,175,5460.0,51.93125,35.4875,47.11875,32.2575,0,1,4
3,3890561,1b91a029-f722-4b0d-b9d5-53cdc776f9e3,1,2.167,175,6039.0,47.11875,32.2575,45.71875,29.6225,21,1,0
4,3890561,2c51f271-c812-45af-896b-06f49a14a5bb,1,2.954,175,6039.0,45.71875,29.6225,29.96875,15.3425,0,1,5


In [15]:
dataset.close()