# StatsBomb Data Preparation

This notebook loads [StatsBomb's 2015/16 Big 5 Leagues Free Data Release](https://statsbomb.com/what-we-do/hub/free-data/) and stores it in a HDF5 database.

To be able to run it, you'll have to install socceraction with the optional `statsbombpy` and `pytables` dependencies:

```
pip install "socceraction[statsbomb,hdf]"
```

## Configuration
We will load the StatsBomb data for the Big 5 leagues in 2015/16. The IDs of these competitions are defined in the cell below.

In [1]:
from socceraction.data import PartitionIdentifier

comps = [
    PartitionIdentifier(competition_id=12, season_id=27),  # ITA 2015/16
    PartitionIdentifier(competition_id=2,  season_id=27),  # ENG 2015/16
    PartitionIdentifier(competition_id=7,  season_id=27),  # FRA 2015/16
    PartitionIdentifier(competition_id=9,  season_id=27),  # GER 2015/16
    PartitionIdentifier(competition_id=11, season_id=27),  # ESP 2015/16
]

We will store the data in `../../data/`. If it does not yet exist, we create the directory now.

In [2]:
from pathlib import Path

data_dir = Path("../../data")

# Create data folder if it doesn't exist
data_dir.mkdir(parents=True, exist_ok=True)

## Set up a data loader

We use the API clients included in the socceraction library to fetch data. These clients enable fetching event streams and their corresponding metadata as Pandas DataFrames using a unified data model. Below we setup a data loader to fetch data from [StatsBomb's open data repository](https://github.com/statsbomb/open-data). The documentation provides instructions on [how to connect with other data sources](https://socceraction.readthedocs.io/en/latest/documentation/data/index.html).

In [3]:
from socceraction.data import StatsBombLoader

SBL = StatsBombLoader(getter="remote")

In [4]:
# suppress warning about missing authentication while downloading public StatsBomb data
import warnings
from statsbombpy.api_client import NoAuthWarning
warnings.simplefilter('ignore', NoAuthWarning)

Let's fetch all available competitions and check whether we've set the correct IDs above.

In [5]:
df_competitions = SBL.competitions()
df_competitions \
 .set_index(["competition_id", "season_id"]) \
 .loc[[(c.competition_id, c.season_id) for c in comps]]

Unnamed: 0_level_0,Unnamed: 1_level_0,competition_name,country_name,competition_gender,season_name
competition_id,season_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
12,27,Serie A,Italy,male,2015/2016
2,27,Premier League,England,male,2015/2016
7,27,Ligue 1,France,male,2015/2016
9,27,1. Bundesliga,Germany,male,2015/2016
11,27,La Liga,Spain,male,2015/2016


With the API client you can also get all available games in a season, the teams and players that participated in a game and the event stream of a game as convenient Pandas DataFrames.

In [6]:
df_games = SBL.games(competition_id=2, season_id=27)
df_games.head()

Unnamed: 0,game_id,season_id,competition_id,competition_stage,game_day,game_date,home_team_id,away_team_id,home_score,away_score,venue,referee
0,3754058,27,2,Regular Season,20,2016-01-02 16:00:00,22,28,0,0,King Power Stadium,Andre Marriner
1,3754245,27,2,Regular Season,9,2015-10-17 16:00:00,27,41,1,0,The Hawthorns,Martin Atkinson
2,3754136,27,2,Regular Season,17,2015-12-19 18:30:00,37,59,1,1,St. James'' Park,Martin Atkinson
3,3754037,27,2,Regular Season,36,2016-04-30 16:00:00,29,28,2,1,Goodison Park,Neil Swarbrick
4,3754039,27,2,Regular Season,26,2016-02-13 16:00:00,31,23,1,2,Selhurst Park,Robert Madley


In [7]:
df_teams = SBL.teams(game_id=3754058)
df_teams.head()

Unnamed: 0,team_id,team_name
0,28,AFC Bournemouth
1,22,Leicester City


In [8]:
df_players = SBL.players(game_id=3754058)
df_players.head()

Unnamed: 0,game_id,team_id,player_id,player_name,nickname,jersey_number,is_starter,starting_position_id,starting_position_name,minutes_played
0,3754058,28,3049,Matt Ritchie,,30,True,12,Right Midfield,95
1,3754058,28,3085,Glenn Murray,,27,False,0,Substitute,44
2,3754058,28,3304,Harry Arter,,8,True,15,Left Center Midfield,60
3,3754058,28,3341,Steve Cook,,3,True,5,Left Center Back,95
4,3754058,28,3343,Dan Gosling,,4,True,13,Right Center Midfield,95


In [9]:
df_events = SBL.events(game_id=3754058)
df_events.head()

Unnamed: 0,game_id,event_id,period_id,team_id,player_id,type_id,type_name,index,timestamp,minute,...,team_name,duration,extra,related_events,player_name,position_id,position_name,location,under_pressure,counterpress
0,3754058,9153e9f4-f69c-4e04-8f64-505592e212cd,1,22,,35,Starting XI,1,0 days 00:00:00,0,...,Leicester City,0.0,"{'tactics': {'formation': 442, 'lineup': [{'pl...",[],,,,,False,False
1,3754058,3fbcf4e7-94d1-485a-be85-fd26a6af0318,1,28,,35,Starting XI,2,0 days 00:00:00,0,...,AFC Bournemouth,0.0,"{'tactics': {'formation': 4141, 'lineup': [{'p...",[],,,,,False,False
2,3754058,06a9a4dc-d9c9-40f6-bd89-437ba7fe682d,1,28,,18,Half Start,3,0 days 00:00:00,0,...,AFC Bournemouth,0.0,{},[100362ee-9311-4187-bd8a-0201d9db2565],,,,,False,False
3,3754058,100362ee-9311-4187-bd8a-0201d9db2565,1,22,,18,Half Start,4,0 days 00:00:00,0,...,Leicester City,0.0,{},[06a9a4dc-d9c9-40f6-bd89-437ba7fe682d],,,,,False,False
4,3754058,2ca23eea-a984-47e4-8243-8f00880ad1c9,1,28,3343.0,30,Pass,5,0 days 00:00:01.753000,0,...,AFC Bournemouth,0.308263,"{'pass': {'recipient': {'id': 3346, 'name': 'J...",[1f98c89e-2326-4200-8c12-a987fdbbaf2e],Dan Gosling,13.0,Right Center Midfield,"[61.0, 40.1]",False,False


## Download and store data

Instead of downloading and converting the data to dataframes every time you need it, it might be a good idea to store the data locally in a structured database. Therefore, SoccerAction providers the `socceraction.data.HDFDataset` class which is a wrapper around [`pandas.HDFStore`](https://pandas.pydata.org/pandas-docs/stable/reference/io.html#hdfstore-pytables-hdf5) that adds a convenient interface for storing and retrieving and event stream dataset. If you prefere SQLite over HDF, SoccerAction also provides a `socceraction.data.SQLDataset` or you can implement an interface for your own custom data storage solution by extending the `socceraction.data.Dataset` class.

In [10]:
from socceraction.data import HDFDataset

# create a HDF dataset
dataset = HDFDataset(
    path=(data_dir / "statsbomb-bigfive-1516.h5"), 
    mode="w"  # note: using `mode=w` will recreate the H5 file if it already exists. To add data to an existing dataset, use `mode=a`.
)

In [11]:
for comp in comps:
    dataset.import_data(SBL, partition=comp)

Loading game data...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 380/380 [09:03<00:00,  1.43s/it]
Loading game data...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 380/380 [09:20<00:00,  1.48s/it]
Loading game data...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 377/377 [09:29<00:00,  1.51s/it]
Loading game data...: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████

The HDF database now contains all games, teams, players and events performed during each game.

In [12]:
dataset.games().head()

Unnamed: 0,game_id,season_id,competition_id,competition_stage,game_day,game_date,home_team_id,away_team_id,home_score,away_score,venue,referee
0,3879863,27,12,Regular Season,37,2016-05-08 15:00:00,228,230,1,1,Gewiss Stadium,Nicola Rizzoli
1,3879773,27,12,Regular Season,28,2016-03-06 16:00:00,291,230,2,0,Stadio Comunale Matusa,Nicola Rizzoli
2,3879847,27,12,Regular Season,36,2016-04-30 18:00:00,230,241,1,5,Dacia Arena,Maurizio Mariani
3,3879862,27,12,Regular Season,37,2016-05-08 20:45:00,241,227,1,2,Stadio Olimpico Grande Torino,Antonio Damato
4,3879817,27,12,Regular Season,33,2016-04-16 20:45:00,238,227,2,0,Stadio Giuseppe Meazza,Gianluca Rocchi


In [13]:
dataset.teams().head()

Unnamed: 0,team_id,team_name
0,290,Empoli
1,231,Chievo
0,224,Juventus
1,230,Udinese
0,241,Torino


In [14]:
dataset.players().head()

Unnamed: 0,team_id,player_id,player_name
0,228,6941,Jasmin Kurtič
1,228,6992,Andrea Masiello
2,228,6994,Marten de Roon
3,228,7002,Rafael Tolói
4,228,7108,Berat Djimsiti


In [15]:
dataset.events(game_id=3754058).head()

Unnamed: 0,game_id,event_id,period_id,team_id,player_id,type_id,type_name,index,timestamp,minute,...,team_name,duration,extra,related_events,player_name,position_id,position_name,location,under_pressure,counterpress
0,3754058,9153e9f4-f69c-4e04-8f64-505592e212cd,1,22,,35,Starting XI,1,0 days 00:00:00,0,...,Leicester City,0.0,"{'tactics': {'formation': 442, 'lineup': [{'pl...",[],,,,,False,False
1,3754058,3fbcf4e7-94d1-485a-be85-fd26a6af0318,1,28,,35,Starting XI,2,0 days 00:00:00,0,...,AFC Bournemouth,0.0,"{'tactics': {'formation': 4141, 'lineup': [{'p...",[],,,,,False,False
2,3754058,06a9a4dc-d9c9-40f6-bd89-437ba7fe682d,1,28,,18,Half Start,3,0 days 00:00:00,0,...,AFC Bournemouth,0.0,{},[100362ee-9311-4187-bd8a-0201d9db2565],,,,,False,False
3,3754058,100362ee-9311-4187-bd8a-0201d9db2565,1,22,,18,Half Start,4,0 days 00:00:00,0,...,Leicester City,0.0,{},[06a9a4dc-d9c9-40f6-bd89-437ba7fe682d],,,,,False,False
4,3754058,2ca23eea-a984-47e4-8243-8f00880ad1c9,1,28,3343.0,30,Pass,5,0 days 00:00:01.753000,0,...,AFC Bournemouth,0.308263,"{'pass': {'recipient': {'id': 3346, 'name': 'J...",[1f98c89e-2326-4200-8c12-a987fdbbaf2e],Dan Gosling,13.0,Right Center Midfield,"[61.0, 40.1]",False,False


Additionally, the `HDFDataset` provides a number of methods which makes it conventient to access the dataset. Below are a few examples.

In [16]:
# Find a player in the dataset
dataset.search_player("Kevin")

Unnamed: 0,team_id,player_id,player_name
10,243,8668,Kevin-Prince Boateng
17,1683,7793,Kevin Lasagna
8,229,6980,Kevin Strootman
26,59,75899,Kevin Toner
10,24,11992,Kevin Linford Stewart
8,38,3611,Kevin Wimmer
3,36,3089,Kevin De Bruyne
22,29,4317,Kevin Mirallas
10,37,16027,Kevin Mbabu
25,40,21540,Kevin Nolan


In [17]:
# Find a team in the dataset
dataset.search_team("Manchester")

Unnamed: 0,team_id,team_name
0,39,Manchester United
1,36,Manchester City


In [18]:
dataset.close()