# StatsBomb Data Preparation

This notebook loads [StatsBomb's 2015/16 Big 5 Leagues Free Data Release](https://statsbomb.com/what-we-do/hub/free-data/) and stores it in a HDF5 database.

To be able to run it, you'll have to install socceraction with the optional `statsbombpy` and `pytables` dependencies:

```
pip install "socceraction[statsbomb,hdf]"
```

## Configuration
We will load the StatsBomb data for the Big 5 leagues in 2015/16. The IDs of these competitions are defined in the cell below.

In [1]:
comps = [
    { "league": { "name": "ENG", "sb_id":  2 }, "season": { "name": "2015/16", "sb_id": 27 } },
    { "league": { "name": "FRA", "sb_id":  7 }, "season": { "name": "2015/16", "sb_id": 27 } },
    { "league": { "name": "GER", "sb_id":  9 }, "season": { "name": "2015/16", "sb_id": 27 } },
    { "league": { "name": "ESP", "sb_id": 11 }, "season": { "name": "2015/16", "sb_id": 27 } },
    { "league": { "name": "ITA", "sb_id": 12 }, "season": { "name": "2015/16", "sb_id": 27 } }
]

We will store the data in `../../data/`. If it does not yet exist, we create the directory now.

In [2]:
from pathlib import Path

data_dir = Path("../../data")

# Create data folder if it doesn't exist
data_dir.mkdir(parents=True, exist_ok=True)

## Set up a data loader

We use the [API clients included in the socceraction library](https://socceraction.readthedocs.io/en/latest/documentation/data/index.html) to fetch data. These clients enable fetching event streams and their corresponding metadata as Pandas DataFrames using a unified data model. Below we setup a data loader to fetch data from [StatsBomb's open data repository](https://github.com/statsbomb/open-data).

In [3]:
from socceraction.data import StatsBombLoader

SBL = StatsBombLoader(getter="remote")

In [4]:
import warnings
# suppress warning about missing authentication while downloading public StatsBomb data
from statsbombpy.api_client import NoAuthWarning
warnings.simplefilter('ignore', NoAuthWarning)

Let's fetch all available competitions and check whether we've set the correct IDs above.

In [5]:
# View all available competitions
df_competitions = SBL.competitions()
set(df_competitions.competition_name)

{'1. Bundesliga',
 'African Cup of Nations',
 'Champions League',
 'Copa del Rey',
 "FA Women's Super League",
 'FIFA U20 World Cup',
 'FIFA World Cup',
 'Indian Super league',
 'La Liga',
 'Liga Profesional',
 'Ligue 1',
 'Major League Soccer',
 'NWSL',
 'North American League',
 'Premier League',
 'Serie A',
 'UEFA Euro',
 'UEFA Europa League',
 "UEFA Women's Euro",
 "Women's World Cup"}

In [6]:
df_competitions \
 .set_index(["competition_id", "season_id"]) \
 .loc[[(c['league']['sb_id'], c['season']['sb_id']) for c in comps]]

Unnamed: 0_level_0,Unnamed: 1_level_0,competition_name,country_name,competition_gender,season_name
competition_id,season_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2,27,Premier League,England,male,2015/2016
7,27,Ligue 1,France,male,2015/2016
9,27,1. Bundesliga,Germany,male,2015/2016
11,27,La Liga,Spain,male,2015/2016
12,27,Serie A,Italy,male,2015/2016


## Download and store data

Next, we download and store the data in a HDF file. Therefore, SoccerAction providers the `socceraction.data.HDFDataset` class which is a wrapper around `pandas.HDFStore` that adds a convenient interface for storing and retrieving and event stream dataset. If you prefere SQLite over HDF, SoccerAction also provides a `socceraction.data.SQLDataset` or you can implement an interface for your own custom data storage solution by extending the `socceraction.data.Dataset` class.

In [7]:
from socceraction.data import HDFDataset

# create a HDF dataset
dataset = HDFDataset(
    path=(data_dir / "statsbomb-bigfive-1516.h5"), 
    mode="w"  # note: using `mode=w` will recreate the H5 file if it already exists. To add data to an existing dataset, use `mode=a`.
)

In [8]:
for comp in comps:
    # get name and id of competition
    competition_name, competition_id = comp['league']['name'], comp['league']['sb_id']
    season_name, season_id = comp['season']['name'], comp['season']['sb_id']
    print(f"Importing {competition_name} {season_name} ...")
    # import data
    dataset.import_data(SBL, competition_id, season_id)

Importing ENG 2015/16 ...


Loading game data...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 380/380 [08:56<00:00,  1.41s/it]


Importing FRA 2015/16 ...


Loading game data...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 377/377 [09:10<00:00,  1.46s/it]


Importing GER 2015/16 ...


Loading game data...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 306/306 [07:24<00:00,  1.45s/it]


Importing ESP 2015/16 ...


Loading game data...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 380/380 [09:03<00:00,  1.43s/it]


Importing ITA 2015/16 ...


Loading game data...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 380/380 [09:22<00:00,  1.48s/it]


The HDF database now contains all games, teams, players and events performed during each game.

In [9]:
dataset.games().head()

Unnamed: 0_level_0,season_id,competition_id,competition_stage,game_day,game_date,home_team_id,away_team_id,home_score,away_score,venue,referee
game_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
3754058,27,2,Regular Season,20,2016-01-02 16:00:00,22,28,0,0,King Power Stadium,Andre Marriner
3754245,27,2,Regular Season,9,2015-10-17 16:00:00,27,41,1,0,The Hawthorns,Martin Atkinson
3754136,27,2,Regular Season,17,2015-12-19 18:30:00,37,59,1,1,St. James'' Park,Martin Atkinson
3754037,27,2,Regular Season,36,2016-04-30 16:00:00,29,28,2,1,Goodison Park,Neil Swarbrick
3754039,27,2,Regular Season,26,2016-02-13 16:00:00,31,23,1,2,Selhurst Park,Robert Madley


In [10]:
dataset.teams().head()

Unnamed: 0_level_0,team_name
team_id,Unnamed: 1_level_1
31,Crystal Palace
41,Sunderland
25,Southampton
37,Newcastle United
30,Stoke City


In [11]:
dataset.players().head()

Unnamed: 0_level_0,team_id,player_name
player_id,Unnamed: 1_level_1,Unnamed: 2_level_1
3049,28,Matt Ritchie
3085,28,Glenn Murray
3304,28,Harry Arter
3341,28,Steve Cook
3343,28,Dan Gosling


In [12]:
dataset.events(game_id=3754058).head()

Unnamed: 0_level_0,game_id,period_id,team_id,player_id,type_id,type_name,index,timestamp,minute,second,...,team_name,duration,extra,related_events,player_name,position_id,position_name,location,under_pressure,counterpress
event_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
9153e9f4-f69c-4e04-8f64-505592e212cd,3754058,1,22,,35,Starting XI,1,0 days 00:00:00,0,0,...,Leicester City,0.0,"{'tactics': {'formation': 442, 'lineup': [{'pl...",[],,,,,False,False
3fbcf4e7-94d1-485a-be85-fd26a6af0318,3754058,1,28,,35,Starting XI,2,0 days 00:00:00,0,0,...,AFC Bournemouth,0.0,"{'tactics': {'formation': 4141, 'lineup': [{'p...",[],,,,,False,False
06a9a4dc-d9c9-40f6-bd89-437ba7fe682d,3754058,1,28,,18,Half Start,3,0 days 00:00:00,0,0,...,AFC Bournemouth,0.0,{},[100362ee-9311-4187-bd8a-0201d9db2565],,,,,False,False
100362ee-9311-4187-bd8a-0201d9db2565,3754058,1,22,,18,Half Start,4,0 days 00:00:00,0,0,...,Leicester City,0.0,{},[06a9a4dc-d9c9-40f6-bd89-437ba7fe682d],,,,,False,False
2ca23eea-a984-47e4-8243-8f00880ad1c9,3754058,1,28,3343.0,30,Pass,5,0 days 00:00:01.753000,0,1,...,AFC Bournemouth,0.308263,"{'pass': {'recipient': {'id': 3346, 'name': 'J...",[1f98c89e-2326-4200-8c12-a987fdbbaf2e],Dan Gosling,13.0,Right Center Midfield,"[61.0, 40.1]",False,False


Additionally, the `HDFDataset` provides a number of methods which makes it conventient to access the dataset. Below are a few examples.

In [13]:
# Find a player in the dataset
dataset.search_player("Kevin")

Unnamed: 0_level_0,team_id,player_name
player_id,Unnamed: 1_level_1,Unnamed: 2_level_1
75899,59,Kevin Toner
11992,24,Kevin Linford Stewart
3611,38,Kevin Wimmer
3089,36,Kevin De Bruyne
4317,29,Kevin Mirallas
16027,37,Kevin Mbabu
21540,40,Kevin Nolan
4440,131,Kevin Trapp
4902,131,Kevin Rimane
8215,175,Kevin Volland


In [14]:
# Find a team in the dataset
dataset.search_team("Manchester")

Unnamed: 0_level_0,team_name
team_id,Unnamed: 1_level_1
39,Manchester United
36,Manchester City


In [15]:
dataset.close()