## Importing Data

This notebook serves as a step-by-step guide to importing match and event data from the official StatsBomb repository. It documents the exact process used to retrieve, filter, and store the raw data for further analysis.

We use the `statsbombpy` library, which provides a **Pythonic interface** to interact with StatsBomb's open-data GitHub repository. It abstracts the underlying JSON structure and URL handling, allowing straightforward access to competitions, matches, events, and lineups through simple function calls.


In [1]:
import pandas as pd
from statsbombpy import sb

In [2]:
import sys
import os

project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
if project_root not in sys.path:
    sys.path.append(project_root)

from src.constants import *

# module made to get data
import src.get_data as gd

In [3]:
# to suppress Authentication warnings
from statsbombpy.api_client import NoAuthWarning
import warnings
warnings.filterwarnings("ignore", category=NoAuthWarning)

In [4]:
# for better visibility and showing all the columns
pd.set_option('display.max_columns', None)

### 1. Competitions Data

In [5]:
# Load and Filter Competitions
competitions = sb.competitions()

# Filter: keep only male competitions
competitions = competitions[competitions['competition_gender']=='male'].sort_values('season_name', ascending=False)

# Filter: keep seasons from 2014 onward
competitions = competitions[competitions['season_name']>'2014']

# Filter: exclude Indian Super League
competitions = competitions[competitions['competition_name']!='Indian Super league']

In [6]:
# Drop unnecessary columns
competitions.drop(COMPETITIONS_COLUMNS_DROP,
                  axis=1,
                  inplace=True)

# Reset index after filtering
competitions.reset_index(drop=True,
                         inplace=True)

In [7]:
competitions

Unnamed: 0,competition_id,season_id,country_name,competition_name,competition_international,season_name
0,223,282,South America,Copa America,True,2024
1,55,282,Europe,UEFA Euro,True,2024
2,9,281,Germany,1. Bundesliga,False,2023/2024
3,44,107,United States of America,Major League Soccer,False,2023
4,1267,107,Africa,African Cup of Nations,True,2023
5,7,235,France,Ligue 1,False,2022/2023
6,43,106,International,FIFA World Cup,True,2022
7,7,108,France,Ligue 1,False,2021/2022
8,11,90,Spain,La Liga,False,2020/2021
9,55,43,Europe,UEFA Euro,True,2020


We end up with 26 leagues across various competitions

In [8]:
competitions.to_csv('../data/competitions.csv', index=False)

### 2. Matches

In [8]:
df_matches = gd.get_matches(competitions)
df_matches.head()

Unnamed: 0,match_id,match_date,kick_off,competition,season,home_team,away_team,home_score,away_score,match_week,competition_stage,stadium,referee,home_managers,away_managers,data_version
0,3943077,2024-07-15,04:15:00.000,South America - Copa America,2024,Argentina,Colombia,1,0,6,Final,Hard Rock Stadium,Raphael Claus,Lionel Sebastián Scaloni,Néstor Gabriel Lorenzo,1.1.0
1,3943076,2024-07-14,03:00:00.000,South America - Copa America,2024,Canada,Uruguay,2,2,6,3rd Place Final,Bank of America Stadium,Alexis Herrera,Jesse Marsch,Marcelo Alberto Bielsa Caldera,1.1.0
2,3942852,2024-07-11,03:00:00.000,South America - Copa America,2024,Uruguay,Colombia,0,1,5,Semi-finals,Bank of America Stadium,César Arturo Ramos Palazuelos,Marcelo Alberto Bielsa Caldera,Néstor Gabriel Lorenzo,1.1.0
3,3942785,2024-07-10,03:00:00.000,South America - Copa America,2024,Argentina,Canada,2,0,5,Semi-finals,MetLife Stadium,Piero Maza Gomez,Lionel Sebastián Scaloni,Jesse Marsch,1.1.0
4,3942416,2024-07-07,01:00:00.000,South America - Copa America,2024,Colombia,Panama,5,0,4,Quarter-finals,State Farm Stadium,Maurizio Mariani,Néstor Gabriel Lorenzo,Thomas Christiansen Tarín,1.1.0


In [9]:
df_matches.groupby('competition').size()

competition
Africa - African Cup of Nations                    52
England - Premier League                          380
Europe - Champions League                           5
Europe - UEFA Euro                                102
France - Ligue 1                                  435
Germany - 1. Bundesliga                           340
International - FIFA World Cup                    128
Italy - Serie A                                   380
South America - Copa America                       32
Spain - La Liga                                   590
United States of America - Major League Soccer      6
dtype: int64

In [10]:
df_matches.to_csv('../data/matches.csv', index=False)

### 3. Lineups

<i>Although will not be used for this project</i>

In [11]:
df_lineups = gd.get_lineups(df_matches)

[0] Processed match_id 3943077
[50] Processed match_id 3930179
[100] Processed match_id 3895320
[150] Processed match_id 3920412
[200] Processed match_id 3837752
[250] Processed match_id 3857292
[300] Processed match_id 3773631
[350] Processed match_id 3794692
[400] Processed match_id 303696
[450] Processed match_id 22912
[500] Processed match_id 7572
[550] Processed match_id 9754
[600] Processed match_id 3754053
[650] Processed match_id 3754093
[700] Processed match_id 3754223
[750] Processed match_id 3754343
[800] Processed match_id 3754016
[850] Processed match_id 3754139
[900] Processed match_id 3754239
[950] Processed match_id 3754333
[1000] Processed match_id 3890545
[1050] Processed match_id 3890492
[1100] Processed match_id 3890442
[1150] Processed match_id 3890387
[1200] Processed match_id 3890335
[1250] Processed match_id 3890284
[1300] Processed match_id 3825839
[1350] Processed match_id 3825803
[1400] Processed match_id 3825711
[1450] Processed match_id 3825623
[1500] Proce

In [12]:
df_lineups.head()

Unnamed: 0,player_id,player_name,player_nickname,jersey_number,country,cards,positions,match_id
0,2995,Ángel Fabián Di María Hernández,Ángel Di María,11,Argentina,[],"[{'position_id': 12, 'position': 'Right Midfie...",3943077
1,3090,Nicolás Hernán Otamendi,Nicolás Otamendi,19,Argentina,[],"[{'position_id': 12, 'position': 'Right Midfie...",3943077
2,3313,Giovani Lo Celso,,16,Argentina,"[{'time': '11:26', 'card_type': 'Yellow Card',...","[{'position_id': 11, 'position': 'Left Defensi...",3943077
3,5503,Lionel Andrés Messi Cuccittini,Lionel Messi,10,Argentina,[],"[{'position_id': 22, 'position': 'Right Center...",3943077
4,5507,Nicolás Alejandro Tagliafico,Nicolás Tagliafico,3,Argentina,[],"[{'position_id': 6, 'position': 'Left Back', '...",3943077


In [13]:
df_lineups.to_csv('../data/lineups.csv', index=False)

### 4. Events

In [None]:
start_i = 0
end_i = 10

df_events = gd.get_events(df_matches, EVENT_COLUMNS_SELECT, start_i, end_i)

[0] Processed match_id=3943077 (4108 rows)
Completed 10 matches.
Total rows: 31102
Columns: 122
Null match_id values: 0


In [26]:
df_events[df_events['type']=='Shot'].shape[0]

274

In [None]:
df_events.to_csv(f'../data/events_{start_i}_{end_i}.csv', index=False, sep=';')

### 5. Frames

In [None]:
df_frames = gd.get_frames(df_matches)

In [None]:
df_frames.to_csv('../data/frames.csv', index=False)