## Importing Data

This notebook serves as a step-by-step guide to importing match and event data from the official StatsBomb repository. It documents the exact process used to retrieve, filter, and store the raw data for further analysis.

We use the `statsbombpy` library, which provides a **Pythonic interface** to interact with StatsBomb's open-data GitHub repository. It abstracts the underlying JSON structure and URL handling, allowing straightforward access to competitions, matches, events, and lineups through simple function calls.


In [1]:
import pandas as pd
from statsbombpy import sb

In [2]:
import sys
import os

project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
if project_root not in sys.path:
    sys.path.append(project_root)

from src.constants import *

In [3]:
# to suppress Authentication warnings
from statsbombpy.api_client import NoAuthWarning
import warnings
warnings.filterwarnings("ignore", category=NoAuthWarning)

In [4]:
# for better visibility and showing all the columns
pd.set_option('display.max_columns', None)

### 1. Competitions Data

In [5]:
# Load and Filter Competitions
competitions = sb.competitions()

# Filter: keep only male competitions
competitions = competitions[competitions['competition_gender']=='male'].sort_values('season_name', ascending=False)

# Filter: keep seasons from 2014 onward
competitions = competitions[competitions['season_name']>'2014']

# Filter: exclude Indian Super League
competitions = competitions[competitions['competition_name']!='Indian Super league']

In [6]:
# Drop unnecessary columns
competitions.drop(COMPETITIONS_COLUMNS_DROP,
                  axis=1,
                  inplace=True)

# Reset index after filtering
competitions.reset_index(drop=True,
                         inplace=True)

In [7]:
competitions

Unnamed: 0,competition_id,season_id,country_name,competition_name,competition_international,season_name
0,223,282,South America,Copa America,True,2024
1,55,282,Europe,UEFA Euro,True,2024
2,9,281,Germany,1. Bundesliga,False,2023/2024
3,44,107,United States of America,Major League Soccer,False,2023
4,1267,107,Africa,African Cup of Nations,True,2023
5,7,235,France,Ligue 1,False,2022/2023
6,43,106,International,FIFA World Cup,True,2022
7,7,108,France,Ligue 1,False,2021/2022
8,11,90,Spain,La Liga,False,2020/2021
9,55,43,Europe,UEFA Euro,True,2020


We end up with 26 leagues across various competitions

In [8]:
competitions.to_csv('../data/competitions.csv', index=False)

### 2. Matches

In [10]:
def get_matches(competitions : pd.DataFrame):

    df_matches = pd.DataFrame()
    
    for i in range(0, len(competitions)):
        competition_id, season_id = competitions.loc[i, ['competition_id', 'season_id']]
        if df_matches.empty:
            df_matches = sb.matches(competition_id=competition_id,
                                    season_id=season_id)
        else:
            df_matches = pd.concat([df_matches,
                                    sb.matches(competition_id=competition_id, season_id=season_id)])
    df_matches.reset_index(drop=True,
                           inplace=True)
    
    df_matches.drop(MATCHES_COLUMNS_DROP,
                    axis=1,
                    inplace=True)
    
    return df_matches

In [11]:
df_matches = get_matches(competitions)
df_matches

Unnamed: 0,match_id,match_date,kick_off,competition,season,home_team,away_team,home_score,away_score,match_week,competition_stage,stadium,referee,home_managers,away_managers,data_version
0,3943077,2024-07-15,04:15:00.000,South America - Copa America,2024,Argentina,Colombia,1,0,6,Final,Hard Rock Stadium,Raphael Claus,Lionel Sebastián Scaloni,Néstor Gabriel Lorenzo,1.1.0
1,3943076,2024-07-14,03:00:00.000,South America - Copa America,2024,Canada,Uruguay,2,2,6,3rd Place Final,Bank of America Stadium,Alexis Herrera,Jesse Marsch,Marcelo Alberto Bielsa Caldera,1.1.0
2,3942852,2024-07-11,03:00:00.000,South America - Copa America,2024,Uruguay,Colombia,0,1,5,Semi-finals,Bank of America Stadium,César Arturo Ramos Palazuelos,Marcelo Alberto Bielsa Caldera,Néstor Gabriel Lorenzo,1.1.0
3,3942785,2024-07-10,03:00:00.000,South America - Copa America,2024,Argentina,Canada,2,0,5,Semi-finals,MetLife Stadium,Piero Maza Gomez,Lionel Sebastián Scaloni,Jesse Marsch,1.1.0
4,3942416,2024-07-07,01:00:00.000,South America - Copa America,2024,Colombia,Panama,5,0,4,Quarter-finals,State Farm Stadium,Maurizio Mariani,Néstor Gabriel Lorenzo,Thomas Christiansen Tarín,1.1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2445,266871,2015-02-08,21:00:00.000,Spain - La Liga,2014/2015,Athletic Club,Barcelona,2,5,22,Regular Season,San Mamés Barria,Antonio Miguel Mateu Lahoz,Ernesto Valverde Tejedor,Luis Enrique Martínez García,1.1.0
2446,266967,2015-03-14,18:00:00.000,Spain - La Liga,2014/2015,Eibar,Barcelona,0,2,27,Regular Season,Estadio Municipal de Ipurúa,,Gaizka Garitano Aguirre,Luis Enrique Martínez García,1.1.0
2447,266929,2015-04-05,21:00:00.000,Spain - La Liga,2014/2015,Celta Vigo,Barcelona,0,1,29,Regular Season,Abanca-Balaídos,,Manuel Eduardo Berizzo,Luis Enrique Martínez García,1.1.0
2448,266770,2014-09-21,21:00:00.000,Spain - La Liga,2014/2015,Levante UD,Barcelona,0,5,4,Regular Season,Estadio Ciudad de Valencia,José Luis González González,José Luis Mendilibar Etxebarria,Luis Enrique Martínez García,1.1.0
