# Task 2 - Data Preparation (StatsBomb Big 5 Leagues 2015/2016)

This notebook focuses on the preparation of the **StatsBomb open data** related to the Big 5 European leagues for the 2015/2016 season. The data will be loaded through the `statsbombpy` library, with an initial exploration of the available structures at both team and player level. Basic cleaning procedures will be applied to ensure fair comparisons across players. Finally, the processed datasets will be saved for subsequent analyses. 

## Note on Available Stats (Open Data Limitation)

**StatsBombpy library** provides convenient aggregated endpoints such as `team_season_stats`, `team_match_stats`, `player_season_stats`, and `player_match_stats`. However, these endpoints are **not available in the public open-data release** and require commercial credentials. As a result, this notebook **builds all team- and player-level statistics from scratch** using only the open-data endpoints:

- `sb.competitions()` – list of competitions/seasons  

- `sb.matches(competition_id, season_id)` – list of matches per competition/season  

- `sb.events(match_id)` – full on-ball event log for a match (shots, passes, dribbles, duels, pressures, etc.)  

- `sb.lineups(match_id)` – squads and players (used to infer minutes played together with events/substitutions)

In [437]:
from statsbombpy import sb

# Demo: Attempt to use an aggregated stats endpoint
# The function sb.player_season_stats() would normally return 
# season-level player statistics if commercial credentials were provided
# However, this endpoint is NOT available in the open-data release

try:
    # Example: attempt to load Premier League 2015/16 season stats
    _ = sb.player_season_stats(competition_id=2, season_id=27)
    
except Exception as e:
    # This error confirms that aggregated stats are not part of the open dataset
    print("Aggregated endpoint not available in open data. Falling back to events/lineups.")
    print(f"ERROR returned by statsbombpy: {e}")


Aggregated endpoint not available in open data. Falling back to events/lineups.
ERROR returned by statsbombpy: There is currently no open data for aggregated stats, please provide credentials


## Imports and Global Settings

In [443]:
import pandas as pd
import numpy as np
import os

from random import randint
from tqdm import tqdm
from statsbombpy import sb

import warnings
warnings.filterwarnings("ignore")

## Load Competitions and Filter 2015/16 Big 5

In [439]:
# Load all available competitions
competitions = sb.competitions()

display(competitions.columns.tolist())

print("All competitions available:")
display(competitions[["competition_id", "season_id", "competition_name", "season_name"]].head())


['competition_id',
 'season_id',
 'country_name',
 'competition_name',
 'competition_gender',
 'competition_youth',
 'competition_international',
 'season_name',
 'match_updated',
 'match_updated_360',
 'match_available_360',
 'match_available']

All competitions available:


Unnamed: 0,competition_id,season_id,competition_name,season_name
0,9,281,1. Bundesliga,2023/2024
1,9,27,1. Bundesliga,2015/2016
2,1267,107,African Cup of Nations,2023
3,16,4,Champions League,2018/2019
4,16,1,Champions League,2017/2018


In [440]:
# Filter competitions for season 2015/2016
season_year = "2015/2016"
competitions_1516 = competitions[competitions["season_name"] == season_year]

print("Competitions for season 2015/2016:")
display(competitions_1516[["competition_id", "season_id", "competition_name", "season_name"]])

Competitions for season 2015/2016:


Unnamed: 0,competition_id,season_id,competition_name,season_name
1,9,27,1. Bundesliga,2015/2016
6,16,27,Champions League,2015/2016
43,11,27,La Liga,2015/2016
60,7,27,Ligue 1,2015/2016
64,2,27,Premier League,2015/2016
66,12,27,Serie A,2015/2016


In [441]:
# Select Big 5 leagues and count matches
big5 = ["Premier League", "La Liga", "Serie A", "1. Bundesliga", "Ligue 1"]

competitions_big5_1516 = competitions_1516[
    competitions_1516["competition_name"].isin(big5)
].copy()

# Count matches for each competition
match_counts = []
for _, row in competitions_big5_1516.iterrows():

    # Retrieve competition id and season id
    comp_id = row["competition_id"]
    season_id = row["season_id"]

    # Retrieve the matches for each competition-season
    matches = sb.matches(competition_id=comp_id, season_id=season_id)

    # Count the number of matches and store it
    n_matches = matches.shape[0]
    match_counts.append(n_matches)

# Add matches column to the dataframe
competitions_big5_1516["num_matches"] = match_counts

# Display the results
print("Big 5 competitions in 2015/2016 with match counts:")
display(competitions_big5_1516[["competition_id", "season_id","competition_name", "season_name", "num_matches"]])

# Total
total_matches = competitions_big5_1516["num_matches"].sum()
print(f"Total matches in Big 5 competitions (2015/2016): {total_matches}")


Big 5 competitions in 2015/2016 with match counts:


Unnamed: 0,competition_id,season_id,competition_name,season_name,num_matches
1,9,27,1. Bundesliga,2015/2016,306
43,11,27,La Liga,2015/2016,380
60,7,27,Ligue 1,2015/2016,377
64,2,27,Premier League,2015/2016,380
66,12,27,Serie A,2015/2016,380


Total matches in Big 5 competitions (2015/2016): 1823


> **NOTE**: For the 2015/2016 season, the StatsBomb open data provides the full set of matches for all Big 5 leagues except Ligue 1.  
> In Ligue 1, only 377 matches are available instead of the expected 380, due to a few games not being released in the public dataset.  
> This minor discrepancy (less than 1% of the total league games) is not considered problematic, as it does not significantly affect aggregated player or team statistics.

### Identifying Missing Matches in Ligue 1 (2015/2016)

Ligue 1 should contain 380 matches in the 2015/2016 season, but only 377 are available in the StatsBomb open data. Let's detect the match weeks where games are missing and to identify the teams involved by comparing the line-up of teams in each round with the complete set of Ligue 1 participants

In [442]:
# Load Ligue 1 2015/16 matches
# Competition "Ligue 1" id: 7
# Season "2015/2016" id: 27
matches_ligue1 = sb.matches(competition_id=7, season_id=27)

# Group by match week and count matches
# .groupby("match_week") groups the DataFrame by each round of the season
# .size() counts the number of rows (i.e., matches) per group
matches_per_week = matches_ligue1.groupby("match_week").size()

# Identify the match weeks with fewer than 10 matches in that round
incomplete_weeks = matches_per_week[matches_per_week < 10]

print("Match weeks with missing games:\n")
print(incomplete_weeks)

# Retrieve the full set of teams that appear across the season
all_teams = set(matches_ligue1["home_team"]).union(set(matches_ligue1["away_team"]))

# Loop through each incomplete week to identify missing teams
for week in incomplete_weeks.index:
    print(f"\nMatch Week {week}")
    
    # Extract all matches for that week
    week_matches = matches_ligue1[matches_ligue1["match_week"] == week]
    
    # Collect all teams that played (both home and away) during that week
    played_teams = set(week_matches["home_team"]).union(set(week_matches["away_team"]))
    
    # Identify the teams that did not play in that week
    missing_teams = all_teams - played_teams

    # Print the missing teams that should form the missing match
    if missing_teams:
        print(f"Missing match: {list(missing_teams)} did not play")


Match weeks with missing games:

match_week
14    9
23    9
36    9
dtype: int64

Match Week 14
Missing match: ['Gazélec Ajaccio', 'Bastia'] did not play

Match Week 23
Missing match: ['Paris Saint-Germain', 'Saint-Étienne'] did not play

Match Week 36
Missing match: ['Troyes', 'Bordeaux'] did not play


The identification of three missing matches in the Ligue 1 dataset for the 2015/2016 season does not pose a significant issue for the analysis. Most of the teams involved did not have players realistically competing for the Ballon d’Or. The only notable exception is *Paris Saint-Germain*; however, given the substantial number of their matches still available, the absence of this single fixture is unlikely to materially affect the aggregated player statistics considered in the study.

## Building Player and Team Statistics from Events and Lineups

### Event Categorization for Ballon d’Or Player Evaluation

In [None]:
def list_event_types(competition_id: int, season_id: int):
    """
    Print all unique event types in a competition/season.
    
    Args:
        competition_id (int): StatsBomb competition ID 
        season_id (int): StatsBomb season ID 
        limit_matches (int, optional): limit number of matches to speed up. Default None.
    """
    # Load matches
    matches = sb.matches(competition_id=competition_id, season_id=season_id)
    
    event_types = set()
    
    for _, match in tqdm(matches.iterrows(), total=matches.shape[0]):
        match_id = match["match_id"]
        events = sb.events(match_id=match_id)
        event_types.update(events["type"].unique())
    
    print(f"Unique event types in competition {competition_id}, season {season_id}:")
    for etype in sorted(event_types):
        print("-", etype)
    
    return event_types

# Example: Premier League 2015/16 (competition_id=2, season_id=27)
event_types = list_event_types(competition_id=2, season_id=27)
print(f"Total unique event types found: {len(event_types)}")


100%|██████████| 380/380 [03:31<00:00,  1.80it/s]

Unique event types in competition 2, season 27:
- 50/50
- Bad Behaviour
- Ball Receipt*
- Ball Recovery
- Block
- Carry
- Clearance
- Dispossessed
- Dribble
- Dribbled Past
- Duel
- Error
- Foul Committed
- Foul Won
- Goal Keeper
- Half End
- Half Start
- Injury Stoppage
- Interception
- Miscontrol
- Offside
- Own Goal Against
- Own Goal For
- Pass
- Player Off
- Player On
- Pressure
- Referee Ball-Drop
- Shield
- Shot
- Starting XI
- Substitution
- Tactical Shift
Total unique event types found: 33





#### Considerations

After the event-level analysis, only those categories that provide **clear and actionable insights** into individual player performance were retained.  
Events considered marginal, redundant, or not directly informative for evaluation have been excluded.

**1. Offensive & Possession Actions**

Events directly related to attacking play, chance creation, and ball progression:

- *Shot*  
- *Pass*  
- *Carry*  
- *Dribble*  

**2. Defensive Actions**

Events that measure defensive contribution and ball recovery:

- *Duel*  
- *Dribbled Past*  
- *Interception*  
- *Block*  
- *Clearance*  
- *Ball Recovery*  
- *Pressure*  
- *Dispossessed*  

**3. Goalkeeping**

Events specifically describing goalkeeper activity:

- *Goal Keeper*  

**4. Discipline & Fouls**

Events linked to fouls, discipline, and negative contributions:

- *Foul Committed*  
- *Foul Won*  
- *Own Goal For / Against*  

> Note: for card-related information (yellow/red cards), we leverage the more detailed data available from `sb.lineups(match_id)`.

**5. Context & Playing Time**

Events providing information on player availability, minutes played, and tactical role:

- *Starting XI*  
- *Substitution*  
- *Half Start / Half End*  

**Excluded Events**

The following events were excluded from further analysis as they provide limited, redundant, or indirect information about individual performance:

- *Tactical Shift* → Indicates formation or role changes; excluded for simplicity.  
- *Player On / Player Off* → Redundant; already covered by lineups and substitution events.  
- *Injury Stoppage* → Contextual interruption; no performance insight.  
- *Referee Ball-Drop* → Administrative event; no performance value.  
- *Shield* → Hard to quantify in terms of individual performance.  
- *Error* → Ambiguous; overlaps with dispossession or miscontrol events.  
- *Miscontrol* → Already captured under *Dispossessed*.  
- *Offside* → Primarily a team-level outcome; limited individual insight.  
- *Ball Receipt* → Redundant; every completed pass implies a ball reception.  
- *50/50* → Already encompassed within *Duel*.  
- *Bad Behaviour* → Less detailed compared to card information from lineups.  


#### Note on Event Columns in StatsBomb Data

The StatsBomb event dataset contains a mixture of **shared attributes** (present in all events) and **event-specific attributes** (only relevant for certain event types). When these events are flattened into a DataFrame, only the columns that actually appear in that match are created. As a result, **the number of columns in the events DataFrame can vary from match to match**, depending on the types of actions recorded.  

What remains consistent are the shared fields, while event-specific fields appear only when relevant for that particular match.


In [None]:
# Load matches from Premier League 2015/16 (comp_id=2, season_id=27)
matches = sb.matches(competition_id=2, season_id=27)

# Pick the first 10 matches
sample_matches = matches.head(10)

print("Number of columns in events DataFrame for 10 matches:\n")

for i, row in sample_matches.iterrows():
    match_id = row['match_id']
    events = sb.events(match_id=match_id)
    print(f"Match {i+1}: {row['home_team']} vs {row['away_team']} -> {events.shape[1]} columns")

Number of columns in events DataFrame for 10 matches:

Match 1: Leicester City vs AFC Bournemouth -> 90 columns
Match 2: West Bromwich Albion vs Sunderland -> 92 columns
Match 3: Newcastle United vs Aston Villa -> 89 columns
Match 4: Everton vs AFC Bournemouth -> 88 columns
Match 5: Crystal Palace vs Watford -> 95 columns
Match 6: Arsenal vs Aston Villa -> 95 columns
Match 7: West Bromwich Albion vs Liverpool -> 93 columns
Match 8: Tottenham Hotspur vs AFC Bournemouth -> 89 columns
Match 9: Leicester City vs Manchester City -> 88 columns
Match 10: Crystal Palace vs Everton -> 90 columns


### Example Match Extraction for Function Testing

In [405]:
# Load matches for Premier League 2015/16 (comp_id=2, season_id=27)
matches = sb.matches(competition_id=2, season_id=27)

# Select the match at index 0
first_match = matches.iloc[0]
match_id = first_match['match_id']

# Print summary information about the selected match
print("EXAMPLE MATCH SELECTED")
print(f"Competition : Premier League")
print(f"Season      : 2015/16")
print(f"Matchweek   : {first_match['match_week']}")
print(f"Date        : {first_match['match_date']}")
print(f"Home Team   : {first_match['home_team']}")
print(f"Away Team   : {first_match['away_team']}")
print(f"Final Score : {first_match['home_score']} - {first_match['away_score']}")
print(f"Match ID    : {match_id}")

EXAMPLE MATCH SELECTED
Competition : Premier League
Season      : 2015/16
Matchweek   : 20
Date        : 2016-01-02
Home Team   : Leicester City
Away Team   : AFC Bournemouth
Final Score : 0 - 0
Match ID    : 3754058


### 1. Offensive Actions

This function extracts all the key **offensive-related metrics** for a player given its events.  
The analysis is based on StatsBomb event types that directly capture attacking contribution, chance creation, and ball progression:

- **Shots** → number of attempts, goals, shots on target, xG (total and average), penalties, headers

- **Passes** → attempted, completed, accuracy, assists, key passes, progressive passes (≥15m forward), crosses, switches of play, average angle and length

- **Carries** → number of carries, total distance covered, progressive carries (≥10m forward), and carries ending inside the penalty area

- **Dribbles** → attempted, completed, success rate, and overruns (failed dribbles losing control)

In [406]:
def extract_offensive_stats(player_events, pitch_length=120):
    """
    Extract offensive statistics from match/player events
    Processes StatsBomb event types: Shot, Pass, Carry, Dribble.
    
    Args:
        player_events (pd.DataFrame): StatsBomb events for a single match
        pitch_length (float): Pitch length in meters (default 120, StatsBomb standard)
    
    Returns:
        dict: Dictionary with aggregated offensive metrics
    """

    stats = {}

    # SHOTS EVENTS
    shots = player_events[player_events['type'] == 'Shot']

    # Total number of shots attempted
    stats['shots_attempted'] = len(shots)

    # Goals scored (shot_outcome == 'Goal')
    stats['goals'] = (shots['shot_outcome'] == 'Goal').sum()

    # Shots on target (goal, saved by goalkeeper, or hitting the post)
    stats['shots_on_target'] = shots['shot_outcome'].isin(
        ['Goal', 'Saved', 'Saved To Post']
    ).sum()

    # Expected Goals (sum of StatsBomb xG values)
    stats['xg_total'] = shots['shot_statsbomb_xg'].sum(skipna=True)

    # Average xG per shot (quality of average shooting chance)
    stats['xg_avg'] = shots['shot_statsbomb_xg'].mean(skipna=True)

    # Penalties attempted (shot_type == 'Penalty')
    stats['penalties'] = (shots['shot_type'] == 'Penalty').sum()

    # Headers attempted (body part == Head)
    stats['headers'] = (shots['shot_body_part'] == 'Head').sum()



    # PASSES EVENTS
    passes = player_events[player_events['type'] == 'Pass']

    # Total passes attempted
    stats['passes_attempted'] = len(passes)

    # Completed passes (StatsBomb: pass_outcome is NaN if successful)
    stats['passes_completed'] = passes['pass_outcome'].isna().sum()

    # Passing accuracy
    stats['pass_accuracy'] = (
        stats['passes_completed'] / stats['passes_attempted']
        if stats['passes_attempted'] > 0 else np.nan
    )

    # Assists 
    assists = 0
    shots_goals = shots[shots['shot_outcome'] == 'Goal']
    for _, shot in shots_goals.iterrows():
        key_pass_id = shot.get('shot_key_pass_id', None)
        if pd.notna(key_pass_id) and key_pass_id in passes['id'].values:
            assists += 1
    stats['assists'] = assists

    # Key passes (passes leading directly to a shot)
    stats['key_passes'] = passes['pass_shot_assist'].fillna(False).sum()

    # Progressive passes (forward passes advancing ≥15m)
    progressive_passes = 0
    for _, row in passes.iterrows():
        start = row.get('location', None)
        end = row.get('pass_end_location', None)
        if isinstance(start, list) and isinstance(end, list):
            if (end[0] - start[0]) >= 15:
                progressive_passes += 1
    stats['progressive_passes'] = progressive_passes

    # Crosses attempted
    stats['crosses'] = passes['pass_cross'].fillna(False).sum()

    # Switches of play
    stats['switches'] = passes['pass_switch'].fillna(False).sum()

    # Average pass angle (measure of verticality vs lateral passing)
    stats['avg_pass_angle'] = passes['pass_angle'].mean(skipna=True)

    # Average pass length (directness, tendency to play long vs short)
    stats['avg_pass_length'] = passes['pass_length'].mean(skipna=True)



    # CARRIES EVENTS
    carries = player_events[player_events['type'] == 'Carry']

    # Total carries (times player moved the ball by running with it)
    stats['carries_attempted'] = len(carries)

    # Total distance carried (sum of carry lengths)
    total_carry_distance = 0
    for _, row in carries.iterrows():
        start = row.get('location', None)
        end = row.get('carry_end_location', None)
        if isinstance(start, list) and isinstance(end, list):
            dist = np.linalg.norm(np.array(end) - np.array(start))
            total_carry_distance += dist
    stats['carry_distance_total'] = total_carry_distance

    # Progressive carries (advancing ≥10m towards goal)
    progressive_carries = 0
    for _, row in carries.iterrows():
        start = row.get('location', None)
        end = row.get('carry_end_location', None)
        if isinstance(start, list) and isinstance(end, list):
            if (end[0] - start[0]) >= 10:
                progressive_carries += 1
    stats['progressive_carries'] = progressive_carries

    # Carries ending inside the penalty area (define the insertions in the area)
    carries_to_box = 0
    for loc in carries['carry_end_location']:
        if isinstance(loc, list):
            if loc[0] >= (pitch_length - 18) and 18 <= loc[1] <= 62:
                carries_to_box += 1
    stats['carries_to_penalty_area'] = carries_to_box



    # DRIBBLES EVENTS
    dribbles = player_events[player_events['type'] == 'Dribble']

    # Total dribbles attempted
    stats['dribbles_attempted'] = len(dribbles)

    # Successful dribbles (outcome == 'Complete')
    stats['dribbles_completed'] = (dribbles['dribble_outcome'] == 'Complete').sum()

    # Dribble success rate (success %)
    stats['dribble_success_rate'] = (
        stats['dribbles_completed'] / stats['dribbles_attempted']
        if stats['dribbles_attempted'] > 0 else np.nan
    )

    # Dribble overruns (failed dribble due to losing control of the ball)
    stats['dribble_overruns'] = dribbles['dribble_overrun'].fillna(False).sum()

    # Round only selected float stats
    for key in ['xg_total', 'xg_avg', 'pass_accuracy', 
                'avg_pass_angle', 'avg_pass_length', 
                'carry_distance_total', 'dribble_success_rate']:
        if key in stats and isinstance(stats[key], (float, np.floating)):
            stats[key] = round(stats[key], 2)

    return stats

In [407]:
# TEST ON A SINGLE PLAYER

# Load events for that match 
player_events = sb.events(match_id=match_id)

# Extract unique players from events (skip NaNs)
players_in_match = player_events[['player_id', 'player', 'team']].dropna().drop_duplicates()

# Pick one player random
player_row = players_in_match.iloc[randint(0, len(players_in_match)-1)]
player_id = player_row['player_id']
player_name = player_row['player']
team_name = player_row['team']

# Filter events for that player
player_events = player_events[player_events['player_id'] == player_id]

print("\nEXAMPLE PLAYER SELECTED")
print(f"Player : {player_name}")
print(f"Team   : {team_name}")
print(f"Total events for player in match: {len(player_events)}")

# Extract offensive stats
player_stats = extract_offensive_stats(player_events)

# Print summary
print("Offensive Stats for Player:")
print(f"Player : {player_name}")
print(f"Team   : {team_name}")
print(f"Match  : {first_match['home_team']} vs {first_match['away_team']} (ID {match_id})\n")

print("Extracted offensive stats:\n")
for k, v in player_stats.items():
    print(f"{k}: {v}")



EXAMPLE PLAYER SELECTED
Player : Nathan Dyer
Team   : Leicester City
Total events for player in match: 95
Offensive Stats for Player:
Player : Nathan Dyer
Team   : Leicester City
Match  : Leicester City vs AFC Bournemouth (ID 3754058)

Extracted offensive stats:

shots_attempted: 0
goals: 0
shots_on_target: 0
xg_total: 0.0
xg_avg: nan
penalties: 0
headers: 0
passes_attempted: 20
passes_completed: 17
pass_accuracy: 0.85
assists: 0
key_passes: 0
progressive_passes: 3
crosses: 1
switches: 0
avg_pass_angle: 1.0
avg_pass_length: 14.37
carries_attempted: 26
carry_distance_total: 162.51
progressive_carries: 3
carries_to_penalty_area: 2
dribbles_attempted: 1
dribbles_completed: 0
dribble_success_rate: 0.0
dribble_overruns: 0


### 2. Defensive Actions

This function extracts the main **defensive contribution metrics** for a player given its events.  
The analysis is based on StatsBomb event types that describe defensive activity, ball recovery, and duels:

- **Duels** → attempted, won, lost, and duel success ratio

- **Interceptions** → attempted, successful (won), lost, and interception ratio

- **Blocks** → number of blocks made against opponent passes or shots

- **Clearances** → defensive actions to remove danger by clearing the ball

- **Ball Recoveries** → regaining possession of the ball

- **Pressures** → pressing actions applied on opponents

- **Dispossessed** → number of times the player lost possession under pressure.

In [408]:
def extract_defensive_stats(player_events):
    """
    Extract defensive statistics from match/player events
    Processes StatsBomb event types: Duel, Interception,
    Block, Clearance, Ball Recovery, Pressure, Dispossessed
    
    Args:
        player_events (pd.DataFrame): StatsBomb events for a single player
    
    Returns:
        dict: Dictionary with aggregated defensive metrics
    """

    stats = {}

    # DUELS EVENTS
    duels = player_events[player_events['type'] == 'Duel']

    # Duels Attempted (total duels)
    stats['duels_attempted'] = len(duels)

    # Duels Won
    stats['duels_won'] = (duels['duel_outcome'] == 'Won').sum()

    # Duels Lost (total duels - duels won)
    stats['duels_lost'] = stats['duels_attempted'] - stats['duels_won']

    # Duels Ratio (number of duels won / total duels attempted)
    stats['duels_ratio'] = (
        stats['duels_won'] / stats['duels_attempted']
        if stats['duels_attempted'] > 0 else np.nan
    )



    # INTERCEPTIONS EVENTS
    interceptions = player_events[player_events['type'] == 'Interception']

    # Interceptions Attempted
    stats['interceptions_attempted'] = len(interceptions)

    if 'interception_outcome' in interceptions:
        # Interceptions Won (outcome == 'Won')
        stats['interceptions_won'] = (interceptions['interception_outcome'] == 'Won').sum()
        stats['interceptions_lost'] = stats['interceptions_attempted'] - stats['interceptions_won']
    else:
        # Fallback: assume all successful
        stats['interceptions_won'] = stats['interceptions_attempted']
        stats['interceptions_lost'] = 0

    # Interceptions Ratio
    stats['interceptions_ratio'] = (
        stats['interceptions_won'] / stats['interceptions_attempted']
        if stats['interceptions_attempted'] > 0 else 1.0
    )



    # BLOCKS EVENTS
    blocks = player_events[player_events['type'] == 'Block']
    stats['blocks'] = len(blocks)

    # CLEARANCES EVENTS
    clearances = player_events[player_events['type'] == 'Clearance']
    stats['clearances'] = len(clearances)

    # BALL RECOVERIES EVENTS
    recoveries = player_events[player_events['type'] == 'Ball Recovery']
    stats['ball_recoveries'] = len(recoveries)

    # PRESSURES EVENTS
    pressures = player_events[player_events['type'] == 'Pressure']
    stats['pressures'] = len(pressures)

    # DISPOSSESSED EVENTS
    dispossessed = player_events[player_events['type'] == 'Dispossessed']
    stats['dispossessed'] = len(dispossessed)



    # Round ratios only
    for key in ['duels_ratio', 'interceptions_ratio']:
        if key in stats and isinstance(stats[key], (float, np.floating)):
            stats[key] = round(stats[key], 2)

    return stats


In [412]:
# TEST ON A SINGLE PLAYER

# Load events for that match 
player_events = sb.events(match_id=match_id)

# Extract unique players from events (skip NaNs)
players_in_match = player_events[['player_id', 'player', 'team']].dropna().drop_duplicates()

# Pick one player random
player_row = players_in_match.iloc[randint(0, len(players_in_match)-1)]
player_id = player_row['player_id']
player_name = player_row['player']
team_name = player_row['team']

# Filter events for that player
player_events = player_events[player_events['player_id'] == player_id]

print("\nEXAMPLE PLAYER SELECTED")
print(f"Player : {player_name}")
print(f"Team   : {team_name}")
print(f"Total events for player in match: {len(player_events)}")

# Extract defensive stats
player_stats = extract_defensive_stats(player_events)

# Print summary
print("\nExtracted Defensive stats:")
for k, v in player_stats.items():
    print(f"{k}: {v}")



EXAMPLE PLAYER SELECTED
Player : Simon Francis
Team   : AFC Bournemouth
Total events for player in match: 131

Extracted Defensive stats:
duels_attempted: 2
duels_won: 0
duels_lost: 2
duels_ratio: 0.0
interceptions_attempted: 2
interceptions_won: 1
interceptions_lost: 1
interceptions_ratio: 0.5
blocks: 0
clearances: 8
ball_recoveries: 0
pressures: 3
dispossessed: 0


### 3. Goalkeeper Actions

This function extracts the main **goalkeeping performance metrics** for a player given its events.  
It requires both the full match events (`events_df`) and the goalkeeper’s own events (`gk_events`) to correctly account for goals conceded, including own goals.

Metrics include:

- **Goals Conceded** → from goalkeeper events (*Goal Conceded*, *Penalty Conceded*) and own goals when the GK was on the pitch.

- **Clean Sheet** → 1 if no goals conceded, else 0

- **Shots Faced** → number of shots registered against the goalkeeper

- **Saves** → total saves made, including penalties saved

- **Save Ratio** → saves / (saves + goals conceded)

- **Penalties Saved** → successful penalty saves

- **Area Command** → claims, punches, and clearances performed

- **Sweeper / Smother Actions** → defensive actions outside the goal line

- **Reliability** → errors and negative outcomes (failures, no touch, dangerous plays).  

In [413]:
def extract_goalkeeper_stats(events_df, gk_events):
    """
    Extract goalkeeper statistics from match/player events.
    Needs both full match events (events_df) and the goalkeeper's own events (gk_events).
    
    Args:
        events_df (pd.DataFrame): StatsBomb events for the full match
        gk_events (pd.DataFrame): StatsBomb events filtered only for the goalkeeper
    
    Returns:
        dict: Dictionary with aggregated goalkeeper metrics
    """

    if gk_events.empty:
        stats.update({
            'gk_shots_faced': 0,
            'gk_saves': 0,
            'gk_save_ratio': 0,
            'gk_penalties_saved': 0,
            'gk_claims': 0,
            'gk_punches': 0,
            'gk_clearances': 0,
            'gk_smother': 0,
            'gk_sweeper': 0,
            'gk_errors': 0
        })
        return stats

    stats = {}

     # GOALS CONCEDED
    # From GK events: Goal Conceded + Penalty Conceded (goal conceded from penalty)
    goals_conceded = gk_events['goalkeeper_type'].eq('Goal Conceded').sum() + \
                     gk_events['goalkeeper_type'].eq('Penalty Conceded').sum()

    # Add Own Goals (only if this GK was on the pitch at that moment)
    if 'match_id' in events_df.columns and not events_df[events_df['type'] == 'Own Goal Against'].empty:
        match_id = events_df['match_id'].iloc[0]
        lineups_dict = sb.lineups(match_id=match_id)

        gk_id = gk_events['player_id'].iloc[0]
        gk_team = gk_events['team'].iloc[0]

        # Get intervals of play for this goalkeeper
        play_spans = []
        for _, team_df in lineups_dict.items():
            row = team_df[team_df["player_id"] == gk_id]
            if not row.empty:
                positions = row.iloc[0].get("positions", [])
                for pos in positions:
                    if pos.get("position") == "Goalkeeper":
                        start_min = int(pos.get("from", "0:00").split(":")[0])
                        to_str = pos.get("to")
                        end_min = int(to_str.split(":")[0]) if to_str else 120
                        play_spans.append((start_min, end_min))
                break


        # Check own goals against GK's team
        own_goals = events_df[events_df['type'] == 'Own Goal Against']
        for _, og in own_goals.iterrows():
            if og['team'] == gk_team and any(s <= og["minute"] <= e for s, e in play_spans):
                goals_conceded += 1

    stats['gk_goals_conceded'] = int(goals_conceded)


    # CLEAN SHEET
    stats['gk_clean_sheet'] = 1 if goals_conceded == 0 else 0

    # SHOT STOPPING EVENTS
    stats['gk_shots_faced'] = (gk_events['goalkeeper_type'] == 'Shot Faced').sum()

    stats['gk_saves'] = gk_events['goalkeeper_type'].isin([
        'Save','Shot Saved','Shot Saved Off','Shot Saved to Post',
        'Saved to Post','Saved Twice','Penalty Saved','Penalty Saved to Post'
    ]).sum()

    stats['gk_penalties_saved'] = gk_events['goalkeeper_type'].isin([
        'Penalty Saved','Penalty Saved to Post'
    ]).sum()



    # AREA COMMAND EVENTS
    stats['gk_claims'] = gk_events['goalkeeper_type'].isin(['Collected','Collected Twice','Claim']).sum()
    stats['gk_punches'] = gk_events['goalkeeper_type'].isin(['Punch','Punched out']).sum()
    stats['gk_clearances'] = (gk_events['goalkeeper_outcome'] == 'Clear').sum()



    # SWEEPER / SMOTHER EVENTS
    stats['gk_smother'] = (gk_events['goalkeeper_type'] == 'Smother').sum()
    stats['gk_sweeper'] = (gk_events['goalkeeper_type'] == 'Keeper Sweeper').sum()



    # RELIABILITY EVENTS (errors, dangerous actions)
    stats['gk_errors'] = gk_events['goalkeeper_outcome'].isin([
        'Fail','No Touch','In Play Danger','Touched in','Lost in play','Lost out'
    ]).sum()



    # ROUND RATIOS
    stats['gk_save_ratio'] = (
        stats['gk_saves'] / (stats['gk_saves'] + goals_conceded)
        if (stats['gk_saves'] + goals_conceded) > 0 else 0
    )
    if isinstance(stats['gk_save_ratio'], (float, np.floating)):
        stats['gk_save_ratio'] = round(stats['gk_save_ratio'], 2)

    return stats


In [414]:
# TEST ON ONE GOALKEEPER

player_events = sb.events(match_id=match_id)
players_in_match = player_events[['player_id','player','team','position']].dropna().drop_duplicates()

# Pick a random GK
gk_row = players_in_match[players_in_match['position'] == 'Goalkeeper'].sample(1).iloc[0]
gk_id, gk_name, gk_team = gk_row['player_id'], gk_row['player'], gk_row['team']

print("EXAMPLE GOALKEEPER SELECTED")
print(f"Goalkeeper : {gk_name}")
print(f"Team       : {gk_team}")

# Get GK events
gk_events = player_events[player_events['player_id'] == gk_id]

# Extract GK stats
gk_stats = extract_goalkeeper_stats(player_events, gk_events)

print("\nExtracted Goalkeeper stats:\n")
for k,v in gk_stats.items():
    print(f"{k}: {v}")


EXAMPLE GOALKEEPER SELECTED
Goalkeeper : Artur Boruc
Team       : AFC Bournemouth

Extracted Goalkeeper stats:

gk_goals_conceded: 0
gk_clean_sheet: 1
gk_shots_faced: 14
gk_saves: 3
gk_penalties_saved: 1
gk_claims: 5
gk_punches: 1
gk_clearances: 0
gk_smother: 0
gk_sweeper: 0
gk_errors: 0
gk_save_ratio: 1.0


### 4. Discipline and Fouls

This function extracts the main **discipline and foul-related metrics** for a player given its events.  
It combines information from both the events (fouls, own goals) and the lineups (cards).

Metrics include:

- **Fouls Committed** → number of fouls committed by the player

- **Fouls Won** → number of fouls gained

- **Fouls Balance** → fouls won minus fouls committed, to highlight fair play or aggressiveness

- **Own Goals** → number of own goals scored

- **Yellow Cards** → retrieved from `sb.lineups(match_id)`

- **Red Cards** → includes both straight red cards and second yellow

In [415]:
def extract_discipline_stats(player_events):
    """
    Extract discipline statistics from match/player events.
    Uses both events (fouls, own goals) and lineups (cards).

    Args:
        player_events (pd.DataFrame): StatsBomb events for the single player in one match

    Returns:
        dict: Dictionary with aggregated discipline metrics
    """

    stats = {}

    # HANDLE CASE WITH EMPTY DF
    if player_events.empty:
        stats.update({
            'fouls_committed': 0,
            'fouls_won': 0,
            'fouls_balance': 0,
            'own_goals': 0,
            'yellow_cards': 0,
            'red_cards': 0
        })
        return stats



    # MATCH ID
    if "match_id" not in player_events.columns:
        raise ValueError("events_df must contain 'match_id' column")
    match_id = player_events["match_id"].iloc[0]

    # LOAD LINEUPS
    lineups_dict = sb.lineups(match_id=match_id)



    # FOULS EVENTS
    fouls_committed = player_events[player_events["type"] == "Foul Committed"]
    fouls_won = player_events[player_events["type"] == "Foul Won"]

    stats["fouls_committed"] = len(fouls_committed)
    stats["fouls_won"] = len(fouls_won)
    stats["fouls_balance"] = stats["fouls_won"] - stats["fouls_committed"]



    # OWN GOALS EVENTS
    own_goals = player_events[player_events["type"].isin(["Own Goal For", "Own Goal Against"])]
    stats["own_goals"] = len(own_goals)



    # CARDS EVENTS (from lineups)
    player_id = player_events["player_id"].iloc[0]
    yellow_cards, red_cards = 0, 0

    for _, team_df in lineups_dict.items():
        row = team_df[team_df["player_id"] == player_id]
        if not row.empty:
            cards_list = row.iloc[0]["cards"]
            if isinstance(cards_list, list):
                for card in cards_list:
                    ctype = card.get("card_type")
                    if ctype == "Yellow Card":
                        yellow_cards += 1
                    elif ctype in ["Red Card", "Second Yellow"]:
                        red_cards += 1
            break

    stats["yellow_cards"] = yellow_cards
    stats["red_cards"] = red_cards

    return stats


In [430]:
# TEST ON A SINGLE PLAYER

# Load events for that match 
player_events = sb.events(match_id=match_id)

# Extract unique players from events (skip NaNs)
players_in_match = player_events[['player_id', 'player', 'team']].dropna().drop_duplicates()

# Pick one player random
player_row = players_in_match.iloc[randint(0, len(players_in_match)-1)]
player_id = player_row['player_id']
player_name = player_row['player']
team_name = player_row['team']

# Filter events for that player
player_events = player_events[player_events['player_id'] == player_id]

print("\nEXAMPLE PLAYER SELECTED")
print(f"Player : {player_name}")
print(f"Team   : {team_name}")
print(f"Total events for player in match: {len(player_events)}")

# Extract discipline stats
player_stats = extract_discipline_stats(player_events)

# Print summary
print("\nDiscipline Stats for Player:")
for k, v in player_stats.items():
    print(f"{k}: {v}")



EXAMPLE PLAYER SELECTED
Player : Marc Albrighton
Team   : Leicester City
Total events for player in match: 104

Discipline Stats for Player:
fouls_committed: 2
fouls_won: 1
fouls_balance: -1
own_goals: 0
yellow_cards: 1
red_cards: 0


### 5. Context and Playing Time

This function computes a player’s **availability and minutes played** in a match using only event data and lineups

- **Match duration** is computed from `Half Start` / `Half End` events, so it **includes added time** (and extra time if present)

- **Substitutions** are read from `Substitution` events:
  - If the player **comes on** (`substitution_replacement_id`) → minutes = `match_duration - minute_in`.
  - If the player **goes off** (`player_id`) → minutes = `minute_out`
  - If both happen → minutes = `minute_out - minute_in`
  - If neither → the player is assumed to have played the **full match duration**

- **Starter flag** is derived from `lineups` (`positions[0]['from'] == "0:00"`)

- **Positions played** are collected from the `positions` list in lineups and deduplicated.

**Returned metrics**

- `minutes_played` — total minutes on pitch (with seconds in 60ths, e.g., `78.67` ≈ 78’40”)

- `matches_started` — 1 if the player started, else 0 

- `substitutions_in` — 1 if the player came on 

- `substitutions_out` — 1 if the player went off  

- `full_matches` — 1 if started and was never subbed off (played entire match duration)  

- `positions_played` — list of role names played in the match

In [431]:
def extract_context_playing_time(events_df, player_events):
    """
    Extract context and playing time statistics for a single-player performance
    Uses events_df (full match) + lineups (no sb.matches)
    Match duration is computed from Half Start / Half End events

    Args:
        events_df (pd.DataFrame): StatsBomb events for the full match
        player_events (pd.DataFrame): StatsBomb events for the single player

    Returns:
        dict: Dictionary with aggregated playing time metrics
    """

    stats = {}

    if player_events.empty:
        stats.update({
            "minutes_played": 0,
            "matches_started": 0,
            "substitutions_in": 0,
            "substitutions_out": 0,
            "full_matches": 0,
            "positions_played": []
        })
        return stats

    # PLAYER ID
    player_id = player_events["player_id"].iloc[0]

    # MATCH ID
    match_id = events_df["match_id"].iloc[0]
    lineups_dict = sb.lineups(match_id=match_id)

    # MATCH DURATION (from Half Start / Half End)
    duration = 0.0
    half_start = events_df[events_df["type"] == "Half Start"]
    half_end = events_df[events_df["type"] == "Half End"]

    for period in sorted(events_df["period"].unique()):
        start_ev = half_start[half_start["period"] == period]
        end_ev = half_end[half_end["period"] == period]
        if not start_ev.empty and not end_ev.empty:
            start_min = int(start_ev.iloc[0]["minute"]) + int(start_ev.iloc[0]["second"]) / 60.0
            end_min = int(end_ev.iloc[0]["minute"]) + int(end_ev.iloc[0]["second"]) / 60.0
            duration += (end_min - start_min)
    match_duration = round(duration, 2)

    # INIT
    minutes_played = 0.0
    matches_started = 0
    subs_in, subs_out, full_matches = 0, 0, 0
    positions_played = []

    # SUBSTITUTION EVENTS
    subs_events = events_df[events_df["type"] == "Substitution"]

    sub_in_time, sub_out_time = None, None
    if not subs_events.empty:
        # Player out
        if (subs_events["player_id"] == player_id).any():
            sub_row = subs_events[subs_events["player_id"] == player_id].iloc[0]
            sub_out_time = int(sub_row["minute"]) + int(sub_row["second"]) / 60.0
        # Player in
        if (subs_events["substitution_replacement_id"] == player_id).any():
            sub_row = subs_events[subs_events["substitution_replacement_id"] == player_id].iloc[0]
            sub_in_time = int(sub_row["minute"]) + int(sub_row["second"]) / 60.0

    # POSITIONS (for role list + check starter)
    for _, team_df in lineups_dict.items():
        row = team_df[team_df["player_id"] == player_id]
        if not row.empty:
            positions = row.iloc[0].get("positions", [])
            if isinstance(positions, list) and len(positions) > 0:
                for pos in positions:
                    if "position" in pos:
                        positions_played.append(pos["position"])
                if positions[0].get("from") in ["0:00", "00:00"]:
                    matches_started = 1
            break

    # COMPUTE MINUTES
    if sub_in_time is not None and sub_out_time is not None:
        minutes_played = sub_out_time - sub_in_time
        subs_in, subs_out = 1, 1
    elif sub_in_time is not None:
        minutes_played = match_duration - sub_in_time
        subs_in = 1
    elif sub_out_time is not None:
        minutes_played = sub_out_time
        subs_out = 1
    else:
        minutes_played = match_duration

    # FULL MATCH?
    if matches_started == 1 and subs_out == 0 and abs(minutes_played - match_duration) < 1.0:
        full_matches = 1

    # SAVE
    stats["minutes_played"] = round(minutes_played, 2)
    stats["matches_started"] = matches_started
    stats["substitutions_in"] = subs_in
    stats["substitutions_out"] = subs_out
    stats["full_matches"] = full_matches
    stats["positions_played"] = list(set(positions_played))

    return stats


In [432]:
# TEST ON A SINGLE PLAYER

# Load events for that match 
player_events = sb.events(match_id=match_id)

# Extract unique players from events (skip NaNs)
players_in_match = player_events[['player_id', 'player', 'team']].dropna().drop_duplicates()

# Pick one player random
player_row = players_in_match.iloc[randint(0, len(players_in_match)-1)]
player_id = player_row['player_id']
player_name = player_row['player']
team_name = player_row['team']

# Filter events for that player
player_events = player_events[player_events['player_id'] == player_id]


print("EXAMPLE PLAYER SELECTED")
print(f"Player : {player_name}")
print(f"Team   : {team_name}")
print(f"Total events for player in match: {len(player_events)}")

# Extract context stats
player_stats = extract_context_playing_time(player_events, player_events)

# Print summary
print("\nContext Stats for Player:")
for k, v in player_stats.items():
    print(f"{k}: {v}")


EXAMPLE PLAYER SELECTED
Player : Riyad Mahrez
Team   : Leicester City
Total events for player in match: 208

Context Stats for Player:
minutes_played: 0.0
matches_started: 1
substitutions_in: 0
substitutions_out: 0
full_matches: 1
positions_played: ['Right Midfield']


### Aggregated Player Statistics for match

This function collects all the relevant statistics for **every player in a single match**. It combines the multiple stat-extraction functions impleted above into one pipeline:

- **Offensive stats** → shooting, passing, carrying, dribbling

- **Defensive stats** → duels, interceptions, pressures, recoveries, etc.

- **Discipline stats** → fouls, own goals, cards (via `sb.lineups`)

- **Context & playing time** → minutes played, starter status, substitutions, positions played

- **Goalkeeper stats** → only for players identified as goalkeepers in the lineups

**Inputs**

- `match_id`: unique match identifier

- `competition`: competition name (e.g., Premier League)

- `season`: season name (e.g., 2015/2016)

**Outputs**

- A `DataFrame` with **one row per player in the match**, enriched with all computed statistics


In [None]:
def collect_player_match_stats(match_id, competition, season):
    """
    Collect all player-match statistics for a given match_id.

    For each player in the match, combine:
      - Offensive stats
      - Defensive stats
      - Discipline stats
      - Context & playing time stats
      - Goalkeeper stats (only if player is GK)

    Args:
        match_id (int): Match identifier
        competition (str): Competition name (from sb.matches)
        season (str): Season name (from sb.matches)

    Returns:
        pd.DataFrame: one row per player with all stats for the match
    """

    # Load events & lineups
    events_df = sb.events(match_id=match_id)
    lineups_dict = sb.lineups(match_id=match_id)

    # Unique players in the match (from events)
    players_in_match = events_df[['player_id', 'player', 'team']].dropna().drop_duplicates()

    rows = []

    for _, player_row in players_in_match.iterrows():
        player_id = int(player_row['player_id'])
        player_name = player_row['player']
        team_name = player_row['team']

        # Events for this player
        player_events = events_df[events_df['player_id'] == player_id]

        # Base record
        base = {
            "competition": competition,
            "season": season,
            "match_id": match_id,
            "team": team_name,
            "player_id": player_id,
            "player_name": player_name,
        }

        # Call Stats functions
        off_stats = extract_offensive_stats(player_events)
        def_stats = extract_defensive_stats(player_events)
        disc_stats = extract_discipline_stats(player_events)   # discipline → player_events
        ctx_stats = extract_context_playing_time(events_df, player_events)  # context → full events + player_events

        # GK stats (only if player is GK)
        gk_stats = {}
        for _, team_df in lineups_dict.items():
            row = team_df[team_df["player_id"] == player_id]
            if not row.empty:
                positions = row.iloc[0].get("positions", [])
                if isinstance(positions, list):
                    if any("Goalkeeper" in str(pos.get("position", "")) for pos in positions):
                        gk_stats = extract_goalkeeper_stats(events_df, player_events)  # GK → full events + player events
                break

        # Merge all
        row_dict = {**base, **off_stats, **def_stats, **disc_stats, **ctx_stats, **gk_stats}
        rows.append(row_dict)

    return pd.DataFrame(rows)


In [404]:
# TEST 

# Get match data from StatsBomb
matches = sb.matches(competition_id=2, season_id=27)  # 2 = Premier League, 27 = 2015/16
example_match = matches.iloc[0]   # first available match

match_id = example_match["match_id"]
competition = example_match["competition"]
season = example_match["season"]

print("Testing on match:")
print(f"Match ID: {match_id}")
print(f"Competition: {competition}")
print(f"Season: {season}")

# Estrazione statistiche per tutti i giocatori del match
df_stats = collect_player_match_stats(match_id, competition, season)

print("\nPlayer stats DataFrame:")
df_stats


Testing on match:
Match ID: 3754058
Competition: England - Premier League
Season: 2015/2016

Player stats DataFrame:


Unnamed: 0,competition,season,match_id,team,player_id,player_name,shots_attempted,goals,shots_on_target,xg_total,xg_avg,penalties,headers,passes_attempted,passes_completed,pass_accuracy,assists,key_passes,progressive_passes,crosses,switches,avg_pass_angle,avg_pass_length,carries_attempted,carry_distance_total,progressive_carries,carries_to_penalty_area,dribbles_attempted,dribbles_completed,dribble_success_rate,dribble_overruns,duels_attempted,duels_won,duels_lost,duels_ratio,interceptions_attempted,interceptions_won,interceptions_lost,interceptions_ratio,blocks,clearances,ball_recoveries,pressures,dispossessed,fouls_committed,fouls_won,fouls_balance,own_goals,yellow_cards,red_cards,minutes_played,matches_started,substitutions_in,substitutions_out,full_matches,positions_played,gk_goals_conceded,gk_clean_sheet,gk_shots_faced,gk_saves,gk_penalties_saved,gk_claims,gk_punches,gk_clearances,gk_smother,gk_sweeper,gk_errors,gk_save_ratio
0,England - Premier League,2015/2016,3754058,AFC Bournemouth,3343,Dan Gosling,2,0,0,0.3,0.15,0,0,36,31,0.86,0,0,3,1,1,-0.54,17.3,24,124.22,0,1,0,0,,0,6,2,4,0.33,2,2,0,1.0,3,3,2,38,4,0,2,2,0,0,0,95.75,1,0,0,1,[Right Center Midfield],,,,,,,,,,,,
1,England - Premier League,2015/2016,3754058,AFC Bournemouth,3346,Joshua King,2,0,0,0.42,0.21,0,1,7,5,0.71,0,1,0,1,0,-1.4,12.19,14,97.56,2,3,3,3,1.0,0,2,0,2,0.0,0,0,0,1.0,0,0,2,17,3,0,2,2,0,0,0,50.7,1,0,1,0,[Center Forward],,,,,,,,,,,,
2,England - Premier League,2015/2016,3754058,AFC Bournemouth,3344,Andrew Surman,0,0,0,0.0,,0,0,49,41,0.84,0,2,8,0,2,0.11,21.42,29,92.51,1,0,0,0,,0,5,0,5,0.0,3,1,2,0.33,2,2,5,12,0,2,2,0,0,0,0,95.75,1,0,0,1,[Center Defensive Midfield],,,,,,,,,,,,
3,England - Premier League,2015/2016,3754058,AFC Bournemouth,6409,Adam Smith,2,0,0,0.04,0.02,0,0,46,36,0.78,0,0,9,0,1,-0.87,16.09,37,327.82,7,0,4,2,0.5,1,10,2,8,0.2,3,1,2,0.33,1,3,4,22,2,2,1,-1,0,0,0,95.75,1,0,0,1,[Right Back],,,,,,,,,,,,
4,England - Premier League,2015/2016,3754058,AFC Bournemouth,3608,Simon Francis,0,0,0,0.0,,0,0,44,38,0.86,0,0,22,1,2,-0.2,27.02,32,228.83,3,0,0,0,,0,2,0,2,0.0,2,1,1,0.5,0,8,0,3,0,1,0,-1,0,0,1,95.75,1,0,0,1,"[Left Center Midfield, Right Center Back]",,,,,,,,,,,,
5,England - Premier League,2015/2016,3754058,AFC Bournemouth,3341,Steve Cook,0,0,0,0.0,,0,0,58,49,0.84,0,0,22,0,6,0.2,28.08,35,386.28,12,0,0,0,,0,2,0,2,0.0,3,1,2,0.33,1,12,2,7,0,0,0,0,0,0,0,95.75,1,0,0,1,"[Right Center Back, Left Center Back]",,,,,,,,,,,,
6,England - Premier League,2015/2016,3754058,AFC Bournemouth,3345,Charlie Daniels,1,0,0,0.01,0.01,0,0,56,41,0.73,0,0,16,1,0,1.05,20.44,34,210.96,6,0,2,2,1.0,0,3,0,3,0.0,1,0,1,0.0,2,5,3,11,0,0,0,0,0,0,0,95.75,1,0,0,1,[Left Back],,,,,,,,,,,,
7,England - Premier League,2015/2016,3754058,Leicester City,3270,Danny Simpson,0,0,0,0.0,,0,0,46,39,0.85,0,0,11,2,0,-1.01,18.27,24,115.61,2,1,0,0,,0,2,0,2,0.0,2,2,0,1.0,4,3,3,11,0,0,0,0,0,0,0,73.18,1,0,1,0,[Right Back],,,,,,,,,,,,
8,England - Premier League,2015/2016,3754058,Leicester City,40123,Robert Huth,1,0,1,0.06,0.06,0,1,35,32,0.91,0,0,7,0,0,0.23,19.75,21,130.1,2,0,0,0,,0,1,0,1,0.0,0,0,0,1.0,1,1,3,11,0,1,0,-1,0,0,0,95.75,1,0,0,1,[Left Center Back],,,,,,,,,,,,
9,England - Premier League,2015/2016,3754058,AFC Bournemouth,3049,Matt Ritchie,0,0,0,0.0,,0,0,43,30,0.7,0,2,10,2,2,-0.25,23.68,28,177.7,1,0,1,1,1.0,0,5,1,4,0.2,3,0,3,0.0,2,2,8,26,2,0,2,2,0,0,0,95.75,1,0,0,1,[Right Midfield],,,,,,,,,,,,


### Aggregated Player Statistics for a Full Season (Top 5 European Leagues)

This function aggregates **player performance across an entire season** for a given competition.  
It builds on the per-match statistics collected with `collect_player_match_stats` and consolidates them at season level.

**Workflow**
1. Retrieve all matches for the given `competition_id` and `season_id`

2. For each match, compute **player-match statistics**

3. Concatenate all match data into a single dataset

4. Aggregate across the season:

   - **Numeric metrics** → summed (e.g., goals, passes, recoveries)

   - **Ratios & accuracies** → averaged (e.g., duel ratio, pass accuracy)

   - **Minutes played** → summed

   - **Positions played** → collected into a unique list of roles

   - **Teams** → collected into a list (to handle mid-season transfers)

**Outputs**

- A `DataFrame` with **one row per player per competition per season**, enriched with all aggregated statistics

- Players who changed team(s) during the season will have all their performances combined, with `teams` showing the list of clubs they represented


In [None]:
def collect_player_season_stats(competition_id, season_id):
    """
    Collect season-level aggregated statistics for all players in a given competition & season.

    Steps:
      - Iterate all matches of that season
      - Collect per-match player stats
      - Aggregate at season level
      - Handle players who changed team (teams → list)

    Args:
        competition_id (int): StatsBomb competition id
        season_id (int): StatsBomb season id

    Returns:
        pd.DataFrame: one row per player+competition+season with aggregated stats
    """

    # LOAD MATCHES
    matches = sb.matches(competition_id=competition_id, season_id=season_id)

    all_rows = []

    # ITERATE MATCHES
    for _, m in tqdm(matches.iterrows(), total=len(matches), desc=f"Season {season_id}"):
        match_id = int(m["match_id"])
        competition = m["competition_name"]
        season = m["season_name"]

        # PER-MATCH PLAYER STATS
        df_match = collect_player_match_stats(match_id, competition, season)
        all_rows.append(df_match)

    # CONCAT ALL MATCHES
    df_all = pd.concat(all_rows, ignore_index=True)

    # AGGREGATION
    # numeric cols
    numeric_cols = df_all.select_dtypes(include=["number"]).columns.tolist()
    meta_cols = ["competition", "season", "player_id", "player_name"]

    # build aggregation dict
    agg_dict = {}
    for col in numeric_cols:
        if col in meta_cols:   # skip metadata
            continue
        if "ratio" in col or "accuracy" in col:   # quality metrics → mean
            agg_dict[col] = "mean"
        elif col in ["minutes_played"]:           # minutes → sum
            agg_dict[col] = "sum"
        else:                                     # default → sum
            agg_dict[col] = "sum"

    # aggregate positions
    def agg_positions(series):
        positions = []
        for lst in series:
            if isinstance(lst, list):
                positions.extend(lst)
        return list(set(positions))

    # aggregate teams
    def agg_teams(series):
        return list(set(series))

    # aggregate numeric
    df_season = df_all.groupby(meta_cols).agg(agg_dict).reset_index()

    # aggregate positions
    df_pos = df_all.groupby(meta_cols)["positions_played"].apply(agg_positions).reset_index()
    df_season = df_season.merge(df_pos, on=meta_cols, how="left")

    # aggregate teams
    df_teams = df_all.groupby(meta_cols)["team"].apply(agg_teams).reset_index().rename(columns={"team": "teams"})
    df_season = df_season.merge(df_teams, on=meta_cols, how="left")

    # REORDER by competition, team, player_name
    df_season = df_season.sort_values(by=["competition", "teams", "player_name"]).reset_index(drop=True)

    return df_season

#### Premier League 2015/16

In [None]:
# Create the "data" folder if it doesn't exist
os.makedirs("data", exist_ok=True)

In [None]:
# Premier League 2015/16
df_premier = collect_player_season_stats(competition_id=2, season_id=27)
df_premier.to_csv("../task2_ballon_dor/data/premier_league_2015_16.csv", index=False)

In [None]:
# Test
df_test = pd.read_csv("../task2_ballon_dor/data/premier_league_2015_16.csv")
print("Premier League 2015/16 dataset shape:", df_test.shape)
df_test.head()

#### LaLiga 2015/16

In [None]:
# La Liga 2015/16
df_laliga = collect_player_season_stats(competition_id=11, season_id=27)
df_laliga.to_csv("../task2_ballon_dor/data/laliga_2015_16.csv", index=False)


In [None]:
# Test
df_test = pd.read_csv("../task2_ballon_dor/data/laliga_2015_16.csv")
print("La Liga 2015/16 dataset shape:", df_test.shape)
df_test.head()

#### Serie A 2015/16

In [None]:
# Serie A 2015/16
df_seriea = collect_player_season_stats(competition_id=55, season_id=43)
df_seriea.to_csv("../task2_ballon_dor/data/seriea_2015_16.csv", index=False)

In [None]:
# TEST
df_test = pd.read_csv("../task2_ballon_dor/data/seriea_2015_16.csv")
print("Serie A 2015/16 dataset shape:", df_test.shape)
df_test.head()

#### 1.Bundesliga 2015/16

In [None]:
# Bundesliga 2015/16
df_bundesliga = collect_player_season_stats(competition_id=9, season_id=27)
df_bundesliga.to_csv("../task2_ballon_dor/data/bundesliga_2015_16.csv", index=False)

In [None]:
# TEST
df_test = pd.read_csv("../task2_ballon_dor/data/bundesliga_2015_16.csv")
print("Bundesliga 2015/16 dataset shape:", df_test.shape)
df_test.head()

#### Ligue 1 2015/16

In [None]:
# Ligue 1 2015/16
df_ligue1 = collect_player_season_stats(competition_id=12, season_id=27)
df_ligue1.to_csv("../task2_ballon_dor/data/ligue1_2015_16.csv", index=False)

In [None]:
# TEST
df_test = pd.read_csv("../task2_ballon_dor/data/ligue1_2015_16.csv")
print("Ligue 1 2015/16 dataset shape:", df_test.shape)
df_test.head()
