# Task 2 - Data Preparation (StatsBomb Big 5 Leagues 2015/2016)

This notebook focuses on the preparation of the **StatsBomb open data** related to the Big 5 European leagues for the 2015/2016 season. The data will be loaded through the `statsbombpy` library, with an initial exploration of the available structures at both team and player level. Basic cleaning procedures will be applied to ensure fair comparisons across players. Finally, the processed datasets will be saved for subsequent analyses. 

## Note on Available Stats (Open Data Limitation)

**StatsBombpy library** provides convenient aggregated endpoints such as `team_season_stats`, `team_match_stats`, `player_season_stats`, and `player_match_stats`. However, these endpoints are **not available in the public open-data release** and require commercial credentials. As a result, this notebook **builds all team- and player-level statistics from scratch** using only the open-data endpoints:

- `sb.competitions()` – list of competitions/seasons  

- `sb.matches(competition_id, season_id)` – list of matches per competition/season  

- `sb.events(match_id)` – full on-ball event log for a match (shots, passes, dribbles, duels, pressures, etc.)  

- `sb.lineups(match_id)` – squads and players (used to infer minutes played together with events/substitutions)

In [1]:
from statsbombpy import sb

# Demo: Attempt to use an aggregated stats endpoint
# The function sb.player_season_stats() would normally return 
# season-level player statistics if commercial credentials were provided
# However, this endpoint is NOT available in the open-data release

try:
    # Example: attempt to load Premier League 2015/16 season stats
    _ = sb.player_season_stats(competition_id=2, season_id=27)
    
except Exception as e:
    # This error confirms that aggregated stats are not part of the open dataset
    print("Aggregated endpoint not available in open data. Falling back to events/lineups.")
    print(f"ERROR returned by statsbombpy: {e}")


Aggregated endpoint not available in open data. Falling back to events/lineups.
ERROR returned by statsbombpy: There is currently no open data for aggregated stats, please provide credentials




## Imports and Global Settings

In [2]:
import pandas as pd
import numpy as np
import os

from random import randint
from tqdm import tqdm
from statsbombpy import sb
from collections import Counter

import warnings
warnings.filterwarnings("ignore")

## Load Competitions and Filter 2015/16 Big 5

In [3]:
# Load all available competitions
competitions = sb.competitions()

display(competitions.columns.tolist())

print("All competitions available:")
display(competitions[["competition_id", "season_id", "competition_name", "season_name"]].head())


['competition_id',
 'season_id',
 'country_name',
 'competition_name',
 'competition_gender',
 'competition_youth',
 'competition_international',
 'season_name',
 'match_updated',
 'match_updated_360',
 'match_available_360',
 'match_available']

All competitions available:


Unnamed: 0,competition_id,season_id,competition_name,season_name
0,9,281,1. Bundesliga,2023/2024
1,9,27,1. Bundesliga,2015/2016
2,1267,107,African Cup of Nations,2023
3,16,4,Champions League,2018/2019
4,16,1,Champions League,2017/2018


In [4]:
# Filter competitions for season 2015/2016
season_year = "2015/2016"
competitions_1516 = competitions[competitions["season_name"] == season_year]

print("Competitions for season 2015/2016:")
display(competitions_1516[["competition_id", "season_id", "competition_name", "season_name"]])

Competitions for season 2015/2016:


Unnamed: 0,competition_id,season_id,competition_name,season_name
1,9,27,1. Bundesliga,2015/2016
6,16,27,Champions League,2015/2016
43,11,27,La Liga,2015/2016
60,7,27,Ligue 1,2015/2016
64,2,27,Premier League,2015/2016
66,12,27,Serie A,2015/2016


In [5]:
# Select Big 5 leagues and count matches
big5 = ["Premier League", "La Liga", "Serie A", "1. Bundesliga", "Ligue 1"]

competitions_big5_1516 = competitions_1516[
    competitions_1516["competition_name"].isin(big5)
].copy()

# Count matches for each competition
match_counts = []
for _, row in competitions_big5_1516.iterrows():

    # Retrieve competition id and season id
    comp_id = row["competition_id"]
    season_id = row["season_id"]

    # Retrieve the matches for each competition-season
    matches = sb.matches(competition_id=comp_id, season_id=season_id)

    # Count the number of matches and store it
    n_matches = matches.shape[0]
    match_counts.append(n_matches)

# Add matches column to the dataframe
competitions_big5_1516["num_matches"] = match_counts

# Display the results
print("Big 5 competitions in 2015/2016 with match counts:")
display(competitions_big5_1516[["competition_id", "season_id","competition_name", "season_name", "num_matches"]])

# Total
total_matches = competitions_big5_1516["num_matches"].sum()
print(f"Total matches in Big 5 competitions (2015/2016): {total_matches}")


Big 5 competitions in 2015/2016 with match counts:


Unnamed: 0,competition_id,season_id,competition_name,season_name,num_matches
1,9,27,1. Bundesliga,2015/2016,306
43,11,27,La Liga,2015/2016,380
60,7,27,Ligue 1,2015/2016,377
64,2,27,Premier League,2015/2016,380
66,12,27,Serie A,2015/2016,380


Total matches in Big 5 competitions (2015/2016): 1823


> **NOTE**: For the 2015/2016 season, the StatsBomb open data provides the full set of matches for all Big 5 leagues except Ligue 1.  
> In Ligue 1, only 377 matches are available instead of the expected 380, due to a few games not being released in the public dataset.  
> This minor discrepancy (less than 1% of the total league games) is not considered problematic, as it does not significantly affect aggregated player or team statistics.

### Identifying Missing Matches in Ligue 1 (2015/2016)

Ligue 1 should contain 380 matches in the 2015/2016 season, but only 377 are available in the StatsBomb open data. Let's detect the match weeks where games are missing and to identify the teams involved by comparing the line-up of teams in each round with the complete set of Ligue 1 participants

In [6]:
# Load Ligue 1 2015/16 matches
# Competition "Ligue 1" id: 7
# Season "2015/2016" id: 27
matches_ligue1 = sb.matches(competition_id=7, season_id=27)

# Group by match week and count matches
# .groupby("match_week") groups the DataFrame by each round of the season
# .size() counts the number of rows (i.e., matches) per group
matches_per_week = matches_ligue1.groupby("match_week").size()

# Identify the match weeks with fewer than 10 matches in that round
incomplete_weeks = matches_per_week[matches_per_week < 10]

print("Match weeks with missing games:\n")
print(incomplete_weeks)

# Retrieve the full set of teams that appear across the season
all_teams = set(matches_ligue1["home_team"]).union(set(matches_ligue1["away_team"]))

# Loop through each incomplete week to identify missing teams
for week in incomplete_weeks.index:
    print(f"\nMatch Week {week}")
    
    # Extract all matches for that week
    week_matches = matches_ligue1[matches_ligue1["match_week"] == week]
    
    # Collect all teams that played (both home and away) during that week
    played_teams = set(week_matches["home_team"]).union(set(week_matches["away_team"]))
    
    # Identify the teams that did not play in that week
    missing_teams = all_teams - played_teams

    # Print the missing teams that should form the missing match
    if missing_teams:
        print(f"Missing match: {list(missing_teams)} did not play")


Match weeks with missing games:

match_week
14    9
23    9
36    9
dtype: int64

Match Week 14
Missing match: ['Bastia', 'Gazélec Ajaccio'] did not play

Match Week 23
Missing match: ['Paris Saint-Germain', 'Saint-Étienne'] did not play

Match Week 36
Missing match: ['Troyes', 'Bordeaux'] did not play


The identification of three missing matches in the Ligue 1 dataset for the 2015/2016 season does not pose a significant issue for the analysis. Most of the teams involved did not have players realistically competing for the Ballon d’Or. The only notable exception is *Paris Saint-Germain*; however, given the substantial number of their matches still available, the absence of this single fixture is unlikely to materially affect the aggregated player statistics considered in the study.

## Building Player and Team Statistics from Events and Lineups

### Event Categorization for Ballon d’Or Player Evaluation

In [7]:
def list_event_types(competition_id: int, season_id: int):
    """
    Print all unique event types in a competition/season.
    
    Args:
        competition_id (int): StatsBomb competition ID 
        season_id (int): StatsBomb season ID 
        limit_matches (int, optional): limit number of matches to speed up. Default None.
    """
    # Load matches
    matches = sb.matches(competition_id=competition_id, season_id=season_id)
    
    event_types = set()
    
    for _, match in tqdm(matches.iterrows(), total=matches.shape[0]):
        match_id = match["match_id"]
        events = sb.events(match_id=match_id)
        event_types.update(events["type"].unique())
    
    print(f"Unique event types in competition {competition_id}, season {season_id}:")
    for etype in sorted(event_types):
        print("-", etype)
    
    return event_types

# Example: Premier League 2015/16 (competition_id=2, season_id=27)
event_types = list_event_types(competition_id=2, season_id=27)
print(f"Total unique event types found: {len(event_types)}")


100%|██████████| 380/380 [03:34<00:00,  1.77it/s]

Unique event types in competition 2, season 27:
- 50/50
- Bad Behaviour
- Ball Receipt*
- Ball Recovery
- Block
- Carry
- Clearance
- Dispossessed
- Dribble
- Dribbled Past
- Duel
- Error
- Foul Committed
- Foul Won
- Goal Keeper
- Half End
- Half Start
- Injury Stoppage
- Interception
- Miscontrol
- Offside
- Own Goal Against
- Own Goal For
- Pass
- Player Off
- Player On
- Pressure
- Referee Ball-Drop
- Shield
- Shot
- Starting XI
- Substitution
- Tactical Shift
Total unique event types found: 33





#### Considerations

After the event-level analysis, only those categories that provide **clear and actionable insights** into individual player performance were retained.  
Events considered marginal, redundant, or not directly informative for evaluation have been excluded.

**1. Offensive & Possession Actions**

Events directly related to attacking play, chance creation, and ball progression:

- *Shot*  
- *Pass*  
- *Carry*  
- *Dribble*  

**2. Defensive Actions**

Events that measure defensive contribution and ball recovery:

- *Duel*  
- *Dribbled Past*  
- *Interception*  
- *Block*  
- *Clearance*  
- *Ball Recovery*  
- *Pressure*  
- *Dispossessed*  

**3. Goalkeeping**

Events specifically describing goalkeeper activity:

- *Goal Keeper*  

**4. Discipline & Fouls**

Events linked to fouls, discipline, and negative contributions:

- *Foul Committed*  
- *Foul Won*  
- *Own Goal For / Against*  

> Note: for card-related information (yellow/red cards), we leverage the more detailed data available from `sb.lineups(match_id)`.

**5. Context & Playing Time**

Events providing information on player availability, minutes played, and tactical role:

- *Starting XI*  
- *Substitution*  
- *Half Start / Half End*  

**Excluded Events**

The following events were excluded from further analysis as they provide limited, redundant, or indirect information about individual performance:

- *Tactical Shift* → Indicates formation or role changes; excluded for simplicity.  
- *Player On / Player Off* → Redundant; already covered by lineups and substitution events.  
- *Injury Stoppage* → Contextual interruption; no performance insight.  
- *Referee Ball-Drop* → Administrative event; no performance value.  
- *Shield* → Hard to quantify in terms of individual performance.  
- *Error* → Ambiguous; overlaps with dispossession or miscontrol events.  
- *Miscontrol* → Already captured under *Dispossessed*.  
- *Offside* → Primarily a team-level outcome; limited individual insight.  
- *Ball Receipt* → Redundant; every completed pass implies a ball reception.  
- *50/50* → Already encompassed within *Duel*.  
- *Bad Behaviour* → Less detailed compared to card information from lineups.  


#### Note on Event Columns in StatsBomb Data

The StatsBomb event dataset contains a mixture of **shared attributes** (present in all events) and **event-specific attributes** (only relevant for certain event types). When these events are flattened into a DataFrame, only the columns that actually appear in that match are created. As a result, **the number of columns in the events DataFrame can vary from match to match**, depending on the types of actions recorded.  

What remains consistent are the shared fields, while event-specific fields appear only when relevant for that particular match.


In [8]:
# Load matches from Premier League 2015/16 (comp_id=2, season_id=27)
matches = sb.matches(competition_id=2, season_id=27)

# Pick the first 10 matches
sample_matches = matches.head(10)

print("Number of columns in events DataFrame for 10 matches:\n")

for i, row in sample_matches.iterrows():
    match_id = row['match_id']
    events = sb.events(match_id=match_id)
    print(f"Match {i+1}: {row['home_team']} vs {row['away_team']} -> {events.shape[1]} columns")

Number of columns in events DataFrame for 10 matches:

Match 1: Leicester City vs AFC Bournemouth -> 90 columns
Match 2: West Bromwich Albion vs Sunderland -> 92 columns
Match 3: Newcastle United vs Aston Villa -> 89 columns
Match 4: Everton vs AFC Bournemouth -> 88 columns
Match 5: Crystal Palace vs Watford -> 95 columns
Match 6: Arsenal vs Aston Villa -> 95 columns
Match 7: West Bromwich Albion vs Liverpool -> 93 columns
Match 8: Tottenham Hotspur vs AFC Bournemouth -> 89 columns
Match 9: Leicester City vs Manchester City -> 88 columns
Match 10: Crystal Palace vs Everton -> 90 columns


### Example Match Extraction for Function Testing

In [9]:
# Load matches for Premier League 2015/16 (comp_id=2, season_id=27)
matches = sb.matches(competition_id=2, season_id=27)

# Select the match at index 0
first_match = matches.iloc[0]
match_id = first_match['match_id']

# Print summary information about the selected match
print("EXAMPLE MATCH SELECTED")
print(f"Competition : Premier League")
print(f"Season      : 2015/16")
print(f"Matchweek   : {first_match['match_week']}")
print(f"Date        : {first_match['match_date']}")
print(f"Home Team   : {first_match['home_team']}")
print(f"Away Team   : {first_match['away_team']}")
print(f"Final Score : {first_match['home_score']} - {first_match['away_score']}")
print(f"Match ID    : {match_id}")

EXAMPLE MATCH SELECTED
Competition : Premier League
Season      : 2015/16
Matchweek   : 20
Date        : 2016-01-02
Home Team   : Leicester City
Away Team   : AFC Bournemouth
Final Score : 0 - 0
Match ID    : 3754058


### 1. Offensive Actions

This function extracts all the key **offensive-related metrics** for a player given its events.  
The analysis is based on StatsBomb event types that directly capture attacking contribution, chance creation, and ball progression:

- **Shots** → number of attempts, goals, shots on target, xG (total and average), penalties, headers

- **Passes** → attempted, completed, accuracy, assists, key passes, progressive passes (≥15m forward), crosses, switches of play, average angle and length

- **Carries** → number of carries, total distance covered, progressive carries (≥10m forward), and carries ending inside the penalty area

- **Dribbles** → attempted, completed, success rate, and overruns (failed dribbles losing control)

In [10]:
def extract_offensive_stats(events_df, player_events, pitch_length=120):
    """
    Extract offensive statistics from match/player events
    Processes StatsBomb event types: Shot, Pass, Carry, Dribble.
    
    Args:
        events_df (pd.DataFrame): StatsBomb events for the entire match
        player_events (pd.DataFrame): StatsBomb events for a single match
        pitch_length (float): Pitch length in meters (default 120, StatsBomb standard)
    
    Returns:
        dict: Dictionary with aggregated offensive metrics
    """

    stats = {}

    # SHOTS EVENTS
    shots = player_events[player_events['type'] == 'Shot']

    # Total number of shots attempted
    stats['shots_attempted'] = len(shots)

    # Goals scored (shot_outcome == 'Goal')
    stats['goals'] = (shots['shot_outcome'] == 'Goal').sum()

    # Shots on target (goal, saved by goalkeeper, or hitting the post)
    stats['shots_on_target'] = shots['shot_outcome'].isin(
        ['Goal', 'Saved', 'Saved To Post']
    ).sum()

    # Expected Goals (sum of StatsBomb xG values)
    stats['xg_total'] = shots['shot_statsbomb_xg'].sum(skipna=True)

    # Average xG per shot (quality of average shooting chance)
    stats['xg_avg'] = shots['shot_statsbomb_xg'].mean(skipna=True)

    # Penalties attempted (shot_type == 'Penalty')
    stats['penalties'] = (shots['shot_type'] == 'Penalty').sum()

    # Headers attempted (body part == Head)
    stats['headers'] = (shots['shot_body_part'] == 'Head').sum()



    # PASSES EVENTS
    passes = player_events[player_events['type'] == 'Pass']


    # Assists
    # There are different methods to calculate assists: with "pass_goal_assist", "shot_key_pass_id" and "pass_assisted_shot_id"
    # The easiest method is to use the "pass_goal_assist" flag from the passes DataFrame
    # The others methods need to check also the goals in the shots DataFrame
    # All these methods retrieve a number of assists that is not the real value (look at the bottom of this notebook to understand why)
    if "pass_goal_assist" in passes.columns:
        assists = passes["pass_goal_assist"].fillna(False).sum()
    else:
        assists = 0

    # Save all three versions
    stats["assists"] = int(assists)

    # Key passes (passes leading directly to a shot)
    stats['key_passes'] = passes['pass_shot_assist'].fillna(False).sum()

    # Total passes attempted
    stats['passes_attempted'] = len(passes)

    # Completed passes (StatsBomb: pass_outcome is NaN if successful)
    stats['passes_completed'] = passes['pass_outcome'].isna().sum()

    # Passing accuracy
    stats['pass_accuracy'] = (
        stats['passes_completed'] / stats['passes_attempted']
        if stats['passes_attempted'] > 0 else np.nan
    )

    # Progressive passes (forward passes advancing ≥15m)
    progressive_passes = 0
    for _, row in passes.iterrows():
        start = row.get('location', None)
        end = row.get('pass_end_location', None)
        if isinstance(start, list) and isinstance(end, list):
            if (end[0] - start[0]) >= 15:
                progressive_passes += 1
    stats['progressive_passes'] = progressive_passes

    # Crosses attempted
    stats['crosses'] = passes['pass_cross'].fillna(False).sum()

    # Switches of play
    stats['switches'] = passes['pass_switch'].fillna(False).sum()

    # Average pass angle (measure of verticality vs lateral passing)
    stats['avg_pass_angle'] = passes['pass_angle'].mean(skipna=True)

    # Average pass length (directness, tendency to play long vs short)
    stats['avg_pass_length'] = passes['pass_length'].mean(skipna=True)



    # CARRIES EVENTS
    carries = player_events[player_events['type'] == 'Carry']

    # Total carries (times player moved the ball by running with it)
    stats['carries_attempted'] = len(carries)

    # Total distance carried (sum of carry lengths)
    total_carry_distance = 0
    for _, row in carries.iterrows():
        start = row.get('location', None)
        end = row.get('carry_end_location', None)
        if isinstance(start, list) and isinstance(end, list):
            dist = np.linalg.norm(np.array(end) - np.array(start))
            total_carry_distance += dist
    stats['carry_distance_total'] = total_carry_distance

    # Progressive carries (advancing ≥10m towards goal)
    progressive_carries = 0
    for _, row in carries.iterrows():
        start = row.get('location', None)
        end = row.get('carry_end_location', None)
        if isinstance(start, list) and isinstance(end, list):
            if (end[0] - start[0]) >= 10:
                progressive_carries += 1
    stats['progressive_carries'] = progressive_carries

    # Carries ending inside the penalty area (define the insertions in the area)
    carries_to_box = 0
    for loc in carries['carry_end_location']:
        if isinstance(loc, list):
            if loc[0] >= (pitch_length - 18) and 18 <= loc[1] <= 62:
                carries_to_box += 1
    stats['carries_to_penalty_area'] = carries_to_box



    # DRIBBLES EVENTS
    dribbles = player_events[player_events['type'] == 'Dribble']

    # Total dribbles attempted
    stats['dribbles_attempted'] = len(dribbles)

    # Successful dribbles (outcome == 'Complete')
    stats['dribbles_completed'] = (dribbles['dribble_outcome'] == 'Complete').sum()

    # Dribble success rate (success %)
    stats['dribble_success_rate'] = (
        stats['dribbles_completed'] / stats['dribbles_attempted']
        if stats['dribbles_attempted'] > 0 else np.nan
    )

    # Dribble overruns (failed dribble due to losing control of the ball)
    stats['dribble_overruns'] = dribbles['dribble_overrun'].fillna(False).sum() if 'dribble_overrun' in dribbles else 0

    # Round only selected float stats
    for key in ['xg_total', 'xg_avg', 'pass_accuracy', 
                'avg_pass_angle', 'avg_pass_length', 
                'carry_distance_total', 'dribble_success_rate']:
        if key in stats and isinstance(stats[key], (float, np.floating)):
            stats[key] = round(stats[key], 2)

    return stats

In [11]:
# TEST ON A SINGLE PLAYER

# Load events for that match 
events_df = sb.events(match_id=match_id)

# Extract unique players from events (skip NaNs)
players_in_match = events_df[['player_id', 'player', 'team']].dropna().drop_duplicates()

# Pick one player random
player_row = players_in_match.iloc[randint(0, len(players_in_match)-1)]
player_id = player_row['player_id']
player_name = player_row['player']
team_name = player_row['team']

# Filter events for that player
player_events = events_df[events_df['player_id'] == player_id]

print("EXAMPLE PLAYER SELECTED")
print(f"Player : {player_name}")
print(f"Team   : {team_name}")
print(f"Total events for player in match: {len(player_events)}")

# Extract offensive stats
player_stats = extract_offensive_stats(events_df, player_events)

# Print summary
print("Offensive Stats for Player:")
print(f"Player : {player_name}")
print(f"Team   : {team_name}")
print(f"Match  : {first_match['home_team']} vs {first_match['away_team']} (ID {match_id})\n")

print("Extracted offensive stats:")
for k, v in player_stats.items():
    print(f"{k}: {v}")


EXAMPLE PLAYER SELECTED
Player : Steve Cook
Team   : AFC Bournemouth
Total events for player in match: 163
Offensive Stats for Player:
Player : Steve Cook
Team   : AFC Bournemouth
Match  : Leicester City vs AFC Bournemouth (ID 3754058)

Extracted offensive stats:
shots_attempted: 0
goals: 0
shots_on_target: 0
xg_total: 0.0
xg_avg: nan
penalties: 0
headers: 0
assists: 0
key_passes: 0
passes_attempted: 58
passes_completed: 49
pass_accuracy: 0.84
progressive_passes: 22
crosses: 0
switches: 6
avg_pass_angle: 0.2
avg_pass_length: 28.08
carries_attempted: 35
carry_distance_total: 386.28
progressive_carries: 12
carries_to_penalty_area: 0
dribbles_attempted: 0
dribbles_completed: 0
dribble_success_rate: nan
dribble_overruns: 0


### 2. Defensive Actions

This function extracts the main **defensive contribution metrics** for a player given its events.  
The analysis is based on StatsBomb event types that describe defensive activity, ball recovery, and duels:

- **Duels** → attempted, won, lost, and duel success ratio

- **Interceptions** → attempted, successful (won), lost, and interception ratio

- **Blocks** → number of blocks made against opponent passes or shots

- **Clearances** → defensive actions to remove danger by clearing the ball

- **Ball Recoveries** → regaining possession of the ball

- **Pressures** → pressing actions applied on opponents

- **Dispossessed** → number of times the player lost possession under pressure.

In [12]:
def extract_defensive_stats(player_events):
    """
    Extract defensive statistics from match/player events
    Processes StatsBomb event types: Duel, Interception,
    Block, Clearance, Ball Recovery, Pressure, Dispossessed
    
    Args:
        player_events (pd.DataFrame): StatsBomb events for a single player
    
    Returns:
        dict: Dictionary with aggregated defensive metrics
    """

    stats = {}

    # DUELS EVENTS
    duels = player_events[player_events['type'] == 'Duel']

    # Duels Attempted (total duels)
    stats['duels_attempted'] = len(duels)

    # Duels Won
    stats['duels_won'] = (duels['duel_outcome'] == 'Won').sum()

    # Duels Lost (total duels - duels won)
    stats['duels_lost'] = stats['duels_attempted'] - stats['duels_won']

    # Duels Ratio (number of duels won / total duels attempted)
    stats['duels_ratio'] = (
        stats['duels_won'] / stats['duels_attempted']
        if stats['duels_attempted'] > 0 else np.nan
    )



    # INTERCEPTIONS EVENTS
    interceptions = player_events[player_events['type'] == 'Interception']

    # Interceptions Attempted
    stats['interceptions_attempted'] = len(interceptions)

    if 'interception_outcome' in interceptions:
        # Interceptions Won (outcome == 'Won')
        stats['interceptions_won'] = (interceptions['interception_outcome'] == 'Won').sum()
        stats['interceptions_lost'] = stats['interceptions_attempted'] - stats['interceptions_won']
    else:
        # Fallback: assume all successful
        stats['interceptions_won'] = stats['interceptions_attempted']
        stats['interceptions_lost'] = 0

    # Interceptions Ratio
    stats['interceptions_ratio'] = (
        stats['interceptions_won'] / stats['interceptions_attempted']
        if stats['interceptions_attempted'] > 0 else 1.0
    )



    # BLOCKS EVENTS
    blocks = player_events[player_events['type'] == 'Block']
    stats['blocks'] = len(blocks)

    # CLEARANCES EVENTS
    clearances = player_events[player_events['type'] == 'Clearance']
    stats['clearances'] = len(clearances)

    # BALL RECOVERIES EVENTS
    recoveries = player_events[player_events['type'] == 'Ball Recovery']
    stats['ball_recoveries'] = len(recoveries)

    # PRESSURES EVENTS
    pressures = player_events[player_events['type'] == 'Pressure']
    stats['pressures'] = len(pressures)

    # DISPOSSESSED EVENTS
    dispossessed = player_events[player_events['type'] == 'Dispossessed']
    stats['dispossessed'] = len(dispossessed)



    # Round ratios only
    for key in ['duels_ratio', 'interceptions_ratio']:
        if key in stats and isinstance(stats[key], (float, np.floating)):
            stats[key] = round(stats[key], 2)

    return stats


In [13]:
# TEST ON A SINGLE PLAYER

# Load events for that match 
player_events = sb.events(match_id=match_id)

# Extract unique players from events (skip NaNs)
players_in_match = player_events[['player_id', 'player', 'team']].dropna().drop_duplicates()

# Pick one player random
player_row = players_in_match.iloc[randint(0, len(players_in_match)-1)]
player_id = player_row['player_id']
player_name = player_row['player']
team_name = player_row['team']

# Filter events for that player
player_events = player_events[player_events['player_id'] == player_id]

print("\nEXAMPLE PLAYER SELECTED")
print(f"Player : {player_name}")
print(f"Team   : {team_name}")
print(f"Total events for player in match: {len(player_events)}")

# Extract defensive stats
player_stats = extract_defensive_stats(player_events)

# Print summary
print("\nExtracted Defensive stats:")
for k, v in player_stats.items():
    print(f"{k}: {v}")



EXAMPLE PLAYER SELECTED
Player : Andrew Surman
Team   : AFC Bournemouth
Total events for player in match: 146

Extracted Defensive stats:
duels_attempted: 5
duels_won: 0
duels_lost: 5
duels_ratio: 0.0
interceptions_attempted: 3
interceptions_won: 1
interceptions_lost: 2
interceptions_ratio: 0.33
blocks: 2
clearances: 2
ball_recoveries: 5
pressures: 12
dispossessed: 0


### 3. Goalkeeper Actions

This function extracts the main **goalkeeping performance metrics** for a player given its events.  
It requires both the full match events (`events_df`) and the goalkeeper’s own events (`gk_events`) to correctly account for goals conceded, including own goals.

Metrics include:

- **Goals Conceded** → from goalkeeper events (*Goal Conceded*, *Penalty Conceded*) and own goals when the GK was on the pitch.

- **Clean Sheet** → 1 if no goals conceded, else 0

- **Shots Faced** → number of shots registered against the goalkeeper

- **Saves** → total saves made, including penalties saved

- **Save Ratio** → saves / (saves + goals conceded)

- **Penalties Saved** → successful penalty saves

- **Area Command** → claims, punches, and clearances performed

- **Sweeper / Smother Actions** → defensive actions outside the goal line

- **Reliability** → errors and negative outcomes (failures, no touch, dangerous plays).  

In [14]:
def extract_goalkeeper_stats(events_df, gk_events):
    """
    Extract goalkeeper statistics from match/player events.
    Needs both full match events (events_df) and the goalkeeper's own events (gk_events).
    
    Args:
        events_df (pd.DataFrame): StatsBomb events for the full match
        gk_events (pd.DataFrame): StatsBomb events filtered only for the goalkeeper
    
    Returns:
        dict: Dictionary with aggregated goalkeeper metrics
    """

    stats = {}

     # GOALS CONCEDED
    # From GK events: Goal Conceded + Penalty Conceded (goal conceded from penalty)
    goals_conceded = gk_events['goalkeeper_type'].eq('Goal Conceded').sum() + \
                     gk_events['goalkeeper_type'].eq('Penalty Conceded').sum()

    # Add Own Goals (only if this GK was on the pitch at that moment)
    if 'match_id' in events_df.columns and not events_df[events_df['type'] == 'Own Goal Against'].empty:
        match_id = events_df['match_id'].iloc[0]
        lineups_dict = sb.lineups(match_id=match_id)

        gk_id = gk_events['player_id'].iloc[0]
        gk_team = gk_events['team'].iloc[0]

        # Get intervals of play for this goalkeeper
        play_spans = []
        for _, team_df in lineups_dict.items():
            row = team_df[team_df["player_id"] == gk_id]
            if not row.empty:
                positions = row.iloc[0].get("positions", [])
                for pos in positions:
                    if pos.get("position") == "Goalkeeper":
                        start_min = int(pos.get("from", "0:00").split(":")[0])
                        to_str = pos.get("to")
                        end_min = int(to_str.split(":")[0]) if to_str else 120
                        play_spans.append((start_min, end_min))
                break


        # Check own goals against GK's team
        own_goals = events_df[events_df['type'] == 'Own Goal Against']
        for _, og in own_goals.iterrows():
            if og['team'] == gk_team and any(s <= og["minute"] <= e for s, e in play_spans):
                goals_conceded += 1

    stats['gk_goals_conceded'] = int(goals_conceded)


    # CLEAN SHEET
    stats['gk_clean_sheet'] = 1 if goals_conceded == 0 else 0

    # SHOT STOPPING EVENTS
    stats['gk_shots_faced'] = (gk_events['goalkeeper_type'] == 'Shot Faced').sum()

    stats['gk_saves'] = gk_events['goalkeeper_type'].isin([
        'Save','Shot Saved','Shot Saved Off','Shot Saved to Post',
        'Saved to Post','Saved Twice','Penalty Saved','Penalty Saved to Post'
    ]).sum()

    stats['gk_penalties_saved'] = gk_events['goalkeeper_type'].isin([
        'Penalty Saved','Penalty Saved to Post'
    ]).sum()



    # AREA COMMAND EVENTS
    stats['gk_claims'] = gk_events['goalkeeper_type'].isin(['Collected','Collected Twice','Claim']).sum()
    stats['gk_punches'] = gk_events['goalkeeper_type'].isin(['Punch','Punched out']).sum()
    stats['gk_clearances'] = (gk_events['goalkeeper_outcome'] == 'Clear').sum()



    # SWEEPER / SMOTHER EVENTS
    stats['gk_smother'] = (gk_events['goalkeeper_type'] == 'Smother').sum()
    stats['gk_sweeper'] = (gk_events['goalkeeper_type'] == 'Keeper Sweeper').sum()



    # RELIABILITY EVENTS (errors, dangerous actions)
    stats['gk_errors'] = gk_events['goalkeeper_outcome'].isin([
        'Fail','No Touch','In Play Danger','Touched in','Lost in play','Lost out'
    ]).sum()



    # ROUND RATIOS
    stats['gk_save_ratio'] = (
        stats['gk_saves'] / (stats['gk_saves'] + goals_conceded)
        if (stats['gk_saves'] + goals_conceded) > 0 else 0
    )
    if isinstance(stats['gk_save_ratio'], (float, np.floating)):
        stats['gk_save_ratio'] = round(stats['gk_save_ratio'], 2)

    return stats


In [15]:
# TEST ON ONE GOALKEEPER

player_events = sb.events(match_id=match_id)
players_in_match = player_events[['player_id','player','team','position']].dropna().drop_duplicates()

# Pick a random GK
gk_row = players_in_match[players_in_match['position'] == 'Goalkeeper'].sample(1).iloc[0]
gk_id, gk_name, gk_team = gk_row['player_id'], gk_row['player'], gk_row['team']

print("EXAMPLE GOALKEEPER SELECTED")
print(f"Goalkeeper : {gk_name}")
print(f"Team       : {gk_team}")

# Get GK events
gk_events = player_events[player_events['player_id'] == gk_id]

# Extract GK stats
gk_stats = extract_goalkeeper_stats(player_events, gk_events)

print("\nExtracted Goalkeeper stats:\n")
for k,v in gk_stats.items():
    print(f"{k}: {v}")


EXAMPLE GOALKEEPER SELECTED
Goalkeeper : Kasper Schmeichel
Team       : Leicester City

Extracted Goalkeeper stats:

gk_goals_conceded: 0
gk_clean_sheet: 1
gk_shots_faced: 10
gk_saves: 0
gk_penalties_saved: 0
gk_claims: 1
gk_punches: 0
gk_clearances: 0
gk_smother: 0
gk_sweeper: 1
gk_errors: 0
gk_save_ratio: 0


### 4. Discipline and Fouls

This function extracts the main **discipline and foul-related metrics** for a player given its events.  
It combines information from both the events (fouls, own goals) and the lineups (cards).

Metrics include:

- **Fouls Committed** → number of fouls committed by the player

- **Fouls Won** → number of fouls gained

- **Fouls Balance** → fouls won minus fouls committed, to highlight fair play or aggressiveness

- **Own Goals** → number of own goals scored

- **Yellow Cards** → retrieved from `sb.lineups(match_id)`

- **Red Cards** → includes both straight red cards and second yellow

In [16]:
def extract_discipline_stats(player_events):
    """
    Extract discipline statistics from match/player events.
    Uses both events (fouls, own goals) and lineups (cards).

    Args:
        player_events (pd.DataFrame): StatsBomb events for the single player in one match

    Returns:
        dict: Dictionary with aggregated discipline metrics
    """

    stats = {}

    # HANDLE CASE WITH EMPTY DF
    if player_events.empty:
        stats.update({
            'fouls_committed': 0,
            'fouls_won': 0,
            'fouls_balance': 0,
            'own_goals': 0,
            'yellow_cards': 0,
            'red_cards': 0
        })
        return stats



    # MATCH ID
    if "match_id" not in player_events.columns:
        raise ValueError("events_df must contain 'match_id' column")
    match_id = player_events["match_id"].iloc[0]

    # LOAD LINEUPS
    lineups_dict = sb.lineups(match_id=match_id)



    # FOULS EVENTS
    fouls_committed = player_events[player_events["type"] == "Foul Committed"]
    fouls_won = player_events[player_events["type"] == "Foul Won"]

    stats["fouls_committed"] = len(fouls_committed)
    stats["fouls_won"] = len(fouls_won)
    stats["fouls_balance"] = stats["fouls_won"] - stats["fouls_committed"]



    # OWN GOALS EVENTS
    own_goals = player_events[player_events["type"].isin(["Own Goal For", "Own Goal Against"])]
    stats["own_goals"] = len(own_goals)



    # CARDS EVENTS (from lineups)
    player_id = player_events["player_id"].iloc[0]
    yellow_cards, red_cards = 0, 0

    for _, team_df in lineups_dict.items():
        row = team_df[team_df["player_id"] == player_id]
        if not row.empty:
            cards_list = row.iloc[0]["cards"]
            if isinstance(cards_list, list):
                for card in cards_list:
                    ctype = card.get("card_type")
                    if ctype == "Yellow Card":
                        yellow_cards += 1
                    elif ctype in ["Red Card", "Second Yellow"]:
                        red_cards += 1
            break

    stats["yellow_cards"] = yellow_cards
    stats["red_cards"] = red_cards

    return stats


In [17]:
# TEST ON A SINGLE PLAYER

# Load events for that match 
player_events = sb.events(match_id=match_id)

# Extract unique players from events (skip NaNs)
players_in_match = player_events[['player_id', 'player', 'team']].dropna().drop_duplicates()

# Pick one player random
player_row = players_in_match.iloc[randint(0, len(players_in_match)-1)]
player_id = player_row['player_id']
player_name = player_row['player']
team_name = player_row['team']

# Filter events for that player
player_events = player_events[player_events['player_id'] == player_id]

print("\nEXAMPLE PLAYER SELECTED")
print(f"Player : {player_name}")
print(f"Team   : {team_name}")
print(f"Total events for player in match: {len(player_events)}")

# Extract discipline stats
player_stats = extract_discipline_stats(player_events)

# Print summary
print("\nDiscipline Stats for Player:")
for k, v in player_stats.items():
    print(f"{k}: {v}")



EXAMPLE PLAYER SELECTED
Player : Glenn Murray
Team   : AFC Bournemouth
Total events for player in match: 105

Discipline Stats for Player:
fouls_committed: 0
fouls_won: 1
fouls_balance: 1
own_goals: 0
yellow_cards: 0
red_cards: 0


### 5. Context and Playing Time

This function computes a player’s **availability and minutes played** in a match using only event data and lineups

- **Match duration** is computed from `Half Start` / `Half End` events, so it **includes added time** (and extra time if present)

- **Substitutions** are read from `Substitution` events:
  - If the player **comes on** (`substitution_replacement_id`) → minutes = `match_duration - minute_in`.
  - If the player **goes off** (`player_id`) → minutes = `minute_out`
  - If both happen → minutes = `minute_out - minute_in`
  - If neither → the player is assumed to have played the **full match duration**

- **Starter flag** is derived from `lineups` (`positions[0]['from'] == "0:00"`)

- **Positions played** are collected from the `positions` list in lineups and deduplicated.

**Returned metrics**

- `presences`- number of game played

- `minutes_played` — total minutes on pitch (with seconds in 60ths, e.g., `78.67` ≈ 78’40”)

- `matches_started` — 1 if the player started, else 0 

- `substitutions_in` — 1 if the player came on 

- `substitutions_out` — 1 if the player went off  

- `full_matches` — 1 if started and was never subbed off (played entire match duration)  

- `positions_played` — list of role names played in the match

In [18]:
def extract_context_playing_time(events_df, player_events):
    """
    Extract context and playing time statistics for a single-player performance.
    Uses events_df (full match) + lineups.
    Match duration is computed from Half Start / Half End events.

    Args:
        events_df (pd.DataFrame): StatsBomb events for the full match
        player_events (pd.DataFrame): StatsBomb events for the single player

    Returns:
        dict: Dictionary with aggregated playing time metrics
    """

    stats = {}

    if player_events.empty:
        stats.update({
            "presences": 0,
            "minutes_played": 0,
            "matches_started": 0,
            "substitutions_in": 0,
            "substitutions_out": 0,
            "full_matches": 0,
            "positions_played": []
        })
        return stats

    # PLAYER ID
    player_id = player_events["player_id"].iloc[0]

    # MATCH ID
    match_id = events_df["match_id"].iloc[0]
    lineups_dict = sb.lineups(match_id=match_id)

    # MATCH DURATION (from Half Start / Half End)
    duration = 0.0
    half_start = events_df[events_df["type"] == "Half Start"]

    half_end = events_df[events_df["type"] == "Half End"]

    for period in sorted(events_df["period"].unique()):
        start_ev = half_start[half_start["period"] == period]
        end_ev = half_end[half_end["period"] == period]
        if not start_ev.empty and not end_ev.empty:
            start_min = float(start_ev.iloc[0]["minute"]) + float(start_ev.iloc[0]["second"]) / 60.0
            end_min = float(end_ev.iloc[0]["minute"]) + float(end_ev.iloc[0]["second"]) / 60.0
            duration += (end_min - start_min)
    match_duration = round(duration, 2)


    # INIT
    minutes_played = 0.0
    matches_started = 0
    subs_in, subs_out, full_matches = 0, 0, 0
    positions_played = []

    # SUBSTITUTION EVENTS
    subs_events = events_df[events_df["type"] == "Substitution"]

    sub_in_time, sub_out_time = None, None
    if not subs_events.empty:
        # Player out
        if (subs_events["player_id"] == player_id).any():
            sub_row = subs_events[subs_events["player_id"] == player_id].iloc[0]
            sub_out_time = float(sub_row["minute"]) + float(sub_row["second"]) / 60.0
        # Player in
        if (subs_events["substitution_replacement_id"] == player_id).any():
            sub_row = subs_events[subs_events["substitution_replacement_id"] == player_id].iloc[0]
            sub_in_time = float(sub_row["minute"]) + float(sub_row["second"]) / 60.0

    # POSITIONS (roles + starter check)
    for _, team_df in lineups_dict.items():
        row = team_df[team_df["player_id"] == player_id]
        if not row.empty:
            positions = row.iloc[0].get("positions", [])
            if isinstance(positions, list) and len(positions) > 0:
                for pos in positions:
                    if "position" in pos:
                        positions_played.append(pos["position"])
                # starter if "from" == "0:00"
                if positions[0].get("from") in ["0:00", "00:00"]:
                    matches_started = 1
            break

    # COMPUTE MINUTES
    if sub_in_time is not None and sub_out_time is not None:
        # came in, then out
        minutes_played = sub_out_time - sub_in_time
        subs_in, subs_out = 1, 1
    elif sub_in_time is not None:
        # came in only
        minutes_played = match_duration - sub_in_time
        subs_in = 1
    elif sub_out_time is not None:
        # started and subbed out
        minutes_played = sub_out_time
        subs_out = 1
    else:
        # played whole match
        minutes_played = match_duration

    # FULL MATCH? (started + no sub out + ~full duration)
    if matches_started == 1 and subs_out == 0 and abs(minutes_played - match_duration) < 1.0:
        full_matches = 1

    # SAVE
    stats["presences"] = 1 if minutes_played > 0 else 0
    stats["minutes_played"] = round(minutes_played, 2)  # keep 2 decimals
    stats["matches_started"] = matches_started
    stats["substitutions_in"] = subs_in
    stats["substitutions_out"] = subs_out
    stats["full_matches"] = full_matches
    stats["positions_played"] = list(set(positions_played))

    return stats


In [19]:
# TEST ON A SINGLE PLAYER

# Load events for that match 
events_df = sb.events(match_id=match_id)

# Extract unique players from events (skip NaNs)
players_in_match = events_df[['player_id', 'player', 'team']].dropna().drop_duplicates()

# Pick one player random
player_row = players_in_match.iloc[randint(0, len(players_in_match)-1)]
player_id = player_row['player_id']
player_name = player_row['player']
team_name = player_row['team']

# Filter events for that player
player_events = events_df[events_df['player_id'] == player_id]


print("EXAMPLE PLAYER SELECTED")
print(f"Player : {player_name}")
print(f"Team   : {team_name}")
print(f"Total events for player in match: {len(player_events)}")

# Extract context stats
player_stats = extract_context_playing_time(events_df, player_events)

# Print summary
print("\nContext Stats for Player:")
for k, v in player_stats.items():
    print(f"{k}: {v}")


EXAMPLE PLAYER SELECTED
Player : Junior Stanislas
Team   : AFC Bournemouth
Total events for player in match: 146

Context Stats for Player:
presences: 1
minutes_played: 95.75
matches_started: 1
substitutions_in: 0
substitutions_out: 0
full_matches: 1
positions_played: ['Left Midfield']


### Aggregated Player Statistics for match

This function collects all the relevant statistics for **every player in a single match**. It combines the multiple stat-extraction functions impleted above into one pipeline:

- **Offensive stats** → shooting, passing, carrying, dribbling

- **Defensive stats** → duels, interceptions, pressures, recoveries, etc.

- **Discipline stats** → fouls, own goals, cards (via `sb.lineups`)

- **Context & playing time** → minutes played, starter status, substitutions, positions played

- **Goalkeeper stats** → only for players identified as goalkeepers in the lineups

**Inputs**

- `match_id`: unique match identifier

- `competition`: competition name (e.g., Premier League)

- `season`: season name (e.g., 2015/2016)

**Outputs**

- A `DataFrame` with **one row per player in the match**, enriched with all computed statistics


In [20]:
def collect_player_match_stats(match_id, competition, season):
    """
    Collect all player-match statistics for a given match_id.

    For each player in the match, combine:
      - Offensive stats
      - Defensive stats
      - Discipline stats
      - Context & playing time stats
      - Goalkeeper stats (only if player is GK)

    Args:
        match_id (int): Match identifier
        competition (str): Competition name (from sb.matches)
        season (str): Season name (from sb.matches)

    Returns:
        pd.DataFrame: one row per player with all stats for the match
    """

    # Load Base Data
    # Load event-level data for the match (all passes, shots, tackles, etc.)
    events_df = sb.events(match_id=match_id)
    # Load lineups information (contains players, positions, starters, subs, GK info)
    lineups_dict = sb.lineups(match_id=match_id)

    # Extract all players that actually appear in the event data
    # dropna to drop any rows with missing player information (base information rows for tactics)
    players_in_match = events_df[['player_id', 'player', 'team']].dropna().drop_duplicates()

    rows = []

    # Iterate over each player
    for _, player_row in players_in_match.iterrows():
        # Player metadata
        player_id = int(player_row['player_id'])
        player_name = player_row['player']
        team_name = player_row['team']

        # Check if the player has a nickname in lineups
        for _, team_df in lineups_dict.items():
            row = team_df[team_df["player_id"] == player_id]
            if not row.empty:
                nickname = row.iloc[0].get("player_nickname", None)
                if pd.notna(nickname) and nickname != "":
                    player_name = nickname  # override with nickname if available
                    
                break

        # Filter events for this player only
        player_events = events_df[events_df['player_id'] == player_id]

        # Base record (metadata for every row)
        base = {
            "competition": competition,  # competition name (e.g., Serie A)
            "season": season,            # season name (e.g., 2015/2016)
            "match_id": match_id,        # match identifier
            "team": team_name,           # team name
            "player_id": player_id,      # player unique ID
            "player_name": player_name,  # player full name
        }

        # Call Stats Functions

        # Offensive stats (shots, passes, dribbles, etc.)
        off_stats = extract_offensive_stats(events_df, player_events)

        # Defensive stats (tackles, interceptions, clearances, etc.)
        def_stats = extract_defensive_stats(player_events)

        # Discipline stats (fouls, yellow/red cards, etc.)
        disc_stats = extract_discipline_stats(player_events)

        # Context & playing time (minutes played, starter/bench, subs, roles, etc.)
        ctx_stats = extract_context_playing_time(events_df, player_events)

        # Goalkeeper stats
        gk_stats = {}
        # Check if this player was ever registered as a Goalkeeper in lineups
        for _, team_df in lineups_dict.items():
            row = team_df[team_df["player_id"] == player_id]
            if not row.empty:
                positions = row.iloc[0].get("positions", [])
                if isinstance(positions, list):
                    if any("Goalkeeper" in str(pos.get("position", "")) for pos in positions):
                        # If player is a GK, extract goalkeeper-specific stats
                        gk_stats = extract_goalkeeper_stats(events_df, player_events)
                    else:
                        # Otherwise, fill GK stats with None (not applicable)
                        gk_stats = {
                            'gk_shots_faced': None,
                            'gk_saves': None,
                            'gk_save_ratio': None,
                            'gk_penalties_saved': None,
                            'gk_claims': None,
                            'gk_punches': None,
                            'gk_clearances': None,
                            'gk_smother': None,
                            'gk_sweeper': None,
                            'gk_errors': None
                        }
                break  

        # Merge All Stats
        # Combine all dictionaries (metadata + stats) into one single record
        row_dict = {**base, **off_stats, **def_stats, **disc_stats, **ctx_stats, **gk_stats}
        rows.append(row_dict)

    # Return as a DataFrame
    # One row per player with all stats for this match
    return pd.DataFrame(rows)


In [21]:
# TEST 

# Get match data from StatsBomb
matches = sb.matches(competition_id=2, season_id=27)  # 2 = Premier League, 27 = 2015/16
example_match = matches.iloc[0]   # first available match

match_id = example_match["match_id"]
competition = example_match["competition"]
season = example_match["season"]

print("Testing on match:")
print(f"Match ID: {match_id}")
print(f"Competition: {competition}")
print(f"Season: {season}")

# Estrazione statistiche per tutti i giocatori del match
df_stats = collect_player_match_stats(match_id, competition, season)

print("\nPlayer stats DataFrame:")
df_stats.head()


Testing on match:
Match ID: 3754058
Competition: England - Premier League
Season: 2015/2016

Player stats DataFrame:


Unnamed: 0,competition,season,match_id,team,player_id,player_name,shots_attempted,goals,shots_on_target,xg_total,...,gk_save_ratio,gk_penalties_saved,gk_claims,gk_punches,gk_clearances,gk_smother,gk_sweeper,gk_errors,gk_goals_conceded,gk_clean_sheet
0,England - Premier League,2015/2016,3754058,AFC Bournemouth,3343,Dan Gosling,2,0,0,0.3,...,,,,,,,,,,
1,England - Premier League,2015/2016,3754058,AFC Bournemouth,3346,Joshua King,2,0,0,0.42,...,,,,,,,,,,
2,England - Premier League,2015/2016,3754058,AFC Bournemouth,3344,Andrew Surman,0,0,0,0.0,...,,,,,,,,,,
3,England - Premier League,2015/2016,3754058,AFC Bournemouth,6409,Adam Smith,2,0,0,0.04,...,,,,,,,,,,
4,England - Premier League,2015/2016,3754058,AFC Bournemouth,3608,Simon Francis,0,0,0,0.0,...,,,,,,,,,,


### Aggregated Player Statistics for a Full Season (Top 5 European Leagues)

This function aggregates **player performance across an entire season** for a given competition.  
It builds on the per-match statistics collected with `collect_player_match_stats` and consolidates them at season level.

**Workflow**
1. Retrieve all matches for the given `competition_id` and `season_id`

2. For each match, compute **player-match statistics**

3. Concatenate all match data into a single dataset

4. Aggregate across the season:

   - **Numeric metrics** → summed (e.g., goals, passes, recoveries)

   - **Ratios & accuracies** → averaged (e.g., duel ratio, pass accuracy)

   - **Minutes played** → summed

   - **Positions played** → collected into a unique list of roles

   - **Teams** → collected into a list (to handle mid-season transfers)

**Outputs**

- A `DataFrame` with **one row per player per competition per season**, enriched with all aggregated statistics

- Players who changed team(s) during the season will have all their performances combined, with `teams` showing the list of clubs they represented


In [22]:
def collect_player_season_stats(competition_id, season_id):
    """
    Collect season-level aggregated statistics for all players in a given competition & season.

    Steps:
      - Iterate all matches of that season
      - Collect per-match player stats
      - Aggregate at season level
      - Identify main_team (first team seen)
      - Identify main_role (most frequent across the season)
      - Build set of competitions and teams (unique lists)
      - Sort output by (main_team, custom main_role order, player_name)
      - Clean final output (round floats, minutes formatting, remove unused fields)

    Args:
        competition_id (int): StatsBomb competition id
        season_id (int): StatsBomb season id

    Returns:
        pd.DataFrame: one row per player+season with aggregated stats
    """

    # LOAD MATCHES
    matches = sb.matches(competition_id=competition_id, season_id=season_id)

    # Season and competition info
    season = matches["season"].iloc[0]
    competition = matches["competition"].iloc[0]

    all_rows = []

    # ITERATE MATCHES
    for _, m in tqdm(matches.iterrows(), total=len(matches), desc=f"{competition} {season}"):
        match_id = int(m["match_id"])
        df_match = collect_player_match_stats(match_id, competition, season)
        all_rows.append(df_match)

    # CONCATENATE ALL MATCHES
    df_all = pd.concat(all_rows, ignore_index=True)

    # AGGREGATION
    numeric_cols = df_all.select_dtypes(include=["number"]).columns.tolist()
    meta_cols = ["season", "player_id", "player_name"]

    agg_dict = {}
    for col in numeric_cols:
        if col in meta_cols:   # skip metadata
            continue
        if "ratio" in col or "accuracy" in col or "rate" in col:   # quality metrics → mean
            agg_dict[col] = "mean"
        elif col in ["minutes_played"]:           # minutes → sum
            agg_dict[col] = "sum"
        else:                                     # default → sum
            agg_dict[col] = "sum"

    # numeric aggregation
    df_season = df_all.groupby(meta_cols).agg(agg_dict).reset_index()

    # HELPER FUNCTIONS
    def agg_list(series):
        """Flatten and return unique items as list"""
        items = []
        for lst in series:
            if isinstance(lst, list):
                items.extend(lst)
            else:
                items.append(lst)
        return list(set(items))

    def most_frequent_role(series):
        """Return the most frequent role in a season"""
        roles = []
        for lst in series:
            if isinstance(lst, list):
                roles.extend(lst)
            elif pd.notna(lst):
                roles.append(lst)
        if not roles:
            return None
        counts = Counter(roles)
        return counts.most_common(1)[0][0]

    # MAIN FIELDS
    # main team (first seen in season if a transfer occurs)
    df_main_team = df_all.groupby(meta_cols)["team"].first().reset_index()
    df_main_team = df_main_team.rename(columns={"team": "main_team"})
    df_season = df_season.merge(df_main_team, on=meta_cols, how="left")

    # main role (most frequent across the season)
    df_main_role = df_all.groupby(meta_cols)["positions_played"].apply(most_frequent_role).reset_index()
    df_main_role = df_main_role.rename(columns={"positions_played": "main_role"})
    df_season = df_season.merge(df_main_role, on=meta_cols, how="left")

    # LIST FIELDS
    # competitions set (list unique)
    df_comps = df_all.groupby(meta_cols)["competition"].apply(agg_list).reset_index()
    df_comps = df_comps.rename(columns={"competition": "competitions"})
    df_season = df_season.merge(df_comps, on=meta_cols, how="left")

    # teams set (list unique)
    df_teams = df_all.groupby(meta_cols)["team"].apply(agg_list).reset_index()
    df_teams = df_teams.rename(columns={"team": "teams"})
    df_season = df_season.merge(df_teams, on=meta_cols, how="left")

    # CUSTOM ROLE ORDER (ranking main_role according to fixed list)
    role_order = [
        "Goalkeeper",
        "Right Back", "Right Center Back", "Center Back", "Left Center Back", "Left Back",
        "Right Wing Back", "Left Wing Back",
        "Right Defensive Midfield", "Center Defensive Midfield", "Left Defensive Midfield",
        "Right Midfield", "Right Center Midfield", "Center Midfield", "Left Center Midfield", "Left Midfield",
        "Right Wing", "Right Attacking Midfield", "Center Attacking Midfield", "Left Attacking Midfield", "Left Wing",
        "Right Center Forward", "Striker", "Left Center Forward", "Secondary Striker"
    ]
    role_rank = {role: i for i, role in enumerate(role_order)}

    # assign numeric rank for sorting; unknown roles go last
    df_season["role_rank"] = df_season["main_role"].map(role_rank).fillna(len(role_order))

    # FINAL SORT
    df_season = df_season.sort_values(
        by=["main_team", "role_rank", "player_name"]
    ).reset_index(drop=True)

    # REORDER COLUMNS (remove roles list and match_id if present)
    first_cols = [
        "competitions", "season",
        "main_team", "teams",
        "main_role",
        "player_id", "player_name",
        "presences", "matches_started",
        "full_matches", "minutes_played", 
        "substitutions_in", "substitutions_out",
        "yellow_cards", "red_cards"
    ]
    
    # exclude helper col role_rank, roles list, match_id if exists
    drop_cols = ["role_rank", "roles", "match_id"]
    other_cols = [c for c in df_season.columns if c not in first_cols + drop_cols]
    df_season = df_season[first_cols + other_cols]

    # ROUND FLOATS to 2 decimals
    float_cols = df_season.select_dtypes(include=["float"]).columns
    df_season[float_cols] = df_season[float_cols].round(2)

    return df_season

#### Premier League 2015/16

In [23]:
# Create the "data" folder if it doesn't exist
os.makedirs("data", exist_ok=True)

In [24]:
# Premier League 2015/16
df_premier = collect_player_season_stats(competition_id=2, season_id=27)
df_premier.to_csv("../task2_ballon_dor/data/premier_league_2015_16.csv", index=False)

England - Premier League 2015/2016: 100%|██████████| 380/380 [08:33<00:00,  1.35s/it]


In [25]:
# TEST
df_test = pd.read_csv("../task2_ballon_dor/data/premier_league_2015_16.csv")
print("Premier League 2015/16 dataset shape:", df_test.shape)

df_test.head()

Premier League 2015/16 dataset shape: (549, 69)


Unnamed: 0,competitions,season,main_team,teams,main_role,player_id,player_name,presences,matches_started,full_matches,...,gk_save_ratio,gk_penalties_saved,gk_claims,gk_punches,gk_clearances,gk_smother,gk_sweeper,gk_errors,gk_goals_conceded,gk_clean_sheet
0,['England - Premier League'],2015/2016,AFC Bournemouth,['AFC Bournemouth'],Goalkeeper,24888,Adam Federici,6,6,5,...,0.35,0.0,2.0,5.0,0.0,0.0,0.0,19.0,15.0,0.0
1,['England - Premier League'],2015/2016,AFC Bournemouth,['AFC Bournemouth'],Goalkeeper,20074,Artur Boruc,32,32,32,...,0.58,2.0,43.0,20.0,1.0,1.0,14.0,78.0,51.0,7.0
2,['England - Premier League'],2015/2016,AFC Bournemouth,['AFC Bournemouth'],Goalkeeper,3807,Ryan Allsop,1,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,['England - Premier League'],2015/2016,AFC Bournemouth,['AFC Bournemouth'],Right Back,6409,Adam Smith,31,22,20,...,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,['England - Premier League'],2015/2016,AFC Bournemouth,['AFC Bournemouth'],Right Center Back,3608,Simon Francis,38,38,38,...,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [26]:
# Top 5 by goals (per match / row, no aggregation)
top_goals = df_test.nlargest(5, "goals")[["player_name", "goals"]]

# Top 5 by assists
top_assists = df_test.nlargest(5, "assists")[["player_name", "assists"]]

# Top 5 by yellow cards
top_yellow_cards = df_test.nlargest(5, "yellow_cards")[["player_name", "yellow_cards"]]

# Top 5 by red cards
top_red_cards = df_test.nlargest(5, "red_cards")[["player_name", "red_cards"]]

# Display
print("Top 5 players by goals:")
display(top_goals)

print("Top 5 players by assists:")
display(top_assists)

print("Top 5 players by yellow cards:")
display(top_yellow_cards)

print("Top 5 players by red cards:")
display(top_red_cards)

Top 5 players by goals:


Unnamed: 0,player_name,goals
469,Harry Kane,25
190,Jamie Vardy,24
247,Sergio Agüero,24
167,Romelu Lukaku,18
182,Riyad Mahrez,17


Top 5 players by assists:


Unnamed: 0,player_name,assists
44,Mesut Özil,19
467,Christian Eriksen,12
540,Dimitri Payet,12
209,James Milner,11
358,Dušan Tadić,11


Top 5 players by yellow cards:


Unnamed: 0,player_name,yellow_cards
295,Jack Colback,11
209,James Milner,10
321,Alexander Tettey,10
374,Erik Pieters,10
456,Eric Dier,10


Top 5 players by red cards:


Unnamed: 0,player_name,red_cards
349,Victor Wanyama,3
83,Thibaut Courtois,2
89,John Terry,2
162,Kevin Mirallas,2
308,Aleksandar Mitrović,2


#### LaLiga 2015/16

In [27]:
# La Liga 2015/16
df_laliga = collect_player_season_stats(competition_id=11, season_id=27)
df_laliga.to_csv("../task2_ballon_dor/data/laliga_2015_16.csv", index=False)


Spain - La Liga 2015/2016: 100%|██████████| 380/380 [11:23<00:00,  1.80s/it]


In [28]:
# TEST
df_test = pd.read_csv("../task2_ballon_dor/data/laliga_2015_16.csv")
print("La Liga 2015/16 dataset shape:", df_test.shape)
df_test.head()

La Liga 2015/16 dataset shape: (539, 69)


Unnamed: 0,competitions,season,main_team,teams,main_role,player_id,player_name,presences,matches_started,full_matches,...,gk_save_ratio,gk_penalties_saved,gk_claims,gk_punches,gk_clearances,gk_smother,gk_sweeper,gk_errors,gk_goals_conceded,gk_clean_sheet
0,['Spain - La Liga'],2015/2016,Athletic Club,['Athletic Club'],Goalkeeper,6576,Gorka Iraizoz,37,37,37,...,0.66,1.0,39.0,16.0,29.0,0.0,63.0,50.0,37.0,14.0
1,['Spain - La Liga'],2015/2016,Athletic Club,['Athletic Club'],Goalkeeper,6662,Iago Herrerín,2,1,1,...,0.12,0.0,1.0,1.0,2.0,1.0,5.0,8.0,8.0,0.0
2,['Spain - La Liga'],2015/2016,Athletic Club,['Athletic Club'],Right Back,6649,De Marcos,34,33,30,...,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,['Spain - La Liga'],2015/2016,Athletic Club,['Athletic Club'],Right Back,6386,Eneko Bóveda,23,15,13,...,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,['Spain - La Liga'],2015/2016,Athletic Club,['Athletic Club'],Right Center Back,26087,Carlos Gurpegi,15,12,10,...,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [29]:
# Top 5 by goals (per match / row, no aggregation)
top_goals = df_test.nlargest(5, "goals")[["player_name", "goals"]]

# Top 5 by assists
top_assists = df_test.nlargest(5, "assists")[["player_name", "assists"]]

# Top 5 by yellow cards
top_yellow_cards = df_test.nlargest(5, "yellow_cards")[["player_name", "yellow_cards"]]

# Top 5 by red cards
top_red_cards = df_test.nlargest(5, "red_cards")[["player_name", "red_cards"]]

# Display
print("Top 5 players by goals:")
display(top_goals)

print("Top 5 players by assists:")
display(top_assists)

print("Top 5 players by yellow cards:")
display(top_yellow_cards)

print("Top 5 players by red cards:")
display(top_red_cards)

Top 5 players by goals:


Unnamed: 0,player_name,goals
72,Luis Suárez,40
393,Cristiano Ronaldo,35
68,Lionel Messi,26
71,Neymar,24
396,Karim Benzema,24


Top 5 players by assists:


Unnamed: 0,player_name,assists
68,Lionel Messi,15
72,Luis Suárez,15
41,Koke,14
71,Neymar,12
390,Gareth Bale,10


Top 5 players by yellow cards:


Unnamed: 0,player_name,yellow_cards
191,Rubén Pérez,17
274,Recio,16
88,Pablo Hernández,15
111,Gonzalo Escalante,15
113,Dani García,15


Top 5 players by red cards:


Unnamed: 0,player_name,red_cards
6,Aymeric Laporte,2
79,Gustavo Cabral,2
83,Jonny Castro,2
139,Víctor Sánchez,2
213,Aythami Artiles,2


#### Serie A 2015/16

In [30]:
# Serie A 2015/16
df_seriea = collect_player_season_stats(competition_id=12, season_id=27)
df_seriea.to_csv("../task2_ballon_dor/data/seriea_2015_16.csv", index=False)

Italy - Serie A 2015/2016: 100%|██████████| 380/380 [11:36<00:00,  1.83s/it]


In [31]:
# TEST
df_test = pd.read_csv("../task2_ballon_dor/data/seriea_2015_16.csv")
print("Serie A 2015/16 dataset shape:", df_test.shape)
df_test.head()

Serie A 2015/16 dataset shape: (551, 69)


Unnamed: 0,competitions,season,main_team,teams,main_role,player_id,player_name,presences,matches_started,full_matches,...,gk_save_ratio,gk_penalties_saved,gk_claims,gk_punches,gk_clearances,gk_smother,gk_sweeper,gk_errors,gk_goals_conceded,gk_clean_sheet
0,['Italy - Serie A'],2015/2016,AC Milan,['AC Milan'],Goalkeeper,26197,Christian Abbiati,1,0,0,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,0.0,1.0
1,['Italy - Serie A'],2015/2016,AC Milan,['AC Milan'],Goalkeeper,6768,Diego López,8,8,8,...,0.52,0.0,6.0,5.0,2.0,0.0,4.0,16.0,14.0,0.0
2,['Italy - Serie A'],2015/2016,AC Milan,['AC Milan'],Goalkeeper,7036,Gianluigi Donnarumma,30,30,29,...,0.75,0.0,20.0,23.0,11.0,1.0,39.0,50.0,29.0,11.0
3,['Italy - Serie A'],2015/2016,AC Milan,['AC Milan'],Right Back,7032,Davide Calabria,6,3,2,...,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,['Italy - Serie A'],2015/2016,AC Milan,['AC Milan'],Right Back,7463,Ignazio Abate,27,27,23,...,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [32]:
# Top 5 by goals (per match / row, no aggregation)
top_goals = df_test.nlargest(5, "goals")[["player_name", "goals"]]

# Top 5 by assists
top_assists = df_test.nlargest(5, "assists")[["player_name", "assists"]]

# Top 5 by yellow cards
top_yellow_cards = df_test.nlargest(5, "yellow_cards")[["player_name", "yellow_cards"]]

# Top 5 by red cards
top_red_cards = df_test.nlargest(5, "red_cards")[["player_name", "red_cards"]]

# Display
print("Top 5 players by goals:")
display(top_goals)

print("Top 5 players by assists:")
display(top_assists)

print("Top 5 players by yellow cards:")
display(top_yellow_cards)

print("Top 5 players by red cards:")
display(top_red_cards)

Top 5 players by goals:


Unnamed: 0,player_name,goals
410,Gonzalo Higuaín,36
358,Paulo Dybala,19
26,Carlos Bacca,18
334,Mauro Icardi,16
47,Mohamed Salah,14


Top 5 players by assists:


Unnamed: 0,player_name,assists
43,Miralem Pjanić,11
404,Marek Hamšík,11
197,Riccardo Saponara,10
356,Paul Pogba,10
409,Lorenzo Insigne,10


Top 5 players by yellow cards:


Unnamed: 0,player_name,yellow_cards
369,Maurício,14
457,Fernando,14
131,Riccardo Gagliolo,12
145,Lorenzo Lollo,12
235,Leonardo Blanchard,12


Top 5 players by red cards:


Unnamed: 0,player_name,red_cards
319,Jeison Murillo,3
68,Gabriel Paletta,2
103,Amadou Diawara,2
145,Lorenzo Lollo,2
256,Armando Izzo,2


#### 1.Bundesliga 2015/16

In [33]:
# Bundesliga 2015/16
df_bundesliga = collect_player_season_stats(competition_id=9, season_id=27)
df_bundesliga.to_csv("../task2_ballon_dor/data/bundesliga_2015_16.csv", index=False)

Germany - 1. Bundesliga 2015/2016: 100%|██████████| 306/306 [09:40<00:00,  1.90s/it]


In [34]:
# TEST
df_test = pd.read_csv("../task2_ballon_dor/data/bundesliga_2015_16.csv")
print("Bundesliga 2015/16 dataset shape:", df_test.shape)
df_test.head()

Bundesliga 2015/16 dataset shape: (475, 69)


Unnamed: 0,competitions,season,main_team,teams,main_role,player_id,player_name,presences,matches_started,full_matches,...,gk_save_ratio,gk_penalties_saved,gk_claims,gk_punches,gk_clearances,gk_smother,gk_sweeper,gk_errors,gk_goals_conceded,gk_clean_sheet
0,['Germany - 1. Bundesliga'],2015/2016,Augsburg,['Augsburg'],Goalkeeper,42462,Alexander Manninger,2,1,1,...,0.2,0.0,0.0,0.0,0.0,0.0,0.0,4.0,3.0,1.0
1,['Germany - 1. Bundesliga'],2015/2016,Augsburg,['Augsburg'],Goalkeeper,8314,Marwin Hitz,33,33,32,...,0.74,1.0,34.0,19.0,5.0,0.0,31.0,67.0,49.0,11.0
2,['Germany - 1. Bundesliga'],2015/2016,Augsburg,['Augsburg'],Right Back,9400,Daniel Opare,4,4,4,...,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,['Germany - 1. Bundesliga'],2015/2016,Augsburg,['Augsburg'],Right Back,8237,Paul Verhaegh,25,25,24,...,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,['Germany - 1. Bundesliga'],2015/2016,Augsburg,['Augsburg'],Right Center Back,40543,Hong Jeong-Ho,23,19,16,...,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [35]:
# Top 5 by goals (per match / row, no aggregation)
top_goals = df_test.nlargest(5, "goals")[["player_name", "goals"]]

# Top 5 by assists
top_assists = df_test.nlargest(5, "assists")[["player_name", "assists"]]

# Top 5 by yellow cards
top_yellow_cards = df_test.nlargest(5, "yellow_cards")[["player_name", "yellow_cards"]]

# Top 5 by red cards
top_red_cards = df_test.nlargest(5, "red_cards")[["player_name", "red_cards"]]

# Display
print("Top 5 players by goals:")
display(top_goals)

print("Top 5 players by assists:")
display(top_assists)

print("Top 5 players by yellow cards:")
display(top_yellow_cards)

print("Top 5 players by red cards:")
display(top_red_cards)

Top 5 players by goals:


Unnamed: 0,player_name,goals
75,Robert Lewandowski,30
99,Pierre-Emerick Aubameyang,25
73,Thomas Müller,20
51,Javier Hernández Balcázar,17
199,Anthony Modeste,15


Top 5 players by assists:


Unnamed: 0,player_name,assists
94,Henrikh Mkhitaryan,13
46,Karim Bellarabi,11
124,Raffael,10
71,Douglas Costa,9
440,Zlatko Junuzović,9


Top 5 players by yellow cards:


Unnamed: 0,player_name,yellow_cards
134,Peter Niemeyer,13
427,Clemens Fritz,13
13,Dominik Kohr,12
22,Caiuby,11
38,Wendell,10


Top 5 players by red cards:


Unnamed: 0,player_name,red_cards
114,Granit Xhaka,3
233,Johan Djourou,2
455,Dante,2
6,Jeffrey Gouweleeuw,1
25,Raúl Bobadilla,1


#### Ligue 1 2015/16

In [36]:
# Ligue 1 2015/16
df_ligue1 = collect_player_season_stats(competition_id=7, season_id=27)
df_ligue1.to_csv("../task2_ballon_dor/data/ligue1_2015_16.csv", index=False)

France - Ligue 1 2015/2016: 100%|██████████| 377/377 [12:03<00:00,  1.92s/it]


In [37]:
# TEST
df_test = pd.read_csv("../task2_ballon_dor/data/ligue1_2015_16.csv")
print("Ligue 1 2015/16 dataset shape:", df_test.shape)
df_test.head()

Ligue 1 2015/16 dataset shape: (573, 69)


Unnamed: 0,competitions,season,main_team,teams,main_role,player_id,player_name,presences,matches_started,full_matches,...,gk_save_ratio,gk_penalties_saved,gk_claims,gk_punches,gk_clearances,gk_smother,gk_sweeper,gk_errors,gk_goals_conceded,gk_clean_sheet
0,['France - Ligue 1'],2015/2016,AS Monaco,['AS Monaco'],Goalkeeper,3444,Danijel Subašić,36,36,36,...,0.71,2.0,38.0,17.0,5.0,1.0,25.0,62.0,50.0,12.0
1,['France - Ligue 1'],2015/2016,AS Monaco,['AS Monaco'],Goalkeeper,24021,Paul Nardi,2,2,2,...,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0
2,['France - Ligue 1'],2015/2016,AS Monaco,['AS Monaco'],Right Back,3204,Almamy Touré,10,9,7,...,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,['France - Ligue 1'],2015/2016,AS Monaco,['AS Monaco'],Right Back,3247,Fabinho,34,34,34,...,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,['France - Ligue 1'],2015/2016,AS Monaco,['AS Monaco'],Right Back,3217,Jemerson,4,3,3,...,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [38]:
# Top 5 by goals (per match / row, no aggregation)
top_goals = df_test.nlargest(5, "goals")[["player_name", "goals"]]

# Top 5 by assists
top_assists = df_test.nlargest(5, "assists")[["player_name", "assists"]]

# Top 5 by yellow cards
top_yellow_cards = df_test.nlargest(5, "yellow_cards")[["player_name", "yellow_cards"]]

# Top 5 by red cards
top_red_cards = df_test.nlargest(5, "red_cards")[["player_name", "red_cards"]]

# Display
print("Top 5 players by goals:")
display(top_goals)

print("Top 5 players by assists:")
display(top_assists)

print("Top 5 players by yellow cards:")
display(top_yellow_cards)

print("Top 5 players by red cards:")
display(top_red_cards)

Top 5 players by goals:


Unnamed: 0,player_name,goals
426,Zlatan Ibrahimović,36
282,Alexandre Lacazette,21
423,Edinson Cavani,19
308,Michy Batshuayi,17
390,Hatem Ben Arfa,17


Top 5 players by assists:


Unnamed: 0,player_name,assists
422,Ángel Di María,17
426,Zlatan Ibrahimović,13
331,Ryad Boudebouz,11
17,Nabil Dirar,8
134,Julien Féret,8


Top 5 players by yellow cards:


Unnamed: 0,player_name,yellow_cards
151,Jérôme Le Moigne,14
71,Yannick Cahuzac,13
41,Romain Saïss,12
178,Mustapha Diallo,12
502,Jaba Kankava,12


Top 5 players by red cards:


Unnamed: 0,player_name,red_cards
499,Antoine Devaux,3
43,Cheikh N'Doye,2
71,Yannick Cahuzac,2
74,François Kamano,2
86,André Poko,2


## Note on Assists Definitions in StatsBomb Data

In the StatsBomb open event data, there are several fields that can be used to identify assists:

1. **`pass_goal_assist`**  
   Boolean flag attached to passes, indicating whether the pass resulted in a goal. This column is not always present in the events dataframe for every match (because maybe there are no assists in that match). When present, this is considered the *official StatsBomb definition* of an assist.  

2. **`pass_assisted_shot_id`**  
   A reference from a pass to the shot it created. If that shot ends in a goal, the pass is effectively an assist.  
   **Limitation**: requires checking the outcome of the following shot. If the shot is not scored, the contribution is ignored.  

3. **`shot_key_pass_id`**  
   A reference from a shot to the event that created it (the “key pass”). If the shot is a goal and the key pass belongs to the player, this also identifies an assist.  
   **Limitation**: as with the previous method, it needs a check if the shot results in a goal.  

When applying all these methods to Lionel Messi in La Liga 2015/16, the counts  were consistent across definitions: **15 assists**. However, according to official season statistics, Messi actually recorded **16 assists**. This discrepancy highlights 
two possible issues:

- The open StatsBomb dataset may not be fully updated or corrected for every  match in the 2015/16 season

- Some situations where a player is widely considered to have created a goal  are not captured by  StatsBomb’s official event definitions, and therefore remain uncounted.  

As a consequence, the official counts in the open data may appear lower than those  reported by other providers or seen in match highlights.  

This limitation is not unique to assists. Other advanced statistics derived from event  data (such as blocks or duels) are also 
constrained by the way StatsBomb defines and encodes events. Importantly, the  same issue is consistently observed across all of the Big 5 leagues in the 2015/16 open dataset.  

**[!!!]** For the purposes of this assignment, we chose to rely on the **official StatsBomb definitions** for assists as well as for all the other performance metrics.


In [39]:
import pandas as pd
from statsbombpy import sb
from tqdm import tqdm

# PARAMETERS
competition_id = 11   # La Liga
season_id = 27        # 2015/2016
player_name = "Lionel Andrés Messi Cuccittini"

# Load all matches
matches = sb.matches(competition_id=competition_id, season_id=season_id)

records = []

for _, match in tqdm(matches.iterrows(), total=matches.shape[0]):
    match_id = match["match_id"]

    # Load events and lineups
    events = sb.events(match_id=match_id).reset_index(drop=True)
    lineups = sb.lineups(match_id=match_id)

    # Collect all players in the match
    all_players = []
    for df in lineups.values():
        all_players.extend(df["player_name"].tolist())

    # Skip if Messi did not play in this match
    if player_name not in all_players:
        continue

    # Official assists (pass_goal_assist)
    passes = events[(events["player"] == player_name) & (events["type"] == "Pass")]
    if "pass_goal_assist" in passes.columns:
        assists_official = passes.loc[passes["pass_goal_assist"] == True, "id"].tolist()
    else:
        assists_official = []

    # Assists via pass_assisted_shot_id
    shots = events[(events["type"] == "Shot") & (events["shot_outcome"] == "Goal")]
    merged = passes.merge(
        shots[["id", "shot_outcome"]],
        left_on="pass_assisted_shot_id",
        right_on="id",
        how="inner"
    )
    assists_linked = merged["id_x"].tolist() if not merged.empty else []

    # Assists via shot_key_pass_id
    gca = []
    if "shot_key_pass_id" in shots.columns:
        key_ids = shots.loc[shots["shot_key_pass_id"].notna(), "shot_key_pass_id"].tolist()
        gca_events = events[events["id"].isin(key_ids)]
        gca_events = gca_events[gca_events["player"] == player_name]
        gca = gca_events["id"].tolist()


    # Save match record only if Messi had any assist
    if assists_official or assists_linked or gca:
        records.append({
            "match_week": match["match_week"],
            "match_date": match["match_date"],
            "home_team": match["home_team"],
            "away_team": match["away_team"],
            "assists_official": len(assists_official),
            "assists_linked": len(assists_linked),
            "gca": len(gca),
        })

# Final result
df = pd.DataFrame(records).sort_values("match_date")

print(f"\nTotal official assists (pass_goal_assist): {df['assists_official'].sum()}")
print(f"Total assists (pass_assisted_shot_id): {df['assists_linked'].sum()}")
print(f"Total goal-creating actions (shot_key_pass_id): {df['gca'].sum()}")

100%|██████████| 380/380 [05:18<00:00,  1.19it/s]


Total official assists (pass_goal_assist): 15
Total assists (pass_assisted_shot_id): 15
Total goal-creating actions (shot_key_pass_id): 15



