# Task 2 - Data Preparation (StatsBomb Big 5 Leagues 2015/2016)

This notebook focuses on the preparation of the **StatsBomb open data** related to the Big 5 European leagues for the 2015/2016 season. The data will be loaded through the `statsbombpy` library, with an initial exploration of the available structures at both team and player level. Basic cleaning procedures will be applied to ensure fair comparisons across players. Finally, the processed datasets will be saved for subsequent analyses. 

## Note on Available Stats (Open Data Limitation)

**StatsBombpy library** provides convenient aggregated endpoints such as `team_season_stats`, `team_match_stats`, `player_season_stats`, and `player_match_stats`. However, these endpoints are **not available in the public open-data release** and require commercial credentials. As a result, this notebook **builds all team- and player-level statistics from scratch** using only the open-data endpoints:

- `sb.competitions()` – list of competitions/seasons  
- `sb.matches(competition_id, season_id)` – list of matches per competition/season  
- `sb.events(match_id)` – full on-ball event log for a match (shots, passes, dribbles, duels, pressures, etc.)  
- `sb.lineups(match_id)` – squads and players (used to infer minutes played together with events/substitutions)

In [26]:
from statsbombpy import sb

# Demo: Attempt to use an aggregated stats endpoint
# The function sb.player_season_stats() would normally return 
# season-level player statistics if commercial credentials were provided
# However, this endpoint is NOT available in the open-data release

try:
    # Example: attempt to load Premier League 2015/16 season stats
    _ = sb.player_season_stats(competition_id=2, season_id=27)
    
except Exception as e:
    # This error confirms that aggregated stats are not part of the open dataset
    print("Aggregated endpoint not available in open data. Falling back to events/lineups.")
    print(f"ERROR returned by statsbombpy: {e}")


Aggregated endpoint not available in open data. Falling back to events/lineups.
ERROR returned by statsbombpy: There is currently no open data for aggregated stats, please provide credentials


## Imports and Global Settings

In [118]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from random import randint

from tqdm import tqdm

from statsbombpy import sb

import warnings
warnings.filterwarnings("ignore")

# Display settings for pandas
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", 100)
pd.set_option("display.width", 150)

## Load Competitions and Filter 2015/16 Big 5

In [5]:
# Load all available competitions
competitions = sb.competitions()

display(competitions.columns.tolist())

print("All competitions available:")
display(competitions[["competition_id", "season_id", "competition_name", "season_name"]])


['competition_id',
 'season_id',
 'country_name',
 'competition_name',
 'competition_gender',
 'competition_youth',
 'competition_international',
 'season_name',
 'match_updated',
 'match_updated_360',
 'match_available_360',
 'match_available']

All competitions available:


Unnamed: 0,competition_id,season_id,competition_name,season_name
0,9,281,1. Bundesliga,2023/2024
1,9,27,1. Bundesliga,2015/2016
2,1267,107,African Cup of Nations,2023
3,16,4,Champions League,2018/2019
4,16,1,Champions League,2017/2018
5,16,2,Champions League,2016/2017
6,16,27,Champions League,2015/2016
7,16,26,Champions League,2014/2015
8,16,25,Champions League,2013/2014
9,16,24,Champions League,2012/2013


In [6]:
# Filter competitions for season 2015/2016
season_year = "2015/2016"
competitions_1516 = competitions[competitions["season_name"] == season_year]

print("Competitions for season 2015/2016:")
display(competitions_1516[["competition_id", "season_id", "competition_name", "season_name"]])

Competitions for season 2015/2016:


Unnamed: 0,competition_id,season_id,competition_name,season_name
1,9,27,1. Bundesliga,2015/2016
6,16,27,Champions League,2015/2016
43,11,27,La Liga,2015/2016
60,7,27,Ligue 1,2015/2016
64,2,27,Premier League,2015/2016
66,12,27,Serie A,2015/2016


In [7]:
# Select Big 5 leagues and count matches
big5 = ["Premier League", "La Liga", "Serie A", "1. Bundesliga", "Ligue 1"]

competitions_big5_1516 = competitions_1516[
    competitions_1516["competition_name"].isin(big5)
].copy()

# Count matches for each competition
match_counts = []
for _, row in competitions_big5_1516.iterrows():

    # Retrieve competition id and season id
    comp_id = row["competition_id"]
    season_id = row["season_id"]

    # Retrieve the matches for each competition-season
    matches = sb.matches(competition_id=comp_id, season_id=season_id)

    # Count the number of matches and store it
    n_matches = matches.shape[0]
    match_counts.append(n_matches)

# Add matches column to the dataframe
competitions_big5_1516["num_matches"] = match_counts

# Display the results
print("Big 5 competitions in 2015/2016 with match counts:")
display(competitions_big5_1516[["competition_id", "season_id","competition_name", "season_name", "num_matches"]])

# Total
total_matches = competitions_big5_1516["num_matches"].sum()
print(f"Total matches in Big 5 competitions (2015/2016): {total_matches}")


Big 5 competitions in 2015/2016 with match counts:


Unnamed: 0,competition_id,season_id,competition_name,season_name,num_matches
1,9,27,1. Bundesliga,2015/2016,306
43,11,27,La Liga,2015/2016,380
60,7,27,Ligue 1,2015/2016,377
64,2,27,Premier League,2015/2016,380
66,12,27,Serie A,2015/2016,380


Total matches in Big 5 competitions (2015/2016): 1823


> **NOTE**: For the 2015/2016 season, the StatsBomb open data provides the full set of matches for all Big 5 leagues except Ligue 1.  
> In Ligue 1, only 377 matches are available instead of the expected 380, due to a few games not being released in the public dataset.  
> This minor discrepancy (less than 1% of the total league games) is not considered problematic, as it does not significantly affect aggregated player or team statistics.

### Identifying Missing Matches in Ligue 1 (2015/2016)

Ligue 1 should contain 380 matches in the 2015/2016 season, but only 377 are available in the StatsBomb open data. Let's detect the match weeks where games are missing and to identify the teams involved by comparing the line-up of teams in each round with the complete set of Ligue 1 participants

In [8]:
# Load Ligue 1 2015/16 matches
# Competition "Ligue 1" id: 7
# Season "2015/2016" id: 27
matches_ligue1 = sb.matches(competition_id=7, season_id=27)

# Group by match week and count matches
# .groupby("match_week") groups the DataFrame by each round of the season
# .size() counts the number of rows (i.e., matches) per group
matches_per_week = matches_ligue1.groupby("match_week").size()

# Identify the match weeks with fewer than 10 matches in that round
incomplete_weeks = matches_per_week[matches_per_week < 10]

print("Match weeks with missing games:\n")
print(incomplete_weeks)

# Retrieve the full set of teams that appear across the season
all_teams = set(matches_ligue1["home_team"]).union(set(matches_ligue1["away_team"]))

# Loop through each incomplete week to identify missing teams
for week in incomplete_weeks.index:
    print(f"\nMatch Week {week}")
    
    # Extract all matches for that week
    week_matches = matches_ligue1[matches_ligue1["match_week"] == week]
    
    # Collect all teams that played (both home and away) during that week
    played_teams = set(week_matches["home_team"]).union(set(week_matches["away_team"]))
    
    # Identify the teams that did not play in that week
    missing_teams = all_teams - played_teams

    # Print the missing teams that should form the missing match
    if missing_teams:
        print(f"Missing match: {list(missing_teams)} did not play")


Match weeks with missing games:

match_week
14    9
23    9
36    9
dtype: int64

Match Week 14
Missing match: ['Bastia', 'Gazélec Ajaccio'] did not play

Match Week 23
Missing match: ['Saint-Étienne', 'Paris Saint-Germain'] did not play

Match Week 36
Missing match: ['Troyes', 'Bordeaux'] did not play


The identification of three missing matches in the Ligue 1 dataset for the 2015/2016 season does not pose a significant issue for the analysis. Most of the teams involved did not have players realistically competing for the Ballon d’Or. The only notable exception is *Paris Saint-Germain*; however, given the substantial number of their matches still available, the absence of this single fixture is unlikely to materially affect the aggregated player statistics considered in the study.

## Building Player and Team Statistics from Events and Lineups

### Event Categorization for Ballon d’Or Player Evaluation

In [None]:
def list_event_types(competition_id: int, season_id: int):
    """
    Print all unique event types in a competition/season.
    
    Args:
        competition_id (int): StatsBomb competition ID 
        season_id (int): StatsBomb season ID 
        limit_matches (int, optional): limit number of matches to speed up. Default None.
    """
    # Load matches
    matches = sb.matches(competition_id=competition_id, season_id=season_id)
    
    event_types = set()
    
    for _, match in tqdm(matches.iterrows(), total=matches.shape[0]):
        match_id = match["match_id"]
        events = sb.events(match_id=match_id)
        event_types.update(events["type"].unique())
    
    print(f"Unique event types in competition {competition_id}, season {season_id}:")
    for etype in sorted(event_types):
        print("-", etype)
    
    return event_types

# Example: Premier League 2015/16 (competition_id=2, season_id=27)
event_types = list_event_types(competition_id=2, season_id=27)
print(f"Total unique event types found: {len(event_types)}")


100%|██████████| 380/380 [03:31<00:00,  1.80it/s]

Unique event types in competition 2, season 27:
- 50/50
- Bad Behaviour
- Ball Receipt*
- Ball Recovery
- Block
- Carry
- Clearance
- Dispossessed
- Dribble
- Dribbled Past
- Duel
- Error
- Foul Committed
- Foul Won
- Goal Keeper
- Half End
- Half Start
- Injury Stoppage
- Interception
- Miscontrol
- Offside
- Own Goal Against
- Own Goal For
- Pass
- Player Off
- Player On
- Pressure
- Referee Ball-Drop
- Shield
- Shot
- Starting XI
- Substitution
- Tactical Shift
Total unique event types found: 33





After the event analysis, it was decided to consider those categories that provide clear insights into individual player performance, while excluding events that are either marginal to the evaluation process.

**1. Offensive & Possession Actions**

Includes all events directly related to attacking play, chance creation, and ball progression:

- *Shot*  

- *Pass*  

- *Carry* 

- *Dribble*

**2. Defensive Actions**

Covers events that measure defensive contribution and ball recovery:

- *Duel* 

- *Dribbled Past*

- *Interception*  

- *Block*  

- *Clearance* 

- *Ball Recovery*  

- *Pressure* 

- *Dispossessed*  

**3. Goalkeeping**

Dedicated to goalkeeper-specific actions:

- *Goal Keeper*  

**4. Discipline & Fouls**

Captures events related to discipline, fouls, and negative contributions:

- *Foul Committed*  

- *Foul Won*  

- *Bad Behaviour* (yellow/red cards)  
- *Own Goal For / Against*  

**5. Context & Playing Time**

Provides the necessary information for player availability, minutes played, and role information:

- *Starting XI*  

- *Substitution*  

**Excluded Events**

The following events were excluded from further analysis, as they provide limited or redundant information for player evaluation:

- *Tactical Shift* --> Indicates formation or role changes. It was decided to not consider for semplicity

- *Player On / Player Off* --> Redundant; substitutions and lineups already provide the necessary information for minutes played  

- *Half Start / Half End* --> Structural markers of match timeline; no performance relevance

- *Injury Stoppage* --> Contextual interruption; does not measure performance

- *Referee Ball-Drop* --> Administrative event; no player performance information

- *Shield* --> Difficult to quantify in terms of individual performance

- *Error* --> Broad and ambiguous; overlaps with dispossession or miscontrol events already considered

- *Miscontrol* --> Already covered under *Dispossessed*, avoiding redundancy

- *Offside* --> Primarily a team-level outcome, offering limited individual insight

- *Ball Receipt\** --> Redundant; every completed pass already implies a ball reception, making this event unnecessary to consider 

- *50/50* --> Redundant; it is already considered inside the event *Duel*


#### Note on Event Columns in StatsBomb Data

The StatsBomb event dataset contains a mixture of **shared attributes** (present in all events) and **event-specific attributes** (only relevant for certain event types). When these events are flattened into a DataFrame, only the columns that actually appear in that match are created. As a result, **the number of columns in the events DataFrame can vary from match to match**, depending on the types of actions recorded.  

What remains consistent are the shared fields, while event-specific fields appear only when relevant for that particular match.


In [None]:
# Load matches from Premier League 2015/16 (comp_id=2, season_id=27)
matches = sb.matches(competition_id=2, season_id=27)

# Pick the first 10 matches
sample_matches = matches.head(10)

print("Number of columns in events DataFrame for 10 matches:\n")

for i, row in sample_matches.iterrows():
    match_id = row['match_id']
    events = sb.events(match_id=match_id)
    print(f"Match {i+1}: {row['home_team']} vs {row['away_team']} -> {events.shape[1]} columns")

Number of columns in events DataFrame for 10 matches:

Match 1: Leicester City vs AFC Bournemouth -> 90 columns
Match 2: West Bromwich Albion vs Sunderland -> 92 columns
Match 3: Newcastle United vs Aston Villa -> 89 columns
Match 4: Everton vs AFC Bournemouth -> 88 columns
Match 5: Crystal Palace vs Watford -> 95 columns
Match 6: Arsenal vs Aston Villa -> 95 columns
Match 7: West Bromwich Albion vs Liverpool -> 93 columns
Match 8: Tottenham Hotspur vs AFC Bournemouth -> 89 columns
Match 9: Leicester City vs Manchester City -> 88 columns
Match 10: Crystal Palace vs Everton -> 90 columns


#### Example Match Extraction for Function Testing

In [155]:
# Load matches for Premier League 2015/16 (comp_id=2, season_id=27)
matches = sb.matches(competition_id=2, season_id=27)

# Select the match at index 0
first_match = matches.iloc[0]
match_id = first_match['match_id']

# Print summary information about the selected match
print("EXAMPLE MATCH SELECTED")
print(f"Competition : Premier League")
print(f"Season      : 2015/16")
print(f"Matchweek   : {first_match['match_week']}")
print(f"Date        : {first_match['match_date']}")
print(f"Home Team   : {first_match['home_team']}")
print(f"Away Team   : {first_match['away_team']}")
print(f"Final Score : {first_match['home_score']} - {first_match['away_score']}")
print(f"Match ID    : {match_id}")

EXAMPLE MATCH SELECTED
Competition : Premier League
Season      : 2015/16
Matchweek   : 20
Date        : 2016-01-02
Home Team   : Leicester City
Away Team   : AFC Bournemouth
Final Score : 0 - 0
Match ID    : 3754058


#### 1. Offensive and Possessions Actions

In [147]:
def extract_offensive_stats(events_df, pitch_length=120):
    """
    Extract offensive statistics from match/player events
    Processes StatsBomb event types: Shot, Pass, Carry, Dribble.
    
    Args:
        events_df (pd.DataFrame): StatsBomb events for a single match
        pitch_length (float): Pitch length in meters (default 120, StatsBomb standard)
    
    Returns:
        dict: Dictionary with aggregated offensive metrics
    """

    stats = {}

    # SHOTS EVENTS
    shots = events_df[events_df['type'] == 'Shot']

    # Total number of shots attempted
    stats['shots_attempted'] = len(shots)

    # Goals scored (shot_outcome == 'Goal')
    stats['goals'] = (shots['shot_outcome'] == 'Goal').sum()

    # Shots on target (goal, saved by goalkeeper, or hitting the post)
    stats['shots_on_target'] = shots['shot_outcome'].isin(
        ['Goal', 'Saved', 'Saved To Post']
    ).sum()

    # Expected Goals (sum of StatsBomb xG values)
    stats['xg_total'] = shots['shot_statsbomb_xg'].sum(skipna=True)

    # Average xG per shot (quality of average shooting chance)
    stats['xg_avg'] = shots['shot_statsbomb_xg'].mean(skipna=True)

    # Penalties attempted (shot_type == 'Penalty')
    stats['penalties'] = (shots['shot_type'] == 'Penalty').sum()

    # Headers attempted (body part == Head)
    stats['headers'] = (shots['shot_body_part'] == 'Head').sum()



    # PASSES EVENTS
    passes = events_df[events_df['type'] == 'Pass']

    # Total passes attempted
    stats['passes_attempted'] = len(passes)

    # Completed passes (StatsBomb: pass_outcome is NaN if successful)
    stats['passes_completed'] = passes['pass_outcome'].isna().sum()

    # Passing accuracy
    stats['pass_accuracy'] = (
        stats['passes_completed'] / stats['passes_attempted']
        if stats['passes_attempted'] > 0 else np.nan
    )

    # Assists 
    assists = 0
    shots_goals = shots[shots['shot_outcome'] == 'Goal']
    for _, shot in shots_goals.iterrows():
        key_pass_id = shot.get('shot_key_pass_id', None)
        if pd.notna(key_pass_id) and key_pass_id in passes['id'].values:
            assists += 1
    stats['assists'] = assists

    # Key passes (passes leading directly to a shot)
    stats['key_passes'] = passes['pass_shot_assist'].fillna(False).sum()

    # Progressive passes (forward passes advancing ≥15m)
    progressive_passes = 0
    for _, row in passes.iterrows():
        start = row.get('location', None)
        end = row.get('pass_end_location', None)
        if isinstance(start, list) and isinstance(end, list):
            if (end[0] - start[0]) >= 15:
                progressive_passes += 1
    stats['progressive_passes'] = progressive_passes

    # Crosses attempted
    stats['crosses'] = passes['pass_cross'].fillna(False).sum()

    # Switches of play
    stats['switches'] = passes['pass_switch'].fillna(False).sum()

    # Average pass angle (measure of verticality vs lateral passing)
    # Values near to 0° indicate more vertical passing, while values near to 90° indicate more lateral passing
    stats['avg_pass_angle'] = passes['pass_angle'].mean(skipna=True)

    # Average pass length (directness, tendency to play long vs short)
    stats['avg_pass_length'] = passes['pass_length'].mean(skipna=True)




    # CARRIES EVENTS
    carries = events_df[events_df['type'] == 'Carry']

    # Total carries (times player moved the ball by running with it)
    stats['carries_total'] = len(carries)

    # Total distance carried (sum of carry lengths)
    total_carry_distance = 0
    for _, row in carries.iterrows():
        start = row.get('location', None)
        end = row.get('carry_end_location', None)
        if isinstance(start, list) and isinstance(end, list):
            dist = np.linalg.norm(np.array(end) - np.array(start))
            total_carry_distance += dist
    stats['carry_distance_total'] = total_carry_distance

    # Progressive carries (advancing ≥10m towards goal)
    progressive_carries = 0
    for _, row in carries.iterrows():
        start = row.get('location', None)
        end = row.get('carry_end_location', None)
        if isinstance(start, list) and isinstance(end, list):
            if (end[0] - start[0]) >= 10:
                progressive_carries += 1
    stats['progressive_carries'] = progressive_carries

    # Carries ending inside the penalty area (define the insertions in the area)
    carries_to_box = 0
    for loc in carries['carry_end_location']:
        if isinstance(loc, list):
            if loc[0] >= (pitch_length - 18) and 18 <= loc[1] <= 62:
                carries_to_box += 1
    stats['carries_to_penalty_area'] = carries_to_box



    # DRIBBLES EVENTS
    dribbles = events_df[events_df['type'] == 'Dribble']

    # Total dribbles attempted
    stats['dribbles_attempted'] = len(dribbles)

    # Successful dribbles (outcome == 'Complete')
    stats['dribbles_completed'] = (dribbles['dribble_outcome'] == 'Complete').sum()

    # Dribble success rate (success %)
    stats['dribble_success_rate'] = (
        stats['dribbles_completed'] / stats['dribbles_attempted']
        if stats['dribbles_attempted'] > 0 else np.nan
    )

    # Dribble overruns (failed dribble due to losing control of the ball)
    stats['dribble_overruns'] = dribbles['dribble_overrun'].fillna(False).sum()

    # Round only selected float stats
    for key in ['xg_total', 'xg_avg', 'pass_accuracy', 
                'avg_pass_angle', 'avg_pass_length', 
                'carry_distance_total', 'dribble_success_rate']:
        if key in stats and isinstance(stats[key], (float, np.floating)):
            stats[key] = round(stats[key], 2)


    return stats


In [156]:
# TEST ON A SINGLE PLAYER

# Load events for that match 
events_df = sb.events(match_id=match_id)

# Extract unique players from events (skip NaNs)
players_in_match = events_df[['player_id', 'player', 'team']].dropna().drop_duplicates()

# Pick one player random
player_row = players_in_match.iloc[randint(0, len(players_in_match)-1)]
player_id = player_row['player_id']
player_name = player_row['player']
team_name = player_row['team']

# Filter events for that player
player_events = events_df[events_df['player_id'] == player_id]

print("\nEXAMPLE PLAYER SELECTED")
print(f"Player : {player_name}")
print(f"Team   : {team_name}")
print(f"Total events for player in match: {len(player_events)}")

# Extract offensive stats
player_stats = extract_offensive_stats(player_events)

# Print summary
print("Offensive Stats for Player:")
print(f"Player : {player_name}")
print(f"Team   : {team_name}")
print(f"Match  : {first_match['home_team']} vs {first_match['away_team']} (ID {match_id})\n")

print("Extracted offensive stats:\n")
for k, v in player_stats.items():
    print(f"{k}: {v}")



EXAMPLE PLAYER SELECTED
Player : Sylvain Distin
Team   : AFC Bournemouth
Total events for player in match: 14
Offensive Stats for Player:
Player : Sylvain Distin
Team   : AFC Bournemouth
Match  : Leicester City vs AFC Bournemouth (ID 3754058)

Extracted offensive stats:

shots_attempted: 0
goals: 0
shots_on_target: 0
xg_total: 0.0
xg_avg: nan
penalties: 0
headers: 0
passes_attempted: 3
passes_completed: 1
pass_accuracy: 0.33
assists: 0
key_passes: 0
progressive_passes: 2
crosses: 0
switches: 0
avg_pass_angle: -0.44
avg_pass_length: 25.04
carries_total: 1
carry_distance_total: 0.2
progressive_carries: 0
carries_to_penalty_area: 0
dribbles_attempted: 0
dribbles_completed: 0
dribble_success_rate: nan
dribble_overruns: 0


#### 2. Defensive Actions

In [193]:
def extract_defensive_stats(events_df):
    """
    Extract defensive statistics from match/player events
    Processes StatsBomb event types: Tackle, Duel, Interception,
    Block, Clearance, Ball Recovery, Pressure, Dispossessed
    
    Args:
        events_df (pd.DataFrame): StatsBomb events for a single match or player
    
    Returns:
        dict: Dictionary with aggregated defensive metrics
    """

    stats = {}

    # DUELS EVENTS
    duels = events_df[events_df['type'] == 'Duel']
    stats['duels_attempted'] = len(duels)
    stats['duels_won'] = (duels['duel_outcome'] == 'Won').sum() if 'duel_outcome' in duels else 0
    stats['duels_lost'] = stats['duels_attempted'] - stats['duels_won']
    stats['duels_ratio'] = (
        stats['duels_won'] / stats['duels_attempted']
        if stats['duels_attempted'] > 0 else 0
    )

    # INTERCEPTIONS EVENTS
    interceptions = events_df[events_df['type'] == 'Interception']
    stats['interceptions'] = len(interceptions)
    if 'interception_outcome' in interceptions:
        stats['interceptions_won'] = (interceptions['interception_outcome'] == 'Won').sum()
        stats['interceptions_lost'] = stats['interceptions'] - stats['interceptions_won']
    else:
        stats['interceptions_won'] = stats['interceptions']
        stats['interceptions_lost'] = 0

    stats['interceptions_ratio'] = (
        stats['interceptions_won'] / stats['interceptions']
        if stats['interceptions'] > 0 else 0
    )

    # BLOCKS EVENTS
    blocks = events_df[events_df['type'] == 'Block']
    stats['blocks'] = len(blocks)

    # CLEARANCES EVENTS
    clearances = events_df[events_df['type'] == 'Clearance']
    stats['clearances'] = len(clearances)

    # BALL RECOVERIES EVENTS
    recoveries = events_df[events_df['type'] == 'Ball Recovery']
    stats['ball_recoveries'] = len(recoveries)

    # PRESSURES EVENTS
    pressures = events_df[events_df['type'] == 'Pressure']
    stats['pressures'] = len(pressures)

    # DISPOSSESSED EVENTS
    dispossessed = events_df[events_df['type'] == 'Dispossessed']
    stats['times_dispossessed'] = len(dispossessed)

    # Round only selected float stats
    for key in ['tackles_ratio', 'duels_ratio', 'interceptions_ratio']:
        if key in stats and isinstance(stats[key], (float, np.floating)):
            stats[key] = round(stats[key], 2)

    return stats


In [195]:
# TEST ON A SINGLE PLAYER

# Load events for that match 
events_df = sb.events(match_id=match_id)

# Extract unique players from events (skip NaNs)
players_in_match = events_df[['player_id', 'player', 'team']].dropna().drop_duplicates()

# Pick one player random
player_row = players_in_match.iloc[randint(0, len(players_in_match)-1)]
player_id = player_row['player_id']
player_name = player_row['player']
team_name = player_row['team']

# Filter events for that player
player_events = events_df[events_df['player_id'] == player_id]

print("\nEXAMPLE PLAYER SELECTED")
print(f"Player : {player_name}")
print(f"Team   : {team_name}")
print(f"Total events for player in match: {len(player_events)}")

# Extract defensive stats
player_stats = extract_defensive_stats(player_events)

# Print summary
print("Defensive Stats for Player:")
print(f"Player : {player_name}")
print(f"Team   : {team_name}")
print(f"Match  : {first_match['home_team']} vs {first_match['away_team']} (ID {match_id})\n")

print("Extracted Defensive stats:\n")
for k, v in player_stats.items():
    print(f"{k}: {v}")



EXAMPLE PLAYER SELECTED
Player : Steve Cook
Team   : AFC Bournemouth
Total events for player in match: 163
Defensive Stats for Player:
Player : Steve Cook
Team   : AFC Bournemouth
Match  : Leicester City vs AFC Bournemouth (ID 3754058)

Extracted Defensive stats:

duels_attempted: 2
duels_won: 0
duels_lost: 2
duels_ratio: 0.0
interceptions: 3
interceptions_won: 1
interceptions_lost: 2
interceptions_ratio: 0.33
blocks: 1
clearances: 12
ball_recoveries: 2
pressures: 7
times_dispossessed: 0


#### 3. Goalkeeping

#### 4. Discipline and Fouls

#### 5. Context and Playing Time