# Data Validation

## Overview
This notebook validates the cleaned dataset (`partizan_2022_cleaned.csv`) to ensure its accuracy and reliability for further analysis. It performs a series of checks on the 2022-2023 EuroLeague season data for Partizan Mozzart Bet Belgrade, including:
- Comparing key metrics (e.g., points, assists) between summed player stats and team totals to confirm they match where expected.
- Investigating discrepancies in stats like rebounds and turnovers, which may differ due to basketball-specific recording practices.
- Verifying internal consistency, such as ensuring total rebounds equal the sum of offensive and defensive rebounds.
- Validating complex calculations, including field goal points and the Performance Index Rating (PIR).

These steps ensure the dataset is trustworthy for subsequent analyses, such as efficiency metrics or trend evaluations.

In [12]:
import pandas as pd

file_path = "../data/partizan_2022_cleaned.csv"

df = pd.read_csv(file_path)

Display a summary of the DataFrame to confirm its structure and ensure the data loaded correctly. 

In [13]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 27 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   game                    500 non-null    object
 1   round                   500 non-null    int64 
 2   phase                   500 non-null    object
 3   is_starter              500 non-null    bool  
 4   is_playing              500 non-null    bool  
 5   player                  500 non-null    object
 6   minutes                 500 non-null    object
 7   points                  500 non-null    int64 
 8   two_points_made         500 non-null    int64 
 9   two_points_attempted    500 non-null    int64 
 10  three_points_made       500 non-null    int64 
 11  three_points_attempted  500 non-null    int64 
 12  free_throws_made        500 non-null    int64 
 13  free_throws_attempted   500 non-null    int64 
 14  offensive_rebounds      500 non-null    int64 
 15  defens

Separate the dataset into team totals and individual player statistics to enable comparison between the two.

In [14]:
# filter
partizan_team = df[df['player'] == 'PARTIZAN MOZZART BET BELGRADE'].copy()
partizan_players = df[df['player'] != 'PARTIZAN MOZZART BET BELGRADE'].copy()

Confirm that certain metrics - where player sums should exactly equal team totals (e.g., points, assists) - are consistent across the dataset.

In [15]:
# cross-check team totals against the sum of individual player stats for accuracy
partizan_team = df[df['player'] == 'PARTIZAN MOZZART BET BELGRADE'].copy()
partizan_players = df[df['player'] != 'PARTIZAN MOZZART BET BELGRADE'].copy()


# focus on metrics that should match perfectly
# total_seconds
partizan_players_total_seconds = partizan_players['total_seconds'].sum()
partizan_team_total_seconds = partizan_team['total_seconds'].sum() 

# points
partizan_players_total_points = partizan_players['points'].sum()
partizan_team_total_points = partizan_team['points'].sum()

# assists
partizan_players_total_assists = partizan_players['assists'].sum()
partizan_team_total_assists = partizan_team['assists'].sum()

# free throws made 
partizan_players_ft_made = partizan_players['free_throws_made'].sum()
partizan_team_ft_made = partizan_team['free_throws_made'].sum()

# free throws attempted
partizan_players_ft_attempted = partizan_players['free_throws_attempted'].sum()
partizan_team_ft_attempted = partizan_team['free_throws_attempted'].sum()

# fouls committed 
partizan_players_fouls_committed = partizan_players['fouls_committed'].sum()
partizan_team_fouls_committed = partizan_team['fouls_committed'].sum()

# fouls received
partizan_players_fouls_received = partizan_players['fouls_received'].sum()
partizan_team_fouls_received = partizan_team['fouls_received'].sum()



def check_metrics():
    if (partizan_players_total_seconds == partizan_team_total_seconds and
        partizan_players_total_points == partizan_team_total_points and
        partizan_players_total_assists == partizan_team_total_assists and
        partizan_players_ft_made == partizan_team_ft_made and
        partizan_players_ft_attempted == partizan_team_ft_attempted and
        partizan_players_fouls_committed == partizan_team_fouls_committed and
        partizan_players_fouls_received == partizan_team_fouls_received):
        print("All metrics that should match perfectly are matching perfectly")
    else:
        print("All metrics that should match perfectly are not matching perfectly")
        

check_metrics()

All metrics that should match perfectly are matching perfectly


Examine metrics (e.g., rebounds, steals, turnovers) where discrepancies between player sums and team totals are expected due to basketball recording practices, and explain any differences.

In [16]:
# discrepancies in other stats may reflect inherent differences in how team vs. player stats are recorded.
partizan_players_total_rebounds = partizan_players['total_rebounds'].sum()
partizan_team_total_rebounds = partizan_team['total_rebounds'].sum()

partizan_players_total_steals = partizan_players['steals'].sum()
partizan_team_total_steals = partizan_team['steals'].sum()

partizan_players_total_turnovers = partizan_players['turnovers'].sum()
partizan_team_total_turnovers = partizan_team['turnovers'].sum()



def check_and_explain_metrics():
    if (partizan_players_total_rebounds == partizan_team_total_rebounds and
        partizan_players_total_steals == partizan_team_total_steals and
        partizan_players_total_turnovers == partizan_team_total_turnovers):
        print("All metrics are matching perfectly")
    if (partizan_players_total_rebounds != partizan_team_total_rebounds):
        print("Total rebounds are not matching perfectly, Reason: Team rebounds (e.g., dead-ball rebounds not attributed to any player) are included in the team total but not in individual stats.")
    if (partizan_players_total_steals != partizan_team_total_steals):
        print('Total steals are not matching perfectly, Reason: Discrepancy likely due to "team steals" (e.g., deflections not credited to a specific player) in the team total.')
    if (partizan_players_total_turnovers != partizan_team_total_turnovers):
        print("Total turnovers are not matching perfectly, Reason: Team turnovers (e.g., shot-clock violations, 8-second violations) are counted in the team total but not assigned to individual players.")

check_and_explain_metrics()



Total rebounds are not matching perfectly, Reason: Team rebounds (e.g., dead-ball rebounds not attributed to any player) are included in the team total but not in individual stats.
Total turnovers are not matching perfectly, Reason: Team turnovers (e.g., shot-clock violations, 8-second violations) are counted in the team total but not assigned to individual players.


Ensure that the total rebounds for each row in the dataset equal the sum of offensive and defensive rebounds, checking internal data consistency.

In [17]:
# check if total_rebounds = offensive + defensive rebounds
df['calculated_total_rebounds'] = df['offensive_rebounds'] + df['defensive_rebounds']
inconsistent_rebounds = df[df['total_rebounds'] != df['calculated_total_rebounds']]
if not inconsistent_rebounds.empty:
    print("Rebound totals inconsistent with offensive/defensive splits!")
else:
    print("Rebound totals consistent with offensive/defensive splits.")
    
# drop the calculated_total_rebounds column    
df.drop(columns=['calculated_total_rebounds'], inplace=True)

Rebound totals consistent with offensive/defensive splits.


Quantify the differences between team totals and player sums for rebounds and turnovers to measure "team" contributions not attributed to individual players.

In [18]:
# calculate the difference to see how many "team rebounds" are included in the team total but not in individual stats
rebound_difference = partizan_team_total_rebounds - partizan_players_total_rebounds
print(f"Unassigned team rebounds: {rebound_difference}")

# calculate the difference to see how many "team turnovers" are counted in the team total but not assigned to individual players
turnover_difference = partizan_team_total_turnovers - partizan_players_total_turnovers
print(f"Unassigned team turnovers: {turnover_difference}")


Unassigned team rebounds: 135
Unassigned team turnovers: 30


Verify that points from field goals (2-pointers and 3-pointers) align with total points minus free throws, ensuring accuracy for both players and the team.

In [19]:
def validate_field_goals():
    """Validates field goal points consistency for players and team."""
    # player field goal points vs. points minus free throws
    field_goal_points = partizan_players['two_points_made'] * 2 + partizan_players['three_points_made'] * 3
    points_minus_ft = partizan_players['points'] - partizan_players['free_throws_made']
    player_errors = partizan_players[field_goal_points != points_minus_ft]

    # team field goal points vs. team points minus free throws
    team_field_goal_points = (partizan_players['two_points_made'].sum() * 2) + (partizan_players['three_points_made'].sum() * 3)
    team_points_minus_ft = partizan_team['points'].sum() - partizan_players['free_throws_made'].sum()
    team_discrepancy = team_field_goal_points - team_points_minus_ft

    if player_errors.empty and team_discrepancy == 0:
        print("Field goals are accurate for all players and team totals.")
    else:
        if not player_errors.empty:
            print("Player field goal discrepancies found:")
            print(player_errors[['player', 'points', 'two_points_made', 'three_points_made', 'free_throws_made']])
        if team_discrepancy != 0:
            print(f"Team field goals mismatch: Expected {team_field_goal_points}, got {team_points_minus_ft} (difference: {team_discrepancy}).")
            
validate_field_goals()

Field goals are accurate for all players and team totals.


Ensure that for each game, the sum of player stats matches the team totals for points, assists, and total seconds, validating consistency on a game-by-game basis.

In [20]:
# for each game, check team totals vs. player sums
games = df['game'].unique()
all_metrics_match = True

for game in games:
    game_data = df[df['game'] == game]
    team_row = game_data[game_data['player'] == 'PARTIZAN MOZZART BET BELGRADE']
    players_in_game = game_data[game_data['player'] != 'PARTIZAN MOZZART BET BELGRADE']
    
    # check points
    if team_row['points'].sum() != players_in_game['points'].sum():
        print(f"Points mismatch in game {game}")
        all_metrics_match = False
    
    # check assists
    if team_row['assists'].sum() != players_in_game['assists'].sum():
        print(f"Assists mismatch in game {game}")
        all_metrics_match = False
        
    # check seconds
    if team_row['total_seconds'].sum() != players_in_game['total_seconds'].sum():
        print(f"Seconds mismatch in game {game}")
        all_metrics_match = False
    


if all_metrics_match:
    print("Team totals match players sums for all metrics in all games.")

Team totals match players sums for all metrics in all games.


Confirm that the Performance Index Rating (PIR) - a composite performance metric - is calculated correctly for both players and the team using the EuroLeague formula.

In [21]:
def validate_valuation():
    """Validates player PIR (Performance Index Rating) 
    (points + rebounds + assists + steals + blocks + fouls drawn)
    - (missed field goals + missed free throws + turnovers + shot rejected + fouls committed)"""
    
    # calculate missed shots for players
    partizan_players.loc[:, 'missed_field_goals'] = (
        (partizan_players['two_points_attempted'] - partizan_players['two_points_made']) +
        (partizan_players['three_points_attempted'] - partizan_players['three_points_made'])
    )
    partizan_players.loc[:, 'missed_free_throws'] = (
        partizan_players['free_throws_attempted'] - partizan_players['free_throws_made']
    )

    # calculate missed shots for team
    partizan_team.loc[:, 'missed_field_goals'] = (
        (partizan_team['two_points_attempted'] - partizan_team['two_points_made']) +
        (partizan_team['three_points_attempted'] - partizan_team['three_points_made'])
    )
    partizan_team.loc[:, 'missed_free_throws'] = (
        partizan_team['free_throws_attempted'] - partizan_team['free_throws_made']
    )

    # calculate expected valuation for players
    expected_player_valuation = (
        partizan_players['points'] +
        partizan_players['total_rebounds'] + 
        partizan_players['assists'] + 
        partizan_players['steals'] + 
        partizan_players['blocks_favour'] + 
        partizan_players['fouls_received'] -
        (
            partizan_players['missed_field_goals'] +
            partizan_players['missed_free_throws'] + 
            partizan_players['turnovers'] + 
            partizan_players['blocks_against'] + 
            partizan_players['fouls_committed']
        )
    )

    # calculate expected valuation for team
    expected_team_valuation = (
        partizan_team['points'] +
        partizan_team['total_rebounds'] + 
        partizan_team['assists'] + 
        partizan_team['steals'] + 
        partizan_team['blocks_favour'] + 
        partizan_team['fouls_received'] -
        (
            partizan_team['missed_field_goals'] +
            partizan_team['missed_free_throws'] + 
            partizan_team['turnovers'] + 
            partizan_team['blocks_against'] + 
            partizan_team['fouls_committed']
        )
    )

    # validate players and team valuation
    player_validation = (partizan_players['valuation'] == expected_player_valuation).all()
    team_validation = (partizan_team['valuation'] == expected_team_valuation).all()
    
    
    
    if player_validation:
        print("Valuation is accurate for all players.")
    else:
        print("Player valuation mismatch detected!")
        
    if team_validation:
        print("Valuation is accurate for the team.")
    else:
        print("Team valuation mismatch detected!")
        
    
validate_valuation()

Valuation is accurate for all players.
Valuation is accurate for the team.


## Conclusion

### The validation process demonstrates that:

- Metrics expected to match (e.g., points, assists) are consistent between player sums and team totals.
- Discrepancies in rebounds (135 unassigned) and turnovers (30 unassigned) are explained by - basketball recording practices.
- Internal checks, like rebound consistency, pass without issues.
- Complex metrics, including field goal points and PIR, are accurately calculated.
Per-game validations show no inconsistencies.

##### The dataset partizan_2022_cleaned.csv is now validated and ready for advanced analyses.