# FIFA World Cup Players Data — Cleaning Notebook

## About This Notebook

`WorldCupPlayers.csv` contains a record for every player's participation in every World Cup match — 37784 rows covering squad selections, positions, shirt numbers, events (goals, cards), and coaching assignments.

**What this notebook does:**

- Verifies that 736 duplicates are complete row-level duplicates before removing them (not just key matches with different values)
- Fills null `Position` values with `'Outfield'` — a null here means the player wasn't a goalkeeper or captain, so the fill is domain-informed rather than arbitrary
- Fills null `Event` values with an empty string — no event recorded is a valid state, not missing data
- Standardises position notation: `GKC` → `GK, C`, `C` → `Outfield, C`
- Converts ID columns to `uint32` and `Shirt Number` to `uint8`
- Investigates shirt number 0: all 3069 affected records come from early tournaments (1930s–50s). They're kept as-is with a note rather than imputed
- Validates coach match counts (groupby + nunique) to check for anomalies
- Checks Line-up distribution to verify squad sizes are reasonable

**Tools:** pandas · numpy


Import libraries numpy, pandas

In [None]:
import numpy as np
import pandas as pd

Reading WorldCupPlayers csv

In [None]:
worldcup_players = pd.read_csv('../data/WorldCupPlayers.csv', encoding='utf-8')

Exploring WorldCupPlayers table

In [None]:
worldcup_players.head(10)

having a glimpse of columns info 

In [None]:
worldcup_players.info()

No null values in any columns except for Position and Event

Checking min, max, count stats about numerical-valued columns

In [None]:
worldcup_players.describe()

Min Shirt number = 0, suspicious shirt number, needs further investigations

In [None]:
worldcup_players.describe(include='object')

further investigations for the frequency of coaches per matches being reasonable, also the unique positions

Checking duplicates for a player in the same match

In [None]:
worldcup_players.duplicated(subset=['Player Name', 'Shirt Number', 'Team Initials', 'MatchID']).sum()

736 Duplicted Players Names with same shirt number playing for the same team for the same MatchID

Having a glimpse on the duplicated entries

In [None]:
worldcup_players.loc[worldcup_players.duplicated(subset=['Player Name', 'Shirt Number', 'MatchID']), :].head(10)

In [None]:
duplicates = worldcup_players.duplicated(subset=['Player Name', 'Shirt Number', 'Team Initials', 'MatchID'], keep=False)
worldcup_players.loc[duplicates & (worldcup_players['Player Name']=='JULIO CESAR'), :]

Checking if when a duplicate is detected based on 'Player Name', 'Shirt Number', 'Team Initials', 'MatchID', that the whole entery is duplicated

In [None]:
players_dup_no = worldcup_players.duplicated(subset=['Player Name', 'Shirt Number', 'Team Initials', 'MatchID']).sum()
entries_dup_no = worldcup_players.duplicated().sum()
if players_dup_no == entries_dup_no:
    print(f"Entries duplicates number matches players duplicate entries number")
else:
    print(f"some mismatches happen in players duplicate entries")

Looks like duplicate entries are just duplicates without any change. Then duplicates are dropped

In [None]:
worldcup_players = worldcup_players.drop_duplicates()

Checking number of duplicates after cleaning

In [None]:
worldcup_players.duplicated(subset=['Player Name', 'Shirt Number', 'Team Initials', 'MatchID']).sum()

Duplicates dropped successfully

Checking Position unique values

In [None]:
worldcup_players['Position'].unique()

Position column here specifies whether this player is a goalkeeper, captain of the team, or both at the same time

If Position cell is null, that means that player's outfield player (not a GK), hence it will be filled with 'Outfield'

In [None]:
worldcup_players['Position'] = worldcup_players['Position'].fillna('Outfield')

Checking filling Position null cells with 'Outfield'

In [None]:
worldcup_players['Position'].unique()

Filling null done successfully

For consistency, Position column would have one of this options for each player ['GK', 'GK, C', 'Outfield', 'Outfield, C']

In [None]:
mask_gkc = worldcup_players['Position']=='GKC'
mask_c = worldcup_players['Position']=='C'

worldcup_players.loc[mask_gkc, 'Position'] = worldcup_players.loc[mask_gkc, 'Position'].str.replace('GKC', 'GK, C')
worldcup_players.loc[mask_c, 'Position'] = worldcup_players.loc[mask_c, 'Position'].str.replace('C', 'Outfield, C')

Checking Position updates

In [None]:
worldcup_players['Position'].unique()

Checking Event unique values

In [None]:
worldcup_players['Event'].unique()

If Event cell is null, that means that player didn't record any event in this match, hence it will be filled with empty string

In [None]:
worldcup_players['Event'] = worldcup_players['Event'].fillna('')

Checking null cells count in table

In [None]:
worldcup_players.info()

In [None]:
worldcup_players.isnull().sum()

No null cells left

Checking unique values in RoundID

In [None]:
worldcup_players['RoundID'].unique()

Checking unique values in MatchID

In [None]:
worldcup_players['MatchID'].unique()

Updating data types for numerical columns, as they are all integr values, and doesn't need to be float. Also optimizing bits  in the data types depending on the needed number of bits, to optimize memory utilisation

In [None]:
worldcup_players['RoundID'] = worldcup_players['RoundID'].astype(np.uint32)
worldcup_players['MatchID'] = worldcup_players['MatchID'].astype(np.uint32)
worldcup_players['Shirt Number'] = worldcup_players['Shirt Number'].astype(np.uint8)

Checking dtype of columns after modification

In [None]:
worldcup_players.dtypes

Checking Shirt Number unique values

In [None]:
worldcup_players['Shirt Number'].unique()

Shirt number 0 is suspicious

Checking which players had the shirt number 0

In [None]:
worldcup_players.loc[worldcup_players['Shirt Number']==0, :]

Looks like the first 3069 players recorded are having shirt number 0.

That indicates the possibility that those are unknown shirt numbers, which may makes sense as those matches were long time ago where shirt number may not be recorded in a proper way

Hence, an assumption would be made for this dataset, that if shirt number = 0, then it's unknown

Checking Team Initials unique values

In [None]:
worldcup_players['Team Initials'].unique()

Checking the number of matches coached by every coach, making sure it's reasonable 

Checking starting lineup players count in each match

In [None]:
starting_mask = worldcup_players['Line-up'] == 'S'
starters_per_match = worldcup_players.loc[starting_mask, :].groupby('MatchID').size()

# Find any matches that don't have 22
anomalies = starters_per_match[starters_per_match != 22]

if len(anomalies) == 0:
    print("All matches have exactly 22 starters")
else:
    print(f"Found {len(anomalies)} matches with unusual starter counts:")
    print(anomalies)

22 starting players for both teams. That's as it should be

In [None]:
nonstarting_mask = worldcup_players['Line-up'] == 'N'
worldcup_players.loc[nonstarting_mask, :].groupby('MatchID').size().mean()

22.3 average non-starting players. Looks okay

In [None]:
matches_per_coach = worldcup_players.groupby('Coach Name')['MatchID'].nunique()
matches_per_coach

In [None]:
matches_per_coach.max()

In [None]:
matches_per_coach.mean()

Looks like coaches matches frequency reasonable

Checking Coach Names unique values

In [None]:
worldcup_players['Coach Name'].unique()

Checking Player Names unique values

In [None]:
worldcup_players['Player Name'].unique()

Export worldCupPlayers clean csv under generated directory

In [None]:
print("DATA CLEANING SUMMARY - WorldCupPlayers")

print(f"\n Dataset Overview:")
print(f"  Total player records: {len(worldcup_players):,}")
print(f"  Unique players: {worldcup_players['Player Name'].nunique():,}")
print(f"  Unique coaches: {worldcup_players['Coach Name'].nunique():,}")
print(f"  Unique matches: {worldcup_players['MatchID'].nunique():,}")
print(f"  Teams represented: {worldcup_players['Team Initials'].nunique()}")

print(f"\n Cleaning Actions Performed:")
print(f"  Duplicate rows removed: 736")
print(f"  Position null values filled: 33641")
print(f"  Event null values filled: 28715")
print(f"  Position values standardized: 4 categories")
print(f"  Shirt Number 0 records: 3,069 (kept as 'unknown')")

print(f"\n Data Quality Verification:")
print(f"  Null values remaining: {worldcup_players.isnull().sum().sum()}")
print(f"  Duplicate records: {worldcup_players.duplicated().sum()}")
print(f"  Position categories: {worldcup_players['Position'].unique()}")

print(f"\n Player Statistics:")
print(f"  Goalkeeper records: {(worldcup_players['Position'].str.contains('GK')).sum():,}")
print(f"  Captain records: {(worldcup_players['Position'].str.contains('C')).sum():,}")
print(f"  Players with events: {(worldcup_players['Event'] != '').sum():,}")

print(f"\n Memory Optimization:")
print(f"  Memory usage: {worldcup_players.memory_usage(deep=True).sum() / 1024 / 1024:.2f} MB")
print(f"  Avg per record: {worldcup_players.memory_usage(deep=True).sum() / len(worldcup_players):.0f} bytes")

In [None]:
worldcup_players.to_csv('../data/generated/WorldCupsPlayers_Clean.csv', index=False)