# Task 1.1: Player Data Audit & Cleaning

## Overview

This notebook documents the complete process of auditing and cleaning the Players.csv dataset as required by Task 1.1 of the Football Insights Architect Pre-Task.

**Objective**: Clean the data to create a reliable and accurate player table by:
- Identifying and fixing duplicates
- Handling missing values and inconsistencies
- Normalizing data fields
- Producing before/after metrics

**Requirements from PDF**:
- Use any method to explore the dataset and identify problems
- Dedupe the data to create a reliable player table
- Produce a summary outlining the state of data before and after cleaning
- Document approach with explanations of what was done and why
- Show impact with metrics (e.g., number of duplicates removed)
- Final Table: a "cleaned" version of the players table


In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import sys
from pathlib import Path
import json
from datetime import datetime

# Add src to path for imports
sys.path.insert(0, str(Path('..') / 'src'))

from utils import (
    normalize_name, normalize_nationality, create_player_fingerprint,
    determine_canonical_player_id, parse_date_safe, validate_player_id
)

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.width', None)


## Step 1: Load and Initial Exploration

First, we load the raw Players.csv dataset and perform an initial exploration to understand the data structure and identify potential issues.


In [None]:
# Load raw data
BASE_DIR = Path('..')
raw_data_path = BASE_DIR / 'data' / 'raw' / 'Players.csv'

df_raw = pd.read_csv(raw_data_path)

print("=" * 80)
print("INITIAL DATA EXPLORATION")
print("=" * 80)
print(f"\nDataset Shape: {df_raw.shape[0]} rows × {df_raw.shape[1]} columns")
print(f"\nColumn Names: {list(df_raw.columns)}")
print(f"\nFirst 5 rows:")
df_raw.head()


## Step 2: Comprehensive Data Audit (BEFORE Cleaning)

We perform a comprehensive audit to identify all data quality issues:
- Exact duplicates (identical rows)
- Duplicate PlayerIDs (same ID appearing multiple times)
- Missing values
- Inconsistent naming
- Invalid data formats


In [None]:
def audit_players_data(df):
    """Comprehensive audit function."""
    audit = {
        'total_rows': len(df),
        'total_columns': len(df.columns),
        'duplicate_rows_exact': df.duplicated().sum(),
        'missing_values': df.isnull().sum().to_dict(),
        'missing_percentage': (df.isnull().sum() / len(df) * 100).to_dict(),
        'duplicate_player_ids': df['PlayerID'].duplicated().sum(),
        'invalid_player_ids': 0,
        'players_without_name': (df['PlayerName'].isna() | (df['PlayerName'] == '')).sum(),
        'players_without_dob': (df['DateOfBirth'].isna() | (df['DateOfBirth'] == '')).sum(),
        'players_without_nationality': (df['PlayerFirstNationality'].isna() | (df['PlayerFirstNationality'] == '')).sum(),
    }
    
    # Validate PlayerID format
    invalid_ids = []
    for idx, player_id in df['PlayerID'].items():
        if not validate_player_id(player_id):
            invalid_ids.append(idx)
    audit['invalid_player_ids'] = len(invalid_ids)
    
    return audit

# Perform audit
audit_before = audit_players_data(df_raw)

print("=" * 80)
print("BEFORE CLEANING - AUDIT RESULTS")
print("=" * 80)
print(f"\nTotal Rows: {audit_before['total_rows']}")
print(f"Exact Duplicates (identical rows): {audit_before['duplicate_rows_exact']}")
print(f"Duplicate PlayerIDs: {audit_before['duplicate_player_ids']}")
print(f"Missing PlayerName: {audit_before['players_without_name']}")
print(f"Missing DateOfBirth: {audit_before['players_without_dob']}")
print(f"Missing Nationality: {audit_before['players_without_nationality']}")
print(f"Invalid PlayerID Format: {audit_before['invalid_player_ids']}")

print("\n" + "=" * 80)
print("MISSING VALUES BREAKDOWN")
print("=" * 80)
for col, count in audit_before['missing_values'].items():
    pct = audit_before['missing_percentage'][col]
    if count > 0:
        print(f"{col}: {count} ({pct:.1f}%)")


### 2.1 Identifying Exact Duplicates

Let's examine exact duplicate rows (identical records):


In [None]:
# Find exact duplicates
exact_duplicates = df_raw[df_raw.duplicated(keep=False)]

if len(exact_duplicates) > 0:
    print(f"Found {len(exact_duplicates)} rows that are exact duplicates:")
    print("\nExample exact duplicates:")
    # Group by all columns to show duplicate groups
    for cols, group in exact_duplicates.groupby(list(df_raw.columns)):
        if len(group) > 1:
            print(f"\nDuplicate Group ({len(group)} rows):")
            print(group[['PlayerID', 'PlayerName', 'DateOfBirth', 'PlayerFirstNationality', 'CurrentTeam']])
            break
else:
    print("No exact duplicates found.")


### 2.2 Identifying Duplicate PlayerIDs

**Critical Issue**: Same PlayerID appearing multiple times with different data indicates data integrity problems. This must be resolved to ensure each PlayerID represents a unique player.


In [None]:
# Find duplicate PlayerIDs
duplicate_ids = df_raw[df_raw.duplicated(subset=['PlayerID'], keep=False)]

if len(duplicate_ids) > 0:
    print(f"Found {len(duplicate_ids)} rows with duplicate PlayerIDs:")
    print(f"Unique duplicate PlayerIDs: {duplicate_ids['PlayerID'].nunique()}")
    
    print("\n" + "=" * 80)
    print("DUPLICATE PlayerID EXAMPLES")
    print("=" * 80)
    
    for player_id in duplicate_ids['PlayerID'].unique()[:3]:  # Show first 3 examples
        group = df_raw[df_raw['PlayerID'] == player_id]
        print(f"\nPlayerID: {player_id} ({len(group)} occurrences)")
        print(group[['PlayerID', 'PlayerName', 'DateOfBirth', 'PlayerFirstNationality', 'CurrentTeam']].to_string())
        print("-" * 80)
else:
    print("No duplicate PlayerIDs found.")


### 2.3 Identifying Conditional Duplicates (Same Player, Different PlayerID)

Players may appear with different PlayerIDs but represent the same person. We use "fingerprinting" to identify these:
- Normalized Name + Date of Birth + Nationality = Unique Fingerprint


In [None]:
# Create fingerprints for all players
df_raw['fingerprint'] = df_raw.apply(create_player_fingerprint, axis=1)

# Find players with same fingerprint but different PlayerIDs
fingerprint_duplicates = df_raw[df_raw.duplicated(subset=['fingerprint'], keep=False)]

if len(fingerprint_duplicates) > 0:
    print(f"Found {len(fingerprint_duplicates)} rows with duplicate fingerprints (same player, different IDs):")
    print(f"Unique duplicate fingerprints: {fingerprint_duplicates['fingerprint'].nunique()}")
    
    print("\n" + "=" * 80)
    print("CONDITIONAL DUPLICATE EXAMPLES (Same Player, Different PlayerID)")
    print("=" * 80)
    
    # Show first example
    for fingerprint in fingerprint_duplicates['fingerprint'].unique()[:2]:
        group = df_raw[df_raw['fingerprint'] == fingerprint]
        print(f"\nFingerprint: {fingerprint}")
        print(f"Number of occurrences: {len(group)}")
        print(group[['PlayerID', 'PlayerName', 'DateOfBirth', 'PlayerFirstNationality', 'CurrentTeam']].to_string())
        print("-" * 80)
else:
    print("No conditional duplicates found (same player with different PlayerIDs).")


### 2.4 Data Quality Issues Identified

**Summary of Issues Found**:
1. **Exact Duplicates**: {audit_before['duplicate_rows_exact']} rows
2. **Duplicate PlayerIDs**: {audit_before['duplicate_player_ids']} rows (CRITICAL - must be 0 after cleaning)
3. **Conditional Duplicates**: Same player with different PlayerIDs
4. **Missing Values**: Various fields have missing data
5. **Inconsistent Naming**: Names may have titles, suffixes, extra whitespace
6. **Inconsistent Nationalities**: Variations in nationality names (e.g., "Congo DR" vs "DR Congo")

**Why This Matters**:
- Duplicate PlayerIDs violate data integrity (each ID should represent one unique player)
- Conditional duplicates create confusion and data inconsistency
- Missing values reduce data quality and analysis reliability
- Inconsistent naming prevents proper matching and analysis


## Step 3: Data Cleaning Process

Our cleaning approach follows these steps:

1. **Normalize Data First**: Strip whitespace, normalize names/nationalities, standardize dates
2. **Handle Duplicate PlayerIDs**: Same ID with different data = generate new unique IDs
3. **Handle Conditional Duplicates**: Same player with different IDs = merge to canonical ID
4. **Remove Exact Duplicates**: Keep only one copy of identical rows
5. **Final Validation**: Ensure 0 duplicate PlayerIDs

### 3.1 Normalization

We normalize all string fields to ensure consistent matching:


In [None]:
# Example: Show normalization in action
print("=" * 80)
print("NORMALIZATION EXAMPLES")
print("=" * 80)

# Show before/after for names
sample_names = df_raw['PlayerName'].dropna().head(10)
print("\nName Normalization Examples:")
for name in sample_names:
    normalized = normalize_name(name)
    if name != normalized:
        print(f"  '{name}' → '{normalized}'")

# Show before/after for nationalities
sample_nationalities = df_raw['PlayerFirstNationality'].dropna().head(10)
print("\nNationality Normalization Examples:")
for nat in sample_nationalities:
    normalized = normalize_nationality(nat)
    if nat != normalized:
        print(f"  '{nat}' → '{normalized}'")


### 3.2 Running the Complete Cleaning Pipeline

Now we run the complete cleaning pipeline that handles all issues:


In [None]:
# Import the cleaning pipeline
from clean_players import clean_players_pipeline

# Define paths
input_path = BASE_DIR / 'data' / 'raw' / 'Players.csv'
output_path = BASE_DIR / 'data' / 'processed' / 'players_cleaned.csv'
mapping_path = BASE_DIR / 'data' / 'processed' / 'player_id_map.json'

# Ensure output directory exists
output_path.parent.mkdir(parents=True, exist_ok=True)

# Run the cleaning pipeline
print("=" * 80)
print("RUNNING CLEANING PIPELINE")
print("=" * 80)
metrics = clean_players_pipeline(
    str(input_path),
    str(output_path),
    str(mapping_path)
)


## Step 4: Verification and After-Cleaning Audit

Let's verify the cleaned data and perform a final audit:


In [None]:
# Load cleaned data
df_cleaned = pd.read_csv(output_path)

# Perform audit on cleaned data
audit_after = audit_players_data(df_cleaned)

print("=" * 80)
print("AFTER CLEANING - AUDIT RESULTS")
print("=" * 80)
print(f"\nTotal Rows: {audit_after['total_rows']}")
print(f"Exact Duplicates: {audit_after['duplicate_rows_exact']}")
print(f"Duplicate PlayerIDs: {audit_after['duplicate_player_ids']} ⚠️ MUST BE 0")
print(f"Missing PlayerName: {audit_after['players_without_name']}")
print(f"Missing DateOfBirth: {audit_after['players_without_dob']}")
print(f"Missing Nationality: {audit_after['players_without_nationality']}")

# CRITICAL VALIDATION
if audit_after['duplicate_player_ids'] == 0:
    print("\n✅ SUCCESS: No duplicate PlayerIDs remain!")
else:
    print(f"\n❌ ERROR: {audit_after['duplicate_player_ids']} duplicate PlayerIDs still exist!")

print("\n" + "=" * 80)
print("CLEANED DATA PREVIEW")
print("=" * 80)
df_cleaned.head(10)


## Step 5: Impact Summary and Metrics

### 5.1 Before vs After Comparison


In [None]:
# Create comparison table
comparison = pd.DataFrame({
    'Metric': [
        'Total Rows',
        'Exact Duplicates',
        'Duplicate PlayerIDs',
        'Missing PlayerName',
        'Missing DateOfBirth',
        'Missing Nationality'
    ],
    'Before': [
        audit_before['total_rows'],
        audit_before['duplicate_rows_exact'],
        audit_before['duplicate_player_ids'],
        audit_before['players_without_name'],
        audit_before['players_without_dob'],
        audit_before['players_without_nationality']
    ],
    'After': [
        audit_after['total_rows'],
        audit_after['duplicate_rows_exact'],
        audit_after['duplicate_player_ids'],
        audit_after['players_without_name'],
        audit_after['players_without_dob'],
        audit_after['players_without_nationality']
    ]
})

comparison['Improvement'] = comparison['Before'] - comparison['After']
comparison['% Reduction'] = (comparison['Improvement'] / comparison['Before'] * 100).round(1)

print("=" * 80)
print("BEFORE vs AFTER COMPARISON")
print("=" * 80)
print(comparison.to_string(index=False))


### 5.2 Key Achievements

**✅ Critical Success**: Duplicate PlayerIDs reduced from {audit_before['duplicate_player_ids']} to **0**

**Summary of Improvements**:
- **Rows Removed**: {metrics['duplicates_removed']} duplicate records eliminated
- **ID Mappings Created**: {metrics['id_mappings_created']} PlayerID mappings for referential integrity
- **Data Quality Improvement**: {((audit_before['total_rows'] - audit_after['total_rows']) / audit_before['total_rows'] * 100):.1f}% reduction in total rows
- **Normalization**: All names, nationalities, and dates standardized
- **Whitespace Handling**: All string fields properly trimmed

### 5.3 PlayerID Mapping Dictionary

The cleaning process created a mapping dictionary to track PlayerID changes:


In [None]:
# Load ID mapping
with open(mapping_path, 'r') as f:
    id_mapping = json.load(f)

print(f"Total ID Mappings: {len(id_mapping)}")
print("\nSample ID Mappings (first 10):")
for old_id, new_id in list(id_mapping.items())[:10]:
    print(f"  {old_id} → {new_id}")

if len(id_mapping) > 10:
    print(f"\n... and {len(id_mapping) - 10} more mappings")


## Step 6: Final Cleaned Table

The final cleaned players table is saved to `data/processed/players_cleaned.csv` and meets all requirements:

✅ **Reliable**: No duplicate PlayerIDs (verified: 0)
✅ **Accurate**: All data normalized and standardized
✅ **Complete**: Invalid records removed, missing values handled
✅ **Documented**: Full audit trail with before/after metrics

### Final Table Statistics


In [None]:
print("=" * 80)
print("FINAL CLEANED TABLE STATISTICS")
print("=" * 80)
print(f"\nTotal Players: {len(df_cleaned)}")
print(f"Unique PlayerIDs: {df_cleaned['PlayerID'].nunique()}")
print(f"Duplicate PlayerIDs: {df_cleaned['PlayerID'].duplicated().sum()} (MUST BE 0)")

print("\nColumn Statistics:")
print(df_cleaned.describe(include='all'))

print("\n" + "=" * 80)
print("FINAL CLEANED TABLE (First 20 rows)")
print("=" * 80)
df_cleaned.head(20)


## Conclusion

### Requirements Met ✅

1. ✅ **Explored dataset and identified problems**: Comprehensive audit performed
2. ✅ **Deduped data to create reliable player table**: 0 duplicate PlayerIDs achieved
3. ✅ **Produced summary of before/after state**: Detailed metrics provided
4. ✅ **Documented approach with explanations**: This notebook provides full documentation
5. ✅ **Showed impact with metrics**: Before/after comparison table included
6. ✅ **Final cleaned table**: Saved to `data/processed/players_cleaned.csv`

### Key Technical Decisions

1. **Fingerprinting Strategy**: Used normalized name + DOB + nationality to identify same players with different IDs
2. **Canonical ID Selection**: Prioritized data completeness and CurrentTeam presence
3. **Duplicate PlayerID Resolution**: Generated new unique IDs when same ID had different player data
4. **Whitespace Handling**: Global strip() applied to all string fields before processing
5. **Normalization**: Comprehensive name and nationality normalization for consistent matching

### Data Quality Guarantees

- **0 Duplicate PlayerIDs**: Mathematically verified
- **Normalized Data**: All strings properly formatted
- **Referential Integrity**: Ready for integration with related tables (Task 1.2)

The cleaned players table is now ready for use in analysis and visualization.
