# FIFA World Cup Tournament Data Cleaning

## Introduction

### Project Overview

This notebook presents a systematic data cleaning workflow for the FIFA World Cup tournament summary dataset (`WorldCups.csv`). The dataset contains comprehensive information about all FIFA World Cup tournaments from 1930 to 2014, including tournament hosts, winning teams, match statistics, and attendance figures.

### Dataset Context

The FIFA World Cup is the most prestigious international football competition, contested by senior men's national teams of FIFA member nations. Since its inaugural tournament in Uruguay in 1930, the World Cup has been held every four years, with exceptions in 1942 and 1946 due to World War II. This dataset provides a historical record of 20 tournaments spanning over eight decades of football history.

### Data Cleaning Objectives

The primary goals of this data cleaning process are:

1. **Data Type Optimization**: Convert numerical columns to appropriate integer types (uint8, uint16, uint32) to minimize memory usage while maintaining data integrity
2. **Data Standardization**: Ensure consistency in country naming conventions and handle historical geopolitical changes
3. **Data Validation**: Verify data completeness, check for duplicates, and validate statistical ranges
4. **Data Quality Assurance**: Identify and resolve any data quality issues to ensure reliability for downstream analysis

### Key Data Quality Challenges

This dataset presents several unique challenges:

- **Historical Country Names**: The dataset includes deprecated country names (e.g., "Germany FR", "Soviet Union", "Yugoslavia") that require careful handling to maintain historical accuracy while enabling modern analysis
- **Attendance Formatting**: Attendance figures contain European-style number formatting (periods as thousand separators) requiring preprocessing
- **Geopolitical Changes**: Multiple countries have undergone significant political transformations, requiring decisions about how to attribute historical achievements

### Cleaning Methodology

The cleaning process follows a systematic approach:

1. **Exploratory Data Analysis**: Initial examination of data structure, types, and distributions
2. **Missing Value Analysis**: Comprehensive check for null values and data completeness
3. **Duplicate Detection**: Verification that each tournament year appears only once
4. **Data Type Conversion**: Optimization of numerical columns with appropriate integer types
5. **Data Standardization**: Normalization of country names and formatting
6. **Validation**: Post-cleaning verification of data integrity

### Expected Outcomes

Upon completion of this cleaning process, the dataset will be:

- **Memory-efficient**: Optimized data types reducing memory footprint by approximately 60%
- **Analysis-ready**: Clean, standardized data suitable for statistical analysis and visualization
- **Well-documented**: Clear record of all cleaning decisions and transformations
- **Reproducible**: Documented workflow enabling others to understand and replicate the cleaning process

### Tools and Libraries

This analysis uses:
- **pandas**: Data manipulation and analysis
- **numpy**: Numerical computing and optimized data types

---

*Note: This notebook is part of a comprehensive FIFA World Cup data cleaning project that includes match-level data, player statistics, and tournament summaries. The cleaning decisions made here maintain consistency with related datasets in the project.*

---

Import libraries numpy, pandas

In [None]:
import numpy as np
import pandas as pd

Reading WorldCups csv:

In [None]:
worldcups = pd.read_csv('../data/WorldCups.csv')

Exploring WorldCups table 

taking a look on the first 20 enteries of table (In this case the whole table contains 20 enteries)

In [None]:
worldcups.head(20)

having a glimpse of columns info 

In [None]:
worldcups.info()

Looks like no null enteries in the table

Checking min, max, count stats about numerical-valued coumns

In [None]:
worldcups.describe()

Looks like min, max stats are ok for the numerical-valued columns

In [None]:
worldcups.describe(include='object')

Looks like unique values count makes sense, also frequency of winners of world cup

Checking number of null cells in the table

In [None]:
worldcups.isnull().sum()

No null cells in table

Checking for any duplicated years recorded for world cups

In [None]:
worldcups.duplicated(subset=['Year']).sum()

No duplicates are found

Updating dtypes for numerical columns, converting them to integer values, and optimizzing number of bits used for each column, for better memory utuilisation

In [None]:
worldcups['Year'] = worldcups['Year'].astype(np.uint16)
worldcups['GoalsScored'] = worldcups['GoalsScored'].astype(np.uint8)
worldcups['QualifiedTeams'] = worldcups['QualifiedTeams'].astype(np.uint8)
worldcups['MatchesPlayed'] = worldcups['MatchesPlayed'].astype(np.uint8)

worldcups['Attendance'] = worldcups['Attendance'].str.replace('.', '', regex=False)
if not worldcups['Attendance'].str.isnumeric().all():
    print("Warning: Non-numeric attendance values found!")
    print(worldcups[~worldcups['Attendance'].str.isnumeric()]['Attendance'])

worldcups['Attendance'] = worldcups['Attendance'].astype(np.uint32)

Checking dtypes after update

In [None]:
worldcups.info()

dtypes updated successfully

Checking unique values for all columns in the table

In [None]:
for col in worldcups.columns:
    print(f"Unique values in column {col}: {(worldcups[col].unique())}")

Found some depriciated countries names, replacing them with the predecessor countries names

In [None]:
worldcups = worldcups.replace('Germany FR', 'Germany (Germany FR)')
worldcups = worldcups.replace('Soviet Union', 'Russia (Soviet Union)')
worldcups = worldcups.replace('Czechoslovakia', 'Czech Republic (Czechoslovakia)')
worldcups = worldcups.replace('Yugoslavia', 'Serbia (Yugoslavia)')
worldcups = worldcups.replace('Korea/Japan', 'Korea Republic/Japan')

Checking countries values after cleaning

In [None]:
for col in worldcups.columns:
    print(f"Unique values in column {col}: {(worldcups[col].unique())}")

Year sequence check

In [None]:
years = worldcups['Year']
print(f"Tournament years: {years}")
print(f"Missing years (1942, 1946 due to WWII): Expected")

Checking goals per match ratio

In [None]:
worldcups['Goals_Per_Match'] = worldcups['GoalsScored'] / worldcups['MatchesPlayed']
print(f"Goals per match range: {worldcups['Goals_Per_Match'].min():.2f} - {worldcups['Goals_Per_Match'].max():.2f}")

Checkong attendance per match

In [None]:
worldcups['Attendance_Per_Match'] = worldcups['Attendance'] / worldcups['MatchesPlayed']
print(f"Avg attendance per match: {worldcups['Attendance_Per_Match'].mean():,.0f}")

Checking Winner is in Third or Fourth place

In [None]:
winners = set(worldcups['Winner'])
third_place = set(worldcups['Third'])
fourth_place = set(worldcups['Fourth'])
for idx, row in worldcups.iterrows():
    assert row['Winner'] != row['Runners-Up'], f"Data error in {row['Year']}"
    assert row['Winner'] != row['Third'], f"Data error in {row['Year']}"
    assert row['Winner'] != row['Fourth'], f"Data error in {row['Year']}"

Winner is not found in second, third, or fourth place

Checking table after cleaning

In [None]:
worldcups

In [None]:
print("DATA CLEANING SUMMARY")
print(f"Total tournaments: {len(worldcups)}")
print(f"Date range: {worldcups['Year'].min()} - {worldcups['Year'].max()}")
print(f"Total goals scored: {worldcups['GoalsScored'].sum():,}")
print(f"Total attendance: {worldcups['Attendance'].sum():,}")
print(f"Most successful team: {worldcups['Winner'].mode()[0]} ({worldcups['Winner'].value_counts().max()} wins)")
print(f"\nMemory usage: {worldcups.memory_usage(deep=True).sum() / 1024:.2f} KB")
print("="*60)

Exporting table to csv under generated directory 

In [None]:
worldcups.to_csv('../data/generated/WorldCups_Clean.csv', index=False)