# Data Cleaning for Eredivisie Match Prediction

This section focuses on the data cleaning process for our Eredivisie match prediction project. The dataset used in this project contains match results from the Eredivisie, the top-tier football league in the Netherlands, spanning from 1956 to the present. 

Effective data cleaning is a crucial step in ensuring the quality and reliability of the predictions. The raw dataset may contain inconsistencies, missing values, or irrelevant information that could negatively impact the performance of the predictive models. 

In this notebook, we will:
- Explore the structure and content of the dataset.
- Handle missing or inconsistent data.
- Transform and preprocess the data into a format suitable for analysis and modeling.

By the end of this process, we aim to have a clean and well-structured dataset that serves as a solid foundation for building accurate and insightful predictive models.

**Future:** N/A

**Version:** 1.0

In [1]:
# Import packages
import pandas as pd
import numpy as np 

In [6]:
def data_cleaning_eredivisie_results(df: pd.DataFrame) -> pd.DataFrame:
    """
    Clean the input DataFrame by setting the correct naming conventions (English and snake_case), ... 
    
    Returns:
        pd.DataFrame: The cleaned DataFrame.
    """
    # Rename headers
    header_mapper = {
        'Seizoen': 'season',       
        'Datum': 'date',
        'Thuisclub': 'home_team',
        'Uitclub': 'away_team',
        'Thuisscore': 'home_score',
        'Uitscore': 'away_score'
    } 

    df = df.rename(columns=header_mapper)

    # Rename teams to create consistent naming
    teams_to_rename = {
        'ADO': 'ADO Den Haag',
        'AZ `67': 'AZ',
        'Alkmaar': 'AZ',        # Fusie naar AZ in '67
        'BVV': 'BVV Den Bosch',
        'DWS/A': 'DWS',
        'Dordrecht `90': 'FC Dordrecht',
        'FC Twente `65': 'FC Twente',
        'FC VVV': 'VVV-Venlo',
        'VVV': 'VVV-Venlo',
        'RKC': 'RKC Waalwijk',
        'Rapid JC': 'Roda JC',
        'Fortuna `54': 'Fortuna Sittard',
        'Feijenoord': 'Feyenoord',
        'Go Ahead': 'Go Ahead Eagles',
        'SC Heracles': 'Heracles Almelo',
        'Sparta': 'Sparta Rotterdam',	
        'DS `79': 'FC Doordrecht',
        'SVV/Dordrecht `90': 'FC Dordrecht',
        'Xerxes/DHC `66': 'Xerxes',
        'Volendam': 'FC Volendam',
        'NAC': 'NAC Breda'
    }

    df['home_team'] = df['home_team'].replace(teams_to_rename)
    df['away_team'] = df['away_team'].replace(teams_to_rename)

    # Convert date to datetime format
    df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d', errors='coerce')
    
    return df

In [3]:
# Start by reading the data and printing the dimensions
df_eredivisie_results = pd.read_csv('../Files/eredivisie_results.csv')

print(f"Dataframe shape: {df_eredivisie_results.shape}")

Dataframe shape: (20722, 6)


In [7]:
df_eredivisie_results = data_cleaning_eredivisie_results(df_eredivisie_results)

In [8]:
# Let's start by checking for null values
print(f"Null values in the dataset: \n{df_eredivisie_results.isnull().sum()}")

Null values in the dataset: 
season        0
date          0
home_team     0
away_team     0
home_score    0
away_score    0
dtype: int64


In [9]:
# Check for naming mistakes in the teams (e.g. 'PSV Eindhoven' vs 'PSV')
np.unique(df_eredivisie_results[['home_team', 'away_team']].values)

array(['ADO Den Haag', 'AZ', 'Ajax', 'Almere City FC', 'BVC Amsterdam',
       'BVV Den Bosch', 'Blauw Wit', 'Cambuur Leeuwarden', 'DOS', 'DWS',
       'De Graafschap', 'De Volewijckers', 'Eindhoven', 'Elinkwijk',
       'Excelsior', 'FC Amsterdam', 'FC Den Bosch', 'FC Den Haag',
       'FC Doordrecht', 'FC Dordrecht', 'FC Emmen', 'FC Groningen',
       'FC Twente', 'FC Utrecht', 'FC Volendam', 'FC Wageningen',
       'FC Zwolle', 'FSC', 'Feyenoord', 'Fortuna Sittard', 'GVAV',
       'Go Ahead Eagles', 'Haarlem', 'Helmond Sport', 'Heracles',
       'Heracles Almelo', 'Holland Sport', 'MVV', 'NAC Breda', 'NEC',
       'NOAD', 'PEC Zwolle', 'PSV', 'RBC Roosendaal', 'RKC Waalwijk',
       'Roda JC', 'SC Cambuur', 'SC Enschede', 'SC Heerenveen', 'SHS',
       'SVV', 'Sittardia', 'Sparta Rotterdam', 'Telstar', 'VVV-Venlo',
       'Veendam', 'Vitesse', 'Willem II', 'Xerxes'], dtype=object)

In [10]:
# Save the cleaned dataset
df_eredivisie_results.to_csv('../Files/eredivisie_results_cleaned.csv', index=False)

In [11]:
df_eredivisie_results.head(10)

Unnamed: 0,season,date,home_team,away_team,home_score,away_score
0,1956-1957,1956-09-02,Ajax,NAC Breda,1,0
1,1956-1957,1956-09-02,BVV Den Bosch,Elinkwijk,1,2
2,1956-1957,1956-09-02,DOS,Sparta Rotterdam,2,3
3,1956-1957,1956-09-02,Fortuna Sittard,Eindhoven,4,1
4,1956-1957,1956-09-02,NOAD,BVC Amsterdam,1,3
5,1956-1957,1956-09-02,PSV,MVV,1,3
6,1956-1957,1956-09-02,SC Enschede,Roda JC,5,2
7,1956-1957,1956-09-02,VVV-Venlo,GVAV,1,0
8,1956-1957,1956-09-02,Willem II,Feyenoord,3,3
9,1956-1957,1956-09-09,BVC Amsterdam,Willem II,0,6
