# Exploratory Data Analysis (EDA) for Football Match Outcome Prediction
In this notebook, we will explore the football match datasets to uncover patterns, relationships, and potential features for our predictive model. We will investigate:
- Missing data
- Feature distributions
- Correlations between features
- Class imbalance
- Insights from betting odds and Elo ratings


Loading the Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# For better visualization
%matplotlib inline
sns.set(style="whitegrid")

# Load the data (assuming you've loaded a single CSV for EDA)
df = pd.read_csv("../data/SpanishLaliga.csv")

# Alternatively, if you want to load all leagues:
# from src.data_loader import load_all_leagues
# df = load_all_leagues('../data/')


3. Basic Data Overview

In [None]:
# Display the first few rows
df.head()

# Check basic info and structure of the dataset
df.info()

# Describe numeric columns
df.describe()

# Check for unique values in categorical columns (e.g., Home Team, Away Team)
print(df['Home Team'].unique())
print(df['Away Team'].unique())


4. Missing Values Analysis

In [None]:
# Check for missing values
missing_values = df.isnull().sum()
missing_values = missing_values[missing_values > 0]

# Plot missing values
plt.figure(figsize=(10, 6))
sns.barplot(x=missing_values.index, y=missing_values.values)
plt.title("Missing Values per Feature")
plt.xticks(rotation=45)
plt.show()


5. Distribution of the Target Variable

In [None]:
# Plot the distribution of the target variable 'Result'
plt.figure(figsize=(8, 5))
sns.countplot(x='Result', data=df, palette='Set2')
plt.title("Distribution of Match Outcomes (Target Variable)")
plt.show()

# Check for class imbalance
df['Result'].value_counts(normalize=True)


6. Exploring Feature Distributions

In [None]:
# Plot distributions of numerical features like goals
numerical_features = ['HG', 'AG', '1', 'X', '2']
df[numerical_features].hist(bins=15, figsize=(15, 10), layout=(3, 2))
plt.suptitle('Distribution of Numerical Features')
plt.show()


7. Feature Correlation Analysis

In [None]:
# Calculate correlations between numerical features
corr_matrix = df.corr()

# Plot a heatmap of the correlation matrix
plt.figure(figsize=(12, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title("Correlation Heatmap of Numerical Features")
plt.show()

# Investigate correlations specifically with the target variable (Result)
result_corr = corr_matrix['HG'].sort_values(ascending=False)
print(result_corr)


8. Visualizing Match Outcomes

In [None]:
# Visualize home vs away goals for each outcome
sns.scatterplot(x='HG', y='AG', hue='Result', data=df, palette='Set1')
plt.title("Home Goals vs Away Goals by Match Outcome")
plt.xlabel("Home Goals")
plt.ylabel("Away Goals")
plt.show()


9. Betting Odds Insights

In [None]:
# Visualize distribution of betting odds (columns '1', 'X', '2')
plt.figure(figsize=(12, 6))
sns.kdeplot(df['1'], label="Home Win Odds", shade=True)
sns.kdeplot(df['X'], label="Draw Odds", shade=True)
sns.kdeplot(df['2'], label="Away Win Odds", shade=True)
plt.title("Distribution of Betting Odds")
plt.legend()
plt.show()

# Explore how odds relate to actual results
sns.boxplot(x='Result', y='1', data=df)
plt.title("Home Win Odds by Match Outcome")
plt.show()


10. Elo Ratings Analysis

In [None]:
# Assuming you've already calculated Elo ratings during feature engineering
# (e.g., 'Home_Offensive_Elo', 'Away_Defensive_Elo')

# Compare Elo ratings for home vs away teams
sns.scatterplot(x='Home_Offensive_Elo', y='Away_Defensive_Elo', hue='Result', data=df, palette='Set2')
plt.title("Home Offensive Elo vs Away Defensive Elo by Match Outcome")
plt.show()

# Distribution of Elo ratings
df[['Home_Offensive_Elo', 'Away_Offensive_Elo']].hist(bins=15, figsize=(12, 5))
plt.suptitle("Distribution of Elo Ratings")
plt.show()


11. Rolling Statistics

In [None]:
# Rolling window statistics (already calculated in feature engineering)
# Example: rolling average of goals for home and away teams
plt.figure(figsize=(12, 6))
sns.lineplot(data=df, x='Date', y='home_rolling_goals', label="Home Rolling Avg Goals")
sns.lineplot(data=df, x='Date', y='away_rolling_goals', label="Away Rolling Avg Goals")
plt.title("Rolling Average Goals for Home and Away Teams")
plt.xticks(rotation=45)
plt.legend()
plt.show()


12. Class Imbalance

In [None]:
# Check if there's a class imbalance
class_counts = df['Result'].value_counts()

plt.figure(figsize=(8, 5))
sns.barplot(x=class_counts.index, y=class_counts.values, palette='Set1')
plt.title("Class Imbalance in Match Outcomes")
plt.show()

# If imbalance is severe, consider techniques such as SMOTE in later stages


# Conclusions and Next Steps
1. The dataset has no significant missing values.
2. There seems to be some level of class imbalance in the 'Result' target variable.
3. Some features show strong correlations with match outcomes, such as Elo ratings and goals scored.
4. Betting odds could provide useful insights for prediction.

**Next Steps**:
- Proceed with feature engineering based on Elo ratings, rolling statistics, and form points.
- Apply class balancing techniques such as SMOTE during model training.
- Test different machine learning models using the engineered features.
