Football Match Data Enhancement for Machine Learning Models
Purpose of Feature Engineering
This script enhances raw football match data by creating additional features specifically designed for machine learning applications. The goal is to transform basic match statistics into a comprehensive dataset that captures the nuanced patterns and relationships that influence match outcomes.
Key Feature Categories
1. Match Outcome Features

goal_diff: Goal difference between home and away teams
result: Match result coded as 'H' (home win), 'D' (draw), or 'A' (away win)
total_goals: Total number of goals scored in the match
Binary indicators: home_win, draw, away_win for easy classification tasks

2. Card and Disciplinary Features

home_total_cards, away_total_cards: Sum of yellow and red cards for each team
total_cards: Total cards shown in the match
card_diff: Difference in cards between home and away teams

3. Performance Comparison Metrics

possession_diff: Difference in possession percentage
pass_accuracy_diff: Difference in passing accuracy
shot_diff: Difference in total shots
shots_on_target_diff: Difference in shots on target
passes_diff: Difference in total passes

4. Efficiency Metrics

home_shot_efficiency, away_shot_efficiency: Goals scored per shot
shot_efficiency_diff: Difference in shot efficiency between teams

5. Team Quality Indicators

big_team_home, big_team_away: Identifies matches involving top teams
big_team_match: Flags high-profile matches between two big teams

Applications in Machine Learning
These features enable several types of predictive modeling:

Match Outcome Prediction: Using team performance metrics to predict if a match will end in a home win, draw, or away win.
Goal Scoring Forecasting: Predicting the number of goals that will be scored in a match.
Team Performance Analysis: Understanding which factors most strongly influence a team's success.
Betting Odds Optimization: Creating models that can identify value bets by comparing predicted outcomes with market odds.
Player Impact Assessment: When combined with player data, these features can help quantify individual player contributions to team success.

Advantage Over Raw Data
The enriched dataset provides several advantages over the original data:

Relative Measures: By calculating differences between teams, we capture the relative advantage one team has over another.
Derived Insights: Features like shot efficiency reveal information about team quality that raw shot counts don't capture.
Categorical Context: Team quality indicators provide important context about match importance and expected performance levels.
Balanced Information: The features balance offensive metrics (goals, shots) with defensive metrics (cards, conceded goals).

Next Steps
This enhanced dataset is now ready for:

Exploratory data analysis to identify key patterns and correlations
Feature selection to determine which variables have the most predictive power
Model training using algorithms like Random Forest, Gradient Boosting, or Neural Networks
Hyperparameter tuning to optimize model performance
Model evaluation using appropriate metrics (accuracy, F1-score, log loss)

By transforming raw match statistics into these engineered features, we've created a much stronger foundation for predictive modeling in football analytics.

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the clean match data
print("Loading the clean match statistics data...")
df = pd.read_csv('clean_matches_stats_only.csv')
print(f"Loaded data with {df.shape[0]} rows and {df.shape[1]} columns")

"""
Feature Engineering for Football Match Prediction

This script takes raw football match data and transforms it into a dataset
suitable for machine learning models. We'll create various features to help
predict match outcomes and analyze performance patterns.
"""

print("\nStarting feature engineering process...")

# Data cleaning and preparation
print("Step 1: Data cleaning and type conversion")

# Fix percentage columns by converting them to numeric values
percentage_columns = ['home_possession', 'away_possession',
                      'home_pass_accuracy', 'away_pass_accuracy']

for col in percentage_columns:
    if df[col].dtype == 'object':  # If data is stored as strings with '%'
        df[col] = df[col].str.replace('%', '').astype(float)
    print(f"  Converted {col} to numeric values")

# Create basic match outcome features
print("\nStep 2: Creating match outcome features")

# Goal difference and match result
df['goal_diff'] = df['home_goals'] - df['away_goals']

# Fix the error by ensuring all values are strings
df['result'] = np.where(df['goal_diff'] > 0, 'H',
                np.where(df['goal_diff'] == 0, 'D', 'A'))

print("  Created 'goal_diff' and 'result' (H/D/A) features")

# Total goals in the match
df['total_goals'] = df['home_goals'] + df['away_goals']
print("  Created 'total_goals' feature")

# Binary outcome indicators
df['home_win'] = (df['goal_diff'] > 0).astype(int)
df['draw'] = (df['goal_diff'] == 0).astype(int)
df['away_win'] = (df['goal_diff'] < 0).astype(int)
print("  Created binary outcome indicators (home_win, draw, away_win)")

# Create card-related features
print("\nStep 3: Creating card-related features")

# Calculate total cards for each team
df['home_total_cards'] = df['home_yellow_cards'] + df['home_red_cards']
df['away_total_cards'] = df['away_yellow_cards'] + df['away_red_cards']
df['total_cards'] = df['home_total_cards'] + df['away_total_cards']
df['card_diff'] = df['home_total_cards'] - df['away_total_cards']
print("  Created total card counts and card difference")

# Create performance comparison features
print("\nStep 4: Creating performance comparison metrics")

# Calculate differences in key performance indicators
df['possession_diff'] = df['home_possession'] - df['away_possession']
df['pass_accuracy_diff'] = df['home_pass_accuracy'] - df['away_pass_accuracy']
df['shot_diff'] = df['home_shots_total'] - df['away_shots_total']
df['shots_on_target_diff'] = df['home_shots_on_target'] - df['away_shots_on_target']
df['passes_diff'] = df['home_passes'] - df['away_passes']
print("  Created difference metrics for possession, passing, and shooting")

# Calculate shot efficiency (goals per shot)
df['home_shot_efficiency'] = np.where(df['home_shots_total'] > 0,
                                     df['home_goals'] / df['home_shots_total'], 0)
df['away_shot_efficiency'] = np.where(df['away_shots_total'] > 0,
                                     df['away_goals'] / df['away_shots_total'], 0)
df['shot_efficiency_diff'] = df['home_shot_efficiency'] - df['away_shot_efficiency']
print("  Created shot efficiency metrics")

# Create team quality indicators
print("\nStep 5: Creating team quality indicators")

# Define big teams in each league
big_teams = {
    'premier league': ['manchester united', 'manchester city', 'liverpool', 'chelsea', 'arsenal', 'tottenham'],
    'serie a': ['juventus', 'ac milan', 'inter milan', 'napoli', 'roma', 'lazio'],
    'la liga': ['real madrid', 'barcelona', 'atletico madrid', 'sevilla', 'valencia'],
    'bundesliga': ['bayern munich', 'borussia dortmund', 'rb leipzig', 'bayer leverkusen'],
    'ligue 1': ['paris saint-germain', 'olympique lyonnais', 'olympique marseille', 'lille']
}

# Flatten the list of big teams
all_big_teams = [team for teams in big_teams.values() for team in teams]

# Create big team indicators
df['big_team_home'] = df['home_team'].str.lower().isin(all_big_teams).astype(int)
df['big_team_away'] = df['away_team'].str.lower().isin(all_big_teams).astype(int)
df['big_team_match'] = ((df['big_team_home'] == 1) & (df['big_team_away'] == 1)).astype(int)
print("  Created big team indicators based on predefined list of top teams")

# Save the enriched dataset
output_filename = 'football_matches_ml_ready.csv'
df.to_csv(output_filename, index=False)
print(f"\nEnhanced dataset saved to '{output_filename}' with {df.shape[1]} features")

# Display summary of new features
print("\nSummary of newly created features:")
original_columns = ['league', 'season', 'fixture_id', 'date', 'venue', 'referee',
                   'home_team', 'away_team', 'home_goals', 'away_goals', 'status',
                   'home_shots_on_target', 'away_shots_on_target', 'home_shots_total',
                   'away_shots_total', 'home_possession', 'away_possession',
                   'home_yellow_cards', 'away_yellow_cards', 'home_red_cards',
                   'away_red_cards', 'home_passes', 'away_passes', 'home_pass_accuracy',
                   'away_pass_accuracy']

new_features = [col for col in df.columns if col not in original_columns]
print(f"Added {len(new_features)} new features:")
for feature in new_features:
    print(f"  - {feature}")

print("\nFeature engineering complete! The dataset is now ready for machine learning applications.")

# For Google Colab, add this code to download the file
try:
    from google.colab import files
    files.download(output_filename)
    print(f"\nDownload initiated for {output_filename}")
except:
    print("\nIf you're in Google Colab and want to download the file,")
    print("please add the following code in a new cell and run it:")
    print("from google.colab import files")
    print(f"files.download('{output_filename}')")

Loading the clean match statistics data...
Loaded data with 1010 rows and 25 columns

Starting feature engineering process...
Step 1: Data cleaning and type conversion
  Converted home_possession to numeric values
  Converted away_possession to numeric values
  Converted home_pass_accuracy to numeric values
  Converted away_pass_accuracy to numeric values

Step 2: Creating match outcome features
  Created 'goal_diff' and 'result' (H/D/A) features
  Created 'total_goals' feature
  Created binary outcome indicators (home_win, draw, away_win)

Step 3: Creating card-related features
  Created total card counts and card difference

Step 4: Creating performance comparison metrics
  Created difference metrics for possession, passing, and shooting
  Created shot efficiency metrics

Step 5: Creating team quality indicators
  Created big team indicators based on predefined list of top teams

Enhanced dataset saved to 'football_matches_ml_ready.csv' with 46 features

Summary of newly created feat

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>


Download initiated for football_matches_ml_ready.csv
