## Rule-Based Model Design

The **rule-based model** evaluates players across different **macro roles** (GK, DEF, MID, ATT) by combining role-specific performance metrics.  

Football performance is assessed in two complementary dimensions:

- **Per-90 evaluation:** how productive a player is on average when on the pitch, regardless of total minutes played

- **Seasonal evaluation:** how much the player actually contributed across the full season, weighting performance by playing time and team context

### Core Principles

- **Per-90 normalization**: ensures fair comparisons across players with different playing times

- **Season totals**: capture overall impact, durability, and consistency across the year

- **Role-specific indices**: tailor the evaluation to macro roles, highlighting what matters most for each position

- **Negative factors** (yellow/red cards, own goals, goals conceded): penalize costly mistakes

- **Impact factor on team league position**: rewards players who played many minutes for successful teams

- **Finishing delta**: difference between goals and xG, highlighting players who consistently outperform or underperform expected finishing.  

### Metrics for Per-90 Evaluation

**Scoring & Shooting**

- `goals_per90`, `xg_total_per90`, `shots_on_target_per90`: finishing efficiency

- `goal_contribution_per90`: combined offensive output (goals + assists)

- `finishing_delta_per90 = goals_per90 – xg_total_per90`: finishing quality relative to chances

**Passing & Creativity**

- `assists_per90`, `key_passes_per90`: chance creation

- `passes_attempted_per90`, `passes_completed_per90`, `pass_accuracy`: volume and efficiency

- `progressive_passes_per90`, `crosses_per90`, `switches`: progression, verticality, and distribution quality

**Carrying & Dribbling**

- `progressive_carries`, `carries_to_penalty_area_per90`, `carry_distance_total_per90`: ball progression on the run

- `dribbles_completed_per90`, `dribbles_success_rate`: 1v1 ability

**Defensive Actions**

- `duels_won_per90`, `duels_success_rate`, `interceptions_won_per90`, `interceptions_ratio`, `blocks_per90`, `clearances_per90`, `ball_recoveries_per90`, `pressures_per90`: defensive efficiency and anticipation.  

**Goalkeeper Metrics**

- `gk_save_ratio`, `gk_saves_per90`: shot-stopping quality

- `gk_goals_conceded_per90` (negative): penalizes frequent conceding

> **NOTE**: For ratio/percentage metrics (e.g., `pass_accuracy`, `dribbles_success_rate`, ...) no true per-90 version is computed, as they are already normalized

### Metrics for Seasonal Evaluation

**General**

- `minutes_played`: overall availability

- `team_league_position`: context adjustment (bonus for higher-ranked teams)

- Combined into an **impact factor**

**Scoring & Shooting**

- `goals`, `xg_total`, `shots_on_target`: season-long scoring output

- `goal_contribution`: total goals + assists

- `finishing_delta = goals – xg_total`: over/under-performance relative to xG

**Passing & Creativity**

- `assists`, `key_passes`: total creative output

- `passes_attempted`, `passes_completed`, `pass_accuracy`: build-up contribution

- `progressive_passes`, `crosses`, `switches`: progression and verticality across the season

**Carrying & Dribbling**

- `progressive_carries`, `carries_to_penalty_area`, `carry_distance_total`: total ball progression

- `dribbles_completed`, `dribbles_success_rate`: successful take-ons across the year

**Defensive Actions**

- `duels_won`, `duels_success_rate`, `interceptions_won`, `interceptions_ratio`, `blocks`, `clearances`, `ball_recoveries`, `pressures`: defensive volume and effectiveness over the season  

**Goalkeeper Metrics**

- `gk_saves`, `gk_penalties_saved`, `gk_clean_sheet`: total contributions to preventing goals

- `gk_save_ratio`: efficiency of shot-stopping

- `gk_goals_conceded` (negative): total goals conceded

**Discipline (Negative Impact)**

- `yellow_cards`, `red_cards`, `own_goals`: season-long negative contributions

**Fouls**

- `fouls_won`, `fouls_balance`: ability to win fouls and generate advantageous set-pieces

### Features Not Used

- **Metadata**: `season`, `competitions`, `teams`, `main_role`: identifiers only

- **Redundant playing-time stats**: `presences`, `matches_started`, `full_matches`, `substitutions_in/out`: already captured by `minutes_played` and per-90 scaling

- **Low-discriminative metrics**:  

  - `shots_attempted`: captured better by xG and shots on target

  - `duels_attempted`, `interceptions_attempted`: success rates are more informative

  - `dispossessed`: reflected in dribble success rate.  

### Final Considerations

By combining **per-90 efficiency metrics** with **season totals and contextual impact factors**, the model captures both *quality* and *quantity*:  

- A player with high efficiency but limited minutes will score high in per-90 evaluation but lower in seasonal impact.  
- A consistent starter with thousands of minutes and steady production will shine in seasonal evaluation, even if per-90 efficiency is modest.  

The final Ballon d’Or ranking is presented from two complementary perspectives:  

1. **Best per-90 performers (efficiency)** → how good a player is when on the pitch

2. **Best seasonal contributors (impact)** → who truly shaped the season at scale

## Imports

In [489]:
import pandas as pd
import numpy as np
import ast

from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler


import warnings
warnings.filterwarnings("ignore")

## Dataset Loading

In [490]:
# Load the dataset
df_final = pd.read_csv("../task2_ballon_dor/data/df_final.csv")

# Check shape
print(f"Dataset shape: {df_final.shape[0]} rows, {df_final.shape[1]} columns")

# Preview
display(df_final.head())

# List of columns
print("Available columns:")
for col in df_final.columns:
    print("-", col)


Dataset shape: 2632 rows, 87 columns


Unnamed: 0,season,player_id,player_name,presences,matches_started,full_matches,minutes_played,substitutions_in,substitutions_out,yellow_cards,...,gk_penalties_saved,gk_save_ratio,gk_clean_sheet,teams,competitions,main_role,team_league_position,macro_role,goal_contribution,goal_contribution_per90
0,2015/2016,2936,Christophe Kerbrat,29,29,28,2755,0,1,7,...,0,0.0,0,['Guingamp'],['France - Ligue 1'],Right Center Back,16,DEF,0,0.0
1,2015/2016,2943,Lucas Deaux,16,13,10,1266,3,3,3,...,0,0.0,0,['Nantes'],['France - Ligue 1'],Right Defensive Midfield,13,MID,0,0.0
2,2015/2016,2944,Benjamin Corgnet,16,5,1,613,11,4,0,...,0,0.0,0,['Saint-Étienne'],['France - Ligue 1'],Center Attacking Midfield,6,MID,2,0.29
3,2015/2016,2946,Frédéric Guilbert,29,28,26,2574,1,2,3,...,0,0.0,0,['Bordeaux'],['France - Ligue 1'],Right Center Back,14,DEF,0,0.0
4,2015/2016,2947,Anthony Lopes,37,37,37,3548,0,0,2,...,0,0.492958,13,['Lyon'],['France - Ligue 1'],Goalkeeper,2,GK,0,0.0


Available columns:
- season
- player_id
- player_name
- presences
- matches_started
- full_matches
- minutes_played
- substitutions_in
- substitutions_out
- yellow_cards
- red_cards
- shots_attempted
- shots_attempted_per90
- shots_on_target
- shots_on_target_per90
- goals
- goals_per90
- xg_total
- xg_total_per90
- assists
- assists_per90
- key_passes
- key_passes_per90
- passes_attempted
- passes_attempted_per90
- passes_completed
- passes_completed_per90
- pass_accuracy
- progressive_passes
- progressive_passes_per90
- crosses
- crosses_per90
- switches
- carries_attempted
- carries_attempted_per90
- carry_distance_total
- carry_distance_total_per90
- progressive_carries
- carries_to_penalty_area
- carries_to_penalty_area_per90
- dribbles_attempted
- dribbles_attempted_per90
- dribbles_completed
- dribbles_completed_per90
- dribbles_success_rate
- duels_attempted
- duels_attempted_per90
- duels_won
- duels_won_per90
- duels_lost
- interceptions_attempted
- interceptions_attempted_pe

## Initial Filtering

Before analyzing the dataset, we exclude players with **very low playing time**. The idea is to avoid misleading results caused by small-sample performances:

- **Threshold:** players with fewer than **900 minutes played** (≈ 10 full matches) are removed

Such players might show inflated per-90 statistics due to limited appearances (e.g., a substitute scoring 1 goal in 90 minutes would appear as 1 goal/90, which is not representative). By applying this filter, we ensure that only players with a **substantial involvement across the season** are considered in the evaluation.

In [491]:
# Filter players with at least 900 minutes played
df_filtered = df_final[df_final["minutes_played"] >= 900].copy()

print(f"Initial dataset size: {df_final.shape[0]} players")
print(f"Remaining players after filtering (≥900 minutes): {df_filtered.shape[0]}")
print(f"Players filtered out: {df_final.shape[0] - df_filtered.shape[0]}")

Initial dataset size: 2632 players
Remaining players after filtering (≥900 minutes): 1658
Players filtered out: 974


## Add `finishing_delta_per90` and `finishing_delta` to the dataset

In [492]:
# Compute finishing delta per90 and add it to the DataFrame
df_filtered["finishing_delta_per90"] = (
    df_filtered["goals_per90"].fillna(0) - df_filtered["xg_total_per90"].fillna(0)
).round(2)

# Compute finishing delta and add it to the DataFrame
df_filtered["finishing_delta"] = (
    df_filtered["goals"].fillna(0) - df_filtered["xg_total"].fillna(0)
).round(2)

# Display the updated DataFrame
display(df_filtered[df_filtered["player_name"] == "Harry Kane"][["player_name", "goals", "xg_total", "finishing_delta_per90", "finishing_delta"]])
display(df_filtered[df_filtered["player_name"] == "Luis Suárez"][["player_name", "goals", "xg_total", "finishing_delta_per90", "finishing_delta"]])
display(df_filtered[df_filtered["player_name"] == "Gonzalo Higuaín"][["player_name", "goals", "xg_total", "finishing_delta_per90", "finishing_delta"]])
display(df_filtered[df_filtered["player_name"] == "Robert Lewandowski"][["player_name", "goals", "xg_total", "finishing_delta_per90", "finishing_delta"]])
display(df_filtered[df_filtered["player_name"] == "Zlatan Ibrahimović"][["player_name", "goals", "xg_total", "finishing_delta_per90", "finishing_delta"]])

Unnamed: 0,player_name,goals,xg_total,finishing_delta_per90,finishing_delta
1697,Harry Kane,25,21.63,0.08,3.37


Unnamed: 0,player_name,goals,xg_total,finishing_delta_per90,finishing_delta
651,Luis Suárez,40,27.66,0.34,12.34


Unnamed: 0,player_name,goals,xg_total,finishing_delta_per90,finishing_delta
680,Gonzalo Higuaín,36,25.01,0.32,10.99


Unnamed: 0,player_name,goals,xg_total,finishing_delta_per90,finishing_delta
743,Robert Lewandowski,30,24.52,0.18,5.48


Unnamed: 0,player_name,goals,xg_total,finishing_delta_per90,finishing_delta
411,Zlatan Ibrahimović,36,22.7,0.46,13.3


## Team Impact

A new feature is introduced to capture the **impact of each player on team success**.  
This metric balances individual playing time with the league position of the team(s) a player represented during the season:

- **League position (`team_league_position`)** is used as a proxy for **team strength and season success**  
  - Lower values correspond to stronger teams (e.g., 1 = champion).  
  - To avoid extreme penalization of players from weaker teams, positions are normalized into a **team strength score** between 0 and 1:
      
    $
    team\_strength = \frac{max(position) - avg\_position}{max(position) - 1}
    $

- **Minutes played (`minutes_played`)** reflect how much the player actually contributed on the pitch.  
  - Minutes are normalized using min-max scaling to keep values in a 0–1 range:

      $
      norm\_minutes = \frac{minutes\_played - 900}{max(minutes) - 900}
      $

- For players with appearances in **multiple teams**, the **average league position** is considered.

The final **team impact** score is then computed as:

$
team\_impact = norm\_minutes \times team\_strength
$


- Players with **many minutes** in **successful teams** are rewarded: reflects sustained contribution in a high-performing context

- Players with **few minutes** or in **low-ranked teams** receive a smaller value → avoids overestimating substitutes or players with limited influence

- Normalization ensures fairness: 3000 minutes ≠ 3× more valuable than 1000 minutes, but still indicates a higher level of involvement

In [493]:
# Create a copy of the filtered DataFrame
df = df_filtered.copy()

# Min and max minutes
min_minutes = 900
max_minutes = df["minutes_played"].max()

# Normalized minutes (0–1 scale)
df["norm_minutes"] = (df["minutes_played"] - min_minutes) / (max_minutes - min_minutes)

# Parse team_league_position 
def parse_team_position(x):
    try:
        val = ast.literal_eval(str(x))
        if isinstance(val, (list, tuple)):
            return float(np.mean(val))
        else:
            return float(val)
    except:
        return np.nan

df["avg_team_position"] = df["team_league_position"].apply(parse_team_position)

# Normalize team position into team strength (1 = champion, 0 = bottom)
max_position = int(df["avg_team_position"].max())
df["team_strength"] = (max_position - df["avg_team_position"]) / (max_position - 1)

# Compute team impact
df["team_impact"] = df["norm_minutes"] * df["team_strength"]

# Display some known players
for name in ["Harry Kane", "Luis Suárez", "Gonzalo Higuaín", "Robert Lewandowski", "Zlatan Ibrahimović"]:
    display(df[df["player_name"] == name][["player_name", "minutes_played", "avg_team_position", "team_strength", "team_impact"]])


Unnamed: 0,player_name,minutes_played,avg_team_position,team_strength,team_impact
1697,Harry Kane,3563,3.0,0.894737,0.854009


Unnamed: 0,player_name,minutes_played,avg_team_position,team_strength,team_impact
651,Luis Suárez,3299,1.0,1.0,0.859857


Unnamed: 0,player_name,minutes_played,avg_team_position,team_strength,team_impact
680,Gonzalo Higuaín,3093,2.0,0.947368,0.744652


Unnamed: 0,player_name,minutes_played,avg_team_position,team_strength,team_impact
743,Robert Lewandowski,2736,1.0,1.0,0.658065


Unnamed: 0,player_name,minutes_played,avg_team_position,team_strength,team_impact
411,Zlatan Ibrahimović,2568,1.0,1.0,0.597849


In [494]:
# Display min and max
min_player = df.loc[df["team_impact"].idxmin(), ["player_name", "macro_role", "minutes_played", "avg_team_position", "team_impact"]]
max_player = df.loc[df["team_impact"].idxmax(), ["player_name", "macro_role", "minutes_played", "avg_team_position", "team_impact"]]

print("Lowest team_impact:")
display(min_player.to_frame().T)

print("\nHighest team_impact:")
display(max_player.to_frame().T)

Lowest team_impact:


Unnamed: 0,player_name,macro_role,minutes_played,avg_team_position,team_impact
31,Stéphane Darbion,MID,1654,20.0,0.0



Highest team_impact:


Unnamed: 0,player_name,macro_role,minutes_played,avg_team_position,team_impact
444,Wes Morgan,DEF,3687,1.0,0.998925


## Per-90 Evaluation

Role-specific **per-90 indices** are computed to measure efficiency. This step identifies which players perform best on average when on the pitch, regardless of total playing time. Players are then ranked by their **per-90 score**, and the top performers across all roles are displayed.

In [495]:
metrics_per90 = [
    # Scoring & Shooting
    "goals_per90", "xg_total_per90", "shots_on_target_per90",
    "goal_contribution_per90", "finishing_delta_per90",

    # Passing & Creativity
    "assists_per90", "key_passes_per90",
    "passes_attempted_per90", "passes_completed_per90", "pass_accuracy",
    "progressive_passes_per90", "crosses_per90", "switches",

    # Carrying & Dribbling
    "progressive_carries", "carries_to_penalty_area_per90", "carry_distance_total_per90",
    "dribbles_completed_per90", "dribbles_success_rate",

    # Defensive Actions
    "duels_won_per90", "duels_success_rate",
    "interceptions_won_per90", "interceptions_ratio",
    "blocks_per90", "clearances_per90", "ball_recoveries_per90", "pressures_per90",

    # Goalkeeper
    "gk_save_ratio", "gk_saves_per90", "gk_goals_conceded_per90",

    # Contextual impact
    "team_impact"
]


In [496]:
# Create a copy of the DataFrame to avoid changing the original data
df_per90 = df.copy()

#### Normalization

In [497]:
# Initialize the scaler
scaler = MinMaxScaler()

# Normalize the values between 0 and 1
df_per90[metrics_per90] = scaler.fit_transform(df_per90[metrics_per90])

# Print min and max values for each metric
print("Min and max values after normalization:")
print(df_per90[metrics_per90].agg(['min', 'max']))

Min and max values after normalization:
     goals_per90  xg_total_per90  shots_on_target_per90  \
min          0.0             0.0                    0.0   
max          1.0             1.0                    1.0   

     goal_contribution_per90  finishing_delta_per90  assists_per90  \
min                      0.0                    0.0            0.0   
max                      1.0                    1.0            1.0   

     key_passes_per90  passes_attempted_per90  passes_completed_per90  \
min               0.0                     0.0                     0.0   
max               1.0                     1.0                     1.0   

     pass_accuracy  ...  interceptions_won_per90  interceptions_ratio  \
min            0.0  ...                      0.0                  0.0   
max            1.0  ...                      1.0                  1.0   

     blocks_per90  clearances_per90  ball_recoveries_per90  pressures_per90  \
min           0.0               0.0                 

#### Weights Definition

In [498]:
import pandas as pd

# Function to calculate feature weights based on correlation per role
def assign_weights_by_correlation(df, metrics):
    # Create a dictionary to store the weights per role
    weights = {}

    # List of roles in the dataset
    roles = df['macro_role'].unique()
    
    for role in roles:
        # Filter data by role
        role_data = df[df['macro_role'] == role][metrics]
        
        # Handle NaN values by dropping rows with NaN values in the selected metrics
        role_data = role_data.dropna(subset=metrics)
        
        # Calculate the correlation matrix for the role
        corr_matrix = role_data.corr().abs()  # Use absolute correlation to avoid negative values
        
        # Calculate the mean correlation for each metric (average of all pairwise correlations)
        mean_corr = corr_matrix.mean(axis=0)
        
        # Save the weights (importance) for each metric per role, sorted by the mean correlation
        weights[role] = mean_corr.sort_values(ascending=False)
        
        # Handle NaN values in the weights (replace NaN with an empty string or 0)
        weights[role] = weights[role].fillna(0)  # Replace NaN with an empty string (or use 0 if preferred)
    
    return weights

# Calculate the weights based on correlation for each role
role_weights = assign_weights_by_correlation(df_per90, metrics_per90)

# Print the weights for each role
for role, weight in role_weights.items():
    print(f"Weights for the role {role}:")
    print(weight)
    print("\n")


Weights for the role DEF:
carries_to_penalty_area_per90    0.338346
crosses_per90                    0.337780
carry_distance_total_per90       0.334754
clearances_per90                 0.328157
key_passes_per90                 0.320600
dribbles_completed_per90         0.320511
passes_attempted_per90           0.311125
ball_recoveries_per90            0.291084
progressive_carries              0.290834
goal_contribution_per90          0.282167
assists_per90                    0.273782
passes_completed_per90           0.270665
pressures_per90                  0.265089
duels_won_per90                  0.231413
interceptions_ratio              0.228061
duels_success_rate               0.219914
pass_accuracy                    0.214666
shots_on_target_per90            0.212440
switches                         0.195477
goals_per90                      0.179515
interceptions_won_per90          0.177353
team_impact                      0.170919
xg_total_per90                   0.158757
blocks_p

In [499]:
import pandas as pd

# Function to normalize weights for each role
def normalize_weights_per_role(weights):
    normalized_weights = {}
    for role, weight in weights.items():
        # Normalizzare i pesi di ogni ruolo tra 0 e 1
        min_weight = weight.min()
        max_weight = weight.max()
        
        # Se ci sono valori unici o tutti uguali, evitare la divisione per zero
        if max_weight != min_weight:
            normalized_weights[role] = (weight - min_weight) / (max_weight - min_weight)
        else:
            normalized_weights[role] = weight  # Mantieni i pesi invariati se max e min sono uguali
    return normalized_weights

# Normalizza i pesi per ruolo
normalized_role_weights = normalize_weights_per_role(role_weights)

# Stampa i pesi normalizzati per ogni ruolo
for role, weight in normalized_role_weights.items():
    print(f"Normalized Weights for the role {role}:")
    print(weight)
    print("\n")


Normalized Weights for the role DEF:
carries_to_penalty_area_per90    1.000000
crosses_per90                    0.998328
carry_distance_total_per90       0.989384
clearances_per90                 0.969887
key_passes_per90                 0.947551
dribbles_completed_per90         0.947288
passes_attempted_per90           0.919547
ball_recoveries_per90            0.860315
progressive_carries              0.859576
goal_contribution_per90          0.833959
assists_per90                    0.809177
passes_completed_per90           0.799966
pressures_per90                  0.783485
duels_won_per90                  0.683954
interceptions_ratio              0.674047
duels_success_rate               0.649967
pass_accuracy                    0.634458
shots_on_target_per90            0.627878
switches                         0.577744
goals_per90                      0.530566
interceptions_won_per90          0.524177
team_impact                      0.505161
xg_total_per90                   0.4692

#### Evaluation

In [500]:

# Apply the new weights to compute the per90_index ===
def compute_role_index_pca(row, weights):
    role = row["macro_role"]
    score = 0
    if role in weights:
        for metric, w in weights[role].items():
            if metric in row.index:
                score += w * row[metric]
    return score

# Apply per90 points using PCA weights
df_per90["per90_points"] = df_per90.apply(lambda r: compute_role_index_pca(r, normalized_role_weights), axis=1)

# Display the Top 30 players based on per90_points
top30_per90_pca = df_per90.sort_values("per90_points", ascending=False).head(30)

print("Top 30 players by per-90 points:")
display(top30_per90_pca[["player_name", "macro_role", "per90_points"]])


Top 30 players by per-90 points:


Unnamed: 0,player_name,macro_role,per90_points
681,Lionel Messi,FWD,9.169529
512,Neymar,FWD,9.12903
33,Ángel Di María,FWD,9.096837
379,Sofiane Boufal,FWD,8.261512
760,James Rodríguez,FWD,8.070939
411,Zlatan Ibrahimović,FWD,7.827096
237,Alexis Sánchez,FWD,7.767099
771,Paulo Dybala,FWD,7.764264
171,Henrikh Mkhitaryan,FWD,7.739743
474,Jesé,FWD,7.725483


## Seasonal Evaluation

In [501]:
# Complete weight matrix for per-90 evaluation

metrics = [
    # Scoring & Shooting
    "goals", "xg_total", "shots_on_target",
    "goal_contribution", "finishing_delta",

    # Passing & Creativity
    "assists", "key_passes",
    "passes_attempted", "passes_completed", "pass_accuracy",
    "progressive_passes", "crosses", "switches",

    # Carrying & Dribbling
    "progressive_carries", "carries_to_penalty_area", "carry_distance_total",
    "dribbles_completed", "dribbles_success_rate",

    # Defensive Actions
    "duels_won", "duels_success_rate",
    "interceptions_won", "interceptions_ratio",
    "blocks", "clearances", "ball_recoveries", "pressures",

    # Goalkeeper
    "gk_save_ratio", "gk_saves", "gk_goals_conceded", 
    "gk_penalties_saved", "gk_clean_sheet",

    # Discipline
    "yellow_cards", "red_cards", "own_goals",

    # Fouls
    "fouls_won", "fouls_balance",

    # Contextual impact
    "team_impact"
]


In [502]:
# Create a copy of the DataFrame to avoid changing the original data
df_seasonal = df.copy()

#### Normalization

In [503]:
# Initialize the scaler
scaler = MinMaxScaler()

# Normalize the values between 0 and 1
df_seasonal[metrics] = scaler.fit_transform(df_seasonal[metrics])

# Print min and max values for each metric
print("Min and max values after normalization:")
print(df_seasonal[metrics].agg(['min', 'max']))

Min and max values after normalization:
     goals  xg_total  shots_on_target  goal_contribution  finishing_delta  \
min    0.0       0.0              0.0                0.0              0.0   
max    1.0       1.0              1.0                1.0              1.0   

     assists  key_passes  passes_attempted  passes_completed  pass_accuracy  \
min      0.0         0.0               0.0               0.0            0.0   
max      1.0         1.0               1.0               1.0            1.0   

     ...  gk_saves  gk_goals_conceded  gk_penalties_saved  gk_clean_sheet  \
min  ...       0.0                0.0                 0.0             0.0   
max  ...       1.0                1.0                 1.0             1.0   

     yellow_cards  red_cards  own_goals  fouls_won  fouls_balance  team_impact  
min           0.0        0.0        0.0        0.0            0.0          0.0  
max           1.0        1.0        1.0        1.0            1.0          1.0  

[2 rows x 37 c

#### Weights Definition

In [504]:
import pandas as pd

# Function to calculate feature weights based on correlation per role
def assign_weights_by_correlation(df, metrics):
    # Create a dictionary to store the weights per role
    weights = {}

    # List of roles in the dataset
    roles = df['macro_role'].unique()
    
    for role in roles:
        # Filter data by role
        role_data = df[df['macro_role'] == role][metrics]
        
        # Handle NaN values by dropping rows with NaN values in the selected metrics
        role_data = role_data.dropna(subset=metrics)
        
        # Calculate the correlation matrix for the role
        corr_matrix = role_data.corr().abs()  # Use absolute correlation to avoid negative values
        
        # Calculate the mean correlation for each metric (average of all pairwise correlations)
        mean_corr = corr_matrix.mean(axis=0)
        
        # Save the weights (importance) for each metric per role, sorted by the mean correlation
        weights[role] = mean_corr.sort_values(ascending=False)
        
        # Handle NaN values in the weights (replace NaN with an empty string or 0)
        weights[role] = weights[role].fillna(0)  # Replace NaN with an empty string (or use 0 if preferred)
    
    return weights

# Calculate the weights based on correlation for each role
role_weights = assign_weights_by_correlation(df_seasonal, metrics)

# Print the weights for each role
for role, weight in role_weights.items():
    print(f"Weights for the role {role}:")
    print(weight)
    print("\n")

Weights for the role DEF:
carry_distance_total       0.426192
ball_recoveries            0.423461
passes_attempted           0.423072
progressive_carries        0.403074
passes_completed           0.395629
pressures                  0.380687
progressive_passes         0.365238
blocks                     0.362443
key_passes                 0.360728
dribbles_completed         0.353046
carries_to_penalty_area    0.346929
duels_won                  0.343476
goal_contribution          0.341349
crosses                    0.339888
team_impact                0.330308
fouls_won                  0.320859
interceptions_won          0.318912
shots_on_target            0.310012
assists                    0.300077
switches                   0.299459
clearances                 0.254084
xg_total                   0.251677
goals                      0.241231
yellow_cards               0.196057
interceptions_ratio        0.193535
duels_success_rate         0.175033
pass_accuracy              0.154602
fo

In [505]:
import pandas as pd

# Function to normalize weights for each role
def normalize_weights_per_role(weights):
    normalized_weights = {}
    for role, weight in weights.items():
        # Normalizzare i pesi di ogni ruolo tra 0 e 1
        min_weight = weight.min()
        max_weight = weight.max()
        
        # Se ci sono valori unici o tutti uguali, evitare la divisione per zero
        if max_weight != min_weight:
            normalized_weights[role] = (weight - min_weight) / (max_weight - min_weight)
        else:
            normalized_weights[role] = weight  # Mantieni i pesi invariati se max e min sono uguali
    return normalized_weights

# Normalizza i pesi per ruolo
normalized_role_weights = normalize_weights_per_role(role_weights)

# Stampa i pesi normalizzati per ogni ruolo
for role, weight in normalized_role_weights.items():
    print(f"Normalized Weights for the role {role}:")
    print(weight)
    print("\n")


Normalized Weights for the role DEF:
carry_distance_total       1.000000
ball_recoveries            0.993593
passes_attempted           0.992680
progressive_carries        0.945756
passes_completed           0.928289
pressures                  0.893228
progressive_passes         0.856980
blocks                     0.850422
key_passes                 0.846398
dribbles_completed         0.828373
carries_to_penalty_area    0.814020
duels_won                  0.805918
goal_contribution          0.800928
crosses                    0.797500
team_impact                0.775023
fouls_won                  0.752851
interceptions_won          0.748282
shots_on_target            0.727401
assists                    0.704089
switches                   0.702638
clearances                 0.596173
xg_total                   0.590525
goals                      0.566016
yellow_cards               0.460022
interceptions_ratio        0.454103
duels_success_rate         0.410691
pass_accuracy              

In [506]:
# Seasonal disciplinary metrics
discipline_metrics = ["yellow_cards", "red_cards", "own_goals"]

# Invert the sign of PCA weights for disciplinary metrics
for metric in discipline_metrics:
    for role in normalized_role_weights:
        if metric in normalized_role_weights[role]:
            normalized_role_weights[role][metric] = -normalized_role_weights[role][metric]


#### Evaluation

In [507]:
# Function to compute the seasonal points with PCA weights and disciplinary penalties
def compute_role_index_seasonal(row, weights):
    role = row["macro_role"]
    score = 0
    if role in weights:
        for metric, w in weights[role].items():
            if metric in row.index and not pd.isna(row[metric]):
                score += w * row[metric]
    return score

# Apply the PCA-based weights to compute seasonal points
df_seasonal["seasonal_points"] = df_seasonal.apply(lambda r: compute_role_index_seasonal(r, normalized_role_weights), axis=1)

#  Display the Top 30 players based on seasonal points 
top30_seasonal = df_seasonal.sort_values("seasonal_points", ascending=False).head(30)

print("Top 30 players by seasonal points:")
display(top30_seasonal[["player_name","teams", "macro_role", "seasonal_points"]])


Top 30 players by seasonal points:


Unnamed: 0,player_name,teams,macro_role,seasonal_points
512,Neymar,['Barcelona'],FWD,10.993652
914,Franco Vázquez,['Palermo'],MID,9.806291
681,Lionel Messi,['Barcelona'],FWD,9.733786
445,Riyad Mahrez,['Leicester City'],MID,9.559806
1084,Marek Hamšík,['Napoli'],MID,9.306021
303,Mesut Özil,['Arsenal'],MID,9.119919
878,Ryad Boudebouz,['Montpellier'],MID,8.930151
1924,Paul Pogba,['Juventus'],MID,8.893151
1063,Borja Valero,['Fiorentina'],MID,8.797432
57,Christian Eriksen,['Tottenham Hotspur'],FWD,8.628883


## Final Ranking Balon d'or 2015/16

In [509]:
# Merge the dataframes on player_name
ranking = pd.merge(df_seasonal[["player_name", "teams", "macro_role", "seasonal_points"]], 
                    df_per90[["player_name", "per90_points"]], 
                    on="player_name", how="inner")


# Combine the seasonal and per-90 points into a final score
ranking["final_points"] = 0.5 * ranking["seasonal_points"] + 0.5 * ranking["per90_points"]

# Rank players based on final points
top30_final = ranking.sort_values("final_points", ascending=False).head(30)

# Display the top 30 players with their final points
print("Top 30 players by final points (weighted per-90 and seasonal points):")
display(top30_final[["player_name", "macro_role","teams",  "final_points", "seasonal_points", "per90_points",]])

Top 30 players by final points (weighted per-90 and seasonal points):


Unnamed: 0,player_name,macro_role,teams,final_points,seasonal_points,per90_points
399,Neymar,FWD,['Barcelona'],10.061341,10.993652,9.12903
509,Lionel Messi,FWD,['Barcelona'],9.451657,9.733786,9.169529
352,Riyad Mahrez,MID,['Leicester City'],8.521604,9.559806,7.483402
706,Franco Vázquez,MID,['Palermo'],8.424915,9.806291,7.043539
24,Ángel Di María,FWD,['Paris Saint-Germain'],8.3812,7.665562,9.096837
247,Mesut Özil,MID,['Arsenal'],8.227111,9.119919,7.334303
848,Marek Hamšík,MID,['Napoli'],8.221763,9.306021,7.137505
44,Christian Eriksen,FWD,['Tottenham Hotspur'],8.116075,8.628883,7.603267
1377,Paul Pogba,MID,['Juventus'],8.06011,8.893151,7.22707
307,Sofiane Boufal,FWD,['Lille'],7.861834,7.462157,8.261512
