
# ⚽ Ligue 1 Club Performance Analysis (2014–2024)

This notebook provides an in-depth exploratory analysis of Ligue 1 club performance over nearly a decade.  
We focus on club statistics at the **seasonal level**, drawing insights into success factors, performance dynamics, and statistical profiles using Python and machine learning.

---


## Charging the libraries
            

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier, plot_tree

## 1. Data Loading and Initial Overview

The dataset is loaded from a CSV file containing team-level seasonal performance data in Ligue 1.
Variables include basic match stats (`GF`, `GA`, `Pts`), advanced metrics (`Poss`, `CS%`), and outcomes (`LgRank`).
We will normalize some of these values per match to allow fair comparisons between teams and across seasons.
            

In [None]:
df = pd.read_csv("ligue-1-stat-15-24.csv", skiprows=1, sep=";")

In [None]:
df.head()

In [None]:
print(df.columns)

df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')

df.info()
df.isna().sum()

In [None]:
df = df.drop(columns = ['comp', 'w.1', 'min', 'rk'])

In [None]:
df['gf/mp'] = df['gf']/df['mp']
df['ga/mp'] = df['ga']/df['mp']
df['gd/mp'] = df['gd']/df['mp']

In [None]:
saisons = df["season"].unique()

for saison in saisons:
    plt.figure(figsize=(12,6))
    
    data_saison = df[df["season"] == saison]
    
    ax = sns.barplot(
            data=data_saison,
            x="team",
            y="gf",
            order=data_saison.sort_values("gf", ascending=False)["team"]
        )
    
    plt.title(f"Goals scored by team - Season {saison}")
    plt.xlabel("Club")
    plt.ylabel("Goals")
    plt.xticks(rotation=45)
    
    for bar in ax.patches:
        height = bar.get_height()
        ax.text(
            bar.get_x() + bar.get_width() / 2,
            height + 1,
            f'{int(height)}',
            ha='center',
            va='bottom',
            fontsize=9
        )
        
    plt.tight_layout()
    plt.show()

In [None]:
saisons = df["season"].unique()

for saison in saisons:
    plt.figure(figsize=(12,6))
    
    data_saison = df[df["season"] == saison]
    
    ax = sns.barplot(
            data=data_saison,
            x="team",
            y="gd",
            order=data_saison.sort_values("gd", ascending=False)["team"]
        )
    
    plt.title(f"Goal difference by team - Season {saison}")
    plt.xlabel("Club")
    plt.ylabel("Goals")
    plt.xticks(rotation=45)
    
    for bar in ax.patches:
        height = bar.get_height()
        ax.text(
            bar.get_x() + bar.get_width() / 2,
            height + 1,
            f'{int(height)}',
            ha='center',
            va='bottom',
            fontsize=9
        )
        
    plt.tight_layout()
    plt.show()

In [None]:
saisons = df["season"].unique()
print("Seasons available :", saisons)

clubs_par_saison = {saison: set(df[df["season"] == saison]["team"]) for saison in saisons}

clubs_toujours_present = set.intersection(*clubs_par_saison.values())

print("Teams that have never been relegated :", sorted(clubs_toujours_present))

In [None]:
df["presence_toutes_saisons"] = df["team"].apply(lambda club: 1 if club in clubs_toujours_present else 0)

## 3. Long-Term Teams vs. Promoted/Relegated Clubs

This section isolates clubs that were present in Ligue 1 every season from 2015 to 2024.
We compare their average statistics with those of less stable clubs (promoted/relegated).
This reveals structural differences in possession, points, and goal statistics.
            

In [None]:
statistiques = ["pts/mp", "gf/mp", "ga/mp", "gd", "poss", "cs%", "w", "d", "l", "pk", "pkatt"]

df_clubs_toujours_present = df[df["team"].isin(clubs_toujours_present)]

moyennes_par_equipe = {}

for equipe in df_clubs_toujours_present["team"].unique():
    data_equipe = df_clubs_toujours_present[df_clubs_toujours_present["team"] == equipe]
    
    moyennes_equipe = data_equipe.mean(numeric_only=True)
    
    moyennes_par_equipe[equipe] = moyennes_equipe

moyennes_equipe_df = pd.DataFrame(moyennes_par_equipe).T

for stat in statistiques:
    plt.figure(figsize=(12, 6))
    moyennes_equipe_df[stat].sort_values(ascending=False).plot(kind='bar', figsize=(12, 6))
    plt.title(f"Mean of {stat} by team across the stretch")
    plt.ylabel(f"Mean of {stat}")
    plt.xticks(rotation=90)
    plt.tight_layout()
    plt.show()

## 3. Long-Term Teams vs. Promoted/Relegated Clubs

This section isolates clubs that were present in Ligue 1 every season from 2015 to 2024.
We compare their average statistics with those of less stable clubs (promoted/relegated).
This reveals structural differences in possession, points, and goal statistics.
            

In [None]:
statistiques = ["pts/mp", "gf/mp", "ga/mp", "gd", "poss", "cs%", "w", "d", "l", "pk", "pkatt"]

df_clubs_toujours_present = df[~df["team"].isin(clubs_toujours_present)]

moyennes_par_equipe = {}

for equipe in df_clubs_toujours_present["team"].unique():
    data_equipe = df_clubs_toujours_present[df_clubs_toujours_present["team"] == equipe]
    
    moyennes_equipe = data_equipe.mean(numeric_only=True)
    
    moyennes_par_equipe[equipe] = moyennes_equipe

moyennes_equipe_df = pd.DataFrame(moyennes_par_equipe).T

for stat in statistiques:
    plt.figure(figsize=(12, 6))
    moyennes_equipe_df[stat].sort_values(ascending=False).plot(kind='bar', figsize=(12, 6))
    plt.title(f"Mean of {stat} by team across the stretch")
    plt.ylabel(f"Mean of {stat}")
    plt.xticks(rotation=90)
    plt.tight_layout()
    plt.show()

In [None]:
moyennes_saison = []

for saison in df["season"].unique():
    df_saison = df[df["season"] == saison]
    
    clubs_toujours = df_saison[df_saison["team"].isin(clubs_toujours_present)]
    clubs_ponctuels = df_saison[~df_saison["team"].isin(clubs_toujours_present)]
    
    moyenne_toujours = clubs_toujours["pts/mp"].mean()
    moyenne_ponctuels = clubs_ponctuels["pts/mp"].mean()
    
    moyennes_saison.append({
        "saison": saison,
        "clubs_toujours": moyenne_toujours,
        "clubs_ponctuels": moyenne_ponctuels
    })

df_moyennes = pd.DataFrame(moyennes_saison).sort_values("saison")

plt.figure(figsize=(10,6))
plt.plot(df_moyennes['saison'].values, df_moyennes['clubs_toujours'].values, label="Clubs toujours présents", marker='o')
plt.plot(df_moyennes['saison'].values, df_moyennes['clubs_ponctuels'].values, label="Clubs ponctuels", marker='s')
plt.title("Evolution of points per game by group of teams")
plt.xlabel("Season")
plt.ylabel("Points per game")
plt.xticks(rotation=45)
plt.legend()
plt.tight_layout()
plt.show()

## 5. Evolution of Ligue 1 Over Time

Here we explore the evolution of global performance indicators (e.g., possession, points, goals) season by season.
It highlights trends in tactical style or league competitiveness over time.
            

In [None]:
stats = ["pts/mp", "gf/mp", "poss"]

evolution_generale = df.groupby("season")[stats].mean().reset_index()

for stat in stats:
    plt.figure(figsize=(10, 5))
    plt.plot(evolution_generale["season"].values, evolution_generale[stat].values, marker='o')
    plt.title(f"Mean of {stat} evolution by Ligue 1 season")
    plt.xlabel("Season")
    plt.ylabel(f"Mean of {stat}")
    plt.xticks(rotation=45)
    plt.grid(True)
    plt.tight_layout()
    plt.show()

## 4. Visualization: Ranking Teams by Stat

We generate bar plots for each numerical statistic, sorted by descending mean value.
This allows visual comparison between clubs for metrics like possession or goals per match.
            

In [None]:
clubs_suivis = ["Paris S-G", "Lyon", "Monaco", "Marseille"]

stats = ["pts/mp", "gf/mp", "poss"]

for stat in stats:
    plt.figure(figsize=(10, 5))
    
    for club in clubs_suivis:
        data_club = df[df["team"] == club].sort_values("season")
        plt.plot(data_club["season"].values, data_club[stat].values, label=club, marker='o')
    
    plt.title(f"Evolution of {stat} by season for the teams selected")
    plt.xlabel("Season")
    plt.ylabel(stat)
    plt.xticks(rotation=45)
    plt.legend()
    plt.grid(True)
    plt.tight_layout()
    plt.show()

## 2. Average Performance by Team (All Seasons)

We compute the average statistics for each team across all seasons.
This helps to identify the most dominant or consistent clubs over time.
Metrics like `GF_per_MP`, `CS%`, and `Poss` are useful in comparing offensive and defensive profiles.
            

In [None]:
df["LgRank_num"] = df["lgrank"].str.extract(r'(\d+)').astype(int)

rang_stats = df.groupby("team").agg(
    moyenne_rang=("LgRank_num", "mean"),
    ecart_type_rang=("LgRank_num", "std"),
    saisons_jouees=("season", "count")
).reset_index()

rang_stats_filtre = rang_stats[rang_stats["saisons_jouees"] >= 5]

plt.figure(figsize=(10, 6))
plt.scatter(rang_stats_filtre["moyenne_rang"], rang_stats_filtre["ecart_type_rang"])

for i, row in rang_stats_filtre.iterrows():
    plt.text(row["moyenne_rang"], row["ecart_type_rang"], row["team"], fontsize=8)

plt.title("Stability of clubs in Ligue 1 (average rank vs. standard deviation)")
plt.xlabel("Average rank (lower = better)")
plt.ylabel("Standard deviation of rank (lower = more stable)")
plt.grid(True)
plt.tight_layout()
plt.show()

## 2. Average Performance by Team (All Seasons)

We compute the average statistics for each team across all seasons.
This helps to identify the most dominant or consistent clubs over time.
Metrics like `GF_per_MP`, `CS%`, and `Poss` are useful in comparing offensive and defensive profiles.
            

In [None]:
stats_classement = df.groupby("team")[["poss", "cs%", "gf/mp"]].mean().sort_values("gf/mp", ascending=False)

for stat in stats_classement.columns:
    plt.figure(figsize=(12, 6))
    stats_classement[stat].sort_values(ascending=False).plot(kind="bar")
    plt.title(f"Club ranking by mean of {stat}")
    plt.ylabel(stat)
    plt.xticks(rotation=90)
    plt.tight_layout()
    plt.show()

## 2. Average Performance by Team (All Seasons)

We compute the average statistics for each team across all seasons.
This helps to identify the most dominant or consistent clubs over time.
Metrics like `GF_per_MP`, `CS%`, and `Poss` are useful in comparing offensive and defensive profiles.
            

In [None]:
stats_radar = ["gf/mp", "ga/mp", "poss", "cs%"]

df_radar = df.groupby("team")[stats_radar].mean()

df_radar["ga/mp"] = df_radar["ga/mp"].max() - df_radar["ga/mp"]

df_radar_norm = df_radar.copy()
for col in stats_radar:
    min_val = df_radar[col].min()
    max_val = df_radar[col].max()
    df_radar_norm[col] = (df_radar[col] - min_val) / (max_val - min_val)
    
df_radar_norm["ga/mp"] = 1 - df_radar_norm["ga/mp"]

top_clubs = df_radar.sort_values("gf/mp", ascending=False).head(5)

def plot_radar(data, club_name):
    values = data.loc[club_name].values
    labels = data.columns
    num_vars = len(labels)
    
    angles = np.linspace(0, 2 * np.pi, num_vars, endpoint=False).tolist()
    values = np.concatenate((values, [values[0]]))
    angles += angles[:1]

    fig, ax = plt.subplots(figsize=(6, 6), subplot_kw=dict(polar=True))
    ax.plot(angles, values, label=club_name)
    ax.fill(angles, values, alpha=0.25)
    ax.set_title(club_name, size=14)
    ax.set_xticks(angles[:-1])
    ax.set_xticklabels(labels)
    plt.tight_layout()
    plt.show()

for club in top_clubs.index:
    plot_radar(df_radar_norm, club)


In [None]:
clubs_comparaison = ["Paris S-G", "Lyon", "Monaco", "Lille", "Rennes"]

labels = stats_radar
num_vars = len(labels)
angles = np.linspace(0, 2 * np.pi, num_vars, endpoint=False).tolist()
angles += angles[:1]

fig, ax = plt.subplots(figsize=(8, 8), subplot_kw=dict(polar=True))

for club in clubs_comparaison:
    values = df_radar_norm.loc[club].tolist()
    values += values[:1]  
    ax.plot(angles, values, label=club)
    ax.fill(angles, values, alpha=0.1)

ax.set_xticks(angles[:-1])
ax.set_xticklabels(labels)
ax.set_title("Multi-club comparison - Normalized performance profile", size=14)
ax.legend(loc='upper right', bbox_to_anchor=(1.2, 1.1))
plt.tight_layout()
plt.show()

In [None]:
sns.lmplot(x="poss", y="pts/mp", data=df)
plt.title("Correlation: Possession vs. Points per Game")
plt.tight_layout()
plt.show()

sns.lmplot(x="gf/mp", y="pts/mp", data=df)
plt.title("Correlation: Goals scored per game vs. Points")
plt.tight_layout()
plt.show()

sns.lmplot(x="cs%", y="LgRank_num", data=df)
plt.title("Correlation: Clean Sheets vs. Ranking (inverse)")
plt.tight_layout()
plt.show()


## 6. Line Plots: Evolution by Metric

We plot the evolution of key indicators over time: `Pts_per_MP`, `GF_per_MP`, and `Poss`.
These graphs show whether the league is becoming more offensive, balanced, or controlled.
            

In [None]:
df["Top3"] = df["LgRank_num"] <= 3

stats_compare = ["gf/mp", "ga/mp", "poss", "cs%", "pts/mp"]

moyennes = df.groupby("Top3")[stats_compare].mean().T


for stat in stats_compare:
    plt.figure(figsize=(6, 4))
    df.groupby("Top3")[stat].mean().plot(kind="bar")
    plt.title(f"{stat} - Average Top 3 vs Others")
    plt.ylabel(stat)
    plt.xticks(rotation=0)
    plt.tight_layout()
    plt.show()


## 12. Logistic Regression to Predict Top 3 Teams

We use a logistic regression model to estimate the probability of a team finishing in the Top 3.
Features include offensive and defensive stats. This helps identify the most influential factors in elite performance.
            

In [None]:
df['pk%'] = df['pk']/df['pkatt']

features = ["gf/mp", "ga/mp", "poss", "cs%", "pk%"]
X = df[features]
y = df["Top3"]

imputer = SimpleImputer(strategy="mean")
X_imputed = imputer.fit_transform(X)

X = pd.DataFrame(X_imputed, columns=features)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LogisticRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print("🔎 Evaluation on the train dataset :")
print(classification_report(y_test, y_pred))

coefficients = pd.DataFrame({
    "Variable": features,
    "Coefficient": model.coef_[0]
}).sort_values("Coefficient", ascending=False)

print("\n📊 Influence of variables on the probability of being in the Top 3:")
print(coefficients)


## 13. Decision Tree Classification

An interpretable decision tree is trained to classify teams as Top 3 or not based on their seasonal stats.
This method reveals key thresholds (e.g., GF_per_MP > 2.0) that define elite teams.
            

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

tree_model = DecisionTreeClassifier(max_depth=3, random_state=42)
tree_model.fit(X_train, y_train)

y_pred_tree = tree_model.predict(X_test)

print("🔍 Evaluation of the tree :")
print(classification_report(y_test, y_pred_tree))

plt.figure(figsize=(15, 8))
plot_tree(tree_model, feature_names=features, class_names=["Non Top 3", "Top 3"], filled=True)
plt.title("Decision Tree - Predicting a Top 3 Place")
plt.show()

## 14. Random Forest: Variable Importance

A Random Forest model is trained to identify the most predictive variables.
This ensemble method mitigates overfitting and provides robust importance scores.
            

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

rf_model = RandomForestClassifier(n_estimators=100, max_depth=4, random_state=42)
rf_model.fit(X_train, y_train)

y_pred_rf = rf_model.predict(X_test)
print("🔍 Evaluation Random Forest :")
print(classification_report(y_test, y_pred_rf))

importances = rf_model.feature_importances_
features_importance = pd.DataFrame({
    "Variable": X.columns,
    "Importance": importances
}).sort_values("Importance", ascending=False)

plt.figure(figsize=(10, 5))
plt.barh(features_importance["Variable"], features_importance["Importance"])
plt.title("📊 Variable importance (Random Forest)")
plt.xlabel("Importance")
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

print("\n📋 Most important variables according to Random Forest:")
print(features_importance)


In [None]:
df["proba_Top3"] = rf_model.predict_proba(X)[:, 1] 

df_top_preds = df[["season", "team", "proba_Top3", "Top3", "LgRank_num", "pts/mp"]]
df_top_preds_sorted = df_top_preds.sort_values("proba_Top3", ascending=False)

print("🔝 Ranking of teams according to the probability of being Top 3 (RF model):")
print(df_top_preds_sorted.head(20))

In [None]:
df_sans_psg = df_top_preds[df_top_preds["team"] != "Paris S-G"]

classement_sans_psg = df_sans_psg.sort_values("proba_Top3", ascending=False)

print("🔝 Ranking of teams according to the probability of being Top 3 (excluding PSG):")
print(classement_sans_psg.head(10))

## 16. Surprise Index: Over- and Underperformers

We compute the difference between actual Top 3 presence and model probability.
This reveals teams that outperformed expectations (e.g., Monaco 16-17, Lille 20-21) or disappointed.
            

In [None]:
df["Top3_gap"] = df["Top3"].astype(int) - df["proba_Top3"]

overperformers = df[df["Top3"] == True].sort_values("Top3_gap", ascending=False)

underperformers = df[df["Top3"] == False].sort_values("Top3_gap", ascending=True)

cols_display = ["season", "team", "Top3", "LgRank_num", "proba_Top3", "Top3_gap", "pts/mp"]

print("Teams that surprised the most (overperformers):")
print(overperformers[cols_display].head(5))

print("Teams that disappointed the most according to the model (underperformers):")
print(underperformers[cols_display].head(5))


## 17. Offensive and Defensive Efficiency Scores

Custom metrics are created to synthesize offensive and defensive output:
- Offensive = (GF - PK) / Possession
- Defensive = CS% / GA_per_MP
These are then normalized and combined for a total effectiveness score.
            

In [None]:
df_eff = df.copy()

df_eff["effic_off"] = (df_eff["gf"] - df_eff["pk"]) / df_eff["poss"]

df_eff["effic_def"] = df_eff["cs"] / df_eff["ga"]


In [None]:
for col in ["effic_off", "effic_def"]:
    min_val = df_eff[col].min()
    max_val = df_eff[col].max()
    df_eff[col + "_norm"] = (df_eff[col] - min_val) / (max_val - min_val)


In [None]:
df_eff["score_total"] = df_eff["effic_off_norm"] + df_eff["effic_def_norm"]

classement_eff = df_eff.sort_values("score_total", ascending=False)

cols_to_show = ["season", "team", "LgRank_num", "pts/mp", 
                "effic_off", "effic_def", 
                "effic_off_norm", "effic_def_norm", "score_total"]

print("Cross-ranking attack + defense:")
print(classement_eff[cols_to_show].head(10))


## 2. Average Performance by Team (All Seasons)

We compute the average statistics for each team across all seasons.
This helps to identify the most dominant or consistent clubs over time.
Metrics like `GF_per_MP`, `CS%`, and `Poss` are useful in comparing offensive and defensive profiles.
            

In [None]:
df_eff = df.groupby("team")[["gf/mp", "ga/mp", "cs%"]].mean().reset_index()

df_eff["cs%"] = pd.to_numeric(df_eff["cs%"], errors="coerce")

df_eff["Off_Eff"] = df_eff["gf/mp"]

cs_min, cs_max = df_eff["cs%"].min(), df_eff["cs%"].max()
ga_min, ga_max = df_eff["ga/mp"].min(), df_eff["ga/mp"].max()

df_eff["CS_norm"] = (df_eff["cs%"] - cs_min) / (cs_max - cs_min)
df_eff["GA_norm"] = (df_eff["ga/mp"] - ga_min) / (ga_max - ga_min)

df_eff["GA_eff"] = 1 - df_eff["GA_norm"]

df_eff["Def_Eff"] = (df_eff["CS_norm"] + df_eff["GA_eff"]) / 2

print("Indicators by team:")
print(df_eff[["team", "Off_Eff", "Def_Eff"]].sort_values("Off_Eff", ascending=False))


In [None]:
plt.figure(figsize=(10, 8))
plt.scatter(df_eff["Off_Eff"], df_eff["Def_Eff"], s=80, alpha=0.7)

for i, row in df_eff.iterrows():
    plt.text(row["Off_Eff"] + 0.01, row["Def_Eff"] + 0.01, row["team"], fontsize=9)

plt.xlabel("Offensive Efficiency (GF per MP)")
plt.ylabel("Defensive effectiveness (composite)")
plt.title("Cross-team ranking: Offensive vs. Defensive efficiency")
plt.grid(True)
plt.tight_layout()
plt.show()

## 2. Average Performance by Team (All Seasons)

We compute the average statistics for each team across all seasons.
This helps to identify the most dominant or consistent clubs over time.
Metrics like `GF_per_MP`, `CS%`, and `Poss` are useful in comparing offensive and defensive profiles.
            

In [None]:
penalty_stats = df.groupby("team")[["pk", "pkatt", "gf"]].mean().reset_index()

penalty_stats["PK_success_rate"] = penalty_stats["pk"] / penalty_stats["pkatt"]
penalty_stats["PK_dependency"] = penalty_stats["pk"] / penalty_stats["gf"]

penalty_stats = penalty_stats.dropna()
penalty_stats = penalty_stats[penalty_stats["pkatt"] > 0]


## 18. Penalty Analysis

We examine penalty frequency, conversion rate, and dependency (`PK / GF`) to identify teams relying heavily on set-piece goals.
            

In [None]:
top_dependency = penalty_stats.sort_values("PK_dependency", ascending=False)

print("Teams most dependent on penalties (PK / GF):")
print(top_dependency[["team", "pk", "pkatt", "PK_success_rate", "PK_dependency"]].head(10))


## 18. Penalty Analysis

We examine penalty frequency, conversion rate, and dependency (`PK / GF`) to identify teams relying heavily on set-piece goals.
            

In [None]:
top10 = top_dependency.head(10)

plt.figure(figsize=(10, 6))
plt.barh(top10["team"], top10["PK_dependency"] * 100)
plt.xlabel("Percentage of goals scored from penalties (%)")
plt.title("Top 10 teams most dependent on penalties")
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()