# Clasificación de los Mejores Jugadores de la Premier League (2024–2025)
### 👥 Autores: [Tu nombre y el de tu compañero]
### 🎓 Curso: Machine Learning - Unit 2
### 📁 Dataset: epl_player_stats_24_25.csv

---

## 1. Dataset Selection

Este proyecto tiene como objetivo clasificar a los jugadores de la Premier League en **Top Players** (rendimiento ofensivo destacado) y **No Top Players**, usando como criterio la suma de **Goles + Asistencias**.

- Se eligió un dataset con más de 8 características (hay 30+ después de limpieza).
- La variable objetivo es **`TopPlayer`**, definida como `1` si el jugador está en el top 25% en goles+asistencias, y `0` en caso contrario.
- Es un problema realista, relevante y completamente resoluble mediante algoritmos de clasificación.


In [None]:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, ConfusionMatrixDisplay
)

# Leer el dataset
df = pd.read_csv("epl_player_stats_24_25.csv")

# Crear variable Total_Goals_Assists y variable objetivo
df["Total_Goals_Assists"] = df["Goals"] + df["Assists"]
threshold = df["Total_Goals_Assists"].quantile(0.75)
df["TopPlayer"] = (df["Total_Goals_Assists"] >= threshold).astype(int)


In [None]:

# Eliminación de columnas irrelevantes o específicas de porteros
df.drop(columns=["Red Cards"], inplace=True)
df.drop(columns=["Saves"], inplace=True)
df.drop(columns=["Saves %"], inplace=True)
df.drop(columns=["Penalties Saved"], inplace=True)
df.drop(columns=["Clearances Off Line"], inplace=True)
df.drop(columns=["Punches"], inplace=True)
df.drop(columns=["High Claims"], inplace=True)
df.drop(columns=["Goals Prevented"], inplace=True)

df_clean = df.copy()


In [None]:

features = [
    "Minutes", "Goals", "Assists", "Shots",
    "Shots On Target", "Passes", "Touches", "Big Chances Missed", "TopPlayer"
]


In [None]:

df_clean[features].describe()


In [None]:

df_clean[features].isnull().sum()


In [None]:

df_clean[features].dtypes


In [None]:

fig, axs = plt.subplots(4, 2, figsize=(16, 18))
fig.suptitle("Exploratory Data Analysis (8 Visualizaciones)", fontsize=18)

# 1
sns.histplot(df_clean["Goals"], bins=20, kde=True, ax=axs[0, 0], color="blue")
axs[0, 0].set_title("Distribución de Goles")
# 2
sns.histplot(df_clean["Assists"], bins=20, kde=True, ax=axs[0, 1], color="green")
axs[0, 1].set_title("Distribución de Asistencias")
# 3
sns.boxplot(data=df_clean, x="TopPlayer", y="Minutes", ax=axs[1, 0], palette="Set2")
axs[1, 0].set_title("Minutos jugados por clase")
# 4
sns.scatterplot(data=df_clean, x="Shots", y="Goals", hue="TopPlayer", ax=axs[1, 1], palette="coolwarm")
axs[1, 1].set_title("Shots vs Goals (Top vs No Top)")
# 5
sns.countplot(data=df_clean, x="TopPlayer", ax=axs[2, 0], palette="Set1")
axs[2, 0].set_title("Distribución de clases: TopPlayer")
# 6
corr = df_clean[features].corr()
sns.heatmap(corr, annot=True, cmap="coolwarm", fmt=".2f", ax=axs[2, 1])
axs[2, 1].set_title("Matriz de Correlación")
# 7
sns.boxplot(data=df_clean, x="TopPlayer", y="Shots On Target", ax=axs[3, 0], palette="Set3")
axs[3, 0].set_title("Shots On Target por clase")
# 8
sns.histplot(df_clean["Touches"], bins=20, kde=True, ax=axs[3, 1], color="orange")
axs[3, 1].set_title("Distribución de Toques")

plt.tight_layout(rect=[0, 0.03, 1, 0.97])
plt.show()


In [None]:

fig, axs = plt.subplots(4, 2, figsize=(16, 16))
fig.suptitle("Detección de Outliers por Variable", fontsize=18)

variables = features[:-1]
for i, var in enumerate(variables):
    row, col = divmod(i, 2)
    sns.boxplot(data=df_clean, y=var, ax=axs[row, col], color="lightblue")
    axs[row, col].set_title(f"Outliers en: {var}")

plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()


In [None]:

X = df_clean[[
    "Minutes", "Goals", "Assists", "Shots",
    "Shots On Target", "Passes", "Touches", "Big Chances Missed"
]]
y = df_clean["TopPlayer"]

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)


In [None]:

X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42, stratify=y
)


In [None]:

models = {
    "Logistic Regression": LogisticRegression(max_iter=1000, random_state=42),
    "Decision Tree": DecisionTreeClassifier(random_state=42),
    "Random Forest": RandomForestClassifier(random_state=42)
}

results = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    results[name] = {
        "Accuracy": accuracy_score(y_test, y_pred),
        "Precision": precision_score(y_test, y_pred),
        "Recall": recall_score(y_test, y_pred),
        "F1-Score": f1_score(y_test, y_pred),
        "Confusion Matrix": confusion_matrix(y_test, y_pred)
    }

results
