# KMeans for player clustering

In this notebook, we will train an unsupervised machine learning algorithm called KMeans.

KMeans is capable of discovering patterns in data and creating groups of similar data points.

In this notebook, we will use a dataset I scraped from FBREF on my own.

This dataset contains midfielders of the argentinian league. We want to cluster players that share similar characteristics. For example, more defensive, more offensive, playmaker, etc.

First we import the necessary packages.

In [92]:
import pandas as pd
import numpy as np

Let's start by reading the dataset using pandas and examining what we have at our disposal:


In [93]:
players_df = pd.read_csv('FBRef_2024_CopaDeLaLiga_MidfieldersAnalysis.csv')
players_df.head()

Unnamed: 0.1,Unnamed: 0,Rk,Player,Nation,Pos,Squad,Age,Born,Matches Played,Starts,...,Interceptions,Passes Blocked,Shots Blocked,Tackles,Tackles Att 3rd,Tackles Def 3rd,Tackles Mid 3rd,Tackles Won,Tackles+Interceptions,Total Blocks
0,0,1,Matías Abaldo,uy URU,"MF,FW",Gimnasia ELP,20-025,2004,7,5,...,2,4,0,4,1,3,0,2,6,4
1,1,2,Luciano Abecasis,ar ARG,"MF,DF",Independiente Rivadavia,33-329,1990,12,10,...,2,10,1,19,0,4,15,11,21,11
2,2,3,Ramón Ábila,ar ARG,FW,Barracas Central,34-197,1989,9,0,...,0,0,0,0,0,0,0,0,0,0
3,3,4,Jonás Acevedo,ar ARG,"MF,FW",Instituto,27-082,1997,9,6,...,3,6,1,13,4,5,4,7,16,7
4,4,5,Guillermo Acosta,ar ARG,MF,Tucumán,35-180,1988,10,10,...,12,3,1,17,0,5,12,9,29,4


We check the column names, non-null count and types

In [94]:
players_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 797 entries, 0 to 796
Data columns (total 88 columns):
 #   Column                        Non-Null Count  Dtype 
---  ------                        --------------  ----- 
 0   Unnamed: 0                    797 non-null    int64 
 1   Rk                            797 non-null    object
 2   Player                        797 non-null    object
 3   Nation                        788 non-null    object
 4   Pos                           796 non-null    object
 5   Squad                         797 non-null    object
 6   Age                           788 non-null    object
 7   Born                          788 non-null    object
 8   Matches Played                797 non-null    object
 9   Starts                        797 non-null    object
 10  Min                           797 non-null    object
 11  90s                           797 non-null    object
 12  Gls                           797 non-null    object
 13  Ast                 

Which are the metrics that are going to help our algorithm cluster the different type of midfielders?
We exclude metrics that are not significant and will hinder our algorithm.
We excluded de 90' metrics, we are going to convert them all later.
We also excluded accuracy metrics, for example, the number of correct passes will not help us deduce the different type of midfielders.

In [95]:
metrics = ['Player', "Min", 'Gls', 'Ast', 'xG', 'xAG',
       'Progressive Carries', 'Progressive Passes', 'Crosses into Penalty Area', 'Key Passes',
       'Long Passes Attempted', 'Medium Passes Attempted',
       'Passes into Final Third', 'Passes into Penalty Area',
       'Progressive Passing Distance', 'Short Passes Attenpted', 'Total Passing Distance', 'xA',
       'Passes Completed', 'Switches', 'Through Balls',
       'Clearances',
       'Interceptions', 'Passes Blocked', 'Shots Blocked', 'Tackles', 'Total Blocks']

players_df = players_df[metrics]
players_df.columns

Index(['Player', 'Min', 'Gls', 'Ast', 'xG', 'xAG', 'Progressive Carries',
       'Progressive Passes', 'Crosses into Penalty Area', 'Key Passes',
       'Long Passes Attempted', 'Medium Passes Attempted',
       'Passes into Final Third', 'Passes into Penalty Area',
       'Progressive Passing Distance', 'Short Passes Attenpted',
       'Total Passing Distance', 'xA', 'Passes Completed', 'Switches',
       'Through Balls', 'Clearances', 'Interceptions', 'Passes Blocked',
       'Shots Blocked', 'Tackles', 'Total Blocks'],
      dtype='object')

Convert all columns to numeric, excluding the "player column"

In [96]:
players_df.iloc[:, 1:] = players_df.iloc[:, 1:].apply(pd.to_numeric, errors='coerce')

Players that did not play sufficient amount of minutes will have data that may not be accurate to the player. We exclude players with less than 400 minutes played that season.

In [97]:
# Filter the DataFrame for rows where 'Min' > 400
players_df = players_df[players_df['Min'] > 400]

players_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 402 entries, 0 to 796
Data columns (total 27 columns):
 #   Column                        Non-Null Count  Dtype 
---  ------                        --------------  ----- 
 0   Player                        402 non-null    object
 1   Min                           402 non-null    object
 2   Gls                           402 non-null    object
 3   Ast                           402 non-null    object
 4   xG                            380 non-null    object
 5   xAG                           380 non-null    object
 6   Progressive Carries           380 non-null    object
 7   Progressive Passes            380 non-null    object
 8   Crosses into Penalty Area     380 non-null    object
 9   Key Passes                    380 non-null    object
 10  Long Passes Attempted         380 non-null    object
 11  Medium Passes Attempted       380 non-null    object
 12  Passes into Final Third       380 non-null    object
 13  Passes into Penalty Area 

We want to convert all the metrics to per 90' scale, first we copy the dataframe into another variable

In [106]:
players_90 = players_df.copy(deep=True)

Create a new column for the matches played, which is the player's total minutes divided by 90

In [107]:
players_90["Min_per_90"] = players_90["Min"] / 90

players_90.head()

Unnamed: 0,Player,Min,Gls,Ast,xG,xAG,Progressive Carries,Progressive Passes,Crosses into Penalty Area,Key Passes,...,Passes Completed,Switches,Through Balls,Clearances,Interceptions,Passes Blocked,Shots Blocked,Tackles,Total Blocks,Min_per_90
0,Matías Abaldo,427.0,0.0,1.0,0.5,0.7,16.0,14.0,2.0,5.0,...,75.0,1.0,1.0,2.0,2.0,4.0,0.0,4.0,4.0,4.744444
1,Luciano Abecasis,860.0,1.0,1.0,0.1,1.3,12.0,43.0,8.0,16.0,...,289.0,6.0,0.0,10.0,2.0,10.0,1.0,19.0,11.0,9.555556
3,Jonás Acevedo,555.0,0.0,2.0,0.9,2.0,17.0,27.0,2.0,14.0,...,163.0,3.0,1.0,1.0,3.0,6.0,1.0,13.0,7.0,6.166667
4,Guillermo Acosta,782.0,1.0,0.0,0.4,0.6,12.0,55.0,1.0,9.0,...,281.0,8.0,1.0,15.0,12.0,3.0,1.0,17.0,4.0,8.688889
5,Lucas Acosta,1260.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,...,224.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,14.0


In [108]:
# Perform the division for columns starting from the third column
players_90.iloc[:, 2:] = players_90.iloc[:, 2:].div(players_90["Min_per_90"], axis=0)

players_90.head()

Unnamed: 0,Player,Min,Gls,Ast,xG,xAG,Progressive Carries,Progressive Passes,Crosses into Penalty Area,Key Passes,...,Passes Completed,Switches,Through Balls,Clearances,Interceptions,Passes Blocked,Shots Blocked,Tackles,Total Blocks,Min_per_90
0,Matías Abaldo,427.0,0.0,0.210773,0.105386,0.147541,3.372365,2.95082,0.421546,1.053864,...,15.807963,0.210773,0.210773,0.421546,0.421546,0.843091,0.0,0.843091,0.843091,1.0
1,Luciano Abecasis,860.0,0.104651,0.104651,0.010465,0.136047,1.255814,4.5,0.837209,1.674419,...,30.244186,0.627907,0.0,1.046512,0.209302,1.046512,0.104651,1.988372,1.151163,1.0
3,Jonás Acevedo,555.0,0.0,0.324324,0.145946,0.324324,2.756757,4.378378,0.324324,2.27027,...,26.432432,0.486486,0.162162,0.162162,0.486486,0.972973,0.162162,2.108108,1.135135,1.0
4,Guillermo Acosta,782.0,0.11509,0.0,0.046036,0.069054,1.381074,6.329923,0.11509,1.035806,...,32.340153,0.920716,0.11509,1.726343,1.381074,0.345269,0.11509,1.956522,0.460358,1.0
5,Lucas Acosta,1260.0,0.0,0.071429,0.0,0.0,0.0,0.0,0.0,0.071429,...,16.0,0.0,0.0,0.285714,0.0,0.0,0.0,0.0,0.0,1.0


Now we drop the minutes columns

In [109]:
players_90.drop(columns=["Min_per_90", "Min"], inplace=True)