# Exploring player groupings using K-means for 2017/18 season

## Purpose

The purpose of this notebook is to:

- Create a complete dataset for all stats for all players in the Premier League in 2017/18 season
	- Potentially create a process for combining all datasets from the same source into a complete dataset for each year
- Investigate and see if there are any logical groupings/cluster present based solely on the data (excluding assigned position and team)
- Investigate the similarities between the players in each cluster

## Data sources

- [Standard Stats](https://fbref.com/en/comps/9/2017-2018/stats/2017-2018-Premier-League-Stats)
- [Goalkeeping Stats](https://fbref.com/en/comps/9/2017-2018/keepers/2017-2018-Premier-League-Stats)
- [Shooting Stats](https://fbref.com/en/comps/9/2017-2018/shooting/2017-2018-Premier-League-Stats)
- [Passing Stats](https://fbref.com/en/comps/9/2017-2018/passing/2017-2018-Premier-League-Stats)
- [Defensive Stats](https://fbref.com/en/comps/9/2017-2018/defense/2017-2018-Premier-League-Stats)
- [Possession Stats](https://fbref.com/en/comps/9/2017-2018/possession/2017-2018-Premier-League-Stats)

## Method

### Import relevant packages

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

## Import relevant datasets

In [2]:
goalkeeping = pd.read_csv("../historic_player_stats/2017-2018/goalkeeping.csv")
defensive_actions = pd.read_csv("../historic_player_stats/2017-2018/defensive_actions.csv")
passing = pd.read_csv("../historic_player_stats/2017-2018/passing.csv")
possession = pd.read_csv("../historic_player_stats/2017-2018/possession.csv")
shooting = pd.read_csv("../historic_player_stats/2017-2018/shooting.csv")
standard = pd.read_csv("../historic_player_stats/2017-2018/standard.csv")

In [3]:
standard.head()

Unnamed: 0,player,position,team,matches_played,starts,minutes_played,90s,goals,assists,goals+assists,...,penalties_attempted,yellow_cards,red_cards,expected_goals,non_penalty_expected_goals,expected_assisted_goals,non_penalty_expected_goals+expected_assisted_goals,progressive_carries,progressive_passes,progressive_passes_recieved
0,Patrick van Aanholt,DF,Crystal Palace,28,25,2184,24.3,5,1,6,...,0,7,0,3.1,3.1,2.1,5.2,46.0,92.0,86.0
1,Rolando Aarons,"MF,FW",Newcastle Utd,4,1,139,1.5,0,0,0,...,0,0,0,0.1,0.1,0.0,0.1,7.0,3.0,4.0
2,Tammy Abraham,FW,Swansea City,31,15,1726,19.2,5,1,6,...,0,0,0,6.8,6.8,1.6,8.4,19.0,20.0,104.0
3,Charlie Adam,MF,Stoke City,11,5,411,4.6,0,0,0,...,1,2,1,1.6,0.9,1.2,2.1,6.0,30.0,9.0
4,Adrián,GK,West Ham,19,19,1710,19.0,0,0,0,...,0,2,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Join select dataframes into one (all bar goalkeeping)

In [4]:
complete = standard.merge(passing, on=["player", "position", "team"], how='inner')
complete = complete.merge(shooting, on=["player", "position", "team"], how='inner')
complete = complete.merge(possession, on=["player", "position", "team"], how='inner')
complete = complete.merge(defensive_actions, on=["player", "position", "team"], how='inner')

print("New dataframe dimensions: " + str(complete.shape))

New dataframe dimensions: (529, 79)


In [5]:
complete.head()

Unnamed: 0,player,position,team,matches_played,starts,minutes_played,90s,goals_x,assists,goals+assists,...,defensive_third_tackles,middle_third_tackles,attacking_third_tackles,dribblers_tackled,dribbler_tackles_attempted,shots_blocked,passes_blocked,interceptions,clearances,errors_leading_to_shot
0,Patrick van Aanholt,DF,Crystal Palace,28,25,2184,24.3,5,1,6,...,29.0,15.0,3.0,16.0,34.0,5.0,19.0,47.0,64.0,2.0
1,Rolando Aarons,"MF,FW",Newcastle Utd,4,1,139,1.5,0,0,0,...,3.0,1.0,0.0,4.0,6.0,0.0,3.0,1.0,0.0,0.0
2,Tammy Abraham,FW,Swansea City,31,15,1726,19.2,5,1,6,...,1.0,7.0,2.0,1.0,13.0,1.0,7.0,1.0,11.0,0.0
3,Charlie Adam,MF,Stoke City,11,5,411,4.6,0,0,0,...,2.0,5.0,2.0,5.0,17.0,0.0,2.0,9.0,11.0,2.0
4,Adrián,GK,West Ham,19,19,1710,19.0,0,0,0,...,2.0,0.0,0.0,1.0,4.0,0.0,1.0,0.0,15.0,1.0


## Remove columns with operations

To reduce the number of features shown in the table originally. As these columns are generated with dta available in the table, they are not initially required, and can be recreated if required

In [6]:
operation_characters = set("/-+*")
columns_to_remove = [col for col in complete.columns if any(char in col for char in operation_characters)]

complete_without_calculated_columns = complete.drop(columns=columns_to_remove)

print(f"Number of Features reduced from {str(len(complete.columns))} to {str(len(complete_without_calculated_columns.columns))}")
print("\nFeatures removed:\n")
print("\n".join(columns_to_remove))


Number of Features reduced from 79 to 71

Features removed:

goals+assists
non_penalty_expected_goals+expected_assisted_goals
assists-expected_assisted_goals
goals/shots
goals/shots_on_target
non_penalty_expected_goals/shots
goals-expected_goals
non_penalty_goals-non_penalty_expected_goals


## EDA

In [7]:
complete_without_calculated_columns.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 529 entries, 0 to 528
Data columns (total 71 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   player                              529 non-null    object 
 1   position                            529 non-null    object 
 2   team                                529 non-null    object 
 3   matches_played                      529 non-null    int64  
 4   starts                              529 non-null    int64  
 5   minutes_played                      529 non-null    int64  
 6   90s                                 529 non-null    float64
 7   goals_x                             529 non-null    int64  
 8   assists                             529 non-null    int64  
 9   non_penalty_goals                   529 non-null    int64  
 10  penalties_scored                    529 non-null    int64  
 11  penalties_attempted                 529 non-n

### Remove duplicated columns

Seeing two goals columns, make sure the are the same before dropping one

In [8]:
complete_without_calculated_columns[["goals_x", "goals_y"]][(complete_without_calculated_columns["goals_x"] != complete_without_calculated_columns["goals_y"])]

Unnamed: 0,goals_x,goals_y


In [9]:
complete_without_calculated_columns[["expected_assisted_goals_x", "expected_assisted_goals_y"]][(complete_without_calculated_columns["expected_assisted_goals_x"] != complete_without_calculated_columns["expected_assisted_goals_y"])]

Unnamed: 0,expected_assisted_goals_x,expected_assisted_goals_y
363,,


In [10]:
complete_without_calculated_columns[["progressive_passes_x", "progressive_passes_y"]][(complete_without_calculated_columns["progressive_passes_x"] != complete_without_calculated_columns["progressive_passes_y"])]

Unnamed: 0,progressive_passes_x,progressive_passes_y
363,,


The above shows the contents of the two columns are the same, as there are no rows where the are not identical; one of them can be dropped and the other renamed to remove the suffix, along with those for the expected_assisted_goals and progressive_passes columns

In [11]:
complete_without_calculated_columns = complete_without_calculated_columns.drop(columns=["goals_y", "expected_assisted_goals_y", "progressive_passes_y"])
complete_without_calculated_columns = complete_without_calculated_columns.rename(columns={"goals_x": "goals", "expected_assisted_goals_x": "expected_assisted_goals", "progressive_passes_x": "progressive_passes"})
complete_without_calculated_columns.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 529 entries, 0 to 528
Data columns (total 68 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   player                              529 non-null    object 
 1   position                            529 non-null    object 
 2   team                                529 non-null    object 
 3   matches_played                      529 non-null    int64  
 4   starts                              529 non-null    int64  
 5   minutes_played                      529 non-null    int64  
 6   90s                                 529 non-null    float64
 7   goals                               529 non-null    int64  
 8   assists                             529 non-null    int64  
 9   non_penalty_goals                   529 non-null    int64  
 10  penalties_scored                    529 non-null    int64  
 11  penalties_attempted                 529 non-n

### Missing values

In [12]:
complete_without_calculated_columns.isnull().sum()[complete_without_calculated_columns.isnull().sum() > 0]

expected_goals                         1
non_penalty_expected_goals             1
expected_assisted_goals                1
progressive_carries                    1
progressive_passes                     1
progressive_passes_recieved            1
total_passing_distance                 1
total_progressive_passing_distance     1
short_passes_completed                 1
short_passes_attempted                 1
medium_passes_completed                1
medium_passes_attempted                1
long_passes_completed                  1
long_passes_attempted                  1
expected_assists                       1
key_passes                             1
passes_into_final_third                1
passes_into_penalty_area               1
crosses_into_penalty_area              1
shots                                  1
average_shot_distance                 88
shots_from_free_kicks                  1
touches                                1
touches_in_defensive_penalty_area      1
touches_in_defen

Average shot distance seems to contain the majority of null values present in the table

In [13]:
pd.set_option("display.max_columns", None)
complete_without_calculated_columns[complete_without_calculated_columns.isnull().any(axis=1)]

Unnamed: 0,player,position,team,matches_played,starts,minutes_played,90s,goals,assists,non_penalty_goals,penalties_scored,penalties_attempted,yellow_cards,red_cards,expected_goals,non_penalty_expected_goals,expected_assisted_goals,progressive_carries,progressive_passes,progressive_passes_recieved,total_passing_distance,total_progressive_passing_distance,short_passes_completed,short_passes_attempted,medium_passes_completed,medium_passes_attempted,long_passes_completed,long_passes_attempted,expected_assists,key_passes,passes_into_final_third,passes_into_penalty_area,crosses_into_penalty_area,shots,shots_on_target,average_shot_distance,shots_from_free_kicks,shots_from_penalties,touches,touches_in_defensive_penalty_area,touches_in_defensive_third,touches_in_middle_third,touches_in_attacking_third,touches_in_attacking_penalty_area,live_ball_touches,take_ons_attempted,take_ons_succeeded,times_tackled_during_take_on,carries,total_carrying_distance,progressive_carrying_distance,carries_into_final_third,carries_into_penalty_area,miscontrols,dispossessed,passes_received,tackles,tackles_won,defensive_third_tackles,middle_third_tackles,attacking_third_tackles,dribblers_tackled,dribbler_tackles_attempted,shots_blocked,passes_blocked,interceptions,clearances,errors_leading_to_shot
4,Adrián,GK,West Ham,19,19,1710,19.0,0,0,0,0,0,2,0,0.0,0.0,0.0,0.0,0.0,0.0,11798.0,10438.0,12.0,12.0,51.0,52.0,203.0,503.0,0.1,0.0,13.0,1.0,0.0,0.0,0,,0.0,0,641.0,541.0,633.0,9.0,0.0,0.0,641.0,0.0,0.0,0.0,284.0,2193.0,1505.0,0.0,0.0,0.0,0.0,166.0,2.0,1.0,2.0,0.0,0.0,1.0,4.0,0.0,1.0,0.0,15.0,1.0
5,Ibrahim Afellay,MF,Stoke City,6,1,166,1.8,0,0,0,0,0,1,0,0.0,0.0,0.0,0.0,4.0,1.0,1326.0,256.0,29.0,32.0,35.0,39.0,6.0,8.0,0.0,0.0,6.0,0.0,0.0,0.0,0,,0.0,0,89.0,0.0,18.0,63.0,12.0,0.0,89.0,1.0,1.0,0.0,61.0,214.0,50.0,0.0,0.0,1.0,3.0,64.0,3.0,3.0,0.0,3.0,0.0,1.0,2.0,0.0,1.0,1.0,1.0,0.0
15,Daniel Amartey,DF,Leicester City,8,6,487,5.4,0,0,0,0,0,1,2,0.0,0.0,0.0,5.0,14.0,4.0,3859.0,1555.0,129.0,142.0,83.0,109.0,20.0,45.0,0.1,0.0,14.0,0.0,0.0,0.0,0,,0.0,0,362.0,24.0,129.0,189.0,46.0,0.0,362.0,5.0,4.0,1.0,157.0,858.0,432.0,6.0,0.0,3.0,4.0,192.0,8.0,4.0,7.0,1.0,0.0,3.0,7.0,3.0,4.0,7.0,21.0,1.0
16,Ethan Ampadu,DF,Chelsea,1,0,11,0.1,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,1.0,0.0,391.0,149.0,5.0,5.0,13.0,13.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0,,0.0,0,22.0,2.0,9.0,13.0,0.0,0.0,22.0,0.0,0.0,0.0,16.0,62.0,33.0,0.0,0.0,0.0,0.0,19.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0
33,Sam Baldock,FW,Brighton,2,0,32,0.4,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,5.0,41.0,7.0,3.0,5.0,1.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,,0.0,0,10.0,0.0,0.0,3.0,7.0,3.0,10.0,1.0,0.0,1.0,6.0,11.0,1.0,0.0,0.0,2.0,1.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,Kyle Walker-Peters,DF,Tottenham,3,2,189,2.1,0,2,0,0,0,0,0,0.0,0.0,1.2,8.0,8.0,21.0,1608.0,479.0,56.0,63.0,35.0,50.0,9.0,13.0,1.0,3.0,5.0,4.0,2.0,0.0,0,,0.0,0,157.0,7.0,43.0,62.0,55.0,4.0,157.0,8.0,4.0,4.0,84.0,433.0,207.0,5.0,1.0,3.0,5.0,99.0,5.0,4.0,4.0,1.0,0.0,2.0,4.0,0.0,2.0,3.0,8.0,0.0
497,Aaron Wan-Bissaka,DF,Crystal Palace,7,7,627,7.0,0,0,0,0,0,1,0,0.0,0.0,0.2,22.0,23.0,14.0,2676.0,1448.0,125.0,150.0,56.0,89.0,8.0,33.0,0.4,3.0,12.0,4.0,3.0,0.0,0,,0.0,0,411.0,45.0,169.0,154.0,93.0,1.0,411.0,15.0,13.0,2.0,175.0,997.0,623.0,22.0,1.0,9.0,6.0,168.0,30.0,19.0,20.0,7.0,3.0,15.0,19.0,7.0,11.0,15.0,45.0,0.0
506,Dean Whitehead,MF,Huddersfield,4,0,71,0.8,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,1.0,0.0,808.0,221.0,15.0,17.0,25.0,30.0,3.0,4.0,0.0,0.0,3.0,0.0,0.0,0.0,0,,0.0,0,54.0,1.0,19.0,34.0,1.0,0.0,54.0,0.0,0.0,0.0,37.0,163.0,60.0,0.0,0.0,0.0,0.0,41.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,2.0,0.0,0.0
517,Ben Woodburn,FW,Liverpool,1,0,7,0.1,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,2.0,92.0,11.0,4.0,4.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0,,0.0,0,8.0,0.0,0.0,3.0,5.0,1.0,8.0,0.0,0.0,0.0,6.0,15.0,5.0,2.0,0.0,0.0,0.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [14]:
complete_without_calculated_columns[["shots", "shots_on_target", "average_shot_distance"]][(complete_without_calculated_columns["shots"] != 0) & (complete_without_calculated_columns["average_shot_distance"].isnull())]

Unnamed: 0,shots,shots_on_target,average_shot_distance
363,,0,


In [15]:
complete_without_calculated_columns[["shots", "shots_on_target", "average_shot_distance"]][(complete_without_calculated_columns["shots"] == 0) & (complete_without_calculated_columns["average_shot_distance"].isnull())]

Unnamed: 0,shots,shots_on_target,average_shot_distance
4,0.0,0,
5,0.0,0,
15,0.0,0,
16,0.0,0,
33,0.0,0,
...,...,...,...
495,0.0,0,
497,0.0,0,
506,0.0,0,
517,0.0,0,


On closer inspection, this occur when the number of shots taken by a player is 0, so this may be the only way to display this value. Setting this value to 0 would indicate that these players are taking shots from 0 yards away from the goal, which is not accurate, and an arbitrarily high number may also adversely affect the conclusions pulled from the data.

Setting to 0, however, may be the only viable option in this position

In [16]:
complete_without_calculated_columns["average_shot_distance"] = complete_without_calculated_columns["average_shot_distance"].fillna(0)

In [22]:
complete_without_calculated_columns[complete_without_calculated_columns.isnull().any(axis=1)]

Unnamed: 0,player,position,team,matches_played,starts,minutes_played,90s,goals,assists,non_penalty_goals,penalties_scored,penalties_attempted,yellow_cards,red_cards,expected_goals,non_penalty_expected_goals,expected_assisted_goals,progressive_carries,progressive_passes,progressive_passes_recieved,total_passing_distance,total_progressive_passing_distance,short_passes_completed,short_passes_attempted,medium_passes_completed,medium_passes_attempted,long_passes_completed,long_passes_attempted,expected_assists,key_passes,passes_into_final_third,passes_into_penalty_area,crosses_into_penalty_area,shots,shots_on_target,average_shot_distance,shots_from_free_kicks,shots_from_penalties,touches,touches_in_defensive_penalty_area,touches_in_defensive_third,touches_in_middle_third,touches_in_attacking_third,touches_in_attacking_penalty_area,live_ball_touches,take_ons_attempted,take_ons_succeeded,times_tackled_during_take_on,carries,total_carrying_distance,progressive_carrying_distance,carries_into_final_third,carries_into_penalty_area,miscontrols,dispossessed,passes_received,tackles,tackles_won,defensive_third_tackles,middle_third_tackles,attacking_third_tackles,dribblers_tackled,dribbler_tackles_attempted,shots_blocked,passes_blocked,interceptions,clearances,errors_leading_to_shot
363,Aiden O'Neill,MF,Burnley,1,0,1,0.0,0,0,0,0,0,0,0,,,,,,,,,,,,,,,,,,,,,0,0.0,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


The final null values shown in the table are all from the above entry from **Aiden O'Neill**, who played one minute the whole season, and therefore has no entries for many of the values in the table. In this case, and only this one, the row will be dropped, as it does not contain any useful important information, and it could be said that this player had no effect on the Burnley team through the season. It would also be very hard to group this palyer with others as the amount of data present on this player is not enough to come to any conclusions

In [23]:
complete_no_nulls = complete_without_calculated_columns[complete_without_calculated_columns["player"] != "Aiden O'Neill"]

In [24]:
complete_no_nulls.info()

<class 'pandas.core.frame.DataFrame'>
Index: 528 entries, 0 to 528
Data columns (total 68 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   player                              528 non-null    object 
 1   position                            528 non-null    object 
 2   team                                528 non-null    object 
 3   matches_played                      528 non-null    int64  
 4   starts                              528 non-null    int64  
 5   minutes_played                      528 non-null    int64  
 6   90s                                 528 non-null    float64
 7   goals                               528 non-null    int64  
 8   assists                             528 non-null    int64  
 9   non_penalty_goals                   528 non-null    int64  
 10  penalties_scored                    528 non-null    int64  
 11  penalties_attempted                 528 non-null  

### Removing unnecessary columns

For the objective of this notebook, which to help to identify key cluster of players within the data