# Clustering

In [1]:
# import main libraries
import pandas as pd
import numpy as np

In [2]:
dataset = pd.read_csv('readyformodel_V2.1.csv')
dataset.head()

Unnamed: 0,won,pim,powerPlayGoals,faceOffWinPercentage,shots,goals,takeaways,hits,blockedShots,giveaways,...,shortHandedTimeOnIce,powerPlayTimeOnIce,hoa_away,hoa_home,settledIn_OT,settledIn_REG,startRinkSide_left,startRinkSide_right,goalieReplacement_No,goalieReplacement_Yes
0,0,2.0,1.0,50.9,8.0,0.0,1.0,14.0,3.0,6.0,...,18.52,31.3,1,0,0,1,1,0,0,1
1,1,2.67,1.0,49.1,8.0,3.0,3.0,5.0,3.0,7.0,...,25.04,23.15,0,1,0,1,1,0,1,0
2,1,2.0,0.0,43.8,11.0,0.0,0.0,4.0,6.0,2.0,...,9.48,31.39,1,0,1,0,0,1,1,0
3,0,2.67,0.0,56.2,12.0,1.0,2.0,4.0,8.0,0.0,...,25.11,11.85,0,1,1,0,0,1,1,0
4,1,3.0,0.0,45.7,9.0,0.0,3.0,4.0,7.0,7.0,...,17.78,29.54,1,0,0,1,1,0,1,0


In [4]:
dataset.shape

(51557, 24)

In [3]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51557 entries, 0 to 51556
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   won                    51557 non-null  int64  
 1   pim                    51557 non-null  float64
 2   powerPlayGoals         51557 non-null  float64
 3   faceOffWinPercentage   51557 non-null  float64
 4   shots                  51557 non-null  float64
 5   goals                  51557 non-null  float64
 6   takeaways              51557 non-null  float64
 7   hits                   51557 non-null  float64
 8   blockedShots           51557 non-null  float64
 9   giveaways              51557 non-null  float64
 10  missedShots            51557 non-null  float64
 11  penalties              51557 non-null  float64
 12  timeOnIce              51557 non-null  float64
 13  evenTimeOnIce          51557 non-null  float64
 14  shortHandedTimeOnIce   51557 non-null  float64
 15  po

## KMeans
We are going to use the KMeans clustering algorithm to look at the clusters in in our data. We will look first at the ideal number of clusters then we will plot this clusters and derive insights from our analysis.

In [46]:
# import additional libraries
from sklearn.cluster import KMeans
from sklearn.preprocessing import MinMaxScaler
import plotly.graph_objects as go
import plotly.express as px
from sklearn.metrics import silhouette_score

In [41]:
# data subset
X = dataset[['won', 'pim', 'powerPlayGoals', 'faceOffWinPercentage', 'penalties', 'timeOnIce', 'evenTimeOnIce', 'shortHandedTimeOnIce', 'powerPlayTimeOnIce', 'hoa_home', 'startRinkSide_right', 'goalieReplacement_Yes']]
scaler = MinMaxScaler()
scaler.fit(X)
X_std = scaler.transform(X)
X_std = pd.DataFrame(X_std, columns=X.columns)
X_std.head()

Unnamed: 0,won,pim,powerPlayGoals,faceOffWinPercentage,penalties,timeOnIce,evenTimeOnIce,shortHandedTimeOnIce,powerPlayTimeOnIce,hoa_home,startRinkSide_right,goalieReplacement_Yes
0,0.0,0.028169,0.5,0.642677,0.0625,0.20103,0.229131,0.160374,0.213944,0.0,0.0,1.0
1,1.0,0.037606,0.5,0.619949,0.0625,0.19593,0.226686,0.216834,0.158237,1.0,0.0,0.0
2,1.0,0.028169,0.0,0.55303,0.0625,0.207126,0.249489,0.082092,0.214559,0.0,1.0,0.0
3,0.0,0.037606,0.0,0.709596,0.1875,0.205518,0.254139,0.21744,0.080998,1.0,1.0,0.0
4,1.0,0.042254,0.0,0.57702,0.125,0.198794,0.230904,0.153966,0.201914,0.0,0.0,0.0


In [42]:
# hyper-parameter tuning
inertia = []
for i in range(2, 7):
    model_km = KMeans(n_clusters=i, random_state = 5)
    model_km.fit(X_std)
    inertia.append(model_km.inertia_)

# graph inertia
fig = go.Figure(data = go.Scatter(x=np.arange(2, 7), y=inertia))
fig.update_layout(title = "Inertia vs Cluster Number", 
                  xaxis = dict(range=[0, 7], title="Cluster Number"),
                  yaxis = {'title': 'Inertia'}, 
                  annotations = [dict(x = 4, y=inertia[2], 
                                      xref="x", 
                                      yref="y", 
                                      text="Elbow?", 
                                      showarrow=True, 
                                      arrowhead=7, 
                                      ax=20, 
                                      ay=-40)])

### Insights
Looking at this graph, the elbow method does not help much in finding the right number of clusters. Fortunately for this analysis we aim to understand the characteristics of a winning team vs a losing team. As such we can consider 2 clusters: one for the teams who won their game and another for the teams that lost.

In [51]:
# graph clusters
model_km = KMeans(n_clusters=2, random_state=5)
model_km.fit(X_std)

clusters = pd.DataFrame(X_std)
clusters['label'] = model_km.labels_
polar = clusters.groupby("label").mean().reset_index()
polar = pd.melt(polar, id_vars=["label"])
fig2 = px.line_polar(polar, r="value", theta="variable", color="label", title = "Winning vs. Losing Team - Cluster Visualization", line_close=True, height=800, width=900)
fig2.show()

### Insights
Label 0 (blue) represents the winning teams and label 2 (red) the losing teams. Some of the key differences between the two teams after the first period are:
- On the winning teams the power play scores more with less time on ice in average.
- The winning teams are more often at home
- The winning teams have much less goalie replacements than the losing teams
Other aspects of the game such as average time on ice, penalties and start rink side, are the same for all teams whether they win or lose.

In [47]:
# look at the silhouette score
labels = model_km.labels_
sil_score = silhouette_score(X_std, labels)
print("Silhouette score is", sil_score)

Silhouette score is 0.4240966372122099


### Insights
The silhouette score indicates the clusters are not well defined and as such there might be other characteristics of the game to consider when determining the outcome of a game OR there might be less characteristics to consider.