# Analysis of the Clustering Results

In this notebook, we are going to interpret the clusters based off the player's hierachial cluster. 

The *K-Means Cluster* is used as a means to double confirm the cluster that the player has been placed in from *Hierachial Agglomerative Clustering*. 

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.preprocessing import StandardScaler
from scipy.interpolate import make_interp_spline

In [None]:
hClusterData = pd.read_csv('./Data/hierarchicalClustering.csv')
kClusterData = pd.read_csv('./Data/kmeansClustering.csv')
AllData = pd.read_csv('./Data/Cleaned/AllData.csv')
AllData = AllData.drop(["Unnamed: 0"], axis=1)
AllData = AllData.rename(columns={'Player':'PLAYER','Team':'TEAM'})

clusterDataframes = [kClusterData,hClusterData,AllData] 

clusterData = pd.DataFrame()
clusterData['PLAYER'] = hClusterData['PLAYER']
for x in clusterDataframes:
    clusterData = clusterData.merge(x,on="PLAYER",how="outer",suffixes=("","_delme"))
clusterData = clusterData[[c for c in clusterData.columns if not c.endswith('_delme')]]
#clusterData = clusterData[['PLAYER','TEAM','hCLUSTER','kCLUSTER','PC1','PC2','PC3','PC4','PC5']]

In [None]:
clusterData.head()

In [None]:
clusterData[clusterData['hCLUSTER']==1].head(10)

## Cluster 1 - Elite Scoring Players
From hCLUSTER 1, we can see that the players generally have high PC2 and a low PC3 value.  
This means that these players are offensive minded stars that are not focusing on assist and can shoot the ball well

In [None]:
clusterData[clusterData['hCLUSTER']==2].head(10)

## Cluster 2 - Assist Maker Guards that operate around the paint
From hCLUSTER 2, we can see that the players generally have a negative low PC1, PC2, PC4 value while they have a postive low PC3 and PC5 value.  
This implies that these player are not centers but are guards that operate around the paint that pass and score mainly around there. 


In [None]:
clusterData[clusterData['hCLUSTER']==3].head(10)

## Cluster 3 - Offensive Minded Guards
From hCLUSTER3, we can see that the players have negative PC1 values and positive PC2 Values. 
This implies that the players are guards that are offensive minded.

In [None]:
clusterData[clusterData['hCLUSTER']==4].head(10)

## Cluster 4 - Offensive Minded Forwards
From hCLUSTER4, we can see that the PC2 values are postive and PC3 values are negative
This implies that the players are scoring focused forwards that do not operate mainly around the paint. 


In [None]:
clusterData[clusterData['hCLUSTER']==5].head(10)

## Cluster 5 - Shooting Big Man
From hCLUSTER5, we can that the PC2 is positive and PC3 and PC5 is negative
Coupled together with the stats, we can tell that the players in this clusters are big men that can shoot the ball well

In [None]:
clusterData[clusterData['hCLUSTER']==6].head(10)

## Cluster 6 - Effective Shooters
From hCLUSTERS, we can see that PC2 and PC3 is negative. 
Coupled together with the stats, we can tell that the players in this clusters are effective shooters of the ball. They are good at pull ups and catch and shoot. 

In [None]:
clusterData[clusterData['hCLUSTER']==7].head(10)

## Cluster 7 - Scoring Shooters 
From hCLUSTERS, we can see that PC1 and PC4 is negative while PC2 is positive. 
Coupled together with the stats, we can tell that the players In this clusters focus on scoring and do not mainly score from the paint.

In [None]:
clusterData[clusterData['hCLUSTER']==8].head(10)

## Cluster 8 - Traditional Big Men 
From hCLUSTERS, we can see that PC1 and PC2 is positive while the rest of the PCs are mixed. 
Coupled together with the stats, we can tell this players are primarily based around in the paint area during the game. They really make most of their points there.


In [None]:
clusterData[clusterData['hCLUSTER']==9].head(10)

In [None]:
teamStandings = pd.read_excel('./Data/TeamRecords.xlsx')
teamStandings

In [None]:
teamCompositionDf = pd.DataFrame()
teamCompositionDf["Team"] = np.nan
teamCompositionDf["Cluster1"] = np.nan
teamCompositionDf["Cluster2"] = np.nan
teamCompositionDf["Cluster3"] = np.nan
teamCompositionDf["Cluster4"] = np.nan
teamCompositionDf["Cluster5"] = np.nan
teamCompositionDf["Cluster6"] = np.nan
teamCompositionDf["Cluster7"] = np.nan
teamCompositionDf["Cluster8"] = np.nan
teamCompositionDf["Cluster9"] = np.nan

for team in clusterData.TEAM.unique():
    clusArr = [team] + [0] * 9

    teamData = clusterData.loc[clusterData['TEAM'] == team]

    for index, row in teamData.iterrows():
        clusArr[row['hCLUSTER']] += 1

    teamCompositionDf.loc[len(teamCompositionDf.index)] = clusArr

teamCompositionDf

In [None]:
mergeDF = [teamCompositionDf,teamStandings] 

teamComparison = pd.DataFrame()
teamComparison['Team'] = teamStandings['Team']
for x in mergeDF:
    teamComparison = teamComparison.merge(x,on="Team",how="outer",suffixes=("","_delme"))
teamComparison = teamComparison[[c for c in teamComparison.columns if not c.endswith('_delme')]]

In [None]:
teamComparison

In [None]:
teamComparisonTop10 = teamComparison.iloc[0:10, 0:10]
teamComparisonMid10 = teamComparison.iloc[10:20, 0:10]
teamComparisonBot10 = teamComparison.iloc[20:30, 0:10]

In [None]:
def TeamComparison(teamData, isTeam):

    if (isTeam):
        teamComparisonProportion = teamData.copy(deep=True)

        for index, row in teamComparisonProportion.iterrows():
            numPlayers = row[1:10].sum()

            newData = list(teamComparisonProportion.loc[index])

            for i in range(1, 10):
                newData[i] /= numPlayers
            
            teamComparisonProportion.loc[index,teamComparisonProportion.columns.to_list()[1:10]] = newData[1:10]
                
        teamComparisonProportion.plot(x='Team', kind='bar', stacked=True,
            title='Stacked Bar Graph by Teams',figsize=(20,10))
    else:
        teamComparisonProportion = teamData.copy(deep=True)

        for index, row in teamComparisonProportion.iterrows():
            numPlayers = row[1:10].sum()

            newData = list(teamComparisonProportion.loc[index])

            for i in range(1, 10):
                newData[i] /= numPlayers
            
            teamComparisonProportion.loc[index,teamComparisonProportion.columns.to_list()[1:10]] = newData[1:10]
                
        teamComparisonProportion.plot(x='Tier', kind='bar', stacked=True,
            title='Stacked Bar Graph by Tiers',figsize=(20,10))

In [None]:
TeamComparison(teamComparisonTop10, True)

In [None]:
TeamComparison(teamComparisonMid10, True)

In [None]:
TeamComparison(teamComparisonBot10, True)

From our initial plots we do see some differences in team composition between the 3 tiers of teams. The lower 2 tiers seem to contain a lot more of cluster 9 and cluster 4 players compared to the top tier. And although rare in general, cluster 1 seems to appear a lot more in top tier teams. We combine the 3 tier datas to confirm our suspicion.

In [None]:
tierDf = pd.DataFrame()
tierDf["Tier"] = np.nan
tierDf["Cluster1"] = np.nan
tierDf["Cluster2"] = np.nan
tierDf["Cluster3"] = np.nan
tierDf["Cluster4"] = np.nan
tierDf["Cluster5"] = np.nan
tierDf["Cluster6"] = np.nan
tierDf["Cluster7"] = np.nan
tierDf["Cluster8"] = np.nan
tierDf["Cluster9"] = np.nan

tiers = [["Top10", teamComparison.iloc[0:10, 0:10]], ["Mid10", teamComparison.iloc[10:20, 0:10]], ["Bot10", teamComparison.iloc[20:30, 0:10]]]

for tier in tiers:

    clusArr = [tier[0]] + [0] * 9

    for index, row in tier[1].iterrows():
        i = 1
        while i < 10:
            clusArr[i] += row[i]
            i += 1
            
    tierDf.loc[index,tierDf.columns.to_list()] = clusArr

tierDf

In [None]:
TeamComparison(tierDf, False)

Our suspicions are confirmed, top tier teams generally comprise of a lot lesser of cluster 4 and 9 players as compared to bottom tier teams. They also have a lot more cluster 5, 1 and 2 players. Clusters 3, 6, 7 and 8 appear to be more evenly distributed in the 3 tiers and are not as important in distiguishing better performing teams. To make the trends of the 9 clusters clearer we attempt simple line plots, with common y axis ranges, below.

In [None]:
fig, ax = plt.subplots(5, 2, figsize=(25,25)) 

line1 = ax[0,0].plot(tierDf.Tier, tierDf.Cluster1,'ko-',label='line1') 
ax[0,0].set_title('Cluster1')
ax[0,0].set_ylim([0, 40])
ax[0,0].set_ylabel('Number of players')
ax[0,0].set_xlabel('Tier')

line2 = ax[0,1].plot(tierDf.Tier, tierDf.Cluster2,'ro-',label='line2')
ax[0,1].set_title('Cluster2')
ax[0,1].set_ylim([0, 40])
ax[0,1].set_ylabel('Number of players')
ax[0,1].set_xlabel('Tier')

line3 = ax[1,0].plot(tierDf.Tier, tierDf.Cluster3,'mo-',label='line3')
ax[1,0].set_title('Cluster3')
ax[1,0].set_ylim([0, 40])
ax[1,0].set_ylabel('Number of players')
ax[1,0].set_xlabel('Tier')

line4 = ax[1,1].plot(tierDf.Tier, tierDf.Cluster4,'bo-',label='line4')
ax[1,1].set_title('Cluster4')
ax[1,1].set_ylim([0, 40])
ax[1,1].set_ylabel('Number of players')
ax[1,1].set_xlabel('Tier')

line5 = ax[2,0].plot(tierDf.Tier, tierDf.Cluster5,'go-',label='line5')
ax[2,0].set_title('Cluster5')
ax[2,0].set_ylim([0, 40])
ax[2,0].set_ylabel('Number of players')
ax[2,0].set_xlabel('Tier')

line6 = ax[2,1].plot(tierDf.Tier, tierDf.Cluster6,'yo-',label='line6')
ax[2,1].set_title('Cluster6')
ax[2,1].set_ylim([0, 40])
ax[2,1].set_ylabel('Number of players')
ax[2,1].set_xlabel('Tier')

line7 = ax[3,0].plot(tierDf.Tier, tierDf.Cluster7,'g-',label='line7')
ax[3,0].set_title('Cluster7')
ax[3,0].set_ylim([0, 40])
ax[3,0].set_ylabel('Number of players')
ax[3,0].set_xlabel('Tier')

line8 = ax[3,1].plot(tierDf.Tier, tierDf.Cluster8,'y-',label='line8')
ax[3,1].set_title('Cluster8')
ax[3,1].set_ylim([0, 40])
ax[3,1].set_ylabel('Number of players')
ax[3,1].set_xlabel('Tier')

line9 = ax[4,0].plot(tierDf.Tier, tierDf.Cluster9,'r-',label='line9')
ax[4,0].set_title('Cluster9')
ax[4,0].set_ylim([0, 40])
ax[4,0].set_ylabel('Number of players')
ax[4,0].set_xlabel('Tier')


Again our trends as suspected earlier are almost accurately confirmed by the line plots

- cluster 1: elite scoring players
- cluster 2: assist maker guards that operate around the pain
- cluster 3: offensive minded guards
- cluster 4: offensive minded forwards
- cluster 5: shooting big men
- cluster 6: effective shooters
- cluster 7: scoring shooters
- cluster 8: traditional big men

insight 1: no need defense. a good offense is the best defense
insight 2: 