# Clustering Lab

 
Based of the amazing work you did in the Movie Industry you've been recruited to the NBA! You are working as the VP of Analytics that helps support a head scout, Mr. Rooney, for the worst team in the NBA probably the Wizards. Mr. Rooney just heard about Data Science and thinks it can solve all the team's problems!!! He wants you to figure out a way to find players that are high performing but maybe not highly paid that you can steal to get the team to the playoffs! 

In this document you will work through a similar process that we did in class with the NBA data files will be in the canvas assignment, merging them together.

Details: 

- Determine a way to use clustering to estimate based on performance if 
players are under or over paid, generally. 

- Then select players you believe would be best for your team and explain why. Do so in three categories: 
    * Examples that are not good choices (3 or 4) 
    * Several options that are good choices (3 or 4)
    * Several options that could work, assuming you can't get the players in the good category (3 or 4)

- You will decide the cutoffs for each category, so you should be able to explain why you chose them.

- Provide a well commented and clean report of your findings in a separate notebook that can be presented to Mr. Rooney, keeping in mind he doesn't understand...anything. Include a rationale for variables you included in the model, details on your approach and a overview of the results with supporting visualizations. 


Hints:

- Salary is the variable you are trying to understand 
- When interpreting you might want to use graphs that include variables that are the most correlated with Salary
- You'll need to scale the variables before performing the clustering
- Be specific about why you selected the players that you did, more detail is better
- Use good coding practices, comment heavily, indent, don't use for loops unless totally necessary and create modular sections that align with some outcome. If necessary create more than one script,list/load libraries at the top and don't include libraries that aren't used. 
- Be careful for non-traditional characters in the players names, certain graphs won't work when these characters are included.


In [2]:
# Import packages
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from sklearn.cluster import KMeans
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix

In [3]:
# Import data
salary_data = pd.read_csv("2025_salaries.csv", header=1)
stats = pd.read_csv("nba_2025.txt", sep=",")

In [4]:
# Merge datasets to work with both at the same time
df = pd.merge(salary_data, stats, on="Player")
# Dropping rows without salary data because we cannot derive information from them without salary information
df = df.dropna(subset='2025-26')
df

Unnamed: 0,Player,Tm,2025-26,Rk,Age,Team,Pos,G,GS,MP,...,TRB,AST,STL,BLK,TOV,PF,PTS,Trp-Dbl,Awards,Player-additional
0,Garrison Mathews,IND,"$131,970",398.0,29.0,IND,SG,15.0,1.0,196.0,...,17.0,10.0,6.0,3.0,3.0,19.0,78.0,0.0,,mathega01
1,Garrison Mathews,IND,"$131,970",398.0,29.0,IND,SG,15.0,1.0,196.0,...,17.0,10.0,6.0,3.0,3.0,19.0,78.0,0.0,,mathega01
2,Mac McClung,IND,"$164,060",459.0,27.0,2TM,SG,4.0,0.0,47.0,...,5.0,2.0,5.0,2.0,3.0,8.0,23.0,0.0,,mccluma01
3,Mac McClung,IND,"$164,060",459.0,27.0,IND,SG,3.0,0.0,34.0,...,4.0,1.0,5.0,1.0,2.0,6.0,19.0,0.0,,mccluma01
4,Mac McClung,IND,"$164,060",459.0,27.0,CHI,SG,1.0,0.0,13.0,...,1.0,1.0,0.0,1.0,1.0,2.0,4.0,0.0,,mccluma01
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
509,Kevin Durant,HOU,"$54,708,609",8.0,37.0,HOU,SF,50.0,50.0,1835.0,...,267.0,222.0,42.0,44.0,161.0,101.0,1291.0,0.0,,duranke01
510,Joel Embiid,PHI,"$55,224,526",56.0,31.0,PHI,C,31.0,31.0,972.0,...,232.0,121.0,20.0,34.0,92.0,67.0,825.0,1.0,,embiijo01
511,Bradley Beal,LAC,"$59,020,270",426.0,32.0,LAC,SG,6.0,6.0,121.0,...,5.0,10.0,3.0,0.0,9.0,14.0,49.0,0.0,,bealbr01
512,Bradley Beal,PHO,"$59,020,270",426.0,32.0,LAC,SG,6.0,6.0,121.0,...,5.0,10.0,3.0,0.0,9.0,14.0,49.0,0.0,,bealbr01


In [5]:
# Checking how many duplicates there are
# df[df.function] - df.function will create a True and False matrix and passing that
# to df will only return the True values of df from df.function
duplicates = df[df.duplicated(subset="Player", keep=False)]
duplicates

Unnamed: 0,Player,Tm,2025-26,Rk,Age,Team,Pos,G,GS,MP,...,TRB,AST,STL,BLK,TOV,PF,PTS,Trp-Dbl,Awards,Player-additional
0,Garrison Mathews,IND,"$131,970",398.0,29.0,IND,SG,15.0,1.0,196.0,...,17.0,10.0,6.0,3.0,3.0,19.0,78.0,0.0,,mathega01
1,Garrison Mathews,IND,"$131,970",398.0,29.0,IND,SG,15.0,1.0,196.0,...,17.0,10.0,6.0,3.0,3.0,19.0,78.0,0.0,,mathega01
2,Mac McClung,IND,"$164,060",459.0,27.0,2TM,SG,4.0,0.0,47.0,...,5.0,2.0,5.0,2.0,3.0,8.0,23.0,0.0,,mccluma01
3,Mac McClung,IND,"$164,060",459.0,27.0,IND,SG,3.0,0.0,34.0,...,4.0,1.0,5.0,1.0,2.0,6.0,19.0,0.0,,mccluma01
4,Mac McClung,IND,"$164,060",459.0,27.0,CHI,SG,1.0,0.0,13.0,...,1.0,1.0,0.0,1.0,1.0,2.0,4.0,0.0,,mccluma01
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
483,James Harden,CLE,"$39,182,693",13.0,36.0,CLE,PG,3.0,3.0,97.0,...,16.0,26.0,1.0,4.0,11.0,7.0,58.0,0.0,,hardeja01
489,Deandre Ayton,LAL,"$43,654,814",111.0,27.0,LAL,C,46.0,46.0,1299.0,...,389.0,41.0,29.0,44.0,63.0,107.0,609.0,0.0,,aytonde01
490,Deandre Ayton,POR,"$43,654,814",111.0,27.0,LAL,C,46.0,46.0,1299.0,...,389.0,41.0,29.0,44.0,63.0,107.0,609.0,0.0,,aytonde01
511,Bradley Beal,LAC,"$59,020,270",426.0,32.0,LAC,SG,6.0,6.0,121.0,...,5.0,10.0,3.0,0.0,9.0,14.0,49.0,0.0,,bealbr01


In [6]:
# Only taking the rows that have the highest MP value which is the highest minutes played statistic to use when they've played the most
def fix_dups(df):
    df = df.loc[df.groupby('Player')['MP'].idxmax()].reset_index(drop=True)
    return df
fixed_df = fix_dups(df)

In [7]:
# This dataframe now has no more duplicate players
fixed_df.head()

Unnamed: 0,Player,Tm,2025-26,Rk,Age,Team,Pos,G,GS,MP,...,TRB,AST,STL,BLK,TOV,PF,PTS,Trp-Dbl,Awards,Player-additional
0,A.J. Green,MIL,"$2,301,587",142.0,26.0,MIL,SG,49.0,49.0,1480.0,...,128.0,99.0,27.0,5.0,41.0,115.0,522.0,0.0,,greenaj01
1,AJ Johnson,DAL,"$3,090,480",404.0,21.0,2TM,SG,28.0,0.0,225.0,...,32.0,23.0,7.0,0.0,18.0,15.0,70.0,0.0,,johnsaj01
2,Aaron Gordon,DEN,"$22,841,455",193.0,30.0,DEN,PF,23.0,20.0,642.0,...,142.0,58.0,16.0,4.0,24.0,38.0,406.0,0.0,,gordoaa01
3,Aaron Holiday,HOU,"$2,296,274",304.0,29.0,HOU,PG,35.0,1.0,465.0,...,30.0,34.0,16.0,4.0,21.0,47.0,200.0,0.0,,holidaa01
4,Aaron Nesmith,IND,"$11,000,000",174.0,26.0,IND,SF,32.0,29.0,977.0,...,152.0,68.0,21.0,18.0,46.0,86.0,434.0,0.0,,nesmiaa01


In [8]:
# Normalizing all numeric columns so they are comparable
numeric = list(fixed_df.select_dtypes('number'))
fixed_df[numeric] = MinMaxScaler().fit_transform(fixed_df[numeric])
fixed_df

  return xp.asarray(numpy.nanmin(X, axis=axis))
  return xp.asarray(numpy.nanmax(X, axis=axis))


Unnamed: 0,Player,Tm,2025-26,Rk,Age,Team,Pos,G,GS,MP,...,TRB,AST,STL,BLK,TOV,PF,PTS,Trp-Dbl,Awards,Player-additional
0,A.J. Green,MIL,"$2,301,587",0.268061,0.318182,MIL,SG,0.870370,0.875000,0.736264,...,0.210873,0.219027,0.247706,0.047170,0.230337,0.629834,0.335045,0.0,,greenaj01
1,AJ Johnson,DAL,"$3,090,480",0.766160,0.090909,2TM,SG,0.481481,0.000000,0.109391,...,0.052718,0.050885,0.064220,0.000000,0.101124,0.077348,0.044929,0.0,,johnsaj01
2,Aaron Gordon,DEN,"$22,841,455",0.365019,0.500000,DEN,PF,0.388889,0.357143,0.317682,...,0.233937,0.128319,0.146789,0.037736,0.134831,0.204420,0.260591,0.0,,gordoaa01
3,Aaron Holiday,HOU,"$2,296,274",0.576046,0.454545,HOU,PG,0.611111,0.017857,0.229271,...,0.049423,0.075221,0.146789,0.037736,0.117978,0.254144,0.128370,0.0,,holidaa01
4,Aaron Nesmith,IND,"$11,000,000",0.328897,0.318182,IND,SF,0.555556,0.517857,0.485015,...,0.250412,0.150442,0.192661,0.169811,0.258427,0.469613,0.278562,0.0,,nesmiaa01
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
406,Zach Edey,MEM,"$6,045,000",0.659696,0.181818,MEM,C,0.166667,0.196429,0.138861,...,0.200988,0.026549,0.064220,0.198113,0.146067,0.198895,0.096277,0.0,,edeyza01
407,Zach LaVine,SAC,"$47,499,660",0.140684,0.500000,SAC,SG,0.685185,0.660714,0.608392,...,0.177924,0.196903,0.256881,0.094340,0.415730,0.441989,0.480103,0.0,,lavinza01
408,Zeke Nnaji,DEN,"$8,177,778",0.633080,0.272727,DEN,PF,0.685185,0.035714,0.246753,...,0.177924,0.046460,0.119266,0.188679,0.117978,0.281768,0.105263,0.0,,nnajize01
409,Ziaire Williams,BRK,"$6,250,000",0.395437,0.227273,BRK,SF,0.666667,0.089286,0.433067,...,0.163097,0.068584,0.422018,0.179245,0.224719,0.425414,0.237484,0.0,,willizi02


In [17]:
#Run the clustering algo with your best guess for K
# Using Age and Minutes Played to predict at this point
kmeans_obj = KMeans(n_clusters=2, random_state=1).fit(fixed_df.loc[:,['Rk','MP','PTS']])

In [10]:
#View the results
# cluster_centers_: The coordinates of the 2 cluster centers in 3D space
#                   (one center for each cluster)
print(kmeans_obj.cluster_centers_)
# labels_: Which cluster (0 or 1) each legislator was assigned to
print(kmeans_obj.labels_)
# inertia_: Within-cluster sum of squares (lower is better)
#           Measures how tight/compact the clusters are
print(kmeans_obj.inertia_)

[[0.64767021 0.22446657 0.10691017]
 [0.19917076 0.65618301 0.44576864]]
[1 0 0 0 1 1 1 0 0 1 0 0 1 1 0 0 0 1 1 1 1 0 1 0 0 0 1 1 1 1 0 0 0 1 0 0 0
 1 0 0 0 1 0 1 1 1 1 0 0 1 1 0 1 1 0 0 1 0 0 1 0 0 1 1 0 1 0 0 0 0 1 0 0 1
 0 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 1 0 1 1 0 1 0 1 1 1 0 1 1 0 0 1 1 1 0
 0 0 1 0 0 0 1 0 0 1 1 0 0 1 1 0 0 0 0 0 0 1 1 0 0 0 1 0 0 1 0 0 1 0 1 1 0
 0 1 1 1 0 0 0 1 0 0 1 0 1 1 0 1 1 0 1 0 1 1 1 0 1 1 1 0 0 1 0 1 0 1 0 1 0
 1 1 0 1 0 1 0 1 1 0 0 0 0 1 0 1 0 1 1 0 0 0 0 0 0 1 1 0 0 0 0 1 1 0 1 0 0
 1 1 0 0 1 1 0 0 1 1 0 1 1 0 1 0 1 0 0 1 1 0 1 0 1 0 0 1 0 0 1 1 0 0 1 1 0
 1 1 0 0 1 1 0 0 0 0 1 0 1 0 0 0 1 0 1 1 1 0 1 0 1 0 1 0 1 1 0 1 1 0 0 0 0
 1 0 1 0 1 1 1 1 0 0 1 0 0 1 0 1 0 1 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 1 1 1 1
 1 1 1 0 0 1 0 0 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 0 1 0 0 0 0 1
 0 0 0 1 1 0 1 0 0 1 1 0 0 1 1 0 0 0 0 1 0 0 1 1 0 1 0 0 1 0 0 0 0 0 1 0 0
 1 0 0 1]
23.136082779114975


In [11]:
#Create a visualization of the results with 2 or 3 variables that you think will best
#differentiate the clusters
fig = px.scatter_3d(
    fixed_df, x="Rk", y="MP", z="PTS",
    color=kmeans_obj.labels_,
    title="Rank vs. Minutes Played vs. Points for NBA Players")
fig.show(renderer="browser")

In [12]:
#Evaluate the quality of the clustering using total variance explained and silhouette scores
# Total sum squares formula
total_sum_squares = np.sum((fixed_df.loc[:,["Rk","MP","PTS"]] - np.mean(fixed_df.loc[:,["Rk","MP","PTS"]]))**2)
total = np.sum(total_sum_squares)
print(f"Total Sum of Squares (TSS): {total}")
# Between cluster sum of squares
between_SSE = (total - kmeans_obj.inertia_)
print(f"Between-Cluster Sum of Squares (BSS): {between_SSE}")
# Variance explained
var_explained = between_SSE / total
print(f"Variance Explained: {var_explained:.4f} or {var_explained*100:.2f}%")

Total Sum of Squares (TSS): 81.80790206428642
Between-Cluster Sum of Squares (BSS): 58.67181928517145
Variance Explained: 0.7172 or 71.72%


In [13]:
# Looking at silhouette scores
from sklearn.metrics import silhouette_score
# Calculate silhouette score for k = 2 through 10
# Note: Silhouette requires at least 2 clusters, so we start at k=2
silhouette_scores = []
for k in range(2, 11):
    kmeans_obj = KMeans(n_clusters=k, algorithm="lloyd", random_state=1)
    kmeans_obj.fit(fixed_df.loc[:,["Rk","MP","PTS"]])
    # Calculate average silhouette score across all points
    silhouette_scores.append(
        silhouette_score(fixed_df.loc[:,["Rk","MP","PTS"]], kmeans_obj.labels_))
# Finding the best silhouette score
best_nc = silhouette_scores.index(max(silhouette_scores)) + 2
print(f"Optimal number of clusters by Silhouette Score: {best_nc}")

Optimal number of clusters by Silhouette Score: 2


In [None]:
#Determine the ideal number of clusters using the elbow method and the silhouette coefficient
# Calculating within cluster sum of squares
wcss = []
for i in range(1, 11):
    kmeans_obj = KMeans(n_clusters=i, random_state=1).fit(fixed_df.loc[:,["Rk",'MP','PTS']])
    wcss.append(kmeans_obj.inertia_)
# Plot the Elbow Curve
# Look for the "elbow" - where adding more clusters doesn't help much
# The elbow indicates optimal k (balance between fit and complexity)
# After the elbow, WCSS decreases slowly, suggesting diminishing returns
elbow_data = pd.DataFrame({"k": range(1, 11), "wcss": wcss})
fig = px.line(elbow_data, x="k", y="wcss", title="Elbow Method")
fig.show()
# The best choice of clusters is still 2

In [18]:
#Use the recommended number of cluster (assuming it's different) to retrain your model and visualize the results
kmeans_obj = KMeans(n_clusters=2, random_state=1).fit(fixed_df.loc[:,['Rk','MP','PTS']])

In [None]:
#Use the model to select players for Mr. Rooney to consider
fig = px.scatter_3d(
    fixed_df, x="Rk", y="MP", z="PTS",
    color=kmeans_obj.labels_,
    title="Rank vs. Minutes Played vs. Points for NBA Players",
    hover_data=["Player", "2025-26"])
fig.show()
# The players I would pick would be separated into Bad, Good, and Could Work
# Good would be Keyonte George ($4.2 million), Toumani Camara ($2.2 million), and Payton Pritchard ($7.2 million)
# Bad would be Zach Collins ($18.0 million), Scott Henderson ($10.7 million), and Brandon Clarke ($12.5 million)
# Could Work would be Chet Holmgren ($13.7 million), Brandin Podziemski ($3.7 million), and Dyson Daniels ($7.7 million)


In [None]:
#Write up the results in a separate notebook with supporting visualizations and 
#an overview of how and why you made the choices you did. This should be at least 
#500 words and should be written for a non-technical audience.