# Clustering Lab

 
Based of the amazing work you did in the Movie Industry you've been recruited to the NBA! You are working as the VP of Analytics that helps support a head scout, Mr. Rooney, for the worst team in the NBA probably the Wizards. Mr. Rooney just heard about Data Science and thinks it can solve all the team's problems!!! He wants you to figure out a way to find players that are high performing but maybe not highly paid that you can steal to get the team to the playoffs! 

In this document you will work through a similar process that we did in class with the NBA data files will be in the canvas assignment, merging them together.

Details: 

- Determine a way to use clustering to estimate based on performance if 
players are under or over paid, generally. 

- Then select players you believe would be best for your team and explain why. Do so in three categories: 
    * Examples that are not good choices (3 or 4) 
    * Several options that are good choices (3 or 4)
    * Several options that could work, assuming you can't get the players in the good category (3 or 4)

- You will decide the cutoffs for each category, so you should be able to explain why you chose them.

- Provide a well commented and clean report of your findings in a separate notebook that can be presented to Mr. Rooney, keeping in mind he doesn't understand...anything. Include a rationale for variables you included in the model, details on your approach and a overview of the results with supporting visualizations. 


Hints:

- Salary is the variable you are trying to understand 
- When interpreting you might want to use graphs that include variables that are the most correlated with Salary
- You'll need to scale the variables before performing the clustering
- Be specific about why you selected the players that you did, more detail is better
- Use good coding practices, comment heavily, indent, don't use for loops unless totally necessary and create modular sections that align with some outcome. If necessary create more than one script,list/load libraries at the top and don't include libraries that aren't used. 
- Be careful for non-traditional characters in the players names, certain graphs won't work when these characters are included.


In [55]:
# load libraries
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt

In [56]:
# loading data 
# needed to clarify that first row of salary data set were column headers

salary_data = pd.read_csv("2025_salaries.csv" , header=1)

# there are several odd characters, latin-1 encoding was needed
# needed to parse txt file with comma separation 

stats = pd.read_csv("nba_2025.txt" , sep = "," , encoding="latin-1")

In [57]:
# need to merge data frames
# will be doing an inner merge on the player column
merged_data = pd.merge(salary_data, stats, on = "Player")

In [58]:
# Drop variables that will not be needed or are duplicates
# there are duplicates of players when they play for multiple teams
# if a player played for multiple teams they have a row for each respective team
# they have a third row with a "2TM" value in the team column
# that row has their combined stats for that season
# I will use that row and drop the other two duplicates

# this pulls each row that is considered a duplicate
# and the first ocurrance and puts those rows into a new DF

duplicates = merged_data[merged_data.duplicated(subset="Player", keep=False)]

# this removes all rows where Team is 2TM
# leaving us with a df of all unwanted rows

unwanted_duplicates = duplicates[duplicates["Team"] != "2TM" ] 

# this drops all of the rows that have the index that matches 
# the index of the merged_data
# we are able to do this because we never reset the index

unique_data = merged_data.drop(unwanted_duplicates.index)

# now that we have removed all duplicates we can reset the index
unique_data = unique_data.reset_index()

In [59]:
# the current salary column is called "2025-26"
# lets rename that "Salary"

unique_data = unique_data.rename(columns={"2025-26" : "Salary"})

In [60]:
# check for mising values 
unique_data.isna().sum()

# Since Salary is our target variable, we will drop rows where NA is present in this column
# The other columns [3P%, 2P%, FT%, Awards] will be dropped altogether 
# dropping specific rows because they have missing values in these columns is not needed.

index                  0
Player                 0
Tm                     0
Salary                 4
Rk                     0
Age                    0
Team                   0
Pos                    0
G                      0
GS                     0
MP                     0
FG                     0
FGA                    0
FG%                    0
3P                     0
3PA                    0
3P%                   19
2P                     0
2PA                    0
2P%                    1
eFG%                   0
FT                     0
FTA                    0
FT%                    3
ORB                    0
DRB                    0
TRB                    0
AST                    0
STL                    0
BLK                    0
TOV                    0
PF                     0
PTS                    0
Trp-Dbl                0
Awards               416
Player-additional      0
dtype: int64

In [61]:
# dropping rows where salary is NA

unique_data = unique_data.dropna(subset=["Salary"])

In [62]:
# looking at all column names to determine which columns I want to keep
# Columns kept should correlate to productivity and value

COLS = unique_data.columns
COLS

Index(['index', 'Player', 'Tm', 'Salary', 'Rk', 'Age', 'Team', 'Pos', 'G',
       'GS', 'MP', 'FG', 'FGA', 'FG%', '3P', '3PA', '3P%', '2P', '2PA', '2P%',
       'eFG%', 'FT', 'FTA', 'FT%', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK',
       'TOV', 'PF', 'PTS', 'Trp-Dbl', 'Awards', 'Player-additional'],
      dtype='str')

In [63]:
# Columns kept will be ["Player" , "Salary" , "FG" , "TRB" , "AST" , "STL" , "BLK" ]
# because they are columns that track offensive and defensive production on the court during a game

keep_cols = ["Player" , "Salary" , "FG" , "TRB" , "AST" , "STL" , "BLK" ]
unique_data = unique_data[keep_cols]

In [80]:
# Lets check the dtypes of each column 
unique_data.info()


<class 'pandas.DataFrame'>
RangeIndex: 412 entries, 0 to 411
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Player  412 non-null    str    
 1   Salary  412 non-null    str    
 2   FG      412 non-null    float64
 3   TRB     412 non-null    float64
 4   AST     412 non-null    float64
 5   STL     412 non-null    float64
 6   BLK     412 non-null    float64
dtypes: float64(5), str(2)
memory usage: 22.7 KB


In [81]:
# All columns are of the correct dtype except for Salary 
# It should be a float but is a string
# lets turn it into a float so we can standardize the data

# first we need to strip the values of the $ and comma

unique_data["Salary"] = (
    unique_data["Salary"]
    .str.replace("$" , "", regex = False)
    .str.replace("," , "" , regex = False)
)

# then we need to convert it from a string to a float

unique_data["Salary"] = unique_data["Salary"].astype(float)

In [None]:
# now we can scale all necessary columns with MinMaxScaler

cols_to_scale = ["FG" , "TRB", "AST", "STL", "BLK", "Salary"]

scaler = MinMaxScaler

scaler = MinMaxScaler()

unique_data[cols_to_scale] = scaler.fit_transform(unique_data[cols_to_scale])


In [64]:
#Run the clustering algo with your best guess for K

In [65]:
#View the results

In [66]:
#Create a visualization of the results with 2 or 3 variables that you think will best
#differentiate the clusters

In [67]:
#Evaluate the quality of the clustering using total variance explained and silhouette scores

In [68]:
#Determine the ideal number of clusters using the elbow method and the silhouette coefficient

In [69]:
#Visualize the results of the elbow method

In [70]:
#Use the recommended number of cluster (assuming it's different) to retrain your model and visualize the results

In [71]:
#Once again evaluate the quality of the clustering using total variance explained and silhouette scores

In [72]:
#Use the model to select players for Mr. Rooney to consider

In [73]:
#Write up the results in a separate notebook with supporting visualizations and 
an overview of how and why you made the choices you did. This should be at least 
500 words and should be written for a non-technical audience.

SyntaxError: invalid syntax (3214829668.py, line 2)