# Positionless Basketball

-----------

This project has several goals:

* Identify features to use in PCA algorithm
* Decompose data with PCA
* Identify similar observations with K-Means clustering
* Measure differences between groups (at low- and high-dimensional spaces)

In [None]:
# Imports
import warnings
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
warnings.filterwarnings('ignore')

# Define custom palette to use for visualization
sns.set_style('white')

# Cluster colors
my_palette = ['#26547C', '#EF476F', '#FFD166', '#06D6A0', '#2D1E2F']

# Position colors
position_palette = ['#292F36', '#4ECDC4', '#E6E6E6', '#FF6B6B', '#FFE66D']

--------

## Data Cleaning

In this section we'll read in the NBA data we scraped, clean it up, and identify any irregularities or multicollinearity issues.

In [None]:
# Read in data from 2020-2021 NBA season
nba = pd.read_csv('./nba-stats-2021.csv').iloc[:, 2:]

# Isolate categorical variables (Name, Team, Position)
positions_only = nba.iloc[:, :4]

# Create an aggregate row per player (for players that were traded in-season)
nba = nba.groupby('Player').mean().sort_values(by='PTS', ascending=False).reset_index()

nba.head()

In [None]:
# Clean up extra rows in positions_only dataframe
po = positions_only[positions_only['Tm'] != 'TOT'].groupby('Player').first().reset_index()

# Merge with nba dataframe
nba = nba.merge(po, on='Player', how='left').drop(columns=['Age_x_y'])

# Use list comprehension to clean up extra characters
nba.columns = [x.split('_')[0] for x in nba.columns]

nba.head()

We have a clean dataset with tidy columns!

Let's reduce some potentital noise by filtering out players that don't see the floor very often

In [None]:
# Define 10th percentile for minutes played
cutoff = np.quantile(nba['MP'], 0.10)

# Remove players < 10th percentile of minutes played, hopefully reduce noise
nba = nba[nba['MP'] >= 6.58].reset_index(drop=True)

In [None]:
# Plot missing data
plt.figure(figsize=(15,10))
sns.heatmap(nba.isnull())
plt.title('NBA Stats Missing Data')
plt.show()

With no missing data to report, we can keep moving forward.

We want to plot any variables that are exceptionally collinear - that way we can remove them prior to clustering, to boost our signal a bit. First, we'll isolate the quantitative variables in this dataset

In [None]:
# Isolate quantitative features
quant_only = nba.select_dtypes(include=np.number)

quant_only.head()

In [None]:
# Custom Heatmap plotting function
def plot_heatmap(DF):
    
    corr = DF.corr()
    mask = np.triu(corr)
    
    plt.figure(figsize=(15,10))
    sns.heatmap(corr, mask=mask)
    plt.title('NBA Stats Correlation Matrix')
    plt.show()

In [None]:
plot_heatmap(quant_only)

Predictably, a few variables appear to be very collinear (e.g., field goals made and field goals attempted). We can drop these without issue

In [None]:
# Highly correlated columns
raw_stats_to_drop = ['FG', 'FGA', '2P', '2PA', '3P', '3PA', 'FT', 'FTA', 'AST']

# Drop factors above
quant_only = quant_only.drop(columns=raw_stats_to_drop)

# Plot a new correlation matrix
plot_heatmap(quant_only)

-----------

## Decomposition

In this section, we'll scale all of our quantitative data, decompose it using PCA, and identify similar observations (i.e., players)

In [None]:
# Imports
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

quant_only.head()

In [None]:
# Instantiate and fit StandardScaler object to data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(quant_only)

scaled_data

In [None]:
# Instantiate PCA object with two components (for visualization purposes)
pca = PCA(n_components = 2)

# Fit to data
low_dimensional_data = pca.fit_transform(scaled_data)

# Check shape
low_dimensional_data.shape

In [None]:
plt.figure(figsize=(12, 10))

sns.scatterplot(low_dimensional_data[:,0], low_dimensional_data[:,1], 
                alpha=0.75, color='#002642', s=75)

plt.title('Low-Dimensional NBA Stats')
plt.xlabel('Component #1')
plt.ylabel('Component #2')
plt.show()

We see spread on both axes, which suggests that each component is explaining some of the total variance

Next we'll want to identify the optimal number of clusters for our dataset using the elbow method

In [None]:
test_range = range(1,20)
sum_of_squares = []

for val in test_range:
    
    # Instantiate and fit n value to KMeans object
    temp = KMeans(n_clusters = val)
    temp.fit(low_dimensional_data)
    
    # Add inertia to container
    sum_of_squares.append(temp.inertia_)
    
# Plot results
plt.figure(figsize=(12,8))
sns.scatterplot(test_range, sum_of_squares, s=75)
plt.xticks(test_range)
plt.show()

We see a change in model fit around **k = 5**, so we'll opt for 5 unique clusters in this analysis

In [None]:
# Instantiate and fit KMeans object to data
model = KMeans(n_clusters=5, random_state=101)
model.fit(low_dimensional_data)

# Predict based on the KMeans object
predicted_values = model.predict(low_dimensional_data)

# Plot results
plt.figure(figsize=(12, 10))
sns.scatterplot(low_dimensional_data[:, 0], low_dimensional_data[:, 1],
                hue = model.labels_, alpha=0.75, s=75, palette=my_palette)
plt.title('K-Means Clusters in Low-Dimensional Space')
plt.xlabel('Component #1')
plt.ylabel('Component #2')
plt.show()

-------

## On the Court

In this section we'll project our predicted cluster values up to the high-dimensional dataset to see how clusters relate to performance on the court

In [None]:
# Project cluster labels to the high-dimensional data
nba['cluster'] = predicted_values

nba

In [None]:
plt.figure(figsize=(15, 8))

sns.histplot(data=nba, x='PTS', hue='cluster', palette=my_palette,
            edgecolor=".1", linewidth=1)

plt.xlabel('Points Per Game')
plt.title('Distribution of Points Per Cluster')
plt.show()

In [None]:
plt.figure(figsize=(12,10))
sns.scatterplot(data=nba, x="VORP", y="PER", 
                hue="cluster", palette=my_palette, s=75, alpha=0.75)
plt.show()

Clearly **Cluster 2** represents the best players in the league - players in this cluster score more points, are more efficient, and less replaceable. Let's see how each cluster breaks down in terms of their positional makeup (the crux of this project, in fact!)

In [None]:
for cluster in range(0,5):
    
    temp = nba[nba['cluster'] == cluster].reset_index()
    
    sns.catplot(data=temp, x="Pos", kind="count", 
                palette=position_palette, order=['PG', 'SG', 'SF', 'PF', 'C'], height=8)
    plt.title(f"Cluster {cluster} Position Distribution")
    plt.show()
    print('\n\n')

Cluster 2, our best players, have a decent spread of positional players. In fact, almost every cluster has a moderate spread of positions. This speaks to our larger point, that basketball greatness and listed position have little relation to one another.

Lastly, we'll dig in to a few stats to see how they differ between clusters

In [None]:
for cluster in range(0,5):
    
    temp = nba[nba['cluster'] == cluster].reset_index()
    
    print(f'---- Cluster {cluster}\n')
    
    print(f'Points:\t\t\t{temp["PTS"].mean()}')
    print(f'Effective FG%:\t\t{temp["eFG%"].mean()}')
    print(f'Steals:\t\t\t{temp["STL"].mean()}')
    print(f'Blocks:\t\t\t{temp["BLK"].mean()}')
    print(f'Assists:\t\t{temp["AST"].mean()}')
    print(f'Total rebounds:\t\t{temp["TRB"].mean()}')
    print('\n\n')

While we've already identified Cluster 2 as our best players, Cluster 0 players seem to be the best role-players. They score consistently, dish out assists, and shoot at a respectable 53.6 effective field goal %

----------

## Playoffs or Bust

In this final section, we'll see how playoff and non-playoff teams are constructed differently

In [None]:
# Top and bottom two teams from each conference
playoff_teams = ['UTA', 'PHO', 'PHI', 'BRK']
lottery_teams = ['HOU', 'OKC', 'DET', 'ORL']

all_teams = playoff_teams + lottery_teams

In [None]:
# Isolate players from the above teams
playoff_or_bust = nba[nba['Tm'].isin(all_teams)].reset_index(drop=True)

playoff_or_bust.head()

In [None]:
# Binarize playoff status
def made_the_playoffs(x):
    if x in playoff_teams:
        return 1
    else:
        return 0
    
# Create new playoff status variable   
playoff_or_bust['playoff-status'] = playoff_or_bust['Tm'].apply(lambda x: made_the_playoffs(x))
playoff_or_bust['cluster'] = playoff_or_bust['cluster'].astype(str)

In [None]:
sns.catplot(data=playoff_or_bust[playoff_or_bust['playoff-status'] == 1],
            x='cluster', kind='count', height=8, order=['0','1','2','3','4'],
           palette=position_palette)
plt.show()

In [None]:
sns.catplot(data=playoff_or_bust[playoff_or_bust['playoff-status'] == 0],
            x='cluster', kind='count', height=8, order=['0','1','2','3','4'],
           palette=position_palette)
plt.show()

Predictably, teams that didn't make the playoffs had significantly fewer Cluster 2 players than teams that did. Indeed, the total number of Cluster 0 players (our steadfast #2 players) is greater in the non-playoff pool ... perhaps teams that don't make the playoffs have a #2 player slotted in as a #1

---------