In [None]:
import numpy as np
import random
# Set seed for reproducibility
np.random.seed(42)  # Set seed for NumPy
random.seed(42) # Set seed for random module

## Introduction

In this weeks tutorial we will start with unsupervised learning. Until now we always had the goal to approximate the relationship of an output variable and multiple inputs, where the output as well as the inputs could be numerical or categorical.  

In unsupervised learning there is no direct target variable, meaning we are not approximating a prediction function but are rather searching for interesting patterns within the given data.  
One such unsupervised learning method is __Clustering__. In clustering we want to group (cluster) homogeneous instances while maximizing the heterogeneity of different clusters.

For demonstration we will use a data set of [Basketball player statistics](https://www.kaggle.com/jacobbaruch/basketball-players-stats-per-season-49-leagues) which includes information about basketball players from all around the world.

In [None]:
import pandas as pd
# Loading the data from a csv file
data = pd.read_csv("https://raw.githubusercontent.com/kbrennig/MODS_WS24_25/refs/heads/main/data/players_stats_by_season_full_details.csv")

## Explore Data
First of all, let's have a look at the raw data. It seems like it is sorted by the league and the season, then for each player there is basic information like birthday and height. Additionally there are a bunch of interesting metrics for the game of basketball, which are shortly defined below:  

- 'GP': Games Played
- 'MIN': Minutes Played
- 'FGM': Field Goals Made
- 'FGA': Field Goals Attempts
- '3PM': Three Points Made
- '3PA': Three Points Attempts
- 'FTM': Free Throws Made
- 'FTA': Free Throws Attempts
- 'TOV': Turnovers
- 'PF': Personal Fouls
- 'ORB': Offensive Rebounds
- 'DRB': Defensive Rebounds
- 'REB': Rebounds
- 'AST': Assists
- 'STL': Steals
- 'BLK': Blocks
- 'PTS': Points

*Run the code below.*

In [None]:
print(data.describe())

## Transform Data

### Filter the data and standardize columns

When we look at the data, we quickly see, that it contains multiple records for one player. This most likely results from the fact that there are multiple season contained in the data. It could definitely be interesting to compare one player to previous versions of himself but for our purpose of clustering it is rather disadvantageous so we will filter the data for only one season. Additionally we will constrain the league to be the NBA.  
So the resulting data contains information about players in the NBA in the 2018 - 2019 season.

We also drop some columns since we will focus on a few key performance indicators.  

And lastly we standardize the remaining columns, which is necessary for clustering when the variables have different units.

*Run the code below.*


In [None]:
# Filter the data for NBA 2018-2019 season
dataset_for_clustering = data[(data['League'] == 'NBA') & (data['Season'] == '2018 - 2019')]

# Drop irrelevant columns
dataset_for_clustering = dataset_for_clustering.drop(columns=['high_school', 'nationality', 'weight', 'height', 'height_cm',
                                                              'weight_kg', 'birth_date', 'birth_month', 'birth_year',
                                                              'League', 'Season', 'Stage', 'FGA', '3PA', 'FTA', 'ORB',
                                                              'DRB', 'MIN'])

# Separate players and teams
players_and_teams = dataset_for_clustering[['Player', 'Team']]
dataset_for_clustering = dataset_for_clustering.drop(columns=['Player', 'Team'])

# Standardize the data
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
dataset_for_clustering_scaled = pd.DataFrame(scaler.fit_transform(dataset_for_clustering), columns=dataset_for_clustering.columns)

# Show summary of transformed data
print(dataset_for_clustering_scaled.describe())

## K-means Clustering

In the lecture two clustering approaches were explained. We will start with __K-means clustering__.




### Perform clustering

*Run the code below.*

In [None]:
from sklearn.cluster import KMeans

# Perform K-means clustering with 4 clusters
kmeans = KMeans(n_clusters=4, n_init='auto', max_iter=300, random_state=42)
kmeans_cluster_model = kmeans.fit(dataset_for_clustering_scaled)

### Extract results

*Run the code below.*

In [None]:
km_clusters = kmeans.labels_
# Extract centroids
centroids = kmeans.cluster_centers_.T

### Calculate silhoutte scores

Evaluating the performance of an unsupervised learning method is different compared to a supervised method.  
Since we are trying to learn a function in supervised learning and we are given the actual outputs of this (unknown) function we can simply compare the outputs of our learned function to the actual outputs and thereby relatively intuitively assess our learned models performance.

In clustering the goal is to cluster similar instances together and maximize the distance between clusters. One way to evaluate this kind of procedure is to look at the __distances__ and __compare__ for example the __distances__ an instance has to other instances in its __own cluster__ and to instances of __other clusters__.

The __silhouette score__ does exactly that.

*Run the code below.*

In [None]:
from sklearn.metrics import silhouette_score

# Calculate silhouette score
silhouette_score_avg_kmeans = silhouette_score(dataset_for_clustering_scaled, km_clusters)
print(f"Mean Silhouette Score: {silhouette_score_avg_kmeans}")

### Find the optimal number of clusters

The number of clusters is a __hyperparameter__ which has to be explored. We can for example simply repeat the clustering for different numbers of clusters and compare their respective silhouette scores to determine, which number yields the best clustering.

*Run the code below.*

In [None]:
# Iterate over different numbers of clusters and calculate silhouette score
for i in range(2, 16):
    kmeans = KMeans(n_clusters=i, n_init='auto', max_iter=300, random_state=42)
    kmeans_cluster_model = kmeans.fit(dataset_for_clustering_scaled)
    km_clusters = kmeans.labels_

    silhouette_score_avg_kmeans = silhouette_score(dataset_for_clustering_scaled, km_clusters)
    print(f"Number of clusters: {i} - Silhouette score: {silhouette_score_avg_kmeans}")

### Visualize Clustering Results
Let's visualize the clustering results using a bar chart of the cluster centroids.

The code below depicts the resulting cluster centroids for two clusters (optimal number of clusters determined in previous step).

+ *How would you interpret the clustering results?*

*Run the code below.*

In [None]:
import plotly.graph_objects as go

# recalculate the model for two clusters
kmeans = KMeans(n_clusters=2, n_init='auto', max_iter=300, random_state=42)
kmeans_cluster_model = kmeans.fit(dataset_for_clustering_scaled)
km_clusters = kmeans.labels_
centroids = kmeans.cluster_centers_.T

# Plot the centroids for the clusters
centroids_df = pd.DataFrame(centroids.T, columns=dataset_for_clustering_scaled.columns)
clusters = [1, 2]

fig = go.Figure()
for column in centroids_df.columns:
    fig.add_trace(go.Bar(x=clusters, y=centroids_df[column], name=column))

fig.update_layout(yaxis_title='Count', barmode='group', title='K-means Clustering Results')
fig.show()

## Hierarchical Clustering
The second method presented in the lecture is **Hierarchical clustering**.
In **K-means clustering** we are randomly choosing centroids and try to cluster our data around these clusters.

Hierarchical clustering works different. Here we start by computing the **pairwise distances** (and later **intercluster distances**) for all instances and then iteratively combine the instances (and later clusters) which are the closest to each other into one cluster.

### Perform clustering

*Run the code below.*

In [None]:
from scipy.spatial.distance import pdist
from scipy.cluster.hierarchy import linkage

# Calculate pairwise distances using Euclidean distance
distances = pdist(dataset_for_clustering_scaled, metric='euclidean')

# Perform hierarchical clustering using complete linkage
hcluster_model = linkage(distances, method='complete')

### Plot Dendrogram
Apart from the clustering itself, hierarchical clustering also produces a **dendrogram** as a side product. The dendrogram visualizes the order in which clusters are merged. Though it may be cluttered, the top part can show interesting insights.

*Run the code below.*


In [None]:
from scipy.cluster.hierarchy import dendrogram
import matplotlib.pyplot as plt

# Plot dendrogram
plt.figure(figsize=(20, 7))
dendrogram(
    hcluster_model, 
    labels=players_and_teams['Player'].values, 
    leaf_rotation=90, 
    leaf_font_size=5
)
plt.title('Hierarchical Clustering Dendrogram')
plt.show()

### Cut Dendrogram
Once the hierarchical clustering is finished, we can extract clusterings for different numbers of clusters by simply cutting the dendrogram at the right position.

*Run the code below.*

In [None]:
from scipy.cluster.hierarchy import cut_tree

# Cut dendrogram to form clusters
h_clusters = cut_tree(hcluster_model, n_clusters=2).flatten()


### Plot Dendrogram for two clusters

To visualize the clustering, we add a horizontal line at the threshold where the dendrogram is cut. This line highlights the level at which the clusters are formed, with branches below it automatically colored to indicate the distinct clusters.

*Run the code below.*

In [None]:
# Compute the flat clusters to confirm there are two clusters
num_clusters = 2

# Find the appropriate distance threshold for two clusters
# The threshold is typically the height at which the tree splits into the desired number of clusters
color_threshold = hcluster_model[-(num_clusters - 1), 2]  # The height of the last merge that creates 2 clusters

# Plot the dendrogram with the color threshold
plt.figure(figsize=(20, 7))
dendrogram(
    hcluster_model,
    labels=players_and_teams['Player'].values,
    leaf_rotation=90,
    leaf_font_size=5,
    color_threshold=color_threshold
)
plt.title('Hierarchical Clustering Dendrogram (Colored by 2 Clusters)')
plt.axhline(y=color_threshold, c='black', linestyle='--', lw=1)  # Add a line for the cut-off
plt.show()

### Calculate Silhouette Score
We'll evaluate the hierarchical clustering using the silhouette score.

*Run the code below.*

In [None]:
# Calculate silhouette score for hierarchical clustering
silhouette_score_avg_hc = silhouette_score(dataset_for_clustering_scaled, h_clusters)
print(f"Mean Silhouette Score for Hierarchical Clustering: {silhouette_score_avg_hc}")


### Find the Optimal Number of Clusters for Hierarchical Clustering
Similar to K-means, we can experiment with different cluster numbers to find the optimal configuration.

*Run the code below.*

In [None]:
# Iterate over different numbers of clusters and calculate silhouette score for hierarchical clustering
for i in range(2, 16):
    h_clusters = cut_tree(hcluster_model, n_clusters=i).flatten()
    silhouette_score_avg_hc = silhouette_score(dataset_for_clustering_scaled, h_clusters)
    print(f"Number of clusters: {i} - Mean Silhouette score: {silhouette_score_avg_hc}")


## Summary
In this tutorial, we covered:
1. Performing K-means clustering and evaluating it using silhouette scores.
2. Visualizing clusters by plotting centroids.
3. Performing hierarchical clustering and visualizing it with a dendrogram.


*You can use the cell below to build and evaluate different clusterings*

In [None]:
# Enter your Code here!