![Data Dunkers Banner](https://github.com/PS43Foundation/data-dunkers/blob/main/docs/top-banner.jpg?raw=true)

<a href="https://hub.callysto.ca/jupyter/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fdata-dunkers%2Fdata-dunkers-modules&branch=main&subPath=AI/visualization.ipynb&depth=1" target="_parent"><img src="https://raw.githubusercontent.com/callysto/curriculum-notebooks/master/open-in-callysto-button.svg?sanitize=true" width="123" height="24" alt="Open in Callysto"/></a><a href="https://colab.research.google.com/github/data-dunkers/data-dunkers-modules/blob/mainAI/visualization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg?sanitize=true" width="123" height="24" alt="Open in Colab"/></a>

# Machine Learning - Visualization Techniques

## Objectives

Students will be able to:

- use the [plotly](https://plotly.com/python/) library to find different visualization techniques
- identify potential sources of error in datasets, and fix them before visualizing data
- learn about "K-means clustering" and how to utilize the specific modelling technique

## Introduction

In this notebook, you will be presented different ways of visualizing data, specifically being shown one technique known as **"K-means clustering"**. 

As an simplified explanation, k-means clustering is a way to group data points into clusters, where each group contains data points that are similar to each other. It works by finding the center of each cluster and then assigning every data point to the nearest center.

As we go through this notebook, you'll also learn different visualization methods and how we can learn more about our data using these methods. In this particular notebook, we'll be using  data from the 2023 [WNBA](https://en.wikipedia.org/wiki/Women%27s_National_Basketball_Association) regular season.

Note: (you can incorporate your own datasets instead of using the own in this particular notebook)

Let's start by importing our libraries necessary for the notebook.

## Import Libraries

In [None]:
import pandas as pd
from sklearn.cluster import KMeans
import plotly.express as px
from sklearn.decomposition import PCA

Let's take a look at the dataset we'll be using in this notebook

In [None]:
wnba_players = pd.read_csv('https://raw.githubusercontent.com/Data-Dunkers/data-dunkers-modules/main/data/wnba_player_stats_2023.csv')
wnba_players

We see that we have the name of WNBA players with their stats for the 2023 WNBA regular season.

Let's take a look at all the different columns in our dataset. 

In [None]:
wnba_players.columns

In the context of an WNBA game, each column in our dataset represents specific statistics about a player's performance. For example, `PTS` is the player's average points per game, `Name` is the player's name, `GP` is their total amount of games played, and so on and so forth.

We'll be using these numerical columns in different visualizations throughout this notebook and trying to potentially find correlations between the columns.

## Cleaning Data

Before we get started into any data visualization, we need to clean our data.

**Data cleaning** is the process of fixing or removing incorrect, corrupted, or irrelevant data from a dataset. This ensures that the data is accurate and ready for analysis or machine learning tasks.

In our particular case, we need to get rid of any numerical columns that are irrelevant for "k-means clustering" and other visualization methods.

In [None]:
wnba_players = wnba_players.drop(['Year', "GP", "MIN"], axis=1)
wnba_players

Now that we've removed these columns, let's find start with some basic analysis.

Let's find the average value for each statistical measure in the WNBA.

In [None]:
numeric_columns = wnba_players.select_dtypes(include=['number']).columns

columns_to_exclude = ['Name', 'POS']
numeric_columns = [col for col in numeric_columns if col not in columns_to_exclude]


mean_values = wnba_players[numeric_columns].mean()
mean_values

By finding the average value of each numerical column in our dataset, we can get a sense of the typical or central values, which helps us understand the overall trend of the data. This is important because it allows us to quickly grasp the general patterns and make informed decisions or observations based on the data's tendencies.

What kind of understandings can you find based on the average value of each statistical metric in the WNBA?

## Visualization Comparisons

First, let's create an comparison scatter-plot between each player's points, assists, and rebounds in the WNBA. We'll start our visualizations using these three metrics as they are general metrics that most basketball players can understand. 

In this plot we can observe whether players who score more points also tend to have more assists or rebounds and vice-versa. We can potentially gain insights into the overall performance and playing style of different players, helping us identify trends and patterns within the data.

In [None]:
scatter_matrix = px.scatter_matrix(wnba_players[['AST', 'PTS', 'REB']], dimensions=['AST', 'PTS', 'REB'],
                                   title="Plot of AST, PTS, and REB")

scatter_matrix.show()

Looking at our comparison plots, it appears difficult to see any major correlations between points, assists, and rebounds. 

Let's take a look at a [heat-map](https://en.wikipedia.org/wiki/Heat_map) and get specific numerical analysis on trends between these three metrics.

In [None]:
corr = wnba_players[['AST', 'PTS', 'REB']].corr()
heatmap = px.imshow(corr,color_continuous_scale='Viridis',text_auto=True,  title='Heatmap of AST, PTS, and REB')

heatmap.show()

In our heatmap, we see that our strongest correlation is between `PTS` and `REB` (0.62 or 62%), and our weakest correlation is between `AST` and `REB` (0.26 or 26%). 

Why do you think there is a stronger correlation between points (`PTS`) and rebounds (`REB`) compared to the weaker correlation between assists (`AST`) and rebounds (`REB`)? Think of what roles in the WNBA would typically get higher rebounds, and then compare the role of those players to players who get higher amounts of assists.

## K-Means Clustering

Now that we have a general grasp of visualization techniques to compare different columns in our dataset, let's get into **K-means clustering**. As mentioned before, this technique helps us group similar data points together, making it easier to identify patterns or trends within the data.

In [None]:
kmeans_model = KMeans(n_clusters=5, random_state=1)
numeric_cols = wnba_players._get_numeric_data()
kmeans_model.fit(numeric_cols)
labels = kmeans_model.labels_ # get the cluster labels
labels

Let's break down what the code above is doing and simplify each step:

```python
kmeans_model = KMeans(n_clusters=5, random_state=1)
```

In the code above, we start by creating a K-means model that will divide our data into 5 clusters (n_clusters=5). 

```python
numeric_cols = wnba_players._get_numeric_data()
```

In this portion of code, we're simply fetching the numerical columns in our data. These columns are: `PTS` `FGM` `FGA` `FG%` `3PM` `3PA` `3P%` `FTM` `FTA` `FT%` `REB` `AST` `STL` `BLK` `TO` 

```python
kmeans_model.fit(numeric_cols)
labels = kmeans_model.labels_ # get the cluster labels
```

Once the model is trained on our data, it assigns a cluster label to each data point, indicating which of the 5 groups it belongs to. These labels can then be used to analyze how players with similar characteristics are grouped together.

```python
array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 1, 3, 2,
       2, 2, 2, 2, 2, 2, 2, 0, 3, 2, 1, 0, 2, 0, 2, 3, 3, 2, 3, 0, 1, 0,
       3, 0, 3, 3, 3, 3, 3, 0, 0, 3, 3, 1, 0, 0, 3, 3, 3, 0, 3, 3, 0, 3,
       3, 3, 0, 3, 0, 3, 3, 0, 3, 3, 1, 3, 1, 0, 3, 1, 3, 3, 0, 1, 1, 0,
       0, 3, 3, 0, 1, 3, 3, 0, 1, 1, 0, 4, 3, 1, 1, 0, 1, 0, 0, 1])
```

This final label output is simply showing us the different labels that our of our data-points were assigned in our `wnba_data` dataframe.

Afterwards, we're going to utilize an technique called **principle component analysis** **(PCA)**.

In simple terms, this technique simplifies data by reducing the number of variables (or dimensions). It essentially takes our original dataframe with all our numerical columns and transforms it into a new set of fewer columns that still capture the main patterns in our data. 

This step can be skipped as well if it is too difficult to implement in your own coding projects.

In [None]:
pca_2 = PCA(2)
plot_columns = pca_2.fit_transform(numeric_cols)

Now we can visualize our K-means visualization! 

In [None]:
temp = pd.DataFrame(plot_columns, columns=['PC1', 'PC2'])
temp['Cluster'] = labels
temp['Name'] = wnba_players['Name'] 

k_means = px.scatter(
    temp,
    x='PC1',
    y='PC2',
    color='Cluster',
    hover_data={'PC1': True, 'PC2': True, 'Cluster': True, 'Name': True}, 
    title='K-Means Clustering of WNBA Players',
    color_continuous_scale=px.colors.qualitative.Plotly, 
    category_orders={'Cluster': list(range(5))}  
)

k_means.show()

Looking at our visualization, we see that our WNBA players are now sorted into 5 different clusters. 

Generally, since K-means clustering is an algorithm used on unlabeled data, we need to go further in our analysis to identify and understand any correlations or patterns within these clusters. This might involve examining the characteristics of each cluster to see what common traits or statistics are shared among players in the same group, which can provide valuable insights into player performance and grouping trends.

## Conclusion

In this notebook, we explored various data visualization techniques, including scatter plots and heatmaps, to understand player statistics in the WNBA. We then applied K-means clustering to group players into distinct clusters based on their performance metrics. By analyzing these clusters and using Principal Component Analysis (PCA) to simplify the data, we gained deeper insights into player performance and identified patterns in our data.

In your projects, find datasets that have useful features that can be used in the context of clustering. Many datasets can be found on [Kaggle](https://www.kaggle.com/), a platform that offers a wide range of data for analysis and experimentation. 