# **Data Analysis with Clustering**
- **Coding Dojo**
- **Course 3 - Advanced Machine Learning**
- **Week 1 - Lecture 1**

# **Code Along Exercise**


### **Data**

#### **Mall Customer Segmentation Data**




This is **a dataset for mall customers** and its originally from **[Kaggle](https://www.kaggle.com/vjchoudhary7/customer-segmentation-tutorial-in-python)**.

This dataset is about customers and spending habits.  

- The problem we are solving is how to group similar customers together and understand the different groups.  This is a common clustering problem called Customer Segmentation.

- Our challenge is to provide a meaningful analysis of customer groups based on the data. This is a business analyst task that can be improved with unsupervised learning.



### **Import Libraries**

In [None]:
## Numpy
import numpy as np
## Pandas
import pandas as pd
## MatPlotlib
import matplotlib.pyplot as plt
## Seaborn
import seaborn as sns

## Warnings
import warnings

## Import Preprocessing Standard Scaler Transformer
from sklearn.preprocessing import StandardScaler

# new libraries
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

### **Notebook Defaults and Runtime Configurations**

##### **Warnings**

In [None]:
## Set filter warnings to ignore
warnings.filterwarnings('ignore')

##### **Pandas Display Configurations**

In [None]:
## Display all columns
pd.set_option('display.max_column', None)

## Display all rows
pd.set_option('display.max_rows', None)

##### **MatPlotLib rcParams**

- **Customizing Matplotlib with style sheets and rcParams**

 - https://matplotlib.org/stable/tutorials/introductory/customizing.html

In [None]:
## Set MatPlotLib default parameters
plt.rcParams.update({'figure.facecolor': 'white',
                          'font.weight': 'bold',
                      'patch.linewidth': 1.25,
                       'axes.facecolor': 'white',
                       'axes.edgecolor': 'black',
                       'axes.linewidth': 2,
                       'axes.titlesize': 14,
                     'axes.titleweight': 'bold',
                       'axes.labelsize': 12,
                     'axes.labelweight': 'bold',
                      'xtick.labelsize': 10,
                      'ytick.labelsize': 10,
                            'axes.grid': True,
                       'axes.grid.axis': 'y',
                           'grid.color': 'black',
                       'grid.linewidth': .5,
                           'grid.alpha': .25,
                   'scatter.edgecolors': 'black'})

## **Load and inspect the data**

### **Load the Data**

In [None]:
## Define the file address as a string
file_url = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vQGG95zRf7Hmos7Gx7VqpJmksOos3bgxr73KYfmc8soEnvk_L4rVcNPcUHDpmNMDnRyof6UPlm-DTEp/pub?gid=1011669702&single=true&output=csv'

In [None]:
## Define a dataframe df
## from the file stored at the location file_url
df = pd.read_csv(file_url)

### **Inspect the Data**

#### **.head()**

In [None]:
## Display the first (5) rows of the dataframe
df.head()

- The data appears to have loaded correctly.

#### **.shape**

In [None]:
## Display the number of rows and columns for the dataframe
df.shape

In [None]:
## Display the number of rows and columns for the dataframe
## using a print() statement and an F-string
## 'There are x rows and x columns in the dataframe'
print(f'There are {df.shape[0]} rows and {df.shape[1]} columns in the dataframe.')

#### **.dtypes**

In [None]:
## Display the column names and datatypes for each column
## Columns with mixed datatypes are identified as an object datatype
df.dtypes

#### **.info()**

In [None]:
## Display the column names, count of non-null values, and their datatypes
df.info()

## **Clean the Data**

- This step has been omitted for Lecture Purposes only.

## **EDA**

- This step has been omitted for Lecture Purposes only.

## **Model Validation Split**

- **This is not required for Unsupervised Learning.**
  - There is no predicted target, and model validation is not required.

## **Preprocess the Data**

### **Select Features to Analyze**

- Initially, we will only utilize **'Annual Income (k$)'** and **'Spending Score (1-100)'** columns to analyse the clusters.


In [None]:
# define the columns you want to use (X is fine, but remember there isn't an X and y)


In [None]:
# check head of new data


In [None]:
## Display the column names, count of non-null values, and their datatypes


### **Transform the Data**

In [None]:
# scale the data using standard scaler


**Why do we scale data**?

- Clustering algorithms are looking for points that are 'close' together.  However, if the features are on different scales, for instance if one is on the order of 10s and another is on the order of 100000s, then the feature with the larger variance, the one on the scale of 100000s, will have an outsized effect on how the algorithm determines 'closeness'.

- By scaling the features, each feature will be considered equally in determining how close or far apart data points are.

## **Model the Data**

### **Choosing Number of Clusters**



Kmeans does not choose the number of clusters to group the data into, that's our job!

There are many ways to do this:
1. Subject Matter Expertise
2. Try several different numbers and explore the clusters for each to see if they make sense.
3. Inertia method
4. Silhouette Score method


#### **Inertia**

- Inertia is measured from each sample to the centroid of its cluster (the centroid is just the center)

- The closer each point is to the center of its cluster, the tighter the cluster

- A lower inertia indicates better clustering

- The more clusters you have, the lower the inertia will be

- But too many clusters is not useful (imagine the extreme case where every data point was its own cluster---inertia would be minimized, but there are no useful groups)

- **The elbow method enables you to visualize the *tradeoff* between inertia and the number of clusters.**
 - The Point where the graph starts to level off indicates a good tradeoff between the inertia and the number of clusters.

##### **Inertia Elbow Plot**

In [None]:
## Create an empty list of inertias


## Loop through k values between 2 and 10
## and store the scores in the list

## Visualize the scores


#### **Silhouette**

- Silhouette score is a measure of how dense each cluster is and how well separated they are from each other.

- The metric is similar to inertia in its overall goal, but it is calculated and interpreted differently.

- Rather than basing the calculations on the centroid, the calculation is based on the distance between points in the same cluster vs distance between points in different clusters

- The silhouette score is computed on every datapoint in every cluster

- The range of Silhouette Scores is -1 to 1 with a **higher score being better**

- **If a graph shows silhouette scores on the y-axis, you would select the highest value.**
  - The elbow method **does not apply** to silhouette scores.



##### **Silhouette Score Plot**

In [None]:
## Create an empty list for silhouette scores

## Loop through k values between 2 and 10
## and store the scores in the list

## Visualize the scores


#### **Inertia Elbow and Silhouette Score Plots Combined**

In [None]:
## Create empty lists for scores
inertias = []
silhouette_scores = []

## Loop through k values between 2 and 10
## and store the scores in the list
for k in range(2,11):
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(processed_ml_df)
    inertias.append(kmeans.inertia_)
    silhouette_scores.append(silhouette_score(processed_ml_df, kmeans.labels_))

## Visualize the scores
fig, ax1 = plt.subplots(figsize=(6,4))
plt.title('K Means Clustering', fontsize = 16, weight='bold')

ax1.plot(range(2, 11), inertias, marker = '.')
ax1.set_ylabel('Inertia Elbow Score', color='blue', fontsize = 12, weight='bold')
plt.xlabel('Number of Clusters', fontsize = 12, weight='bold')
plt.xticks(fontsize = 10, weight='bold')
plt.yticks(fontsize = 10, weight='bold')
ax1.tick_params(axis='y', labelcolor='b')
ax1.xaxis.set_ticks(np.arange(0, 12, 2))

plt.xlim([1, 11])

ax2 = ax1.twinx()
ax2.plot(range(2, 11), silhouette_scores, color='r', marker = '.')
ax2.set_ylabel('Silhouette Score', color='r', fontsize = 12, weight='bold', )
plt.yticks(fontsize = 10, weight='bold');
ax2.tick_params(axis='y', labelcolor='r')
ax2.spines['right'].set_color('r')

ax2.spines['left'].set_color('blue');
ax2.spines['right'].set_color('r');

plt.tight_layout()
plt.show;

**K = 5** would be the value I would choose to optimize both the Inertia Score and the Silhouette Score.
- **Inertia Score** - The best clustering is a balance between the lowest number of clusters and the lowest inertia.
- **Silhouette Score** -  The best score is 1 (signaling well-defined & well-separated clusters) and the worst is -1.

**Note**: Sometimes the Inertia Elbow Method and the Silhouette Score disagree!  There is no exact science for choosing clusters.

### **Instantiate and Fit the Model**

In [None]:
# instantiate a KMeans model with the value for k based on elbow plot method
# and silhouette score


## **Analyze the Clusters**

In [None]:
# add a column to the dataframe to add the cluster label as you fit and predict x


#### **Cluster Statistics**

In [None]:
## Display the descriptive statistics for the column


In [None]:
## Display the descriptive statistics for the column


In [None]:
## Display the descriptive statistics for the column


#### **Visualize Cluster Mean Values**

In [None]:
## Define a variable for a dataframe
## grouped by cluster


In [None]:
## Define a dataframe
## indexed by cluster
## with the feature means for each cluster


In [None]:
## Display the dataframe


In [None]:
## Visualize the means values of each column


##### **Interpret and explain the Visualizations**

###### **'Age'**



**This visualization plots the mean 'age' of each cluster.**

- Clusters 0, 1, and 3 all have mean ages above 40.

- Cluster 3 has a mean age in the mid 20's.

- Cluster 3 has a mean age in the low 30's.

###### **'Annual Income (k$)'**

**This visualization plots the mean 'Annual Income (k$)' of each cluster.**

- Clusters 1 and 4 have high annual incomes, greater than $80k.

- Cluster 2 and 3 have low annual incomes, less than $30k.

- Cluster 0 has a moderate annual income, aproximately $50k.

###### **'Spending Score (1-100)'**

**This visualization plots the mean 'Spending Score (1-100)' of each cluster.**

- Cluster 1 and 2 have low spending scores.

- Cluster 0 has a moderate spending score.

- Clusters 3 and 4 have high spending scores.

#### **Describe the Clusters**

- **0. The Moderates**:
   - Older moderate earner and moderate spenders
- **1. The Savers**:
   - Older and successful but frugal
- **2. The Humble Elders**:
   - Low income and low spending
- 3**. The Irresponsible Youth**:
   - Low income, big spenders
- **4. The 'Work Hard, Play Hard' Crowd**:
   - Big earners, Big spenders



### **Recommendations**

- How might mall stores market to each group (cluster) differently?

#### **Bonus: 3D Plotting**

- As a challenge make a 3D scatterplot!  See [link here](https://www.geeksforgeeks.org/3d-scatter-plotting-in-python-using-matplotlib/) for info on 3D scatterplot

In [None]:
import plotly.express as px
px.scatter_3d(df,
              x='Annual Income (k$)',
              y='Spending Score (1-100)',
              z='Age',
              color='cluster')

# **Challenge Exercise**


- A stakeholder SME (Subject Matter Expert) has recommended that you include the **'Age'** column, along with **'Annual Income (k$)'** and **'Spending Score (1-100)'** columns to determine the clusters.
- What differences can you see?

### **Load the Data**

In [None]:
## Define a dataframe df
## from the file stored at the location file_url
df = pd.read_csv(file_url)

## **Preprocess the Data**

### **Select Features to Analyze**

In [None]:
# define the columns you want to use (X is fine, but remember there isn't an X and y)
ml_df = df[['Age', 'Annual Income (k$)','Spending Score (1-100)']]

In [None]:
# check head of new data
ml_df.head()

In [None]:
## Display the column names, count of non-null values, and their datatypes
ml_df.info()

### **Transform the Data**

In [None]:
# scale the data using standard scaler
scaler = StandardScaler()
processed_ml_df = scaler.fit_transform(ml_df)

## **Model the Data**

#### **Inertia**

- Inertia is measured from each sample to the centroid of its cluster (the centroid is just the center)

- The closer each point is to the center of its cluster, the tighter the cluster

- A lower inertia indicates better clustering

- The more clusters you have, the lower the inertia will be

- But too many clusters is not useful (imagine the extreme case where every data point was its own cluster---inertia would be minimized, but there are no useful groups)

- **The elbow method enables you to visualize the *tradeoff* between inertia and the number of clusters.**
 - The Point where the graph starts to level off indicates a good tradeoff between the inertia and the number of clusters.

##### **Inertia Elbow Plot**

In [None]:
## Create an empty list of inertias


## Loop through k values between 2 and 10
## and store the scores in the list





## Visualize the scores




#### **Silhouette**

- Silhouette score is a measure of how dense each cluster is and how well separated they are from each other.

- The metric is similar to inertia in its overall goal, but it is calculated and interpreted differently.

- Rather than basing the calculations on the centroid, the calculation is based on the distance between points in the same cluster vs distance between points in different clusters

- The silhouette score is computed on every datapoint in every cluster

- The range of Silhouette Scores is -1 to 1 with a **higher score being better**

- ** If a graph shows silhouette scores on the y-axis, your would select the highest value.**
  - The elbow method **does not apply** to silhouette scores.



##### **Silhouette Score Plot**

In [None]:
## Create an empty list for silhouette scores


## Loop through k values between 2 and 10
## and store the scores in the list





## Visualize the scores




#### **Inertia Elbow and Silhouette Score Plots Combined**

In [None]:
## Create empty lists for scores
inertias = []
silhouette_scores = []

## Loop through k values between 2 and 10
## and store the scores in the list
for k in range(2,11):
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(processed_ml_df)
    inertias.append(kmeans.inertia_)
    silhouette_scores.append(silhouette_score(processed_ml_df, kmeans.labels_))

## Visualize the scores
fig, ax1 = plt.subplots(figsize=(6,4))
plt.title('K Means Clustering', fontsize = 16, weight='bold')

ax1.plot(range(2, 11), inertias, marker = '.')
ax1.set_ylabel('Inertia Elbow Score', color='blue', fontsize = 12, weight='bold')
plt.xlabel('Number of Clusters', fontsize = 12, weight='bold')
plt.xticks(fontsize = 10, weight='bold')
plt.yticks(fontsize = 10, weight='bold')
ax1.tick_params(axis='y', labelcolor='b')
ax1.xaxis.set_ticks(np.arange(0, 12, 2))

plt.xlim([1, 11])

ax2 = ax1.twinx()
ax2.plot(range(2, 11), silhouette_scores, color='r', marker = '.')
ax2.set_ylabel('Silhouette Score', color='r', fontsize = 12, weight='bold', )
plt.yticks(fontsize = 10, weight='bold');
ax2.tick_params(axis='y', labelcolor='r')
ax2.spines['right'].set_color('r')

ax2.spines['left'].set_color('blue');
ax2.spines['right'].set_color('r');

plt.tight_layout()
plt.show;

**K = ?** would be the value I would choose to optimize both the Inertia Score and the Silhouette Score.
- **Inertia Score** - The best clustering is a balance between the lowest number of clusters and the lowest inertia.
- **Silhouette Score** -  The best score is 1 (signaling well-defined & well-separated clusters) and the worst is -1.

**Note**: Sometimes the Inertia Elbow Method and the Silhouette Score disagree!  There is no exact science for choosing clusters.

### **Instantiate and Fit the Model**

In [None]:
# instantiate a KMeans model with the value for k based on elbow plot method
# and silhouette score



## **Analyze the Clusters**

In [None]:
# add a column to the dataframe to add the cluster label as you fit and predict x


#### **Cluster Statistics**

In [None]:
## Display the descriptive statistics for the column 'Age'


In [None]:
## Display the descriptive statistics for the column 'Annual Income (k$)'


In [None]:
## Display the descriptive statistics for the column 'Spending Score (1-100)'


#### **Visualize Cluster Mean Values**

In [None]:
## Define a variable for a dataframe
## grouped by cluster


In [None]:
## Define a dataframe
## indexed by cluster
## with the feature means for each cluster


In [None]:
## Display the dataframe


In [None]:
## Visualize the means values of each column






##### **Interpret and explain the Visualizations**

###### **'Age'**



-


###### **'Annual Income (k$)'**

-

###### **'Spending Score (1-100)'**

-

#### **Describe the Clusters**

-

### **Recommendations**

- How might mall stores market to each group (cluster) differently?

### **Bonus: 3D Plotting**

- As a challenge make a 3D scatterplot!  See [link here](https://www.geeksforgeeks.org/3d-scatter-plotting-in-python-using-matplotlib/) for info on 3D scatterplot

In [None]:
import plotly.express as px
px.scatter_3d(df,
              x='Annual Income (k$)',
              y='Spending Score (1-100)',
              z='Age',
              color='cluster')