# Market Segmentation with Clustering - Lab

## Introduction

In this lab, you'll use your knowledge of clustering to perform market segmentation on a real-world dataset!

## Objectives

In this lab you will: 

- Use clustering to create and interpret market segmentation on real-world data 

## Getting Started

In this lab, you're going to work with the [Wholesale customers dataset](https://archive.ics.uci.edu/ml/datasets/wholesale+customers) from the UCI Machine Learning datasets repository. This dataset contains data on wholesale purchasing information from real businesses. These businesses range from small cafes and hotels to grocery stores and other retailers. 

Here's the data dictionary for this dataset:

|      Column      |                                               Description                                              |
|:----------------:|:------------------------------------------------------------------------------------------------------:|
|       FRESH      |                    Annual spending on fresh products, such as fruits and vegetables                    |
|       MILK       |                               Annual spending on milk and dairy products                               |
|      GROCERY     |                                   Annual spending on grocery products                                  |
|      FROZEN      |                                   Annual spending on frozen products                                   |
| DETERGENTS_PAPER |                  Annual spending on detergents, cleaning supplies, and paper products                  |
|   DELICATESSEN   |                           Annual spending on meats and delicatessen products                           |
|      CHANNEL     | Type of customer.  1=Hotel/Restaurant/Cafe, 2=Retailer. (This is what we'll use clustering to predict) |
|      REGION      |            Region of Portugal that the customer is located in. (This column will be dropped)           |



One benefit of working with this dataset for practice with segmentation is that we actually have the ground-truth labels of what market segment each customer actually belongs to. For this reason, we'll borrow some methodology from supervised learning and store these labels separately, so that we can use them afterward to check how well our clustering segmentation actually performed. 

Let's get started by importing everything we'll need.

In the cell below:

* Import `pandas`, `numpy`, and `matplotlib.pyplot`, and set the standard alias for each. 
* Use `numpy` to set a random seed of `0`.
* Set all matplotlib visualizations to appear inline.

In [26]:
# lets import all the tools we are going to be needing
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt

from sklearn.cluster import KMeans # our K clustering tool
from sklearn.metrics import calinski_harabasz_score, adjusted_rand_score # scores to rate the clusters
from sklearn.preprocessing import StandardScaler # makes the data set numbers fair
from sklearn.decomposition import PCA # tool reduces big columns into smaller ones

# make results same each time
np.random.seed(0)

#make plots show inside the notebook
%matplotlib inline

Now, let's load our data and inspect it. You'll find the data stored in `'wholesale_customers_data.csv'`. 

In the cell below, load the data into a DataFrame and then display the first five rows to ensure everything loaded correctly.

In [3]:
raw_df = pd.read_csv("wholesale_customers_data.csv")
# first five rows 
raw_df.head()

Unnamed: 0,Channel,Region,Fresh,Milk,Grocery,Frozen,Detergents_Paper,Delicassen
0,2,3,12669,9656,7561,214,2674,1338
1,2,3,7057,9810,9568,1762,3293,1776
2,2,3,6353,8808,7684,2405,3516,7844
3,1,3,13265,1196,4221,6404,507,1788
4,2,3,22615,5410,7198,3915,1777,5185


Now, let's go ahead and store the `'Channel'` column in a separate variable and then drop both the `'Channel'` and `'Region'` columns. Then, display the first five rows of the new DataFrame to ensure everything worked correctly. 

In [5]:
channels = raw_df["Channel"] # save the real labels
df = raw_df.drop(columns=["Channel","Region"], axis =1)
# first 5 to confirm
df.head()

Unnamed: 0,Fresh,Milk,Grocery,Frozen,Detergents_Paper,Delicassen
0,12669,9656,7561,214,2674,1338
1,7057,9810,9568,1762,3293,1776
2,6353,8808,7684,2405,3516,7844
3,13265,1196,4221,6404,507,1788
4,22615,5410,7198,3915,1777,5185


Now, let's get right down to it and begin our clustering analysis. 

In the cell below:

* Import `KMeans` from `sklearn.cluster`, and then create an instance of it. Set the number of clusters to `2`
* Fit it to the data (`df`) 
* Get the predictions from the clustering algorithm and store them in `cluster_preds` 

In [6]:
# Already imported

In [8]:
#lets make 2 clusters because we have 2 channels
k_means = KMeans(n_clusters =2, random_state=0)
k_means.fit(df)

# lets get the cluster predictions 
cluster_preds = k_means.labels_
cluster_preds[:10] # show first 10 predictions

array([0, 0, 0, 0, 1, 0, 0, 0, 0, 0])

Now, use some of the metrics to check the performance. You'll use `calinski_harabasz_score()` and `adjusted_rand_score()`, which can both be found inside [`sklearn.metrics`](https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation). 

In the cell below, import these scoring functions. 

Now, start with CH score to get the variance ratio. 

In [13]:
# calinski score shows as how good the clusters are (higher=better)
ch_score = calinski_harabasz_score(df, cluster_preds)
print("CH score:",ch_score)

CH score: 171.68461633384186


Although you don't have any other numbers to compare this to, this is a pretty low score, suggesting that the clusters aren't great. 

Since you actually have ground-truth labels, in this case you can use `adjusted_rand_score()` to check how well the clustering performed. Adjusted Rand score is meant to compare two clusterings, which the score can interpret our labels as. This will tell us how similar the predicted clusters are to the actual channels. 

Adjusted Rand score is bounded between -1 and 1. A score close to 1 shows that the clusters are almost identical. A score close to 0 means that predictions are essentially random, while a score close to -1 means that the predictions are pathologically bad, since they are worse than random chance. 

In the cell below, call `adjusted_rand_score()` and pass in `channels` and `cluster_preds` to see how well your first iteration of clustering performed. 

In [14]:
# here we use the adjusted_rand_score
# ARS compares our clusters to the real labels
# near 1 = very good , near 0 = random, near -1 = very bad
ars = adjusted_rand_score(channels, cluster_preds)
print("Adjusted Rand (Score Without Scaling): ", ars)

Adjusted Rand (Score Without Scaling):  -0.03060891241109425


According to these results, the clusterings were essentially no better than random chance. Let's see if you can improve this. 

### Scaling our dataset

Recall that k-means clustering is heavily affected by scaling. Since the clustering algorithm is distance-based, this makes sense. Let's use `StandardScaler` to scale our dataset and then try our clustering again and see if the results are different. 

In the cells below:

* Import and instantiate [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) and use it to transform the dataset  
* Instantiate and fit k-means to this scaled data, and then use it to predict clusters 
* Calculate the adjusted Rand score for these new predictions 

In [19]:
# lets scale the numbers so that they are fair
# some columns have very big numbers(like Fresh and Delicatessen)
# Scaling them makes them all similar size
scaler = StandardScaler()
scaled_df = scaler.fit_transform(df)

In [23]:
# lets redoo KMeans with the now scaled data
scaled_k_means = KMeans(n_clusters=2, random_state=0)
scaled_k_means.fit(scaled_df)

scaled_preds = scaled_k_means.labels_

In [24]:
# lets check ARS again
ars_scaled = adjusted_rand_score(channels,scaled_preds)
print("Adjusted Rand Score(with scaling): ",ars_scaled)

Adjusted Rand Score(with scaling):  0.23664708510864038


That's a big improvement! Although it's not perfect, we can see that scaling our data had a significant effect on the quality of our clusters. 

## Incorporating PCA

Since clustering algorithms are distance-based, this means that dimensionality has a definite effect on their performance. The greater the dimensionality of the dataset, the greater the total area that we have to worry about our clusters existing in. Let's try using Principal Component Analysis to transform our data and see if this affects the performance of our clustering algorithm. 

Since you've already seen PCA in a previous section, we will let you figure this out by yourself. 

In the cells below:

* Import [PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) from the appropriate module in sklearn 
* Create a `PCA` instance and use it to transform our scaled data  
* Investigate the explained variance ratio for each Principal Component. Consider dropping certain components to reduce dimensionality if you feel it is worth the loss of information 
* Create a new `KMeans` object, fit it to our PCA-transformed data, and check the adjusted Rand score of the predictions it makes. 

**_NOTE:_** Your overall goal here is to get the highest possible adjusted Rand score. Don't be afraid to change parameters and rerun things to see how it changes. 

In [30]:
# lets use PCA (making columns smaller keeping the important info)
# makes it easier for clustering by reducing dimensionality

#lets instantiate the pca
pca= PCA(n_components=2) # lets try 2 components
#lets fit the data
pca_df =pca.fit_transform(scaled_df)

# lets get the Explained variance ratio 
print("Explained Variance ratio: ", pca.explained_variance_ratio_)

Explained Variance ratio:  [0.44082893 0.283764  ]


### insights
44.08% + 28.38% = 72.46%
SO the first 2 components capture about 72% information of the data 
with PC1 explaining most of it

Info Loss is about 27%

In [31]:
# lets run Kmeans again on pca data
pca_kmeans =KMeans(n_clusters = 2, random_state=0)
pca_kmeans.fit(pca_df)

KMeans(n_clusters=2, random_state=0)

In [32]:
# lets predict
pca_preds=pca_kmeans.labels_

In [33]:
# lets check the score
ars_pca = adjusted_rand_score(channels, pca_preds)
print("Adjusted Rand Score (with PCA): ", ars_pca)

Adjusted Rand Score (with PCA):  0.23084287036169227


In [35]:
df.shape[1]

6

In [37]:
# lets try more PCA components (to see best score)
for n in range(1, df.shape[1]+1): # These is the range of columns we need at each iteration
    pca = PCA(n_components=n) # use one n at each iteration
    pca_df = pca.fit_transform(scaled_df)
    
    pca_kmeans = KMeans(n_clusters =2 , random_state =2)
    pca_kmeans.fit(pca_df)
    
    preds = pca_kmeans.labels_
    ars = adjusted_rand_score(channels, preds)
    
    print(f"PCA with {n} components -> ARS:{ars:.4f} ") # show ars with 4 decimal places at each iteration of n

PCA with 1 components -> ARS:0.1921 
PCA with 2 components -> ARS:0.2308 
PCA with 3 components -> ARS:0.2308 
PCA with 4 components -> ARS:0.2366 
PCA with 5 components -> ARS:0.1996 
PCA with 6 components -> ARS:0.1921 


### insights
the clustering is closer to 0 than 1 meaning its better than random guessing
but its generally weak since the PCA of 2 has removed most of the info

**_Question_**:  What was the Highest Adjusted Rand Score you achieved? Interpret this score and determine the overall quality of the clustering. Did PCA affect the performance overall?  How many principal components resulted in the best overall clustering performance? Why do you think this is?

Write your answer below this line:
**The highest ARS is 0.2366**

**Its closer to 0 than 1 meaning its better than randomly guessing clusters**

**Its generally weak since its closer to 0 than 1**

**4 principal components achieved the highest components**

**These would be because with 4 components not much info was lost**
_______________________________________________________________________________________________________________________________

## Optional (Level up) 

### Hierarchical Agglomerative Clustering

Now that we've tried doing market segmentation with k-means clustering, let's end this lab by trying with HAC!

In the cells below, use [Agglomerative clustering](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html) to make cluster predictions on the datasets we've created and see how HAC's performance compares to k-mean's performance. 

**_NOTE_**: Don't just try HAC on the PCA-transformed dataset -- also compare algorithm performance on the scaled and unscaled datasets, as well! 

In [38]:
from sklearn.cluster import AgglomerativeClustering

# HAC on unscaled data
hac_unscaled = AgglomerativeClustering(n_clusters=2)
hac_unscaled_preds = hac_unscaled.fit_predict(df)
print("HAC ARS (unscaled):", adjusted_rand_score(channels, hac_unscaled_preds))

#  HAC on scaled data
hac_scaled = AgglomerativeClustering(n_clusters=2)
hac_scaled_preds = hac_scaled.fit_predict(scaled_df)
print("HAC ARS (scaled):", adjusted_rand_score(channels, hac_scaled_preds))

#  HAC on PCA data
hac_pca = AgglomerativeClustering(n_clusters=2)
hac_pca_preds = hac_pca.fit_predict(pca_df)
print("HAC ARS (PCA):", adjusted_rand_score(channels, hac_pca_preds))


HAC ARS (unscaled): -0.01923156414375716
HAC ARS (scaled): 0.022565317001188977
HAC ARS (PCA): 0.022565317001188977


### Summary (Answer)
- Highest ARS:  Both PCA and scaled are high
- PCA effect:  no effect on scaled data

## Summary

In this lab, you used your knowledge of clustering to perform a market segmentation on a real-world dataset. You started with a cluster analysis with poor performance, and then implemented some changes to iteratively improve the performance of the clustering analysis!