# Lab 8: Introduction to Customer segmentation using Hierarchical Clustering and Dendrograms

# Practice goals:
In this practice, we will deep dive into how Visual Analytics is at the core of the most **interesting and necessary applications** in Business: customer segmentation.
Through this real use we will:
- Use **dendrograms** to identify the number of cluster
- Apply **hierarchical clustering** to segment a customer base
- Generate **business insights** from clustering results


Recall the Analytics workflow we learned in the lecture that is fully recommended to follow in any Data Science or MLOPS project:



![Figure 1](https://drive.google.com/uc?export=view&id=1Bvv1rKSn61iRdz3M1WcsCwtR7K5n0hTB)


<center> Figure 1</center>

### Due date: during the lab session. It is not allowed to send it after the session
### Submission procedure: via Moodle.
### Complete with your Name: Luca Franceschi
### Complete with your NIA: 253885



# Context:  Visual Analytics for Customer segmentation

In any sector or Industry, knowing and understanding the customer behavior is key. As Data Scientist we should be ready to answer questions as: “Can you identify which of our current customers should we target for our new product?” or "How are our customers that buy product A, B or D?". In other words, segmentating the customer base is so important in order to any organization can tailor and build targeted strategies.

As explained in the Lecture class, a customer segmentation could be considered as a **unsupervised learning** case where there is not a **unique** solution because it depends on the design principles as **similarity**, **number of clusters**,.... It’s a challenging use case as a data scientist.

During this lab, our aim is to create a segmentation based on a clustering process to group customers that are "similar" among them. We will use a Hierarchical Clustering for this use case. However, this real use case can also be solved using other clustering techniques as K-Means or Mixture of Gaussians.


## Dataset

For this practice, we will use a customer base of a e-commerce. In particular, the dataset contains the following variables:

- `CustomerID`: Identication of the customer

- `Genre`: Takes values of male or female

- `Age`: corresponding to the customerID

- `Annual_income`: customers' annual revenues (in K€)

- `Spending_score`: evaluation (from 1 to 100) done by the shop to its customers based the purchase frequency and other conditions


## Imports

In [None]:
import pandas as pd
import seaborn as sns
from matplotlib import pyplot
from collections import Counter
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy.cluster.hierarchy as sch

# Data understanding and preparation

Once the **Challenge/Problem definition** and **Data gathering** stages are covered (see Figure 1), the next step is the **Exploration Data Analysis and Transformation**. In this stage, Visual Analytics takes a key role to explore and understand the features and their relationships between them. Thanks to this process, a data scientist will have more context to determine the algorithm(s) to apply to the data according to the **Challenge/Problem definition**.

Transformation that also includes **data cleaning, structuring and enriching** is also named **data wrangling**.


## Read the data

Let's open the csv with separator "," and assign to a dataframe variable (use read_csv from Pandas library). Let's see the top 5 elements.

In [None]:
# define the dataset location
route="data"
filename = '/Mall_Customers.csv'
sep=","
encoding="utf-8"
total_route=route+filename


# Set Pandas to show all the columns
pd.set_option('display.max_columns', None)
# Read the data as a dataframe
data = pd.read_csv(total_route)

In [None]:
data.head()


## Dataset Exploratory Data Analysis (EDA)

[**EX1**] Analyze the main characteristics (type of variables, number of records, nulls, etc...) of the variables of the dataset and answer the following questions bellow:


Tip: [.info()](https://www.geeksforgeeks.org/python-pandas-dataframe-info/) is a function that reports the main characteristics of a dataframe.

Tip: [Counter()](https://docs.python.org/3/library/collections.html) is a function from **collections** library to identify the number of categories and number of samples per category in a variable.




In [None]:
#Identify the main characteristics
data.info()
#Summarize the class distribution
counter = Counter(data.values[:,1])
for category, count in counter.items():
	ratio = count / len(data) * 100
	print('Class=%s, Count=%d, Percentage=%.3f%%' % (category, count, ratio))

[**Solution**]

- Which are the types of the variables (integer, float, chart...)? Integer and 1 object (i.e. 1 category)
- Which is the the size (number of records) of the dataset and the file? 200
- Which are the variable with more nulls? None
- Is there any categorical variable? Yes. Genre
- Which is the ratio between Male and Female in the dataset? Male: 44% and Female: 56%

[**EX2**] Plot the following distribution or histogram of each variable (except of `CustomerID`) using **seaborn** library and answer the following questions:
- Which is the distribution of `Age`, `Annual_income`and `Spending_score`?
- What insights do you obtain from the distribution of these 3 variables?

In [None]:
#With Seaborn
def plot_distribution(data, attr):
    f, axes = plt.subplots(1, 1, figsize=(3, 3), sharex=True)
    ax1= sns.histplot(data, x=attr , color="skyblue", ax=axes, bins=30, kde=True)
    sns.despine(top=True, right=True, left=True)
    f.suptitle(f'{attr} Histogram', fontsize=14)

# Plot the distributions
data_aux=data.iloc[:,2:]
column_names=data_aux.columns

for column in column_names:
    plot_distribution(data_aux, column)

**[Solution]**

Seemingly there is not a clear distribution pattern of the data.

[**EX3**]  Use the **Seaborn** library to plot, for every `age`, `annual income` and `spending score` features, a histogram per class (i.e. male and female) and one plot combining both classes' distributions (i.e. `male`vs `female`).

- Which are the variables with most differentiated distributions between both genres?
- Which features could be the most interesting to distinguish male vs female? Justify your answer?

[**Solution**]

In [None]:
def plot_distribution_by_genre(data, attr):
    f, axes = plt.subplots(1, 3, figsize=(20, 5), sharex=True)

    ax1 = sns.histplot(data[data['Genre'] == 'Male'], x=attr, bins=30, kde=True, ax=axes[0])
    ax1.set_title(f'{attr}: Male Distribution')

    ax2 = sns.histplot(data[data['Genre'] == 'Female'], x=attr, bins=30, kde=True, ax=axes[1])
    ax2.set_title(f'{attr}: Female Distribution')

    ax3 = sns.histplot(data, x=attr, bins=30, kde=True, ax=axes[2], hue='Genre', multiple="stack")
    ax3.set_title(f'{attr}: Combined Distribution')

    sns.despine(top=True, right=True, left=True)
    f.suptitle(f'{attr} Distribution', fontsize=14)

    plt.show()

data_aux = data.iloc[:, 2:]
column_names = data_aux.columns

for column in column_names:
    plot_distribution_by_genre(data, column)

They seem to follow similar distributions. 

[**EX4**] Using **Seaborn** libraries, draw a box plot for `Age`, `Annual_income` and `Spending_score` against male and female categories. Answer the following questions:
- Which variable presents more outliers?
- Which is Q1, Q2 (or median), Q3 for these variables of `female` registers?

[**Solution**]

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(20, 5))

sns.boxplot(x='Genre', y='Age', data=data, hue='Genre', ax=axes[0])
axes[0].set_title('Age Distribution by Gender')

sns.boxplot(x='Genre', y='Annual_income', data=data, hue='Genre', ax=axes[1])
axes[1].set_title('Annual Income Distribution by Gender')

sns.boxplot(x='Genre', y='Spending_score', data=data, hue='Genre', ax=axes[2])
axes[2].set_title('Spending Score Distribution by Gender')

plt.tight_layout()
plt.show()

Annual income presents more outliers. E.g.: in the first plot, Q1 is around 30; median is around 35; and Q3 is around 48.

[**EX5**] To understand the relationship between features, the matrix correlation is a key measure. Calculate and plot the correlation matrix for the original dataset with a **heatmap** and **correlation value** for each variable in the matrix

[**Solution**]

In [None]:
corr_matrix = data.corr(numeric_only=True)
sns.set_theme(rc = {'figure.figsize':(15, 10)})
sns.heatmap(corr_matrix, annot=True)
plt.show()

[**EX6**] Execute this 3D scatter plot between `Age`, `Annual_income`and `Spending_score` using **plotly** library. Answer the following questions:
- Do you think a decision boundary to classify the dataset between `male` and `female` is feasible? Justify your answer.
- Do you identify any group or cluster in the data? Justify your answer.

**Tip**: Visit [plotly'3D Scatter Plot with go.Scatter3d ](https://plotly.com/python/3d-scatter-plots/) for further information

In [None]:
#Import the library
import plotly as py
import plotly.graph_objs as go
from plotly.offline import init_notebook_mode, iplot

male=data[data["Genre"]=="Male"]
female=data[data["Genre"]=="Female"]

# Female
female_scatter = go.Scatter3d(
                        x = female.Age,
                        y = female.Annual_income,
                        z = female.Spending_score,
                        mode = 'markers',
                        opacity = 0.7,
                        name = "female",
                        marker = dict(size = 3)
)

# Male
male_scatter= go.Scatter3d(
                        x = male.Age,
                        y = male.Annual_income,
                        z = male.Spending_score,
                        mode = 'markers',
                        opacity = 0.7,
                        name = "male",
                        marker = dict(size = 3)
)


list_3d = [female_scatter, male_scatter]

fig_3d = go.Figure(data = list_3d)
iplot(fig_3d)

[**Solution**]

I think it is not feasible with this set of variables. Since the datapoints do not seem to have a different pattern based on Genre.
Data is definitely clustered in small chunks (forming kind of an X-shaped scatter plot).

## Data wrangling: Data normalization

### Values normalization
But before applying Hierarchical Clustering, we should think about the necessity to normalize the data so that the scale of each variable is the same.
Why is this important? If the scale of the variables is not the same, the model might become biased towards the variables with a higher magnitude.


[**EX7**] From the previous section outcome, i.e. EDA (Exploratory Data Analysis), do you consider **data normalization** is required? In affirmative case, normalize the data and re-plot a histogram per class (i.e. `male`vs `female`) and one plot combining both classes' distributions.

[**Solution**]

We should normalize data since there are some features (e.g.: income that can get out of hand really quickly).

[**EX8**] Execute the following code to normalize data and plot the same charts as **EX3**. Is there any significant difference? Justify your reason.

In [None]:
#In case of deciding apply a normalization in the data we could use:
from sklearn.preprocessing import normalize
data_scaled = normalize(data.iloc[:,2:])
data_scaled = pd.DataFrame(data_scaled, columns=data.columns[2:])
data_scaled['Genre']=data['Genre']
data_scaled = data_scaled[["Genre", "Age", "Annual_income", "Spending_score"]]
data_scaled.head()

[**Solution**]

In [None]:
data_aux = data_scaled.iloc[:,1:]
column_names = data_aux.columns

for column in column_names:
    plot_distribution_by_genre(data_scaled, column)

I would say there is no significant difference.

# Clustering modelling

## Hierarchical clustering review

This section is based on the web page: https://www.analyticsvidhya.com/blog/2019/05/beginners-guide-hierarchical-clustering/ and https://medium.com/@daython3/segmenting-customer-groups-a-comparative-study-of-hierarchical-and-k-means-clustering-methods-for-f35946d52810


Hierarchical clustering is a bottom-up approach, where each data point starts as a separate cluster and then clusters are successively merged based on their similarity until a single cluster containing all data points is formed.

K-means clustering is a top-down approach, where the number of clusters is fixed in advance, and the algorithm assigns each data point to the nearest cluster centre, known as a centroid. The algorithm then updates the centroids based on the mean of the data points assigned to each cluster and repeats the process until the centroids converge.

Both hierarchical and k-means clustering have their own advantages and disadvantages, and the choice of approach depends on the data and the goals of the analysis. Understanding the differences between these two types of clustering algorithms is crucial for choosing the right approach for a given task.



### Setting up the Example

Suppose a teacher wants to divide her students into different groups. She has the marks scored by each student in an assignment and based on these marks, she wants to segment them into groups. There’s no fixed target here as to how many groups to have. Since the teacher does not know what type of students should be assigned to which group, it cannot be solved as a supervised learning problem. So, we will try to apply hierarchical clustering here and segment the students into different groups.

Let’s take a sample of 5 students:

![Figure 2](https://drive.google.com/uc?export=view&id=1ouNaYCv0a1EurdKwV5c2iLEMSW_H2Ap2)




### Creating Proximity Matrix

First, we will create a proximity matrix which will tell us the distance between each of these points. Since we are calculating the distance of each point from each of the other points, we will get a square matrix of shape n X n (where n is the number of observations).

Let’s make the 5 x 5 proximity matrix for our example:


![Figure 3](https://drive.google.com/uc?export=view&id=1gRMOj6yV3AUPFulmiw30EyD-FdJn-Ppv)




The diagonal elements of this matrix will always be 0 as the distance of a point with itself is always 0. We will use the Euclidean distance formula to calculate the rest of the distances. So, let’s say we want to calculate the distance between point 1 and 2:

√(10-7)^2 = √9 = 3

Similarly, we can calculate all the distances and fill the proximity matrix.

### Steps to Perform Hierarchical Clustering

- **Step 1:** First, we assign all the points to an individual cluster:

![Figure 4](https://drive.google.com/uc?export=view&id=1VMW8Z1BDqZZVPcV-16_l6TI5B8HY6mJ2)



Different colors here represent different clusters. You can see that we have 5 different clusters for the 5 points in our data.

- **Step 2:** Next, we will look at the smallest distance in the proximity matrix and merge the points with the smallest distance. We then update the proximity matrix:

![Figure 5](https://drive.google.com/uc?export=view&id=1ceoOiUKxXSQHoDRZDxMhYqlfKfE2sgbN)



Here, the smallest distance is 3 and hence we will merge point 1 and 2:


![Figure 6](https://drive.google.com/uc?export=view&id=1sAw0Zh97az9EeIyiZ4061EFUIxcU8xXV)



Let’s look at the updated clusters and accordingly update the proximity matrix:


![Figure 7](https://drive.google.com/uc?export=view&id=1Wr7lFzU3uwdLSn551kBy07WJhUhPb-VP)



Here, we have taken the maximum of the two marks (7, 10) to replace the marks for this cluster. Instead of the maximum, we can also take the minimum value or the average values as well. Now, we will again calculate the proximity matrix for these clusters:

![Figure 8](https://drive.google.com/uc?export=view&id=1HAWuMjHWXAyXl2GuKlY16P8DCsYPmoAD)


**Step 3:** We will repeat step 2 until only a single cluster is left.

So, we will first look at the minimum distance in the proximity matrix and then merge the closest pair of clusters. We will get the merged clusters as shown below after repeating these steps:

![Figure 9](https://drive.google.com/uc?export=view&id=1LbP4QAKB6erJR3OK8JNxW8G431KsjHIT)



We started with 5 clusters and finally have a single cluster. This is how agglomerative hierarchical clustering works. But the burning question still remains – how do we decide the number of clusters? Let’s understand that in the next section.



### How should we Choose the Number of Clusters in Hierarchical Clustering?

Ready to finally answer this question that’s been hanging around since we started learning? To get the number of clusters for hierarchical clustering, we make use of an awesome concept called a Dendrogram.

Let’s get back to our teacher-student example. Whenever we merge two clusters, a dendrogram will record the distance between these clusters and represent it in graph form. Let’s see how a dendrogram looks like:


![Figure 10](https://drive.google.com/uc?export=view&id=1wfegqu3s6LbWpas5aIbXzvgKY1FjViwK)


We have the samples of the dataset on the x-axis and the distance on the y-axis. Whenever two clusters are merged, we will join them in this dendrogram and the height of the join will be the distance between these points. Let’s build the dendrogram for our example:


![Figure 9](https://drive.google.com/uc?export=view&id=1LbP4QAKB6erJR3OK8JNxW8G431KsjHIT)


Take a moment to process the above image. We started by merging sample 1 and 2 and the distance between these two samples was 3 (refer to the first proximity matrix in the previous section). Let’s plot this in the dendrogram:


![Figure 11](https://drive.google.com/uc?export=view&id=1O_KRhAMWN8WnbodvRod7d7KPSVeNAq9T)



Here, we can see that we have merged sample 1 and 2. The vertical line represents the distance between these samples. Similarly, we plot all the steps where we merged the clusters and finally, we get a dendrogram like this:

![Figure 12](https://drive.google.com/uc?export=view&id=1xayyhM-iFItJyMfP7WdoXcQhJE5CiR0V)


We can clearly visualize the steps of hierarchical clustering. More the distance of the vertical lines in the dendrogram, more the distance between those clusters.

Now, we can set a threshold distance and draw a horizontal line (Generally, we try to set the threshold in such a way that it cuts the tallest vertical line). Let’s set this threshold as 12 and draw a horizontal line:



![Figure 12](https://drive.google.com/uc?export=view&id=1Xlx2ZuzaiiNH7OsAN80pJV-MAGg2KVp8)


The number of clusters will be the number of vertical lines which are being intersected by the line drawn using the threshold. In the above example, since the red line intersects 2 vertical lines, we will have 2 clusters. One cluster will have a sample (1,2,4) and the other will have a sample (3,5). Pretty straightforward, right?

This is how we can decide the number of clusters using a dendrogram in Hierarchical Clustering. In the next section, we will implement hierarchical clustering which will help you to understand all the concepts that we have learned in this article.

## Modelling

Let's create a new dataframe based on `Age`, `Annual_income`and `Spending_score`.

In [None]:
data.head()

### Looking for the most descriptive variables: first clustering model

One of the critical aspects in the clustering process is to select the variable from which the clustering model identifies groups with a meaning from a real use case perspective. This process is usually iterative; i.e. we start with a set of features more significant from a business perspective and apply a clustering model. If results are not good enough from a clustering performance, we repeat the process with other set of features until the outcome presents a good performance from analytics perspective and are also understanding from a real use case perspective.

[**EX9**] Execute the following piece of code that creates a `X` matrix formed by `Annual_income`and `Spending_score`variables and builds and represents a dendrogram using the **Ward methodology**.

Answer the following questions:

- Which is the largest blue line?
- How many cluster do you choice? Justify your answer

Tip: Use the [cluster.hierarchy](https://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html) library from scipy to create a dendrogram

In [None]:
X = data.iloc[:, 3:].values
hier_clust=sch.linkage(X, method = 'ward')
dendrogram = sch.dendrogram(hier_clust)

plt.title('Dendrogram')
plt.xlabel('Customers')
plt.ylabel('Euclidean distance')
plt.show()

[**Solution**]

The largest blue line is the one between clusters 2 and 3 (largest distance).
I would say that there are 3 main clusters, with possibility of dividing into 5 depending on the use case, though it would require a more in-depth analysis.

[**EX10**] Execute the following piece of code that creates a `X` matrix formed by `Annual_income`and `Spending_score`variables and builds and represents a dendrogram using the **Ward methodology**. Which are the main differences in terms of visualization with respecto to **EX09**?

In [None]:
dendrogram = sch.dendrogram(hier_clust, truncate_mode = 'level',
           p = 5,
           show_leaf_counts=True,
           no_labels= True)

plt.title('Dendrogram')
plt.xlabel('Customers')
plt.ylabel('Euclidean distance')
plt.show()

[**Solution**]

This dendogram truncates when p<=5 so it looks cleaner in the leafs.

[**EX10**] Execute the following Agglomerative Clustering using the sklearn library with the **number of clusters** and number of clusters=5. Let's verify each sample has been assigned to a cluster number in the rank of the **number of clusters** of 5.

In [None]:
from sklearn.cluster import AgglomerativeClustering
hc = AgglomerativeClustering(n_clusters = 5,
                    metric = 'euclidean',
                    linkage = 'ward')

y_hc = hc.fit_predict(X)

In [None]:
y_hc

[**EX11**] Execute the following scatter plot between `Annual_income`and `Spending_score` features. Describe with your words what type of customers are in cluster 1 and cluster 4.

In [None]:
plt.scatter(X[y_hc == 0, 0], X[y_hc == 0, 1], s = 100, c = 'red', label = 'Cluster 1')
plt.scatter(X[y_hc == 1, 0], X[y_hc == 1, 1], s = 100, c = 'blue', label = 'Cluster 2')
plt.scatter(X[y_hc == 2, 0], X[y_hc == 2, 1], s = 100, c = 'green', label = 'Cluster 3')
plt.scatter(X[y_hc == 3, 0], X[y_hc == 3, 1], s = 100, c = 'cyan', label = 'Cluster 4')
plt.scatter(X[y_hc == 4, 0], X[y_hc == 4, 1], s = 100, c = 'magenta', label = 'Cluster 5')
plt.title('Clusters of customers')
plt.xlabel('Annual Income (k€)')
plt.ylabel('Spending Score (1-100)')
plt.legend()
plt.show()

[**Solution**]

Cluster 1 contains people with high income and low spending, while cluster 4 contains people with low income and high spending (bad idea).

[**EX12**] Repeat **EX9**, **EX10** and **EX11** including `Age`variable. Answer the following questions:
- which are the differences in the dendrogram visualization?
- Represent a `Spending score`vs `Annual Income`, `Annual_income` vs `Age` and  `Spending_score` vs `Age`scatter plots

[**Solution**]

In [None]:
X = data.iloc[:, 2:].values
hier_clust=sch.linkage(X, method = 'ward')
dendrogram = sch.dendrogram(hier_clust, truncate_mode = 'level',
           p = 5,
           show_leaf_counts=True,
           no_labels= True)

plt.title('Dendrogram')
plt.xlabel('Customers')
plt.ylabel('Euclidean distance')
plt.show()

In [None]:
plt.figure(figsize=(20, 5))

plt.subplot(1, 3, 1)
plt.scatter(X[y_hc == 0, 0], X[y_hc == 0, 1], s = 100, c = 'red', label = 'Cluster 1')
plt.scatter(X[y_hc == 1, 0], X[y_hc == 1, 1], s = 100, c = 'blue', label = 'Cluster 2')
plt.scatter(X[y_hc == 2, 0], X[y_hc == 2, 1], s = 100, c = 'green', label = 'Cluster 3')
plt.scatter(X[y_hc == 3, 0], X[y_hc == 3, 1], s = 100, c = 'cyan', label = 'Cluster 4')
plt.scatter(X[y_hc == 4, 0], X[y_hc == 4, 1], s = 100, c = 'magenta', label = 'Cluster 5')
plt.title('Clusters of customers')
plt.xlabel('Annual Income (k€)')
plt.ylabel('Spending Score (1-100)')

plt.subplot(1, 3, 2)
plt.scatter(X[y_hc == 0, 0], X[y_hc == 0, 2], s = 100, c = 'red', label = 'Cluster 1')
plt.scatter(X[y_hc == 1, 0], X[y_hc == 1, 2], s = 100, c = 'blue', label = 'Cluster 2')
plt.scatter(X[y_hc == 2, 0], X[y_hc == 2, 2], s = 100, c = 'green', label = 'Cluster 3')
plt.scatter(X[y_hc == 3, 0], X[y_hc == 3, 2], s = 100, c = 'cyan', label = 'Cluster 4')
plt.scatter(X[y_hc == 4, 0], X[y_hc == 4, 2], s = 100, c = 'magenta', label = 'Cluster 5')
plt.title('Clusters of customers')
plt.xlabel('Annual Income (k€)')
plt.ylabel('Age')

plt.subplot(1, 3, 3)
plt.scatter(X[y_hc == 0, 1], X[y_hc == 0, 2], s = 100, c = 'red', label = 'Cluster 1')
plt.scatter(X[y_hc == 1, 1], X[y_hc == 1, 2], s = 100, c = 'blue', label = 'Cluster 2')
plt.scatter(X[y_hc == 2, 1], X[y_hc == 2, 2], s = 100, c = 'green', label = 'Cluster 3')
plt.scatter(X[y_hc == 3, 1], X[y_hc == 3, 2], s = 100, c = 'cyan', label = 'Cluster 4')
plt.scatter(X[y_hc == 4, 1], X[y_hc == 4, 2], s = 100, c = 'magenta', label = 'Cluster 5')
plt.title('Clusters of customers')
plt.xlabel('Spending Score (1-100)')
plt.ylabel('Age')

plt.legend()
plt.show()

[**Solution**]

# Business insights

[**EX13**] Our marketing department would like to group 5 types of customers based on a hierarchical clustering with `annual income`and `spending score` features. Describe from a statistics point of view (**number of sample, max, min, mean and standard deviation**) in terms of `Annual_income` and `Spending_score` features.

[**Solution**]

In [None]:
X = data.iloc[:, 3:].values
hier_clust=sch.linkage(X, method = 'ward')
dendrogram = sch.dendrogram(hier_clust, truncate_mode = 'level',
           p = 5,
           show_leaf_counts=True,
           no_labels= True)

plt.title('Dendrogram')
plt.xlabel('Customers')
plt.ylabel('Euclidean distance')
plt.show()

hc = AgglomerativeClustering(n_clusters = 5,
                    metric = 'euclidean',
                    linkage = 'ward')

y_hc = hc.fit_predict(X)

plt.scatter(X[y_hc == 0, 0], X[y_hc == 0, 1], s = 100, c = 'red', label = 'Cluster 0')
plt.scatter(X[y_hc == 1, 0], X[y_hc == 1, 1], s = 100, c = 'blue', label = 'Cluster 1')
plt.scatter(X[y_hc == 2, 0], X[y_hc == 2, 1], s = 100, c = 'green', label = 'Cluster 2')
plt.scatter(X[y_hc == 3, 0], X[y_hc == 3, 1], s = 100, c = 'cyan', label = 'Cluster 3')
plt.scatter(X[y_hc == 4, 0], X[y_hc == 4, 1], s = 100, c = 'magenta', label = 'Cluster 4')
plt.title('Clusters of customers')
plt.xlabel('Annual Income (k€)')
plt.ylabel('Spending Score (1-100)')

plt.legend()
plt.show()

In [None]:
data['cluster'] = y_hc

In [None]:
data[data['cluster'] == 0].describe()

In [None]:
data[data['cluster'] == 1].describe()

In [None]:
data[data['cluster'] == 2].describe()

In [None]:
data[data['cluster'] == 3].describe()

In [None]:
data[data['cluster'] == 4].describe()

[**EX14**] If our sales team wants to launch new campaign, answer the following questions:

- which cluster(s) will you recommend for a retention campaign of high value customers?
- how are the customers with more possibilites to improve their score?
- we have very limited budget for a new commercial campaign, which group do you select? Justify your answer

[**Solution**]

- I would recommend clusters 2 and 3 for instance (high spending score)
- Cluster 0 is the one that has more possibilities of improving their score.
- I would say cluster 0 and 2 since they have already some amount of spending score (already somewhat convinced) and they have margin for improvement since they have highest incomes.