<div style="background-color:#bf80ff; color:#636363;">
    <h1><center>Introduction</center></h1>
</div>

When you perform customer segmentation, you find similar characteristics in each customer’s behaviour and needs. Then, those are generalized into groups to satisfy demands with various strategies. Moreover, those strategies can be an input of the:

1. Targeted marketing activities to specific groups
2. Launch of features aligning with the customer demand
3. Development of the product roadmap

In this notebook, I'll be performing Unsupervised Machine Learning with Python to give everyone a basic understanding of how we can segment data into particular groups and find valuable insights from it.

I'll be using 3 algorithms to understand how each algorithm segments the data using different techniques.

The 3 algorithms will be:

1. K-Means Clustering
2. Heirarchal (Agglomerative) Clustering
3. DBSCAN (Density Based Spatial Clustering of Applications with Noise)

<div class="alert alert-block alert-info">
<h4>If you like this notebook, feel free to upvote it! Comment down your suggestions/opinions, I get to learn alot from them and it's very valuable to me! Thanks you! :D</h3>
</div>

<div style="background-color:#B4DBE9;">
    <center><img src="https://raw.githubusercontent.com/jaykumar1607/Customer-Segmentation/main/source.gif">
</div>

<div style="background-color:#bf80ff; color:#636363;">
    <h1><center>Importing Libraries</center></h1>
</div>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import seaborn as sns
import plotly.express as px
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples,silhouette_score
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram
from sklearn.cluster import DBSCAN
from collections import Counter
from sklearn.decomposition import PCA

<div style="background-color:#bf80ff; color:#636363;">
    <h1><center>Loading the Dataset</center></h1>
</div>

In [None]:
df = pd.read_csv('../input/customer-segmentation-tutorial-in-python/Mall_Customers.csv')

In [None]:
df.head()

<div style="background-color:#bf80ff; color:#636363;">
    <h1><center>Colors</center></h1>
</div>

In [None]:
colors_dark = ["#1F1F1F", "#313131", '#636363', '#AEAEAE', '#DADADA']
colors_mix = ["#17869E", '#264D58', '#179E66', '#D35151', '#E9DAB4', '#E9B4B4', '#D3B651', '#6351D3']

sns.palplot(colors_dark)
sns.palplot(colors_mix)

<div style="background-color:#bf80ff; color:#636363;">
    <h1><center>Visualizations</center></h1>
</div>

In [None]:
d= pd.DataFrame(df['Gender'].value_counts())
fig = px.pie(d,values='Gender',names=['Male','Female'],hole=0.4,opacity=0.7,
            color_discrete_sequence=[colors_mix[7],colors_mix[2]])

fig.add_annotation(text='Gender',
                   x=0.5,y=0.5,showarrow=False,font_size=18,opacity=0.7,font_family='monospace')

fig.update_layout(
    font_family='monospace',
    title=dict(text='Gender Ratio',x=0.5,y=0.98,
               font=dict(color=colors_dark[2],size=20)),
    legend=dict(x=0.37,y=-0.05,orientation='h',traceorder='reversed'),
    hoverlabel=dict(bgcolor='white'))

fig.update_traces(textposition='outside', textinfo='percent+label')

fig.show()

In [None]:
fig = px.histogram(df,x='Age',template='plotly_white',opacity=0.7,nbins=25,
                   color_discrete_sequence=[colors_mix[7]])

fig.update_layout(
    font_family='monospace',
    title=dict(text='Distribution Of Age',x=0.5,y=0.95,
               font=dict(color=colors_dark[2],size=20)),
    xaxis_title_text='Age',
    yaxis_title_text='Count',
    legend=dict(x=1,y=0.96,bordercolor=colors_dark[4],borderwidth=0,tracegroupgap=5),
    bargap=0.3,
)
fig.show()

In [None]:
fig = px.histogram(df,x='Annual Income (k$)',template='plotly_white',opacity=0.7,nbins=20,
                   color_discrete_sequence=[colors_mix[7]])

fig.update_layout(
    font_family='monospace',
    title=dict(text='Distribution Of Annual Income',x=0.5,y=0.95,
               font=dict(color=colors_dark[2],size=20)),
    xaxis_title_text='Annual Income (k$)',
    yaxis_title_text='Count',
    legend=dict(x=1,y=0.96,bordercolor=colors_dark[4],borderwidth=0,tracegroupgap=5),
    bargap=0.3,
)
fig.show()

In [None]:
fig = px.histogram(df,x='Spending Score (1-100)',template='plotly_white',opacity=0.7,nbins=20,
                   color_discrete_sequence=[colors_mix[7]])

fig.update_layout(
    font_family='monospace',
    title=dict(text='Distribution Of Spending Score',x=0.5,y=0.95,
               font=dict(color=colors_dark[2],size=20)),
    xaxis_title_text='Spending Score',
    yaxis_title_text='Count',
    legend=dict(x=1,y=0.96,bordercolor=colors_dark[4],borderwidth=0,tracegroupgap=5),
    bargap=0.3,
)
fig.show()

<div style="background-color:#bf80ff; color:#636363;">
    <h1><center>Data Preprocessing</center></h1>
</div>

We know that CustomerID is unnessary for our model so we'll drop that column

In [None]:
df.drop('CustomerID',axis=1,inplace=True)

In [None]:
df.columns

Now, we'll encode the labels from the **Gender** feature into numerical values.

In [None]:
df['Gender'] = df['Gender'].apply(lambda x: 0 if x=='Male' else 1)

In [None]:
df['Gender']

We'll scale the data as algorithms like K-Means Clustering uses Euclidean Distance and other such distance metrics for computation of distances between data points, therefore making it sensitive to outliers.

In [None]:
scaler = StandardScaler()
scaler.fit(df)
X = scaler.transform(df)

<div style="background-color:#bf80ff; color:#636363;">
    <h1><center>Modelling</center></h1>
</div>

We'll look at a few models with which we can use make clusters and segment our customers.

### 1. K-Means Clustering

We'll first start off by trying the elbow method to find out the best value for the **n_clusters** parameter.

In [None]:
wcss= []    # within cluster sum of squares
ss = []     # silouette score
for i in range(2,11):
    model = KMeans(n_clusters=i)
    model.fit_transform(X)
    wcss.append(model.inertia_)
    ss.append(silhouette_score(X,labels=model.predict(df)))

In [None]:
fig,ax1 = plt.subplots(figsize=(14,8))

ax1.plot(range(2, 11), wcss , '--', color=colors_mix[2], linewidth=2)
ax1.legend(['Inertia'],bbox_to_anchor=(0.9365,1),frameon=False)
ax1.plot(range(2, 11), wcss , 'o', color=colors_mix[7],alpha=0.7)
ax1.set_ylabel('Inertia')

ax2 = ax1.twinx()
ax2.plot(range(2, 11), ss, '-', color=colors_mix[7], linewidth=2)
ax2.legend(['Silhouette Score'],bbox_to_anchor=(1,0.95),frameon=False)
ax2.plot(range(2, 11), ss, 'o', color=colors_mix[2], alpha=0.7)
ax2.set_ylabel('Silhouette Score')

plt.xlabel('Number of clusters')
plt.show()

<div class="alert alert-block alert-danger">
As we can see that the inertia curve is smooth and there is no abrupt decrease, we cannot find the value of n_clusters and be sure about it with this method.
</div>

To validate the number of clusters we're going to make from the model, we'll use the **Silhouette analysis.**

<div class="alert alert-block alert-info">
<b>3 things to keep in mind when we look at the graph below are:</b><br>
 <ol>
  <li>The difference between the widths of the clusters should not be large.</li>
  <li>Part of clusters with < 0 silhouette score means that some points have been assigned to a wrong cluster, so we need to minimize that.</li>
  <li>We also need to look at the overall Silhouette score of a model for a particular cluster.</li>
</ol> 
</div>

In [None]:
range_n_clusters =[2,4,6,8,10]

for n_clusters in range_n_clusters:
    # Create a subplot with 1 row and 2 columns
    fig, (ax1, ax2) = plt.subplots(1, 2)
    fig.set_size_inches(18, 7)

    # The 1st subplot is the silhouette plot
    # The silhouette coefficient can range from -1, 1 but in this example all
    # lie within [-0.1, 1]
    ax1.set_xlim([-0.1, 1])
    # The (n_clusters+1)*10 is for inserting blank space between silhouette
    # plots of individual clusters, to demarcate them clearly.
    ax1.set_ylim([0, len(X) + (n_clusters + 1) * 10])

    # Initialize the clusterer with n_clusters value and a random generator
    # seed of 10 for reproducibility.
    clusterer = KMeans(n_clusters=n_clusters, random_state=10)
    cluster_labels = clusterer.fit_predict(X)

    # The silhouette_score gives the average value for all the samples.
    # This gives a perspective into the density and separation of the formed
    # clusters
    silhouette_avg = silhouette_score(X, cluster_labels)
    print("For n_clusters =", n_clusters,
          "The average silhouette_score is :", silhouette_avg)

    # Compute the silhouette scores for each sample
    sample_silhouette_values = silhouette_samples(X, cluster_labels)

    y_lower = 10
    for i in range(n_clusters):
        # Aggregate the silhouette scores for samples belonging to
        # cluster i, and sort them
        ith_cluster_silhouette_values = \
            sample_silhouette_values[cluster_labels == i]

        ith_cluster_silhouette_values.sort()

        size_cluster_i = ith_cluster_silhouette_values.shape[0]
        y_upper = y_lower + size_cluster_i

        color = cm.nipy_spectral(float(i) / n_clusters)
        ax1.fill_betweenx(np.arange(y_lower, y_upper),
                          0, ith_cluster_silhouette_values,
                          facecolor=color, edgecolor=color, alpha=0.7)

        # Label the silhouette plots with their cluster numbers at the middle
        ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))

        # Compute the new y_lower for next plot
        y_lower = y_upper + 10  # 10 for the 0 samples

    ax1.set_title("The silhouette plot for the various clusters.")
    ax1.set_xlabel("The silhouette coefficient values")
    ax1.set_ylabel("Cluster label")

    # The vertical line for average silhouette score of all the values
    ax1.axvline(x=silhouette_avg, color="red", linestyle="--")

    ax1.set_yticks([])  # Clear the yaxis labels / ticks
    ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])

    # 2nd Plot showing the actual clusters formed
    colors = cm.nipy_spectral(cluster_labels.astype(float) / n_clusters)
    ax2.scatter(X[:, 0], X[:, 1], marker='.', s=30, lw=0, alpha=0.7,
                c=colors, edgecolor='k')

    # Labeling the clusters
    centers = clusterer.cluster_centers_
    # Draw white circles at cluster centers
    ax2.scatter(centers[:, 0], centers[:, 1], marker='o',
                c="white", alpha=1, s=200, edgecolor='k')

    for i, c in enumerate(centers):
        ax2.scatter(c[0], c[1], marker='$%d$' % i, alpha=1,
                    s=50, edgecolor='k')

    ax2.set_title("The visualization of the clustered data.")
    ax2.set_xlabel("Feature space for the 1st feature")
    ax2.set_ylabel("Feature space for the 2nd feature")

    plt.suptitle(("Silhouette analysis for KMeans clustering on sample data "
                  "with n_clusters = %d" % n_clusters),
                 fontsize=14, fontweight='bold')

plt.show()

<div class="alert alert-block alert-success">
<b>Result: </b>After looking at the Silhouette Analysis graphs, I've chosen n_clusters = 2 for my KMeans model.
</div>

In [None]:
model = KMeans(n_clusters=2)
predictions = model.fit_predict(X)

In [None]:
new_df = pd.merge(df,pd.Series(predictions,name='Cluster'),on=df.index)
new_df.drop('key_0',axis=1,inplace=True)
new_df.head()

In [None]:
fig = px.histogram(new_df,x='Age',color='Cluster',template='plotly_white',
                  marginal='box',opacity=0.7,nbins=50,color_discrete_sequence=[colors_mix[2],colors_mix[7]],
                  barmode='group',histfunc='count')

fig.update_layout(
    font_family='monospace',
    title=dict(text='Distribution Of Age After Clustering',x=0.5,y=0.95,
               font=dict(color=colors_dark[2],size=20)),
    xaxis_title_text='Age',
    yaxis_title_text='Count',
    legend=dict(x=1,y=0.96,bordercolor=colors_dark[4],borderwidth=0,tracegroupgap=5),
    bargap=0.3,
)
fig.show()

In [None]:
fig = px.histogram(new_df,x='Annual Income (k$)',color='Cluster',template='plotly_white',
                  marginal='box',opacity=0.7,nbins=50,color_discrete_sequence=[colors_mix[2],colors_mix[7]],
                  barmode='group',histfunc='count')

fig.update_layout(
    font_family='monospace',
    title=dict(text='Distribution Of Annual Income After Clustering',x=0.5,y=0.95,
               font=dict(color=colors_dark[2],size=20)),
    xaxis_title_text='Annual Income (k$)',
    yaxis_title_text='Count',
    legend=dict(x=1,y=0.96,bordercolor=colors_dark[4],borderwidth=0,tracegroupgap=5),
    bargap=0.3,
)
fig.show()

In [None]:
fig = px.histogram(new_df,x='Spending Score (1-100)',color='Cluster',template='plotly_white',
                  marginal='box',opacity=0.7,nbins=25,color_discrete_sequence=[colors_mix[2],colors_mix[7]],
                  barmode='group',histfunc='count')

fig.update_layout(
    font_family='monospace',
    title=dict(text='Distribution Of Spending Score After Clustering',x=0.5,y=0.95,
               font=dict(color=colors_dark[2],size=20)),
    xaxis_title_text='Spending Score (1-100)',
    yaxis_title_text='Count',
    legend=dict(x=1,y=0.96,bordercolor=colors_dark[4],borderwidth=0,tracegroupgap=5),
    bargap=0.3,
)
fig.show()

We can use **PCA** to visualize the clusters and observe how our model performed.

In [None]:
pca = PCA(2)
pca.fit(X)
X_PCA = pca.transform(X)
plt.figure(figsize=(15,8))
sns.scatterplot(x=X_PCA[:, 0], y=X_PCA[:, 1], 
                hue=predictions, palette=[colors_mix[7],colors_mix[2]], s=50)
plt.title('Cluster of Customers in 2D', size=15, pad=10)
sns.despine()
plt.legend(loc=0, bbox_to_anchor=[1,1])
plt.show()

We can also use PCA to visualize the clusters in 3D

In [None]:
pca = PCA(3)
pca.fit(X)
X_PCA = pca.transform(X)


fig = px.scatter_3d(x=X_PCA[:,0], y=X_PCA[:,1], z=X_PCA[:,2],
                    color=predictions,opacity=0.8,color_continuous_scale=[colors_mix[2],colors_mix[7]],
                   width=800,height=800)

fig.update_layout(font_family='monospace',
    title=dict(text='Customer Clusters in 3D',x=0.5,y=0.95,
    font=dict(color=colors_dark[2],size=20)),
    coloraxis_showscale=False)
fig.show()

### 2. Hierarchal Clustering

Agglomerative Clustering is a technique which forms groups or clusters based on the similarity between points. This is a pairwise process as each unit of the data is compared with another one to find how similar they are and group them into a small cluster. It has a bottom-up appraoch as it starts with the smallest units to make small clusters and then at the end of the computation makes one huge cluster with all the sub-clusters and units in them.

In [None]:
model = AgglomerativeClustering(distance_threshold=0,n_clusters=None)
model.fit(X)

Code to plot Dendograms:

In [None]:
def plot_dendrogram(model, **kwargs):
    # Create linkage matrix and then plot the dendrogram

    # create the counts of samples under each node
    counts = np.zeros(model.children_.shape[0])
    n_samples = len(model.labels_)
    for i, merge in enumerate(model.children_):
        current_count = 0
        for child_idx in merge:
            if child_idx < n_samples:
                current_count += 1  # leaf node
            else:
                current_count += counts[child_idx - n_samples]
        counts[i] = current_count

    linkage_matrix = np.column_stack([model.children_, model.distances_,
                                      counts]).astype(float)

    # Plot the corresponding dendrogram
    dendrogram(linkage_matrix, **kwargs)

In [None]:
plt.figure(figsize=(14,8))
plot_dendrogram(model,truncate_mode = 'level',p=3)
plt.xlabel("Number of points in node (or index of point if no parenthesis).")
plt.show()

With the help of dendograms, we can see 2 distinct clusters being made from this model.

### 3. DBSCAN

DBSCAN uses 2 major parameters to make clusters:

1. **eps** - Radius of each core.
2. **min_samples** -  Minimum number of points that should be inside the radius of a point to consider it as a core point

In [None]:
model = DBSCAN(eps=1,min_samples=5)
cluster_labels = model.fit_predict(X)

In [None]:
silhouette_score(X,cluster_labels)

With the help of **PCA**, we can observe the clusters which the models formed.

<div class="alert alert-block alert-info">
<b>Note: </b>
-1 is assigned to the points which are termed as <b>outliers</b> as these points aren't part of a cluster
</div>

In [None]:
pca = PCA(2)
pca.fit(X)
X_PCA = pca.transform(X)
plt.figure(figsize=(15,8))
sns.scatterplot(x=X_PCA[:, 0], y=X_PCA[:, 1], 
                hue=cluster_labels, palette=[colors_mix[3],colors_mix[2],colors_mix[7]], s=50)
plt.title('Cluster of Customers in 2D', size=15, pad=10)
sns.despine()
plt.legend(loc=0, bbox_to_anchor=[1,1])
plt.show()

The distinction between clusters doesn't seem too clear with the 2D representation, so let's make a 3D representation of it to understand how these clusters are distinct from one another.

In [None]:
pca = PCA(3)
pca.fit(X)
X_PCA = pca.transform(X)


fig = px.scatter_3d(x=X_PCA[:,0], y=X_PCA[:,1], z=X_PCA[:,2],
                    color=cluster_labels,opacity=0.8,
                    color_continuous_scale=[colors_mix[3],colors_mix[7],colors_mix[2]],
                    width=800,height=800)

fig.update_layout(font_family='monospace',
    title=dict(text='Customer Clusters in 3D',x=0.5,y=0.95,
    font=dict(color=colors_dark[2],size=20)),
    coloraxis_showscale=False)
fig.show()

Now we can clearly see how these clusters are formed and how these are distinct from one another.

<div style="background-color:#bf80ff; color:#636363;">
    <h1><center>Conclusion</center></h1>
</div>

1. We used the **Silhouette Analysis** to find the correct value of n_clusters.
2. With K-Means, 2 clusters divided the Customers based on **Age** and **Spending Score** quite distinctly.
3. With Heirarchal Clustering, 2 clusters were formed and were visualized using **dendograms**.
4. With DBSCAN, the clusters formed were different than that made by K-Means and we also found out the points which were not counted in any cluster and were termed as **outliers**.

<div style="background-color:#bf80ff; color:#636363;">
    <h1><center>Thank You!</center></h1>
</div>