<a href="https://www.kaggle.com/code/martinab/customer-segmentation-k-means-clustering?scriptVersionId=113557522" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Customer Segmentation (K-Means Clustering)

#### Introduction 

Data for customer segmentation was obtained through membership cards of supermaket mall. We have some basic data about customers such as Customer ID, age, gender, annual income and spending score.

Spending Score is something we assign to the customer based on our defined parameters like customer behavior and purchasing data.

In this project we would like to identify the potential customer base for selling a new product. To do so we will be using Unsupervised Learning technique called KMeans Clustering. 

#### Clustering 

Clustering is the task of partitioning the dataset into groups, called clusters. The goal is to split the data in such a way that points within a single cluster are very similar and points in different clusters are different. Clustering algorithms assign a number to each data point, indicating which cluster a particular point belongs to. 

KMeans clustering is one of the simplest and most commonly used clustering algorithms. It tries to find cluster centers that are representative of certain regions of the data. 

The algorithm alternates between two steps:
1. assigning each data point to the closest custer center,
2. setting each cluster center as the mean of the data ponts that are assigned to it.

The algorithm is finished when the assignment of instances to clusters no longer changes.

#### Targets

By the end of this project, we will:
 -  Know how to achieve customer segmentation using machine learning algorithm (KMeans Clustering) in Python in simplest way.
 -  Identify who are our target customers with whom we can start marketing strategy easy to converse

#### Acknowledgements

The source of data for this project is from Kaggle's [Mall Customer Segmentation Data]('https://www.kaggle.com/vjchoudhary7/customer-segmentation-tutorial-in-python/kernels') dataset.

Literature:
 - A. Geron: Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow. O'Reilly, 2019, ISBN: 978-1-492-03264-9.
 - A.C Muller & S.Guido: Introduction to machine Learning with Python. O'Reilly, 2017, ISBN: 978-1-449-36941-5.

## Importing Libraries 

In [1]:
# Importing libraries:

# Importing numpy, pandas, matplotlib and seaborn:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Imports for plotly:
import plotly.graph_objs as go
import plotly.express as px
import plotly.figure_factory as ff
from plotly.subplots import make_subplots


# To keep graph within the nobebook:
%matplotlib inline

# To hide warnings
import warnings
warnings.filterwarnings('ignore')

### Loading the Data

In [2]:
# Read data from saved csv file:
df = pd.read_csv('../input/customer-segmentation-tutorial-in-python/Mall_Customers.csv')

## Exploratory Data Analysis  

In [3]:
# Display first five rows of dataframe:
header = ff.create_table(df.head())

header.show()

In [4]:
# Function to describe variables
def desc(df):
    d = pd.DataFrame(df.dtypes,columns=['Data_Types'])
    d = d.reset_index()
    d['Columns'] = d['index']
    d = d[['Columns','Data_Types']]
    d['Missing'] = df.isnull().sum().values    
    d['Uniques'] = df.nunique().values
    return d


descr = ff.create_table(desc(df))

descr.show()

In [5]:
# Explore dataframe's statistics (numerical values only):
desc = ff.create_table(df.describe())

desc.show()

From the description customers' ages are from 18 to 70, with average of 38 years. We have 200 samples in our dataset and 0 missing values.

In [6]:
age_df = pd.DataFrame(df.groupby(['Gender'])['Gender'].count())
#age_df.head()

In [7]:
#Gender distribution of shoppers:


data=go.Bar(x = age_df.index
           , y = age_df.Gender
           ,  marker=dict( color=['#FF0000', '#0000FF'])
           )



layout = go.Layout(title = 'Number of Customers split by Gender'
                   , xaxis = dict(title = 'Gender')
                   , yaxis = dict(title = 'Volume')
                  )

fig = go.Figure(data,layout)
fig.show()

In our Dataset we have 112 female and 88 male samples, so female represantation is higher.

In [8]:
# Box plot for Annual Income by Gender:

fig = px.box(df
             , x='Gender'
             , y='Annual Income (k$)'
             , points='all'
             , color='Gender'
             , title='Box plot of Annual Income by Gender'
             #, width = 950
            )

fig.show()

Perhaps it's not a surprise that males have on average higher annual income than females. Just for a comparison the highest male annual income reaches 137k whilst highest female annual income is 126k whitch is 11k lower. 

In [9]:
# Box Plot for Spending Score split by Gender:

fig = px.box(df
             , x='Gender'
             , y='Spending Score (1-100)'
             , points="all"
             , color='Gender'
             , title='Box plot of Spending Score by Gender'
            )

fig.show()

In terms of Spending Score, both genders have a median of 50. Lower quartile for males is 23, 12 lower than q1 for females. It seems that females' spending score is slightly higer. 

Observing the scatter points for both genders we can see some gaps. We could devide customers into 3 groups into low (0-40), medium (40-60) and high Spending Score (>60).

In [10]:
# Scatters for Age vs. Spending Score split by Gender:

fig = px.scatter(df
                 , x='Age'
                 , y='Spending Score (1-100)'
                 , color = 'Gender'
                 , facet_col='Gender'
                 , color_continuous_scale= ['#FF0000','#0000FF']   #px.colors.sequential.Viridis
                 , render_mode="webgl"
                # , width = 950
                )

fig.show()

In [11]:
# Scatter for Annual income vs. Speniding Score split by Gender:

fig = px.scatter(df
                 , x='Annual Income (k$)'
                 , y='Spending Score (1-100)'
                 , color = 'Gender'
                 , facet_col='Gender'
                 , color_continuous_scale= ['#FF0000','#0000FF']   #px.colors.sequential.Viridis
                 , render_mode="webgl"
                )

fig.show()

In [12]:
# Histograms, Distribution of Annual Income, Age and Spending Score:

fig = make_subplots(rows=1
                    , cols=3
                    ,subplot_titles=('Annual Income', 'Age', 'Spending Score'))


trace0 = go.Histogram(x=df['Annual Income (k$)']
                      , xbins=dict(start=15
                                   , end=140
                                   , size= 5)
                      , autobinx=False
                      , opacity=0.7
                     )
trace1 = go.Histogram(x=df['Age']
                      , xbins=dict(start=18
                                   , end=98
                                   , size= 5)
                      , autobinx=False
                      , opacity=0.7
                     )
trace2 = go.Histogram(x=df['Spending Score (1-100)']
                      , xbins=dict(start=1
                                   , end=100
                                   , size= 2)
                      , autobinx=False
                      , opacity=0.7
                     )

fig.append_trace(trace0, 1, 1)
fig.append_trace(trace1, 1, 2)
fig.append_trace(trace2, 1, 3)

# Update xaxis properties
fig.update_xaxes(title_text='Annual Income (k$)', row=1, col=1)
fig.update_xaxes(title_text='Age', row=1, col=2)
fig.update_xaxes(title_text='Spending Score (1-100)',  row=1, col=3)


# Update yaxis properties
fig.update_yaxes(title_text='count', row=1, col=1)

# Update title and height
fig.update_layout(title_text='Distributions of ', height=600)


fig.show()

In [13]:
# Scatter graph for Annual Income vs Spending Score by Gender:

fig = px.scatter(df
                 , x='Annual Income (k$)'
                 , y='Spending Score (1-100)'
                 , color= 'Gender'
                 , marginal_y='rug'
                 , marginal_x='histogram'
                )
fig.show()

In [14]:
# Scatter graph for Annual Income vs. Spending Score by Age:

fig = px.scatter(df
                 , x='Annual Income (k$)'
                 , y='Spending Score (1-100)'
                 , color= 'Age'
                 , marginal_y='box'
                 , marginal_x='histogram'
                )
fig.show()

In [15]:
# Scatter graph for Age vs. Spending Score by Annual Income:

fig = px.scatter(df
                 , x='Age'
                 , y= 'Spending Score (1-100)'
                 , color= 'Annual Income (k$)'
                 , marginal_y='box'
                 , marginal_x='histogram'
                )
fig.show()

In [16]:
# 3D Scatter graph for Annual Income, Spending Score and Age:

fig = px.scatter_3d(df
                    , x='Annual Income (k$)'
                    , y='Spending Score (1-100)'
                    , z='Age'
                    , color='Annual Income (k$)'
                    , size='Spending Score (1-100)'
                   )

fig.show()

In [17]:
# Correlation matrix for Mall dataset features:

corr = df[['Age', 'Annual Income (k$)', 'Spending Score (1-100)']].corr()

fig = go.Figure(data=go.Heatmap(
                   z=corr
                 , x=['Age', 'Annual Income (k$)', 'Spending Score (1-100)']
                 , y=['Spending Score (1-100)','Annual Income (k$)', 'Age' ]
                 , hoverongaps = False))

fig.update_layout(title='Correlation for Features of Mall data')


fig.show()

## KMeans Clustering 

The KMeans algorithm clusters data by trying to separate samples in n groups of equal variance, minimizing a criterion known as the inertia or within-cluster sum-of-squares. **This algorithm requires the number of clusters to be specified.** It scales well to large number of samples and has been used across a large range of application areas in many different fields.

The k-means algorithm divides a set of samples into disjoint clusters, each described by the mean of the samples in the cluster. The means are commonly called the cluster “centroids”; note that they are not, in general, points from, although they live in the same space.


### What is the optimal number of clusters? 

Default number of clusters in sklearn is set to 8. It is not an easy task to decide on optimal number of custers k and result can be quite bad if it's set to wrong number. 

#### Selecting the number of clusters using internia

The K-means algorithm aims to choose centroids that minimise the inertia, or within-cluster sum-of-squares criterion:


$$\sum\limits_{i=0}^{n}\min_{\mu_{j}\in C}(\lVert x_i - \mu_j \rVert^2)$$
 


In [18]:
# Calculate inertia for k-clusters:

from sklearn.cluster import KMeans

X = df[['Age', 'Annual Income (k$)', 'Spending Score (1-100)']]

l = []
for i in range(2, 11):
    kmeans = KMeans(n_clusters = i, random_state = 123)
    kmeans.fit(X)
    l.append(kmeans.inertia_)

df_1= pd.DataFrame(l, columns=['Inertia'])
df_1['k'] = df_1.index+2

In [19]:
# Line graph for Inertia vs. k-Clusters

fig = go.Figure()
fig.add_trace(go.Scatter(x=df_1.k
                         , y=df_1.Inertia
                         , mode='lines'
                         , name='inertia lines'
                        )
             )

fig.add_trace(go.Scatter(x=df_1.k, y=df_1.Inertia,
                    mode='markers', name='inertia point'))

fig.update_layout(title='The Total Sum of Squares Method'
                  , xaxis_title='k-clusters'
                  , yaxis_title='Internia'
                 )

fig.show()

The inertia is not a good performance metric when trying to choose k because it keeps getting lower as we increase k. 
As we can see, the inertia drops very quickly as we increase k up to 5, but then it decreases much more slowly as we keep increasing k. This curve has roughly the shape of an arm and we can consider point at k=5 to be an 'elbow'. It's not actually 100% clear what point the 'elbow' is. It can also be 4 or 6.

So 4,5 or 6 would be a good choice for number of clusters. Lower values would be dramatic and going for higher values would not be much help, as we might be splitting perfectly good clusters in half for no good reason.

#### Selecting the number of clusters using silhouette score

Silhouette score is considered to be a more precise approach than internia's 'elbow'. It is the mean silhouette coefficient over all the instances. An instance's silhouette coefficient is equal to:

$$\frac{(b-a)}{max(a,b)}$$

where a is the mean distance to the other instances in the same cluster and b is the miean nearest cluster distance.

In [20]:
# Calculate silhouette score for k-clusters:

from sklearn.metrics import silhouette_score
from sklearn import metrics

m = []

for i in range(2,11):
    kmeans = KMeans(n_clusters = i, random_state = 123)
    k_means = kmeans.fit(X)  
    labels = k_means.labels_
    sil_coeff = metrics.silhouette_score(X, labels,metric='euclidean')
    m.append(sil_coeff)


df_2= pd.DataFrame(m, columns=['Score'])
df_2['k'] = df_2.index+2
#print(df_2)


In [21]:
# Line graph for silhouette score vs. k-clusters

fig = go.Figure()
fig.add_trace(go.Scatter(x=df_2.k
                         , y=df_2.Score
                         , mode='lines'
                         , name=' score lines'
                        )
             )

fig.add_trace(go.Scatter(x=df_2.k
                         , y=df_2.Score
                         , mode='markers'
                         , name='score point'))

fig.update_layout(title='The Silhouette Score Method'
                  , xaxis_title='k-clusters'
                  , yaxis_title='Score'
                 )

fig.show()

As we can see this visualisation is much richer than the elbow method. It just confirms that 6 is a good choice for a number of clusters. It also underlines that k=5 is quite a good choice too and much better than k=4 or k=7.

We will choose k = 5 (in case we can see that we would benefit from higher or lower number of clusters, k can be replaced by 6 or 4)

### Clustering 

Now, when we have decided on potential number of clusters let's jump into actual KMeans clustering.

In [22]:
# Fit data to KMeans clustering with 5 clusters, assign labels and cetroids: 

kmeans= KMeans(n_clusters = 5)
kmeans.fit(X)

labels = kmeans.labels_
centroids = kmeans.cluster_centers_


In [23]:
print('Cluster membership: \n{}'.format(labels))

Cluster membership: 
[3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3
 2 3 2 3 2 3 4 3 2 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
 4 4 4 4 4 4 4 4 4 4 4 4 0 1 0 4 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0]


In [24]:
print('Centroids: \n{}'.format(centroids))

Centroids: 
[[32.69230769 86.53846154 82.12820513]
 [40.32432432 87.43243243 18.18918919]
 [25.27272727 25.72727273 79.36363636]
 [45.2173913  26.30434783 20.91304348]
 [43.12658228 54.82278481 49.83544304]]


In [25]:
clusters = labels.tolist()
X['clusters'] = clusters
X['Id'] = df['CustomerID']

In [26]:
# Display first 5 rows of dataset X:

head = ff.create_table(X.head())
head.show()

In [27]:
# Scatter graph of clusters, centroids are displayed as black markers:

X['clusters'] = X['clusters'].astype(str)

fig = px.scatter(X
                 , x='Annual Income (k$)'
                 , y='Spending Score (1-100)'
                 , color='clusters'
                 , title='Customer Segmentation (k=5)'
                )

fig.add_trace(go.Scatter(x=centroids[:,1], y=centroids[:,2],
                    mode='markers',
                    name='centroids',
                        marker=dict(
            color='black')))
              
              
fig.show()

Introducing a new product, we would like to target (send marketing materials) customers in clusters 1 and 3.

Cluster 3 has annual income similar to cluster 4 (between 15k - 40k). However, cluster 3 has high Spening Score.
Cluster 3 has high annual income similar to cluster 0, but again much higher Spending Score than 0.