# Clustering the Mall Customers

In this project we have a small dataset which contains information on the customers coming to the mall and try to find some interesting patterns within the customers using k means algorithm.

In [1]:
# Importing the required libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
# Importing the dataset

data=pd.read_csv("Mall_Customers.csv")

In [3]:
#Exploring the data
data.head()

Unnamed: 0,CustomerID,Gender,Age,Annual Income (k$),Spending Score (1-100)
0,1,Male,19,15,39
1,2,Male,21,15,81
2,3,Female,20,16,6
3,4,Female,23,16,77
4,5,Female,31,17,40


# Data Preprocessing and Preparation

We have the following columns:
1. CustomerID - Unique identification of every customer
2. Gender - The gender of every customer
3. Age - The Age of every customer
4. Annual Income - The Annual income in dollars for every customer
5. Spending score - The spending score in a range of 1-100. Higher the score better the spending power.

From the above the CustomerID is not very useful since it is not a identification number.Hence we will not be using that column in our analysis.

In [4]:
# Drop the customer id column
data_clean=data.drop('CustomerID',axis=1)

In [5]:
# Check for missing values
data_clean.isnull().sum()

Gender                    0
Age                       0
Annual Income (k$)        0
Spending Score (1-100)    0
dtype: int64

In [6]:
data_clean.columns

Index(['Gender', 'Age', 'Annual Income (k$)', 'Spending Score (1-100)'], dtype='object')

In [7]:
#Change the column names
data_clean.columns=['Gender','Age','Annual_Income','Spending_Score']

We do not seem to have any missing values in the data.

In [8]:
data_clean.describe(include='all')

Unnamed: 0,Gender,Age,Annual_Income,Spending_Score
count,200,200.0,200.0,200.0
unique,2,,,
top,Female,,,
freq,112,,,
mean,,38.85,60.56,50.2
std,,13.969007,26.264721,25.823522
min,,18.0,15.0,1.0
25%,,28.75,41.5,34.75
50%,,36.0,61.5,50.0
75%,,49.0,78.0,73.0


In [9]:
data_clean.dtypes

Gender            object
Age                int64
Annual_Income      int64
Spending_Score     int64
dtype: object

As we can see above the Gender column is indeed a categorical variable. As we know clustering algorithm is a distance based algorithm we will need to pass only numeric values to the algorith. Hence we will need to convert this column to numeric.

In [10]:
# data_clean=pd.get_dummies(data_clean,drop_first=True)  
# # will by default convert all the categorical columns to dummies
# # drop_first =True will drop the first category so as to get n-1 values
# data_clean.dtypes

In [11]:
from sklearn.preprocessing import OneHotEncoder
enc=OneHotEncoder(drop='first')
a=enc.fit_transform(data_clean).toarray()
a.shape



(200, 197)

Now that our data is cleaned we will first need to scale the variables. This is required since k means uses a distance metric hence will be affected by the magnitude of the scale of the variables.

In [12]:
# Scaling the variables
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
data_scaled=sc.fit_transform(data_clean)

ValueError: could not convert string to float: 'Male'

# Identifying number of clusters

In [None]:
# Using the elbow method to identify the number of clusters
from sklearn.cluster import KMeans
wcss=[]  # store the values of within cluster sum of squares(obs and centroid) for different values of no of clusters
         
for i in range(1,11):
    '''
    Try for values from 1 to 10'''
    kmeans=KMeans(n_clusters=i,init='k-means++',random_state=4)
    '''
    k_means++ to avoid the random initialization trap'''
    kmeans.fit(data_scaled)
    wcss.append(kmeans.inertia_)
    '''
    Sum of squared distances of samples to their closest cluster center'''

In [None]:
plt.plot(range(1,11),wcss)
plt.title('Elbow method to determine number of clusters')
plt.xlabel('Number of Clusters')
plt.ylabel('WCSS')
plt.show()

From the above graph we can see that the WCSS initially decreases rapidly but after 5 clusters it decreases very slowly. Hence we can say that 5 is the optimum number of clusters.

# Fitting the Model

In [None]:
#Fitting the model
kmeans=KMeans(n_clusters=5,init='k-means++',random_state=4)
cluster_preds=kmeans.fit_predict(data_scaled)

In [None]:
kmeans.cluster_centers_

In [None]:
kmeans.labels_  # similar to cluster_preds

In [None]:
# Assigning the cluster labes to a new column
data_clean['Cluster_Labels']=pd.Series(cluster_preds,index=data_clean.index)

# Analyzing the clusters

In [None]:
# Create a column for females to count their numbers for every cluster
data_clean['Gender_Female']=np.where(data_clean['Gender_Male']==0,1,0)

In [None]:
pd.pivot_table(data_clean,index=['Cluster_Labels'],values=['Gender_Male','Age','Annual_Income','Spending_Score','Gender_Female']
              ,aggfunc={'Gender_Male':np.sum,  # 
                       'Age':np.mean,
                       'Annual_Income':np.mean,
                       'Age':np.mean,
                       'Spending_Score':np.mean,
                       'Gender_Female':np.sum}
              )


From the above clusters we can make the following conclusions:
1. Cluster 0 is people with average age of 55 having a decent annual income of 52$ comprising  of males spend slightly on a lower side.
2. Cluster 1 is people with an average age of 36 have a very high annual income and almost equal proportion of Males and Females spend very low.
3. Cluster 2 is people with an average age of 24 have lower annual income and more proportion of females than males spend on a higher side.
4. Cluster 3 is people with an average age of 46 have a lower annual income comprising mainly of Males and spend on a lower side.
5. Cluster 4 is people with an average age of 32 and a very high annual income with equal proportion of males and females and spend on higher side

Overall the clusters make sense. We have a group of people having high income but low spending score and a group of people with a low income but a very high spending score. The spending score is also very intuitive of age stating that the younger people tend to spend alightly on a higher side than the older one's.


# Visualizing the clusters

In [None]:
# Visualizing the clusters
from sklearn.decomposition import PCA
import seaborn as sns
#Perform PCA to reduce the vbariables to 2 variables so as to visualize them on a graph
pca=PCA(n_components=2,random_state=1)
# Pass the scaled data to pCA
pca_data=pd.DataFrame(pca.fit_transform(data_scaled),columns=['PC1','PC2'])
#Assigning the cluster labels to the PCA data
pca_data['cluster']=pd.Series(cluster_preds,index=pca_data.index)
# Visualizing the clusters
plt.figure(figsize=(8,5))
sns.scatterplot(x='PC1',y='PC2',hue='cluster',data=pca_data,palette=('tab10'))
plt.show()

As we can see above the blue green and red clusters are fairly seperated which are clusters 0,2 and 3 respectively. We do have a overlap between the clusters 1 and 4 so we can even consider combining them going further.