# **Mall Customers Segmentation Project**

Customer Relationship Management (CRM) seeks to build relations with the most profitable clients by performing customer segmentation and designing appropriate marketing tools. This is particularly important within the competitive environment that combines sociodemographic characteristics of retail consumers and specialisation of sellers and buyers which forces companies to adopt a dynamic management of clients to achieve higher profits and to gain a higer share of the market than its competitors.



# **1. Import Libraries**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_style('whitegrid')

# **2. Load Data**

In [None]:
df = pd.read_csv('../input/customer-segmentation-tutorial-in-python/Mall_Customers.csv')

In [None]:
df.head(2)

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.describe()

# **3. Exploratory Data Analysis**

Check the number of null values.

In [None]:
100*df.isnull().sum()/df.shape[0]

**Customer Gender Visualization**

Create a count plot of the gender distribution across the mall customer dataset.

In [None]:
sns.countplot(df['Gender'])
plt.xlabel('Gender')
plt.ylabel('Number of Customers')
plt.title('Customer Distribution by Gender')

**Customer Distribution by Age**

Create a distribution plot of the ages by frequencies

In [None]:
sns.distplot(df['Age'],kde=False,bins=30)
plt.ylabel('Frequency')
plt.title('Customer Distribution by Age')

In [None]:
sns.distplot(df[df['Gender'] == 'Male']['Age'],kde=False,bins=30,color='blue',label='Male')
sns.distplot(df[df['Gender'] == 'Female']['Age'],kde=False,bins=30,color='red',label='Female')
plt.title('Customer Distribution by Age and Gender')
plt.ylabel('Number of Customers')
plt.legend()

In [None]:
sns.boxplot(df['Age'])

**Annual Income Analysis**

In [None]:
sns.distplot(df['Annual Income (k$)'],kde=False,bins=25)
plt.title('Customer Distribution by Annual Income (k$)')
plt.ylabel('Frequency')

In [None]:
sns.distplot(df[df['Gender'] == 'Male']['Annual Income (k$)'],kde=False,bins=30,color='blue',label='Male')
sns.distplot(df[df['Gender'] == 'Female']['Annual Income (k$)'],kde=False,bins=30,color='red',label='Female')
plt.title('Customer Distribution by Annual Income (k$) and Gender')
plt.ylabel('Number of Customers')
plt.legend()

In [None]:
sns.boxplot(df['Annual Income (k$)'])

In [None]:
sns.distplot(df['Spending Score (1-100)'])

In [None]:
sns.distplot(df[df['Gender'] == 'Male']['Spending Score (1-100)'],color='blue',kde=False,label='Male')
sns.distplot(df[df['Gender'] == 'Female']['Spending Score (1-100)'],color='red',kde=False,label='Female')
plt.legend()

In [None]:
sns.boxplot(df['Spending Score (1-100)'])

**Create a pairplot of the Mall Customer Data.**

In [None]:
sns.pairplot(data=df,hue='Gender',palette='coolwarm')

**Create a heatmap of the Mall Customer Data.**

In [None]:
plt.figure(figsize=(12,6))
sns.heatmap(df.corr(),cmap='viridis',annot=True)

There seems to a strong positive correlation between CustomerID and Annual Income (k$). Let us investigate this further by plotting an lmplot.

In [None]:
sns.lmplot(x='Annual Income (k$)',y='CustomerID',data=df)
plt.title('CustomerID vs. Annual Income (k$)')

# **4. Feature Engineering**

Check the data types of all the columns of the dataframe.

In [None]:
df.dtypes

In [None]:
df['Gender'].unique()

In [None]:
df.Gender.value_counts()

In [None]:
dmap = {'Male':1,'Female':0}
df['Gender'] = df['Gender'].map(dmap)

In [None]:
df.head(5)

# **5. Customer Segmentation via K-Means Clustering**

k-means clustering is a method of vector quantization, that aims to partition ***n*** observations into ***k*** cluster in which each observation belongs to the cluster  with the nearest mean (cluster centers or cluster centroid), serving as a prototype of the cluster. The less the variation we have within clusters, the more homogenous (similar) the data points are within the same cluster.

The k-means algorithm can be summarised as follows:

1. Specify the number of clusters to be created.
2. Initialize centroids by first shuffling the dataset and then randomly selecting K data points for the centroids without replacement.
3. Keep iterating until there is no change to the centroids. i.e assignment of data points to clusters isn’t changing.
-Compute the sum of the squared distance between data points and all centroids.

-Assign each data point to the closest cluster (centroid).

-Compute the centroids for the clusters by taking the average of the all data points that belong to each cluster.

**Determining Optimal Clusters**

Determining the optimal number of clusters in a data set is fundamental in partitioning into clusters, which requires the user to specify the number of clusters ***k*** to be generated.

The methods employed in determining the optimal number of cluster can be categorised into*** Direct Methods*** and ***Statistical Testing Methods***.

1. Direct Methods consist of optimising a criterion, that is, within cluster sum of squares. *Elbow* and *silhouette methods* are in this category.
2. Statistical Methods compare the evidence against the null hypothesis. *Gap statistic *is one of the examples.



* **Elbow method**

The well-known elbow method is to identify the number of clusters based on the assumption that the optimal number of clusters must produce small inertia, or total intra-cluster variation. As such, there will be a trade-off between the inertia and the number of clusters.

* **Silhouette method**

Silhouette score measures how well an observation is clustered and it estimates the average distance between clusters. It wants to find the optimal number of clusters that produce a subdivision of the dataset to dense blocks that are well separated from each other. 

The value will be between -1 and 1, whereas a value near 0 indicates overlapping clusters. Negative values generally indicate that an observation has been assigned to the wrong cluster.



* **Gap Statistic Method**

The gap statistic compares the total within intra-cluster variation for different values of k with their expected values under the null reference distribution of the data. Hence, the optimal choice of k is the value that maximizes the gap (meaning that the clustering structure is far away from a random uniform distribution of points).



.

**Normalizing the Data**


In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
scaler = MinMaxScaler()

In [None]:
data_scaled = scaler.fit_transform(df)

**Elbow Method**

In [None]:
from sklearn.cluster import KMeans
wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
    kmeans.fit(data_scaled)
    wcss.append(kmeans.inertia_)
plt.plot(range(1, 11), wcss,marker='*')
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

From the above plot, we could conclude that the optimal number of cluster is 5 since it occurs at the bend in the elbow plot.

**Training the K-Means model on the dataset**

In [None]:
kmeans = KMeans(n_clusters=5,init='k-means++',max_iter=300,tol=0.0001).fit(data_scaled)
kmeans

In [None]:
kmeans.labels_

In [None]:
kmeans.cluster_centers_

In [None]:
clusters_data = pd.DataFrame(data_scaled,columns=['CustomerID', 'Genre', 'Age', 'Annual Income (k$)',
       'Spending Score (1-100)'])
clusters_data['Cluster'] = kmeans.labels_
clusters_data.head()

In [None]:
clusters_data.groupby('Cluster').count()['Spending Score (1-100)']

# **6. Evaluation**

In [None]:
from sklearn.metrics import confusion_matrix,classification_report
print(confusion_matrix(clusters_data['Cluster'],kmeans.labels_))
print(classification_report(clusters_data['Cluster'],kmeans.labels_))