# Customer Segmentation

####  Problem Statement:
A key challenge for e-commerce businesses is to analyze the trend in the market to increase their sales. The trend can be easily observed if the companies can group the customers; based on their activity on the e-commerce site.  This grouping can be done by applying different criteria like previous orders, mostly searched brands and so on. The machine learning clustering algorithms can provide an analytical method to cluster customers with similar interests.

## Data :

Input variables:

1) **Cust_ID** Unique numbering for customers

2) **Gender:** Gender of the customer


3) **Orders:** Number of orders placed by each customer in the past


Remaining 35 features contains the number of times customers have searched them

<a id='import_packages'></a>
## 1. Import Packages

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
# Set default setting of seaborn
sns.set()
from warnings import filterwarnings
filterwarnings('ignore')

<a id='Read_Data'></a>
## 2. Read the Data

In [None]:
# read the data
cust_data = pd.read_excel('cust_data.xlsx', index_col=0)

# print the first five rows of the data
cust_data.head()

<a id='data_preparation'></a>
## 3. Understand and Prepare the Data



##  Data Types and Dimensions

In [None]:
# check the data types for variables
cust_data.info()

In [None]:
# get the shape
print(cust_data.shape)

In [None]:
cust_data.dtypes

**We see the dataframe has 37 columns and 30000 observations**


## Distribution of Variables


**Distribution of orders placed by customers**

Check the distribution for the number of orders placed by the customers in the past

In [None]:
# 'countplot' to plot barplot for orders
sns.countplot(data = cust_data, x = 'Orders')
plt.title('Distribution of Orders', fontsize = 15)
plt.xlabel('No. of Orders', fontsize = 15)
plt.ylabel('Count', fontsize = 15)
plt.show()

Most of the customers have no past orders 

**Distribution of gender of the customer**



In [None]:
# use 'countplot' to gender-wise calculate the number of customers 
sns.countplot(data= cust_data, x = 'Gender')
plt.title('Distribution of Gender', fontsize = 15)
plt.xlabel('Gender', fontsize = 15)
plt.ylabel('No. of Customers', fontsize = 15)

# use below code to print the values in the graph
# 'x' and 'y' gives the position of the text
# 's' is the text 
plt.text(x = -0.1, y = cust_data.Gender.value_counts()[1] + 20, s = str(round((cust_data.Gender.value_counts()[1])*100/len(cust_data.Gender),2)) + '%')
plt.text(x = 0.9, y = cust_data.Gender.value_counts()[0] + 20, s = str(round((cust_data.Gender.value_counts()[0])*100/len(cust_data.Gender),2)) + '%')
plt.show()

**There are more female customers in the data than the male customers.
It can be seen that the variable 'Gender' has lesser observations (percent-wise only 90.92% observations) than the total number of observations. This inconsistency is because of the existence of missing values**

In [None]:
cust_data.describe()

The above output illustrates the summary statistics of the numeric variable.
The customers have placed 4 orders on an average with minimum zero orders and maximum of 12.
From the summary output, it can be seen that the considered dataset is sparse; 
since, for all the variables with brand searches, 75% of the observations are 0

In [None]:
# summary of the categorical variables
cust_data.describe(include = object)

**The summary contains information about the total number of observations, number of unique classes, the most occurring class and frequency of the same.
It can be seen that the mode of the variable 'Gender' is F with 22054 observations**


## Treating Missing values:
If the missing values are not handled properly we may end up drawing an inaccurate inference about the data. Due to improper handling, the result obtained will differ from the ones where the missing values are present.


In [None]:
# sorting variables based on null values
# 'ascending = False' sorts values in the descending order
Total = cust_data.isnull().sum().sort_values(ascending=False)          

# percentage of missing values
Percent = (cust_data.isnull().sum()/cust_data.isnull().count()*100).sort_values(ascending=False)   

# create a dataframe using 'concat' function 
# 'keys' is the list of column names
# 'axis = 1' concats along the columns
missing_data = pd.concat([Total, Percent], axis=1, keys=['Total', 'Percent'])    
missing_data

Only the variable 'Gender' has 9% of missing values

In [None]:
# plot heatmap to check null values
# 'cbar = False' does not show the color axis 
sns.heatmap(cust_data.isnull(),yticklabels=False,cbar=False)
plt.title('Heatmap for Missing Values', fontsize = 15)
plt.xlabel('Variables', fontsize = 15)
plt.ylabel('Cust_ID', fontsize = 15)

plt.show()

#### Replace missing values in 'Gender'

'Gender' is a categorical variable with categories, 'M' and 'F'. We have 2724 customers whose gender is not known to us. To deal with this, we perform dummy encoding for the variable  

In [None]:
# create dummies against 'gender'
data = pd.get_dummies(cust_data,columns=['Gender'])     

# head() to display top five rows
data.head()

In [None]:
# check the dimensions after dummy encoding
data.shape

The customer for which both the columns have '0' value indicates that the gender is not known

In [None]:
# recheck the null values
data.isnull().sum()


##  Visualization


In [None]:
fig = data.hist(figsize = (18,18))



# K-means Clustering


Centroid-based clustering algorithms cluster the data into non-hierarchical clusters. Such algorithms are efficient but sensitive to initial conditions and outliers. K-means is the most widely-used centroid-based clustering algorithm


##  Prepare the Data

Feature scaling is used to transform all the variables in the same range. If the variables are not in the same range, then the variable with higher values can dominate our final result. 

The two most discussed scaling methods are normalization and standardization. 



We consider only the brand names to segment the customers. Thus, drop the variables 'Orders', 'Gender_F', 'Gender_M' and scale the remaining variables

In [None]:
# 'features' contain only the brand names
features = data.drop(['Orders', 'Gender_F', 'Gender_M'], axis=1)

# head() to display top five rows
features.head()

**Scaling the data**

In [None]:
from sklearn.preprocessing import StandardScaler

scale = StandardScaler().fit(features)       

# scale the 'features' data
features = scale.transform(features)                

In [None]:
# create a dataframe of the scaled features 
features_scaled = pd.DataFrame( features, columns= data.columns[0:35])

# head() to display top five rows
features_scaled.head()

<a id='model_k'></a>
## Build a Model with Multiple K


**We build our models using the silhouette score method. The silhouette is a method of interpretation and validation of consistency within clusters of data**

**We do not know how many clusters give the most useful results. So, we create the clusters varying K, from 4 to 8 and then decide the optimum number of clusters (K) with the help of the silhouette score**

In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# create a list for different values of K
n_clusters = [4, 5, 6, 7, 8]

# use 'for' loop to build the clusters
# 'random_state' returns the same sample each time you run the code  
# fit and predict on the scaled data
# 'silhouette_score' function computes the silhouette score for each K
for K in n_clusters :
    cluster = KMeans (n_clusters= K, random_state= 10)
    predict = cluster.fit_predict(features_scaled)
    
    score = silhouette_score(features_scaled, predict, random_state= 10)
    print ("For n_clusters = {}, silhouette score is {})".format(K, score))

**The optimum value for K is associated with the high value of the 'silhouette score'. From the above output it can be seen that, for K = 4, the silhouette score is highest. Thus, we build the clusters with K = 4**

### elbow method

In [None]:
from sklearn.cluster import KMeans
wcss = []
for i in range(1,11):
    kmeans = KMeans(n_clusters=i, init="k-means++", random_state = 42)
    kmeans.fit(cust_data.iloc[:,3:])
    wcss.append(kmeans.inertia_)
plt.grid()
sns.lineplot(x=range(1,11), y=wcss, color="red", marker ="8")
plt.xlabel("K-Value")
plt.xticks(np.arange(11.5))
plt.ylabel("WCSS")
plt.title("Elbow Graph")
plt.show()

In [None]:
# building a K-Means model for K = 4
model = KMeans(n_clusters= 4, random_state= 10)

# fit the model
model.fit(features_scaled)

**Now, explore these 4 clusters to gain some insights about the clusters**


# Retrieve the Clusters



**As we have built the 4 clusters, now we want to know which customers belong to which cluster. 'model.labels_' can give the cluster number in which the customer belongs**

In [None]:
data_output = data.copy(deep = True)
# add a column 'Cluster' in the data giving cluster number corresponding to each observation
data_output['Cluster'] = model.labels_

# head() to display top five rows
data_output.head()

**We have added a column 'cluster' in the dataframe describing the cluster number for each observation**

##### Check the size of each cluster

In [None]:
# 'return_counts = True' gives the number observation in each cluster
np.unique(model.labels_, return_counts=True)                

In [None]:
# use 'seaborn' library to plot a barplot for cluster size
sns.countplot(data= data_output, x = 'Cluster')

# set the axes and plot labels
# set the font size using 'fontsize'
plt.title('Cluster Sizes', fontsize = 15)
plt.xlabel('Clusters', fontsize = 15)
plt.ylabel('No. of Customers', fontsize = 15)

# add values in the graph
# 'x' and 'y' assigns the position to the text
# 's' represents the text on the plot
plt.text(x = -0.18, y =2000, s = np.unique(model.labels_, return_counts=True)[1][0])
plt.text(x = 0.9, y =2000, s = np.unique(model.labels_, return_counts=True)[1][1])
plt.text(x = 1.85, y =2000, s = np.unique(model.labels_, return_counts=True)[1][2])
plt.text(x = 2.85, y =2000, s = np.unique(model.labels_, return_counts=True)[1][3])

plt.show()

#### Cluster Centers

The cluster centers can give information about the variables belonging to the clusters


In [None]:
# form a dataframe containing cluster centers
# 'cluster_centers_' returns the co-ordinates of a cluster center 
centers = pd.DataFrame(model.cluster_centers_, columns=  data_output.columns[1:36])      

In [None]:
# head() to display top five rows
centers.head()

**Now, extract the variables in each of the clusters and try to name each of the cluster based on the variables**


# Clusters Analysis



##  Analysis of Cluster_1

1. Check the size of a cluster
2. Sort the variables belonging to the cluster
3. Compute the statistical summary for observations in the cluster

Sort all the variables based on value for the cluster center (i.e., the variable with the highest value of the cluster center will be on top of the sorted list) and store the first ten variables as a list

In [None]:
# sort the variables based on cluster centers
cluster_1 = sorted(zip(list(centers.iloc[0,:]), list(centers.columns)), reverse = True)[:10]     

**Check size of the cluster**

In [None]:
# size of a cluster_1
np.unique(model.labels_, return_counts=True)[1][0]

**2. Sort variables belonging to the cluster**

In [None]:
# retrieve the top 10 variables present in the cluster
cluster1_var = pd.DataFrame(cluster_1)[1]
cluster1_var

**Compute the statistical summary for observations in the cluster**

In [None]:
# get summary for observations in the cluster
# consider the number of orders and customer gender for cluster analysis
data_output[['Orders', 'Gender_F', 'Gender_M', 'Cluster']][data_output.Cluster == 0].describe()

The proportion of both male and female customers is proportionate in this cluster as compared to the overall gender proportion in the dataset

 
## Analysis of Cluster_2

In [None]:
# sort the variables based on cluster centers
cluster_2 = sorted(zip(list(centers.iloc[1,:]), list(centers.columns)), reverse = True)[:10]     

**1. Check the size of a cluster**

In [None]:
# size of a cluster_2
np.unique(model.labels_, return_counts=True)[1][1]

561 customers belong to cluster_2. This is the smallest cluster

**2. Sort variables belonging to the cluster**

In [None]:
# retrieve the top 10 variables present in the cluster
cluster2_var = pd.DataFrame(cluster_2)[1]
cluster2_var        

**3. Compute the statistical summary for observations in the cluster**

In [None]:
# get summary for observations in the cluster
# consider the number of orders and customer gender for cluster analysis
data_output[['Orders', 'Gender_F', 'Gender_M', 'Cluster']][data_output.Cluster == 1].describe()

This cluster contains highest male population among all the clusters. But, there is high deviation in both the genders

<a id='cluster_3'></a>
## 6.3 Analysis of Cluster_3

In [None]:
# sort the variables based on cluster centers
cluster_3 = sorted(zip(list(centers.iloc[2,:]), list(centers.columns)), reverse = True)[:10]   

**1. Check the size of a cluster**

In [None]:
# size of cluster_3
np.unique(model.labels_, return_counts=True)[1][2]

**2. Sort variables belonging to the cluster**

In [None]:
# retrieve the top 10 variables present in the cluster
cluster3_var = pd.DataFrame(cluster_3)[1]
cluster3_var             

**3. Compute the statistical summary for observations in the cluster**

In [None]:
# get summary for observations in the cluster
# consider the number of orders and customer gender for cluster analysis
data_output[['Orders', 'Gender_F', 'Gender_M', 'Cluster']][data_output.Cluster == 2].describe()


## Analysis of Cluster_4

In [None]:
# sort the variables based on cluster centers
cluster_4 = sorted(zip(list(centers.iloc[3,:]), list(centers.columns)), reverse=True)[:10]   

**1. Check the size of a cluster**

In [None]:
# size of cluster_4
np.unique(model.labels_, return_counts=True)[1][3]

**2. Sort variables belonging to the cluster**

In [None]:
# retrieve the top 10 variables present in the cluster
cluster4_var = pd.DataFrame(cluster_4)[1]
cluster4_var             

**3. Compute the statistical summary for observations in the cluster**

In [None]:
# get summary for observations in the cluster
# consider the number of orders and customer gender for cluster analysis
data_output[['Orders', 'Gender_F', 'Gender_M', 'Cluster']][data_output.Cluster==3].describe()


## Conclusion

**In this case study, we have grouped the customers' dataset into 4 clusters based on the brands they have searched on e-commerce sites. We have used the silhouette score method to find the optimum number of clusters and decided k = 4 as the best pick after analyzing the silhouette score.**

**After applying the K-means algorithm with an optimized number of clusters, we segment the customers under 'Grocery', 'Apparels', 'Electronics', and 'Basket class' categories. These clusters give information about the interest of the customer in the different brands. This type of segmentation can help the e-commerce companies, to know the customer's choices and they can provide more accurate recommendations to the customers**