# Customer segmentation

This notebook provides an example code for segmenting customers using the k-means clustering algorithm. The data was prepared using MS Excel (privot tables) and the clustering results were also saved for further analysis in MS Excel (building the final database using VLOOKUP and analysing it using pivot tables and conditional formatting).

In [1]:
# import pandas library for working with dataframes
import pandas as pd

In [2]:
# We are interested in the 3rd Excel sheet in our file
# In Python world, 3rd sheet has index 2, as enumeration starts from 0
data = pd.read_excel("segment.xlsx",sheetname=2)

In [3]:
data.head()

Unnamed: 0,Row Labels,1,2,3,4,5,6,7,8,9,...,23,24,25,26,27,28,29,30,31,32
0,Adams,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,1,0,0
1,Allen,0,0,0,0,0,0,0,0,1,...,0,0,0,0,1,0,0,0,0,0
2,Anderson,0,0,0,0,0,0,0,0,0,...,0,1,0,1,0,0,0,0,0,0
3,Bailey,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,1,0,0
4,Baker,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,1,0


In [4]:
# sklearn is the most popular machine library in the world
# it is a huge library, so we will only import the function that we will use (KMeans)
from sklearn.cluster import KMeans

In [5]:
# initialize the number of clusters as an argument to KMeans function
cluster = KMeans(5)

In [7]:
# prepare dataset for the algorithm (i.e. get rid of the very first column with names)
inputs = data.iloc[:,1:]

In [8]:
inputs.head()

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,23,24,25,26,27,28,29,30,31,32
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,1,0,0
1,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,1,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,1,0,1,0,0,0,0,0,0
3,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,0,0,0,0,0,0,1,0,0,1,...,0,0,0,0,0,0,0,0,1,0


In [9]:
# predict clusters and save them in our dataframe as one last additional column
data["cluster"] = cluster.fit_predict(inputs)

In [10]:
data.head()

Unnamed: 0,Row Labels,1,2,3,4,5,6,7,8,9,...,24,25,26,27,28,29,30,31,32,cluster
0,Adams,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,1,0,0,1
1,Allen,0,0,0,0,0,0,0,0,1,...,0,0,0,1,0,0,0,0,0,4
2,Anderson,0,0,0,0,0,0,0,0,0,...,1,0,1,0,0,0,0,0,0,2
3,Bailey,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,1,0,0,1
4,Baker,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,1,0,4


In [11]:
# quickly count the number of customers in each cluster
data["cluster"].value_counts()

4    34
1    24
3    21
2    13
0     8
Name: cluster, dtype: int64

In [12]:
# get only the numes and clusters of customers from the dataset
new_data = data.iloc[:,[0,-1]]

In [13]:
new_data.head()

Unnamed: 0,Row Labels,cluster
0,Adams,1
1,Allen,4
2,Anderson,2
3,Bailey,1
4,Baker,4


In [14]:
# save it to an Excel file for further analysis
new_data.to_excel('cluster.xlsx')