I downloaded this wholesale customer dataset from UCI Machine Learning Repository. The data set refers to clients of a wholesale distributor. It includes the annual spending in monetary units on diverse product categories.

My goal today is to use various clustering techniques to segment customers. Clustering is an unsupervised learning algorithm that tries to cluster data based on their similarity. Thus, there is no outcome to be predicted, and the algorithm just tries to find patterns in the data.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt 
import numpy as np
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split

In [4]:
#load in the data and perform train_test_split
df = pd.read_csv("WCD.csv")
train, test = train_test_split(df, test_size=0.2)

In [5]:
#Shapes
print("Training Data: ", train.shape)
print("Testing Data: ", test.shape)

Training Data:  (352, 8)
Testing Data:  (88, 8)


In [6]:
#Model object initialisaton and fitting.
model = KMeans()
model.fit(train)

KMeans()

In [8]:
#Default number of clusters and their centers
print("Default count of clusters: ", model.n_clusters) 
print("Cluster Centers: ", model.cluster_centers_)

Default count of clusters:  8
Cluster Centers:  [[1.07482993e+00 2.44897959e+00 5.93576871e+03 2.87615646e+03
  3.20838095e+03 3.13613605e+03 8.83632653e+02 9.12231293e+02]
 [2.00000000e+00 2.50000000e+00 9.03883333e+03 2.40307778e+04
  2.97280556e+04 2.01988889e+03 1.37903889e+04 3.20355556e+03]
 [1.23863636e+00 2.52272727e+00 1.94172727e+04 3.68285227e+03
  5.04531818e+03 3.12479545e+03 1.15784091e+03 1.68278409e+03]
 [2.00000000e+00 2.66666667e+00 2.02080000e+04 2.98486667e+04
  7.32253333e+04 1.47033333e+03 3.52100000e+04 2.05866667e+03]
 [1.83098592e+00 2.60563380e+00 4.44359155e+03 8.89511268e+03
  1.38744789e+04 1.30697183e+03 6.15484507e+03 1.33391549e+03]
 [1.09523810e+00 2.52380952e+00 4.05739048e+04 3.86733333e+03
  4.31419048e+03 3.99595238e+03 8.92714286e+02 2.10133333e+03]
 [1.00000000e+00 3.00000000e+00 9.41940000e+04 1.65500000e+04
  1.26250000e+04 1.66415000e+04 2.86300000e+03 4.73400000e+03]
 [1.00000000e+00 2.50000000e+00 3.47820000e+04 3.03670000e+04
  1.68980000e+0

In [11]:
#Prediction of clusters on train dataset
print("Clusters on train data: ", model.predict(train))

Clusters on train data:  [0 2 2 4 0 0 0 4 0 0 4 4 1 2 0 0 0 4 4 0 2 4 0 0 0 0 0 0 2 2 4 5 2 0 0 2 5
 0 0 2 2 2 6 2 2 0 0 0 2 0 0 4 2 0 5 0 2 4 4 4 4 0 0 2 1 0 4 0 0 0 0 2 0 0
 4 0 4 0 0 4 2 1 5 4 0 0 0 4 5 2 0 5 1 4 4 0 0 2 0 4 4 0 0 2 4 2 0 2 4 1 0
 0 2 4 0 0 3 2 5 2 2 4 2 1 0 0 0 4 2 0 0 0 2 2 0 5 0 2 0 0 4 5 4 0 4 2 2 5
 0 2 2 4 0 2 2 0 4 0 4 2 0 0 4 4 2 2 0 0 0 2 1 0 4 2 4 7 0 2 0 0 4 4 2 0 0
 2 0 5 4 0 4 4 2 0 0 4 2 0 2 4 0 5 2 0 1 2 0 3 0 2 2 0 5 0 4 0 1 0 1 4 5 2
 7 4 0 0 5 2 2 2 0 1 5 0 2 2 0 1 0 4 0 0 2 0 0 2 2 0 0 0 0 0 0 4 4 0 5 0 2
 2 2 0 2 0 2 0 4 0 1 4 2 4 2 2 2 2 0 4 0 0 4 4 4 0 0 1 2 0 4 0 6 0 2 2 2 0
 2 5 1 2 0 0 2 4 0 0 0 0 4 1 4 0 4 0 0 0 1 1 5 2 0 2 4 3 0 2 5 0 2 0 2 4 5
 4 2 4 2 0 0 0 4 4 0 0 2 4 0 0 0 0 0 4]


In [12]:
#Prediction of clusters on test dataset
print("Clusters on test data: ", model.predict(test))

Clusters on test data:  [4 0 3 2 0 6 0 4 0 0 0 2 4 0 4 0 0 2 2 5 0 5 0 5 4 4 0 0 2 4 0 5 0 2 4 2 0
 2 4 2 2 2 2 2 4 0 0 0 4 0 4 0 1 2 4 0 0 0 0 0 0 0 2 0 0 5 2 0 4 0 4 0 2 0
 0 2 0 4 5 4 4 1 4 4 4 2 4 0]


In [15]:
#Now let's try to play around a bit.
#We'll change the count of clusters.
model1 = KMeans(n_clusters=3)
model1.fit(train)
print("Number of clusters: ", model1.n_clusters)
print("Cluster Centers: ", model1.cluster_centers_)

Number of clusters:  3
Cluster Centers:  [[1.26119403e+00 2.51119403e+00 8.50286940e+03 3.92690672e+03
  5.24915672e+03 2.64588060e+03 1.79889179e+03 1.12884328e+03]
 [1.97368421e+00 2.50000000e+00 7.85026316e+03 1.92139474e+04
  2.88961842e+04 1.87563158e+03 1.34792895e+04 2.27468421e+03]
 [1.15217391e+00 2.52173913e+00 3.59352391e+04 6.37056522e+03
  6.52834783e+03 6.60410870e+03 1.18050000e+03 3.45458696e+03]]


In [16]:
#Prediction of clusters on train dataset
print("Clusers on train: ", model1.predict(train))

Clusers on train  [0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 0 0 0 0 2
 0 0 0 0 0 2 2 2 0 0 0 0 0 0 0 0 0 2 0 2 0 1 0 0 0 0 2 1 0 0 0 0 0 0 2 0 0
 1 0 0 0 0 0 0 1 2 0 0 0 0 0 2 0 0 2 1 1 0 0 0 0 0 0 0 0 0 2 0 0 0 2 0 1 0
 0 2 0 0 0 1 0 2 0 0 1 2 1 0 0 0 0 0 0 0 0 0 0 0 2 0 2 0 0 0 2 1 0 0 0 0 2
 0 2 2 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 2 0 0 0 0 0 0 0 0 0
 0 0 2 0 0 0 0 0 0 0 0 0 0 0 1 0 2 0 0 1 0 0 1 0 2 2 0 2 0 0 0 1 0 1 0 2 0
 2 1 0 0 2 0 0 0 0 1 2 0 0 2 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 2 0 0
 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 2 0 2 0 2 0
 2 2 1 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 1 2 2 0 0 0 1 0 2 2 0 0 0 0 1 2
 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]


In [17]:
#Prediction of clusters on test dataset
print("Clusters on test: ", model1.predict(test))

Clusters on test:  [0 0 1 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 2 0 2 0 0 0 0 2 0 0 2 0 0 0 0 0
 2 0 2 2 0 0 0 0 0 0 0 1 0 0 0 1 2 1 0 0 0 0 0 0 0 2 0 0 2 0 0 1 0 0 0 0 0
 0 0 0 1 2 0 1 1 0 0 1 0 0 0]
