Rebecca Black

## Customer Segmentation for a Wholesale Grocery Distributor 

This analysis is based on a dataset describing clients of a wholesale grocery distributor. It is a subset of a larger database that was analyzed in the following journal article:

*Abreu, N. (2011). Analise do perfil do cliente Recheio e desenvolvimento de um sistema promocional. Mestrado em Marketing, ISCTE-IUL, Lisbon.*

The backstory is that the grocery distributor implemented a change to their delivery system, changing both the frequency and time of delivery to their clients. Subsequently many customers expressed dissatisfaction with the change and took their business to other wholesalers. It was discovered later that the dissatisfied customers consisted primarily of small family-run shops.

To prevent future customer dissatisfaction, the wholesaler wished to learn more about their customer segments so as to more mindfully implement changes in the future.

In order to discover this information, I will perform a cluster analysis on the data, followed by a Principal Components Analysis (PCA).

The attributes in this dataset are continuous and represent annual spending (in monetary units) on products in that category

This analysis was written in Python.

### Load the libraries needed for this analysis

In [37]:
import numpy as np
import pandas as pd
from sklearn import cluster
from sklearn.cluster import KMeans
from sklearn import metrics
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale

### Read in the data and convert to a pandas dataframe

In [20]:
cust=pd.read_csv("wholesale_customers.csv")
cust=pd.DataFrame(cust)

### Print the initial structure and variables

In [21]:
cust.head()

Unnamed: 0,Fresh,Milk,Grocery,Frozen,Detergents_Paper,Delicatessen
0,12669,9656,7561,214,2674,1338
1,7057,9810,9568,1762,3293,1776
2,6353,8808,7684,2405,3516,7844
3,13265,1196,4221,6404,507,1788
4,22615,5410,7198,3915,1777,5185


In [22]:
cust.index

Int64Index([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,
            ...
            430, 431, 432, 433, 434, 435, 436, 437, 438, 439],
           dtype='int64', length=440)

In [23]:
cust.values

array([[12669,  9656,  7561,   214,  2674,  1338],
       [ 7057,  9810,  9568,  1762,  3293,  1776],
       [ 6353,  8808,  7684,  2405,  3516,  7844],
       ..., 
       [14531, 15488, 30243,   437, 14841,  1867],
       [10290,  1981,  2232,  1038,   168,  2125],
       [ 2787,  1698,  2510,    65,   477,    52]])

### Print some summary statistics

In [24]:
cust.describe()

Unnamed: 0,Fresh,Milk,Grocery,Frozen,Detergents_Paper,Delicatessen
count,440.0,440.0,440.0,440.0,440.0,440.0
mean,12000.297727,5796.265909,7951.277273,3071.931818,2881.493182,1524.870455
std,12647.328865,7380.377175,9503.162829,4854.673333,4767.854448,2820.105937
min,3.0,55.0,3.0,25.0,3.0,3.0
25%,3127.75,1533.0,2153.0,742.25,256.75,408.25
50%,8504.0,3627.0,4755.5,1526.0,816.5,965.5
75%,16933.75,7190.25,10655.75,3554.25,3922.0,1820.25
max,112151.0,73498.0,92780.0,60869.0,40827.0,47943.0


So we have six features, all continuous, with 440 observations corresponding to individual customer expenditures over the past year. As illustrated by the mean values of each feature, the summary is exactly what one might expect from a grocer catering to typical consumer grocery purchasing patterns.

### Now for some questions I want to answer with this analysis

#### Question 1:
How do these data cluster together? Are there logical clusters that represent differing customer behaviors and needs? Recall that the issue at hand is the delivery time and mode - the customers who had a problem with bulk evening deliveries might logically have needs that require more frequent deliveries during morning hours (e.g. perhaps due to the need for fresh food availability.)

To answer this question, I'll implement a K Means clustering algorithm implemented through scikit-learn.

In [25]:
k = 2
kmeans = cluster.KMeans(n_clusters=k)
kmeans.fit(cust)

KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=2, n_init=10,
    n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001,
    verbose=0)

In [26]:
labels = kmeans.labels_
centroids = kmeans.cluster_centers_

In [27]:
print centroids.T

[[  7944.112       35401.36923077]
 [  5151.81866667   9514.23076923]
 [  7536.128       10346.36923077]
 [  2484.13066667   6463.09230769]
 [  2872.55733333   2933.04615385]
 [  1214.26133333   3316.84615385]]


So above we see the centroids of each cluster. To better visualize what this means for the wholesaler in a practical sense, we can examine a dataframe of the mean value for each grocery category alongside each centroid element corresponding to that category.

In [28]:
data_means= np.asarray(cust.describe().iloc[1])
interpretation=[data_means,centroids[0],centroids[1]]
interpretation=pd.DataFrame(interpretation)
interpretation=interpretation.T
interpretation.columns = ['Mean Value', 'Centroid 1', 'Centroid 2']
interpretation.index = ['Fresh', 'Milk', 'Grocery', 'Frozen', 'Detergents_Paper', "Delicatessen"]
interpretation

Unnamed: 0,Mean Value,Centroid 1,Centroid 2
Fresh,12000.297727,7944.112,35401.369231
Milk,5796.265909,5151.818667,9514.230769
Grocery,7951.277273,7536.128,10346.369231
Frozen,3071.931818,2484.130667,6463.092308
Detergents_Paper,2881.493182,2872.557333,2933.046154
Delicatessen,1524.870455,1214.261333,3316.846154


Customers in Cluster 1 (described by the centroid of that cluster - here called "Centroid 1") have their highest expenditures on fresh food and grocery, with a lower than average expenditure on everything except the detergents and paper category.

Customers in Cluster 2 (described by the centroid of that cluster - here called "Centroid 2") have a very high expenditure on fresh food (35,401 monetary units annually.) They also have a higher than average expenditure on everything except the detergents and paper category.

#### Question 2:
Can these data be described with fewer features? If the data can be reduced to a few key important factors, it would make future decision making more efficient, since the wholesaler could focus on the concerns most important to each customer.

To answer this question, I'll use Principal Components Analysis to see if the dataset can be represented in a simpler way.

In [38]:
pca = PCA(n_components=6)
pca.fit(cust)

PCA(copy=True, n_components=6, whiten=False)

In [39]:
print pca.components_

[[-0.97653685 -0.12118407 -0.06154039 -0.15236462  0.00705417 -0.06810471]
 [-0.11061386  0.51580216  0.76460638 -0.01872345  0.36535076  0.05707921]
 [-0.17855726  0.50988675 -0.27578088  0.71420037 -0.20440987  0.28321747]
 [-0.04187648 -0.64564047  0.37546049  0.64629232  0.14938013 -0.02039579]
 [ 0.015986    0.20323566 -0.1602915   0.22018612  0.20793016 -0.91707659]
 [-0.01576316  0.03349187  0.41093894 -0.01328898 -0.87128428 -0.26541687]]


In [40]:
print pca.explained_variance_ratio_

[ 0.45961362  0.40517227  0.07003008  0.04402344  0.01502212  0.00613848]


It looks like I may be able to reduce the dimensionality of the dataset to 2 Principal Components. Let's see how that looks.

In [41]:
pca = PCA(n_components=2)
pca.fit(cust)

PCA(copy=True, n_components=2, whiten=False)

In [42]:
print pca.components_

[[-0.97653685 -0.12118407 -0.06154039 -0.15236462  0.00705417 -0.06810471]
 [-0.11061386  0.51580216  0.76460638 -0.01872345  0.36535076  0.05707921]]


In [43]:
print pca.explained_variance_ratio_

[ 0.45961362  0.40517227]


Together these two Principal Components represent 86% of the variance in these data. Let's go ahead and use the two components to recalculate and visualize these clusters.

In [44]:
reduced_data = PCA(n_components=2).fit_transform(cust)

Representing each customer by the first two principal components results in a two-feature dataset in which the first feature is the customer weighting for the first principal component and the second feature is the customer weighting for the second principal component. The first 5 rows of the reduced dataset are printed below.

In [36]:
print reduced_data[:5]

[[  -650.02212207   1585.51909007]
 [  4426.80497937   4042.45150884]
 [  4841.9987068    2578.762176  ]
 [  -990.34643689  -6279.80599663]
 [-10657.99873116  -2159.72581518]]


As an example, we can see that the first customer has a weighting of -650 for the first principal component, and 1585 for the second principal component. This suggests a strong weighting for the second component - which is itself correlated most strongly (0.765) with the grocery category. 

For the fifth customer, we see that the weighting for the first principal component is approximately five times the weight for the second principal component. This suggests a very strong weighting for the first component, which has a very strong negative correlation (-0.976) with the fresh food category. For this customer, since the weight on the first component is so high, we can conclude that fresh food is the dominant category for which purchases are made. This may well imply that a frequent morning delivery may be optimal for the customer, so they can have fresh food available in turn for *their* customers.

The main takeway from this analysis is that there indeed seems to be two reasonably distinct categories of customer. Through a careful consideration of the factors represented by the clusters obtained from the cluster analysis, or alternatively the principal components from the PCA, the grocery wholesaler can derive a profile of their customer segments that will aid in future service changes and enable them to strike a balance between cost savings and customer satisfaction.