# Customer segmentation at Instacart

## Miguel Ángel Canela, IESE Business School

******

### Introduction

**Instacart** (`instacart.com`) is a grocery ordering and delivery app. After selecting products through the Instacart app, personal shoppers review your order and do the in-store shopping and delivery for you.

The data for this example come from the public release of an anonymized data base containing a sample of over 3 million grocery orders from more than 200,000 Instacart users. The users with less than 10 orders were discarded, leaving us with about one half of the users. For the users retained, only the last 10 orders were taken, so the data set became balanced. 

The database covers 49,688 **products**, distributed across 134 **aisles**. A table has been derived from the database, in which the rows are the users, the columns are the aisles and every entry is the number of product units purchased by a particular user from a particular aisle. In this example, I use these data to present two cases of **customer segmentation**.

### Importing the data

Since the original data set was too big to be posted on GitHub in CSV format, it has been partitioned into two subsets. I import these subsets separately, putting them together with the Pandas function `concat`. The argument `axis=0` means that they are concatenated vertically.

In [1]:
import pandas as pd
folder = 'https://raw.githubusercontent.com/mcanela-iese/ML_Course/master/Data/'
df1 = pd.read_csv(folder + 'instacart1.csv')
df2 = pd.read_csv(folder + 'instacart2.csv')
df = pd.concat([df1, df2], axis=0)

I check that the shape and the aspect of the data set are as expected.

In [2]:
df.shape

(107438, 134)

In [3]:
df.head()

Unnamed: 0,air fresheners candles,asian foods,baby accessories,baby bath body care,baby food formula,bakery desserts,baking ingredients,baking supplies decor,beauty,beers coolers,...,spreads,tea,tofu meat alternatives,tortillas flat bread,trail mix snack mix,trash bags liners,vitamins supplements,water seltzer sparkling water,white wines,yogurt
0,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,1
1,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,27
2,0,0,0,0,0,0,0,0,0,0,...,4,1,0,0,0,0,0,1,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,4
4,0,0,0,0,0,0,2,0,0,0,...,0,2,0,7,0,0,0,0,0,1


To see which aisles are the top-10 sellers, I calculate the mean of every column, that is, the average number of product units purchased by these customers, separately for every aisle. Note that the default of `apply` is `axis=0`, meaning columnwise.

In [4]:
import numpy as np
df.apply(np.mean).sort_values(ascending=False)[:10]

fresh fruits                     11.035034
fresh vegetables                 10.559160
packaged vegetables fruits        5.455937
yogurt                            4.554459
packaged cheese                   3.097172
milk                              2.656267
water seltzer sparkling water     2.555949
chips pretzels                    2.303263
soy lactosefree                   1.948659
refrigerated                      1.825565
dtype: float64

### 4-cluster analysis

I perform a **cluster analysis** to get the customer segments, using the scikit-learn module `cluster`. The **k-means** algorithm is provided by the class `KMeans`. I use the default arguments, except for two cases: I set the number of clusters at 4 and I specify `random_state` to make the clustering process reproducible (so youm can get the same segments which I am reporting here). This is a common approach for book authors when a random step is involved in an algorithm. 

*Note*. I am not implying that the results are better with `random_state=0`. Any other choice will give you a similar structure, but the order of the clusters (so, the labels) can change. 

In [5]:
from sklearn import cluster
clus = cluster.KMeans(n_clusters=4, random_state=0)

The method `fit` groups the instances into the specified number of clusters. Note that `fit` only has an argument, because this is **unsupervised learning**. This step can take a while in a slow computer.

In [6]:
clus.fit(df)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=4, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=0, tol=0.0001, verbose=0)

The method `predict` creates a vector containing a label for every customer, coding the segments to which the costumers have been assigned. I set this vector as a Pandas series, to be able to use Pandas functions in the analysis below. 

In [7]:
segment = pd.Series(clus.predict(df))

Now, I can calculate the **segment sizes** with the function `value_counts`. Note that the cluster sizes are quite different. This is the rule, not an exception.

In [8]:
segment.value_counts()

2    54841
3    25498
0    17353
1     9746
dtype: int64

The attribute `cluster_centers_` is a matrix with one row for each cluster. The entries are the coordinates of the **cluster center** in a 134-dimensional space in which every customer can be plotted as a point. The center of the cluster is the average of the points belonging to that cluster. It can be used to **profile** the cluster, that is, to describe the purchasing habits of its members in a user-friendly way.

In [9]:
centers = clus.cluster_centers_

In [10]:
centers.shape

(4, 134)

I set each of the four rows of the matrix `centers` as a Pandas series, so I can examine them at ease.

In [11]:
center0 = pd.Series(data = centers[0, :], index=df.columns)
center0.sort_values(ascending=False).head(10).round(1)

fresh fruits                     21.8
yogurt                           11.8
fresh vegetables                  8.9
packaged vegetables fruits        8.4
packaged cheese                   5.0
milk                              4.6
chips pretzels                    3.6
baby food formula                 3.2
water seltzer sparkling water     3.1
energy granola bars               3.0
dtype: float64

In [12]:
center1 = pd.Series(data = centers[1, :], index=df.columns)
center1.sort_values(ascending=False).head(10).round(1)

fresh vegetables              37.1
fresh fruits                  25.7
packaged vegetables fruits    12.3
yogurt                         6.8
packaged cheese                5.3
fresh herbs                    3.8
milk                           3.6
soy lactosefree                3.5
frozen produce                 3.5
chips pretzels                 2.8
dtype: float64

In [13]:
center2 = pd.Series(data = centers[2, :], index=df.columns)
center2.sort_values(ascending=False).head(10).round(1)

fresh fruits                     4.9
fresh vegetables                 2.9
packaged vegetables fruits       2.6
water seltzer sparkling water    2.6
yogurt                           2.4
milk                             2.0
packaged cheese                  1.9
chips pretzels                   1.9
ice cream ice                    1.6
soft drinks                      1.5
dtype: float64

In [14]:
center3 = pd.Series(data = centers[3, :], index=df.columns)
center3.sort_values(ascending=False).head(10).round(1)

fresh vegetables                 18.0
fresh fruits                     11.3
packaged vegetables fruits        6.9
packaged cheese                   3.4
yogurt                            3.3
milk                              2.4
soy lactosefree                   2.3
water seltzer sparkling water     2.1
frozen produce                    2.1
fresh herbs                       2.1
dtype: float64

Although the average baskets of the customers of each segment are a bit different, the main difference between clusters is in the purchasing volume. 

* In cluster 1, we have the big spenders, buying more vegetables than fruits. This is the smallest group, with a 10% of the sample.

* In cluster 0, less vegetables and more yogurt.

* In cluster 3, the basket is similar to that of cluster 1, but buying less than one half. 

* In cluster 2, which contains one half of the sample, we have the smallest baskets. Except the sparkling water, no relevant specificities.

### 20-cluster analysis

It is typically said in statistics courses that the output of a cluster analysis is useful only if we can "understand" the clusters, which typically means that we can provide a short description. This implies, in most cases, a small number of clusters. Although this may be true in many cases, it is not so in e-commerce, where the software can manage a high number of clusters and use them for recommendation purposes. 

I repeat the analysis, setting now `n_clusters=20`. This may take a few minutes in your computer.

In [15]:
clus = cluster.KMeans(n_clusters=20, random_state=0)
clus.fit(df)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=20, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=0, tol=0.0001, verbose=0)

In [16]:
segment = pd.Series(clus.predict(df))
segment.value_counts()

3     25053
1     13085
5     10981
4      7753
18     7221
9      5982
11     5408
12     4930
0      4533
2      4127
8      2829
7      2814
13     2754
14     2560
6      1646
16     1593
15     1493
17     1368
10      666
19      642
dtype: int64

I profile all the clusters, but focusing on the top five products.

In [17]:
centers = clus.cluster_centers_
center0 = pd.Series(data = centers[0, :], index=df.columns)
center0.sort_values(ascending=False).head(5).round(1)

fresh vegetables              38.1
fresh fruits                  17.2
packaged vegetables fruits     9.8
yogurt                         4.4
fresh herbs                    4.1
dtype: float64

In [18]:
center1 = pd.Series(data = centers[1, :], index=df.columns)
center1.sort_values(ascending=False).head(5).round(1)

fresh vegetables              11.1
fresh fruits                   5.9
packaged vegetables fruits     4.2
packaged cheese                2.6
yogurt                         2.3
dtype: float64

In [22]:
center2 = pd.Series(data = centers[2, :], index=df.columns)
center2.sort_values(ascending=False).head(5)

fresh fruits                  29.827478
packaged vegetables fruits     7.907439
fresh vegetables               7.491398
yogurt                         5.685728
milk                           4.268718
dtype: float64

In [19]:
center3 = pd.Series(data = centers[3, :], index=df.columns)
center3.sort_values(ascending=False).head(5).round(1)

fresh fruits                     2.4
water seltzer sparkling water    1.8
milk                             1.5
packaged vegetables fruits       1.5
soft drinks                      1.5
dtype: float64

In [20]:
center4 = pd.Series(data = centers[4, :], index=df.columns)
center4.sort_values(ascending=False).head(5).round(1)

fresh vegetables              23.2
fresh fruits                   9.0
packaged vegetables fruits     6.0
packaged cheese                3.3
yogurt                         3.2
dtype: float64

In [21]:
center5 = pd.Series(data = centers[5, :], index=df.columns)
center5.sort_values(ascending=False).head(5).round(1)

fresh fruits                  14.3
fresh vegetables               4.6
packaged vegetables fruits     4.1
yogurt                         2.9
milk                           2.8
dtype: float64

In [22]:
center6 = pd.Series(data = centers[6, :], index=df.columns)
center6.sort_values(ascending=False).head(5).round(1)

refrigerated                  19.3
fresh fruits                   9.3
fresh vegetables               6.1
packaged vegetables fruits     4.8
yogurt                         4.7
dtype: float64

In [27]:
center7 = pd.Series(data = centers[7, :], index=df.columns)
center7.sort_values(ascending=False).head(5)

packaged produce              15.767591
fresh fruits                   9.414357
packaged vegetables fruits     3.367804
fresh vegetables               2.726368
milk                           1.413291
dtype: float64

In [23]:
center8 = pd.Series(data = centers[8, :], index=df.columns)
center8.sort_values(ascending=False).head(5).round(1)

fresh fruits                  37.2
fresh vegetables              28.5
packaged vegetables fruits    15.7
yogurt                         9.0
packaged cheese                7.2
dtype: float64

In [24]:
center9 = pd.Series(data = centers[9, :], index=df.columns)
center9.sort_values(ascending=False).head(5).round(1)

chips pretzels     7.0
packaged cheese    5.3
fresh fruits       5.3
ice cream ice      4.7
soft drinks        3.8
dtype: float64

In [25]:
center10 = pd.Series(data = centers[10, :], index=df.columns)
center10.sort_values(ascending=False).head(5).round(1)

yogurt                        48.2
fresh fruits                  14.6
energy granola bars            5.5
packaged vegetables fruits     5.5
milk                           5.1
dtype: float64

In [26]:
center11 = pd.Series(data = centers[11, :], index=df.columns)
center11.sort_values(ascending=False).head(5).round(1)

packaged vegetables fruits    14.7
fresh fruits                  13.4
fresh vegetables              11.6
packaged cheese                5.1
yogurt                         4.5
dtype: float64

In [27]:
center12 = pd.Series(data = centers[12, :], index=df.columns)
center12.sort_values(ascending=False).head(5).round(1)

yogurt                        17.1
fresh fruits                   8.8
fresh vegetables               5.1
packaged vegetables fruits     4.3
milk                           3.9
dtype: float64

In [28]:
center13 = pd.Series(data = centers[13, :], index=df.columns)
center13.sort_values(ascending=False).head(5).round(1)

fresh fruits                  22.5
yogurt                        21.5
fresh vegetables              17.3
packaged vegetables fruits    11.1
packaged cheese                7.8
dtype: float64

In [29]:
center15 = pd.Series(data = centers[15, :], index=df.columns)
center15.sort_values(ascending=False).head(5).round(1)

baby food formula             35.3
fresh fruits                  16.9
fresh vegetables              12.4
yogurt                         8.9
packaged vegetables fruits     7.7
dtype: float64

In [30]:
center16 = pd.Series(data = centers[16, :], index=df.columns)
center16.sort_values(ascending=False).head(5).round(1)

frozen meals                  23.1
fresh fruits                   7.2
fresh vegetables               5.1
yogurt                         4.8
packaged vegetables fruits     4.4
dtype: float64

In [31]:
center17 = pd.Series(data = centers[17, :], index=df.columns)
center17.sort_values(ascending=False).head(5).round(1)

fresh vegetables              57.6
fresh fruits                  31.1
packaged vegetables fruits    15.1
yogurt                         7.1
fresh herbs                    6.0
dtype: float64

In [32]:
center18 = pd.Series(data = centers[18, :], index=df.columns)
center18.sort_values(ascending=False).head(5).round(1)

fresh fruits                  21.1
fresh vegetables              20.6
packaged vegetables fruits     8.0
yogurt                         4.2
packaged cheese                3.8
dtype: float64

In [33]:
center19 = pd.Series(data = centers[19, :], index=df.columns)
center19.sort_values(ascending=False).head(5).round(1)

energy granola bars              31.8
fresh fruits                     13.5
yogurt                            8.6
chips pretzels                    7.2
water seltzer sparkling water     5.8
dtype: float64

To keep it short, I do not comment the results cluster by cluster. Instead I point to clusters 6, 7, 9, 16 and 19. Here we have **market niches**.

### Source of the data

The Instacart Online Grocery Shopping Dataset 2017.