## Introduction

Instacart is a US service which allows customers to order goods online and have them picked up/delivered by a personal shopper. I'm not personally familiar with the company, but I think "Uber for grocery shopping" is a good way of putting it.

A few years ago, Instacart posted a [competition on Kaggle](https://www.kaggle.com/c/instacart-market-basket-analysis), which included a large dataset of anonymised orders. The intent of the competition was to predict repeat product orders; however I'm going to use this data to explore customer segmentation using unsupervised machine learning, specifically K-means clustering.

Customer segmentation is a good way to better understand a company's user base, enabling targeted campaigns, informing long-term development, and generally helps move away from thinking of 'the customer' as one monolithic entity.

A blunt approach to customer segmentation would be to intuitively decide what customer types may be out there, and set parameters to define which segment any given customer falls into. You might say 'if the customer uses the service on average over twice a month then they are "regular"', or 'if the customer spends over $50 on average then they are "high value"'. In my experience, business stakeholders quite like this approach because it's easy to understand and they have a lot of control over it. (It even allows them to massage the numbers later if, for example, the number of "high value" customers is inconveniently low - just move that threshold down by a few dollars!). The downside of the blunt approach is that the thresholds are essentially arbitrary and rarely align with actual boundaries that exist between segments in the data.

Bringing machine learning to this problem allows us to use the data itself to inform the two main parameters of the segmentation, and thus do away with a lot of bias or misjudgement that come from the blunt approach:
- How many segments should we be looking for?
- What are the defining features of each of those segments?

## Approach

I'm going to keep this exercise fairly simple and high-level. My aim is to identify characteristics that may help to cluster customers together, and then I will iterate through different combinations of those features (and numbers of clusters) to find an optimum model - i.e. the model with the most useful segmentation in terms of distinct clusters.

For convenience and performance I will be using the scikit-learn library's K-means clustering algorithm.

As mentioned, the data that I will be using was originally intended for a different purpose, and as such it contains a few things that I'm not interested in and shall ignore. The train/test split is irrelevant here. It is likely also missing a few things that would be useful for customer segmentation - a glaring omission is any data on customer spending - so we'll have to use our imagination a little.

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
from sklearn.cluster import KMeans

pd.set_option('display.max_columns', None)

order_products__train = pd.read_csv("order_products__train.csv")
order_products__prior = pd.read_csv("order_products__prior.csv")
order_products = pd.concat([order_products__train, order_products__prior])

orders = pd.read_csv("orders.csv")
products = pd.read_csv("products.csv")
aisles = pd.read_csv("aisles.csv")
departments = pd.read_csv("departments.csv")

### Explore the data

Having imported the data, it would be useful to understand some simple things about this data and Instacart itself. Chiefly, how often are the customers in this data using the service, and what products are being bought - i.e. is this service exclusively  used for groceries, or are there likely to be contingents of customers using it for non-perishables etc.

In [17]:
# Aggregate orders by customer and show order count/lag distribution

orders_by_customer = orders.groupby("user_id").agg(
    count_of_orders=("order_id", "count"),
    mean_days_since_prior_order=("days_since_prior_order", "mean")
).reset_index()

orders_by_customer["mean_days_since_prior_order"] = orders_by_customer["mean_days_since_prior_order"].round()

customers_by_count_of_orders = orders_by_customer.groupby("count_of_orders").user_id.count().reset_index()
customers_by_count_of_orders.rename(columns={"user_id":"user_count"}, inplace=True)

customers_by_mean_days_since_prior_order = orders_by_customer.groupby("mean_days_since_prior_order").user_id.count().reset_index()
customers_by_mean_days_since_prior_order.rename(columns={"user_id":"user_count"}, inplace=True)

fig_count_of_orders = px.bar(
    customers_by_count_of_orders, 
    x="count_of_orders", 
    y="user_count", 
    title="Customers by Total Orders"
)
fig_count_of_orders.show()

fig_mean_days_since_prior_order = px.bar(
    customers_by_mean_days_since_prior_order, 
    x="mean_days_since_prior_order", 
    y="user_count",
    title="Customers by Mean Days Since Prior Order"
)
fig_mean_days_since_prior_order.show()


orders_by_customer[["count_of_orders", "mean_days_since_prior_order"]].describe()

Unnamed: 0,count_of_orders,mean_days_since_prior_order
count,206209.0,206209.0
mean,16.590367,15.448749
std,16.654774,6.923002
min,4.0,0.0
25%,6.0,10.0
50%,10.0,15.0
75%,20.0,21.0
max,100.0,30.0


In [22]:
# Aggregate products ordered by department and show distribution

order_products_product = pd.merge(order_products, products)
order_products_department = pd.merge(order_products_product, departments)

orders_by_department = order_products_department.groupby("department").order_id.count().reset_index()
orders_by_department.rename(columns={"order_id":"count_of_orders"}, inplace=True)
orders_by_department.sort_values("count_of_orders", inplace=True)

fig_orders_by_department = px.bar(
    orders_by_department, 
    x="count_of_orders", 
    y="department", 
    orientation="h",
    title="Orders by Department"
)
fig_orders_by_department.show()

What have we learned here?

- The data that Instacart have shared only includes customers which have used the service four times or more. Therefore we must bear in mind that this analysis is not all-encompassing - we are segmenting Instacart's regular customers.
- Instacart have hard-limited order counts and days between orders at 100 and 30 respectively, for an unknown reason. This isn't a huge problem, but again it's worth bearing in mind in case we see something strange later.
- The service is very heavily used for grocery shopping, rather than anything else. It's nothing like a service such as Amazon where we would likely be able to segment heavily based on products ordered - the overwhelming customer mission here is grocery shopping, but it's possible we may see slight variations in the departments frequented. Moving the analysis down to aisle- or product-level is probably going to be too granular.