# Customer Segmentation in Python

This notebook explains how to perform Association Analysis from customer purchase history data. We are using [pandas](https://pandas.pydata.org) (for data manipulation) and [mlxtend](https://github.com/rasbt/mlxtend) (for apriori and association rules algorithnms). The data we're using comes from John Foreman's book [Data Smart](http://www.john-foreman.com/data-smart-book.html).

## Understanding the data

The dataset contains both information on marketing newsletters/e-mail campaigns (e-mail offers sent)...

In [4]:
import pandas as pd

df_offers = pd.read_excel("data/WineKMC.xlsx", sheet_name=0)
df_offers.columns = ["offer_id", "campaign", "varietal", "min_qty", "discount", "origin", "past_peak"]
df_offers.head()

Unnamed: 0,offer_id,campaign,varietal,min_qty,discount,origin,past_peak
0,1,January,Malbec,72,56,France,False
1,2,January,Pinot Noir,72,17,France,False
2,3,February,Espumante,144,32,Oregon,True
3,4,February,Champagne,72,48,France,True
4,5,February,Cabernet Sauvignon,144,44,New Zealand,True


and transaction level data from customers (which offer customers responded to and what they bought).

In [5]:
df_transactions = pd.read_excel("data/WineKMC.xlsx", sheet_name=1)
df_transactions.columns = ["customer_name", "offer_id"]
df_transactions['n'] = 1
df_transactions.head()

Unnamed: 0,customer_name,offer_id,n
0,Smith,2,1
1,Smith,24,1
2,Johnson,17,1
3,Johnson,24,1
4,Johnson,26,1


## Clustering the data

First we are going to combine both data sets

In [10]:
# join the offers and transactions table
df = pd.merge(df_offers, df_transactions)
# create a "pivot table" which will give us the number of times each customer responded to a given offer
matrix = df.pivot_table(index=['customer_name'], columns=['offer_id'], values='n')
# a little tidying up. fill NA values with 0 and make the index into a column
matrix = matrix.fillna(0).reset_index()
# save a list of the 0/1 columns. we'll use these a bit later
x_cols = matrix.columns[1:]

We need to propose a proper number of clusters. Let's use the silhouette score

In [21]:
from sklearn.metrics import silhouette_score

range_n_clusters = [2, 3, 4, 5, 6, 7, 8, 9, 10]            # clusters range you want to select
dataToFit = matrix[matrix.columns[2:]] 
best_clusters = 0                       # best cluster number which you will get
previous_silh_avg = 0.0

for n_clusters in range_n_clusters:
    clusterer = KMeans(n_clusters=n_clusters)
    cluster_labels = clusterer.fit_predict(dataToFit)
    silhouette_avg = silhouette_score(dataToFit, cluster_labels)
    print("For n_clusters =", n_clusters,
          "The average silhouette_score is :", silhouette_avg)
    if silhouette_avg > previous_silh_avg:
        previous_silh_avg = silhouette_avg
        best_clusters = n_clusters

# Final Kmeans for best_clusters
kmeans = KMeans(n_clusters=best_clusters, random_state=0).fit(dataToFit)

For n_clusters = 2 The average silhouette_score is : 0.09467039888175721
For n_clusters = 3 The average silhouette_score is : 0.12478535792491957
For n_clusters = 4 The average silhouette_score is : 0.13592449651185978
For n_clusters = 5 The average silhouette_score is : 0.13655997960045999
For n_clusters = 6 The average silhouette_score is : 0.11271906335295924
For n_clusters = 7 The average silhouette_score is : 0.13264633025262193
For n_clusters = 8 The average silhouette_score is : 0.11229778764053028
For n_clusters = 9 The average silhouette_score is : 0.13447090205346915
For n_clusters = 10 The average silhouette_score is : 0.11276541602269813


According to this analysis 5 is a good candidate. Let's determine the number of customers per cluster.

In [22]:
cluster = KMeans(n_clusters=5)
# slice matrix so we only include the 0/1 indicator columns in the clustering
matrix['cluster'] = cluster.fit_predict(matrix[x_cols])
matrix.cluster.value_counts()

1    40
3    24
2    14
4    13
0     9
Name: cluster, dtype: int64