<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span><ul class="toc-item"><li><span><a href="#Notebook-Introduction" data-toc-modified-id="Notebook-Introduction-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Notebook Introduction</a></span></li></ul></li><li><span><a href="#Setup" data-toc-modified-id="Setup-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Setup</a></span><ul class="toc-item"><li><span><a href="#Importing-the-libraries" data-toc-modified-id="Importing-the-libraries-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Importing the libraries</a></span></li><li><span><a href="#Read-in-the-dataset" data-toc-modified-id="Read-in-the-dataset-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Read in the dataset</a></span></li></ul></li><li><span><a href="#Customer-Clustering" data-toc-modified-id="Customer-Clustering-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Customer Clustering</a></span><ul class="toc-item"><li><span><a href="#Model-functions" data-toc-modified-id="Model-functions-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Model functions</a></span></li><li><span><a href="#Choosing-the-right-columns" data-toc-modified-id="Choosing-the-right-columns-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Choosing the right columns</a></span></li><li><span><a href="#Scaling" data-toc-modified-id="Scaling-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Scaling</a></span></li><li><span><a href="#Run-the-elbow-method" data-toc-modified-id="Run-the-elbow-method-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>Run the elbow method</a></span></li><li><span><a href="#Run-KMeans" data-toc-modified-id="Run-KMeans-3.5"><span class="toc-item-num">3.5&nbsp;&nbsp;</span>Run KMeans</a></span></li></ul></li><li><span><a href="#Cluster-Analysis" data-toc-modified-id="Cluster-Analysis-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Cluster Analysis</a></span><ul class="toc-item"><li><span><a href="#Cluster-Breakdown" data-toc-modified-id="Cluster-Breakdown-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Cluster Breakdown</a></span></li><li><span><a href="#Improving-our-Results" data-toc-modified-id="Improving-our-Results-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Improving our Results</a></span></li><li><span><a href="#Executive-View-and-Marketing-Strategy" data-toc-modified-id="Executive-View-and-Marketing-Strategy-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Executive View and Marketing Strategy</a></span><ul class="toc-item"><li><span><a href="#Cluster-Proportions" data-toc-modified-id="Cluster-Proportions-4.3.1"><span class="toc-item-num">4.3.1&nbsp;&nbsp;</span>Cluster Proportions</a></span></li><li><span><a href="#Building-a-radial-view" data-toc-modified-id="Building-a-radial-view-4.3.2"><span class="toc-item-num">4.3.2&nbsp;&nbsp;</span>Building a radial view</a></span></li></ul></li><li><span><a href="#Marketing-Strategies" data-toc-modified-id="Marketing-Strategies-4.4"><span class="toc-item-num">4.4&nbsp;&nbsp;</span>Marketing Strategies</a></span></li></ul></li><li><span><a href="#Customer-Segmentation-Conclusions" data-toc-modified-id="Customer-Segmentation-Conclusions-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Customer Segmentation Conclusions</a></span></li></ul></div>

# E-Commerce 

**Notebook 3 - Customer Segmentation**

This project will explore an E-commerce dataset of transactions from a UK registered online store. The dataset covers the period of 01/12/2010 - 09/12/2011. To access the dataset and read more about please refer to its [UCI repo](http://archive.ics.uci.edu/ml/datasets/Online+Retail).

![alt text](imgs/ecom_back.png "Title")

## Introduction

This project will go through the following stages using this data. There is a separate notebook for each process.

- NB1: Data loading & Data Cleaning
- NB2: Exploratory Data Analysis (EDA)
- NB3: Customer Segmentation
- NB4: Attrition Prevention Strategies 
- NB5: Product Recommendation (WIP)

This project is using the cookiecutter [data science template](https://github.com/drivendata/cookiecutter-data-science). More about this can be found [in this article](https://medium.com/@rrfd/cookiecutter-data-science-organize-your-projects-atom-and-jupyter-2be7862f487e).

### Notebook Introduction

This notebook will use the findings from the Customer Analysis of the EDA notebook to cluster our customers into categories. These categories can then be used to decide on targeted offers, product recommendations and loyalty schemes.

The notebook is split into 3 main sections:

1. Build and optimise a KMeans clustering algorithm using the properties of each customer to identify customer clusters.


2. Analyze the cluster results to understand the properties of each customer in each group.


3. Decide on recommendations and future strategies based on these clusters

## Setup

This section will setup our notebook by importing the right libraries, setting paths and reading the data. 

### Importing the libraries

The following libraries and paths that will be used through out the project.

In [1]:
# This allows us to syncronise our IDE with
# the notebook for efficient function storage.
%load_ext autoreload
%autoreload 2

In [14]:
# Generic libraries
import os
import sys
from pathlib import Path
import warnings
from tqdm import tqdm 
from datetime import datetime
from collections import defaultdict

# Data manipulation
import numpy as np
import pandas as pd

# Visualisation
import plotly.express as px
import matplotlib.pyplot as plt
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)
%matplotlib inline

# Modeling
from sklearn.cluster import KMeans

# Import our helpers module
import src
from src.data import utils
from src.features import build_features
from src.visualization import visualize
from src.models import modeling

# Ensure that we are operating from our base dir
os.chdir(Path(src.__file__).resolve().parents[1])

# Define a function that saves the data
# to the corresponding folder
data_folder = "data"
raw_path = os.path.join(data_folder, "raw")
int_path = os.path.join(data_folder, "interim")
processed_path = os.path.join(data_folder, "processed")

You should be able to see your project directory if you run the below command (e.g. `C:\Users\username\Desktop\ecom_project`)

In [3]:
# print(os.getcwd())

### Read in the dataset

Recall that we are using the dataset from [this website](http://archive.ics.uci.edu/ml/datasets/Online+Retail). In the first notebook we have prepared the data and created five distinct datasets:

- Transaction Data (This was the original dataset)
- Customer Data
- Product Data
- Invoice Data
- Main Agg Data

We then carried out some feature engineering inthe EDA notebook on the customer data. We will use the latest version of that data. Recall that we had two different datasets:

1. Customer data which is likely to be corporate organisations
2. Customer data which is likely to be individuals

For this notebook we will just focus on the individuals as the corporate accounts are not as many and they all generally have high quantity and high revenues.

In [6]:
# Define the paths of the processed dataframes
ind_fn = os.path.join(processed_path, "customer_data_process_ind.csv")

df_ind = pd.read_csv(ind_fn)

Let's recall what the data looks like.

In [7]:
utils.quick_summary(df_ind, "Process Customer Data", row_num=10, show_summary=False)



[1mPROCESS CUSTOMER DATA[0m
----------------------


Number of rows: 4026 	 Number of Columns: 20


Unnamed: 0,customer_id,country,orders,first_purchase,last_purchase,quantity,unq_products,total_spend,cancel_rate,total_loss,min,median,max,std,lifetime,period_perc,time_inactive,ord_spend_rate,quant_spend_rate,quant_rate
0,13468,United Kingdom,37,2010-12-01 15:08:00,2011-12-08 10:39:00,2581,184,5656.75,0.003311,48.22,0.0,10.3565,25.976,6.00105,372.9,1.0,1.1,152.885135,2.191689,69.756757
1,13534,United Kingdom,25,2010-12-16 15:23:00,2011-12-07 14:52:00,2879,119,5643.06,0.013514,227.02,0.0,12.303,54.02,14.928685,357.9,0.96,1.9,225.7224,1.960076,115.16
2,18118,United Kingdom,26,2010-12-05 12:13:00,2011-11-29 11:32:00,2877,424,5595.77,0.003127,58.05,2.967,10.04,56.068,12.317778,369.0,0.989,10.1,215.221923,1.945002,110.653846
3,17049,United Kingdom,9,2011-03-09 08:33:00,2011-12-07 10:48:00,2669,201,5594.78,0.009434,65.2,0.001,32.503,61.029,19.639907,275.2,0.738,2.1,621.642222,2.096208,296.555556
4,12481,Germany,10,2010-12-09 10:13:00,2011-11-17 08:29:00,3290,150,5590.86,0.0,5.46,0.001,25.101,113.954,38.630907,365.1,0.979,22.2,559.086,1.69935,329.0
5,17757,United Kingdom,30,2010-12-02 17:17:00,2011-12-08 15:31:00,3316,275,5585.49,0.0,19.8,1.979,11.887,26.932,7.481823,371.8,0.997,0.9,186.183,1.684406,110.533333
6,12539,Spain,4,2011-01-10 09:11:00,2011-11-17 13:30:00,2067,157,5568.35,0.0,0.0,17.079,54.205,239.896,119.378183,333.2,0.893,22.0,1392.0875,2.693928,516.75
7,17716,United Kingdom,11,2011-03-17 10:38:00,2011-11-17 12:00:00,3016,171,5550.79,0.032653,105.2,0.001,14.5625,72.033,24.978878,267.1,0.716,22.0,504.617273,1.840448,274.181818
8,16161,United Kingdom,19,2010-12-06 10:03:00,2011-12-08 12:10:00,2573,286,5487.57,0.010549,87.99,0.88,16.038,52.899,14.700812,368.1,0.987,1.0,288.819474,2.132752,135.421053
9,16985,United Kingdom,10,2010-12-16 10:38:00,2011-11-22 13:33:00,1830,62,5461.62,0.0,2.5,0.001,45.063,70.959,29.791103,358.1,0.96,17.0,546.162,2.984492,183.0


As we can see we have 4026 unique customers with around 15 features that we can explore.

## Customer Clustering

For this section we will build a KMeans clustering algorithm that will enable us to cluster "similar" customers into groups. We will then be able to analyze these groups and come up with targeted actions against each of them. Below I have included some additional details about some of the steps you will see in this section.

**KMeans Model:** KMeans algorithm is one of the most commonly unsupervised algorithms used to cluster datasets of medium to large sizes. In simple terms the algorithm works by using centroids which equal the number of the required clusters. The steps are:

1. The centroids are initialized randomly around the dataset space. 
2. The points closest (using Euclidean distance metric) to each centroid (cluster) are assigned to that cluster
3. The centroids are recalculated by taking the average across all dimensions for all members of that cluster
4. The assignment is repeated
5. This is repeated for X iterations

The final output contains the individual clusters. The illustration below helps demonstrate the above steps.

![title](https://miro.medium.com/max/832/1*O6_nsE3nLwPw1thqaGmTYA.gif)


**How to choose our clusters:** As this is an unsupervised model, we don't have a pre-defined number of clusters (unless we have some prior knowledge for that particular data) we therefore need to workout a way to find the optimum number of clusters. In reality there are two approaches:

1. *Use Intuition / Case Specific :* Based on your dataset you can estimate the number of clusters that you expect. You also need to think about practicality. For our examples, we can't have 30 clusters of customers as it will be nearly impossible to come up with a specific campaign for each one. We also need the clusters to be fairly generic.


2. *Elbow Method:* In addition to the above method which is more business case specific, there is a relatively simple way to get a feel on how different cluster numbers affect the result. This is called the "Elbow Method". Recall that KMeans works by taking the Euclidean distance between the individual points and the surrounding centroids. If you sum the square of all distances of all points for a particular cluster then you get a metric called "inertia". The more clusters you have the lower the "inertia" as centroids are getting closer and closer. An easy way to think about this, is that at the extreme where we define as many centroids / clusters as data points, then the "inertia" is zero. Therefore you would expect a natural decreasing line of "inertia" as the number of clusters increase. Usually there is a certain number of clusters where the "inertia"'s rate of change is rapidly decreased causing a "kink" in the curve. This is where the term "Elbow Method" comes from as the line resembles a human elbow. You then choose that number as your number of clusters. You can read more about the elbow method [here](https://www.geeksforgeeks.org/elbow-method-for-optimal-value-of-k-in-kmeans/).

As you will see later on, for this project we will use a combination of both methods to decide on the number of clusters. In a real life project it is likely that the business has an "ideal" number or range of clusters that they would like to use as it fits their wider strategy. 

**Feature Scaling:**  Scaling, as the name suggests refers to different techniques used in the space of machine learning to (usually) shorten the range of your values. Most of the times this is done to reduce processing times but for some models it can also have a large effect on the final result. For KMeans when you are dealing with values of varying ranges, particularly between dimensions, the results can be unclear and process intensive. This is because we are using Euclidean distances. 

The scaling you use depends on the type of data you are using. The most common scaling methods are either standard or min/max scaling for which the *sklearn* library has tools for. If you recall from the EDA part of this study, most of our customer properties are right skewed indicating the long right tails of extreme values. According to [this article](https://towardsdatascience.com/top-3-methods-for-handling-skewed-data-1334e0debf45), one of the most efficient ways of scaling heavily right skewed distributions is just taking the log of the values. This transformation is very common in the field of statistics too. Therefore, this is the approach we will use. 

### Model functions

We start by creating all our functions we will use for this part of the project. Below we have the following functions:

- Scaling using a the `np.log1p()` method
- Creating a KMeans model using sklearn
- Running the "Elbow Method" and plotting the elbow method

All these functions will be stored in `models\modeling.py`. 

```python

def log_scale_dataset(df):
    
    """
    Takes in a dataframe of
    numerical columns and 
    transforms it using the
    np.logp1 function.
    
    Parameters:
    -----------
    
    df : dataframe
    
    Dataframe with numerical 
    values to be scaled
    
    Returns:
    --------
    
    df_out : dataframe
    
    Dataframe with scaled values
    
    """
    
    df_scaled = df.copy()
    all_cols = df_scaled.columns
    
    # We use np.logp1 instead of log
    # as it copes deals a lot better
    # with very small values
    # https://numpy.org/doc/stable/reference/generated/numpy.log1p.html
    for col in all_cols:
        df_scaled[col] = np.log1p(df_scaled[col])
        
    return df_scaled

def run_kmeans(df, cluster_num, fit_only=False, iter_num=1000):
    
    """
    Runs a kmeans algorithm using
    the sklearn library. 
    
    Paramaters:
    -----------
    
    df : dataframe
    
    Dataframe to cluster
    
    cluster_num : int
    
    The number of clusters to use
    for the algorithm
    
    fit_only : bool (default = False)
    
    If True it only fits the model
    and returns the model rather than
    getting the results. If False it 
    returns both the clusters and the
    model class.
    
    iter_num : int
    
    The number of iterations to run
    the model for.
    
    Returns:
    --------
    
    model : sklearn class
    
    The model class with all 
    its artifacts
    
    results : numpy array
    
    An array of all cluster labels
    predicted by the model. Only if
    fit_only is False
    
    """
    
    values = df.copy()

    # Define the model we use kmeans++
    # as the initialisation which used
    #"smart" initializing
    # https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
    
    model = KMeans( n_clusters = cluster_num, 
                    init='k-means++', 
                    max_iter=iter_num)
    
    # If we want to only return the
    # model then we use the fit only
    # method. This is useful when you
    # are running the "Elbow Method"
    if fit_only:
        model.fit(df)
        return model
    
    else:
        results = model.fit_predict(df)
        return model, results


def create_elbow_fig(df):
    
    """
    Takes in a dataframe
    of inertia and cluster
    number and plots the
    elbow method.
    
    Parameters:
    -----------
    
    df : dataframe
    
    Dataframe with inertia 
    values and cluster numbers
    
    Returns:
    --------
    
    fig : plotly figure
    
    The elbow method figure
    
    """
    
    # Copy the dataframe and build a line chart to 
    # represent the elbow method
    df_elbow = df.copy()
    
    fig = px.line(df_elbow,
                  x="cluster_num",
                  y="intertia",
                  title="Elbow Method Chart Inertia vs Number of Clusters",
                  template="ggplot2")

    fig.update_traces(line_width=3)

    fig.update_layout(title_font_size=14,
                      font_size=10)
    
    fig.update_yaxes(title="Intertia")
    fig.update_xaxes(title="Number of Clusters",
                     dtick=1)
    
    return fig
    
    
def run_elbow_method(df, max_clusters, iter_num=1000):

    """
    Takes in a dataframe and 
    runs the elbow method 
    for a max number of iterations.
    It then plots the elbow 
    chart.
    
    Parameters:
    -----------
    
    df : dataframe
    
    The dataframe to cluster
    
    max_clusters : int
    
    The maximum number of clusters
    to test for.
    
    iter_num : int
    
    The number of iterations to run
    the model for.
    
    Returns:
    --------
    
    df_elboew : dataframe
    
    The dataframe with the inertia
    scores and the number of
    clusters
    
    fig : plotly figure
    
    The elbow method figure
    
    """

    # Initialize all variables
    score_dict = defaultdict(list)

    # For each cluster create a model
    # and get the inerti value, then
    # add the inertia and the cluster
    # number to the dictionary
    for cluster in tqdm(range(1, max_clusters + 1)):

        model = run_kmeans(df=df, cluster_num=cluster, fit_only=True, iter_num=iter_num)

        inertia = model.inertia_
        score_dict["cluster_num"].append(cluster)
        score_dict["intertia"].append(inertia)

    # Create a dataframe and visualise it
    df_elbow = pd.DataFrame(score_dict)
    fig = create_elbow_fig(df=df_elbow)

    return fig, df_elbow

```

### Choosing the right columns

Thinking about the analysis we saw on our customers we want to have groups of customers that differentiates them in the following ways:

- Total amount they spent in  the website (total_spend)
- The frequency they make purchases (median)
- The cancellation rate (cancel_rate)
- Lifetime / How long they have been customers (lifetime)
- Inactivity / Time since last shop (time_inactive)
- Total number of products they buy (quantity)
- Total orders (order)
- Number of products per order (quantity_rate)
- Number of unique products they bought (unq_products)

We first copy the dataframe to avoid losing the original dataset.

In [9]:
df_model = df_ind.copy()

We then define the columns we want to use as above.

In [10]:
# We define the columns we want to use
cols_to_use = ['total_spend',
               'median',
               'cancel_rate',
               'lifetime',
               'time_inactive',
               'quantity',
               'orders',
               'quant_rate',
               'unq_products']

# Filter for those columns
df_model = df_model[cols_to_use]

# Print a preview
utils.quick_summary(df_model, "Model input dataframe before scaling", row_num=10, show_summary=False)



[1mMODEL INPUT DATAFRAME BEFORE SCALING[0m
-------------------------------------


Number of rows: 4026 	 Number of Columns: 9


Unnamed: 0,total_spend,median,cancel_rate,lifetime,time_inactive,quantity,orders,quant_rate,unq_products
0,5656.75,10.3565,0.003311,372.9,1.1,2581,37,69.756757,184
1,5643.06,12.303,0.013514,357.9,1.9,2879,25,115.16,119
2,5595.77,10.04,0.003127,369.0,10.1,2877,26,110.653846,424
3,5594.78,32.503,0.009434,275.2,2.1,2669,9,296.555556,201
4,5590.86,25.101,0.0,365.1,22.2,3290,10,329.0,150
5,5585.49,11.887,0.0,371.8,0.9,3316,30,110.533333,275
6,5568.35,54.205,0.0,333.2,22.0,2067,4,516.75,157
7,5550.79,14.5625,0.032653,267.1,22.0,3016,11,274.181818,171
8,5487.57,16.038,0.010549,368.1,1.0,2573,19,135.421053,286
9,5461.62,45.063,0.0,358.1,17.0,1830,10,183.0,62


As we can see the values vary in range greatly hence why the need to scale.

### Scaling

This section will scale the dataframe using the logarithmic method.

In [11]:
df_scaled = modeling.log_scale_dataset(df_model)

utils.quick_summary(df_scaled,
                    "Model input dataframe with scaling",
                    row_num=10,
                    show_summary=False)



[1mMODEL INPUT DATAFRAME WITH SCALING[0m
-----------------------------------


Number of rows: 4026 	 Number of Columns: 9


Unnamed: 0,total_spend,median,cancel_rate,lifetime,time_inactive,quantity,orders,quant_rate,unq_products
0,8.640782,2.42979,0.003306,5.923988,0.741937,7.85632,3.637586,4.259248,5.220356
1,8.638359,2.58799,0.013423,5.883044,1.064711,7.965546,3.258097,4.754969,4.787492
2,8.629945,2.401525,0.003123,5.913503,2.406945,7.964851,3.295837,4.715403,6.052089
3,8.629768,3.511635,0.00939,5.621125,1.131402,7.889834,2.302585,5.695601,5.308268
4,8.629067,3.261974,0.0,5.902907,3.144152,8.098947,2.397895,5.799093,5.01728
5,8.628106,2.556219,0.0,5.921042,0.641854,8.106816,3.433987,4.714323,5.620401
6,8.625034,4.011054,0.0,5.81174,3.135494,7.634337,1.609438,6.249493,5.062595
7,8.621876,2.744864,0.032131,5.59136,3.135494,8.012018,2.484907,5.617432,5.147494
8,8.610423,2.835446,0.010493,5.911068,0.693147,7.853216,2.995732,4.915746,5.659482
9,8.605684,3.83001,0.0,5.883601,2.890372,7.512618,2.397895,5.214936,4.143135


We can immediately see the difference in the values which will help the model converge faster.

### Run the elbow method

The next step is to use this dataset to run the elbow method. As explained above this will help us gauge the number of clusters we need.

In [15]:
fig, df_elbow = modeling.run_elbow_method(df=df_scaled,
                                          max_clusters=20,
                                          iter_num=1000)
iplot(fig)

100%|████████████████████████████████████████████████████████████████████████████████| 20/20 [00:06<00:00,  3.02it/s]


**Comments**

The above chart indicates that there is a sharp drop in the rate of change of the curve between 2-5 clusters. We will go with 4 clusters as 2 is too little and 5 seem too much. Once we analyze each cluster we might find that two clusters are too similar to we can bundle them together. 

### Run KMeans

The next step is to run the KMeans algorithm using the chosen number of clusters.

In [17]:
# Run the KMeans algorithm using the 
# number of clusters chosen
num_of_clusters = 4

model, results = modeling.run_kmeans(df=df_scaled,
                           cluster_num = num_of_clusters,
                           fit_only=False,
                           iter_num=1000)

In [18]:
# Assign the clusters to the scaled and 
# non-scaled dataset
df_model['cluster'] = results
df_scaled['cluster'] = results

## Cluster Analysis

This section will look at the results of the KMeans algorithm and understand the customer properties of each cluster. Below we iterate through all columns we used and plot the histogram of each cluster. For ease of reading we will use the scaled dataset.

### Cluster Breakdown

In [19]:
for col in cols_to_use:
    
    fig = visualize.make_histogram(df_scaled.sort_values(by="cluster"), col_name=col, color="cluster")
    fig.update_layout(height=450)
    iplot(fig)

**Comments**

The above charts show the following:

1. It is evident that using 4 clusters, there is a significant overlap between them, making it very difficult to come up with logical explanations of what they might mean. 


2. Columns "median" and "cancel_rate" are not giving us much information. This is mainly because a large portion of our customers have "zero" or very low values for those two. These are customers that have shopped recently and have never canceled an order. Due to this strong dominance of this low values they are probably "over shadowing" other features.


3. Quantity rate doesn't show much difference between clusters. This is probably because overall all customers tend to make fairly large purchases as we saw in the EDA sections.


4. Order number, being a discrete variable doesn't show well in a histogram. We need to use a bar chart for that.


Based on the above we will rerun the KMeans algorithm using only 3 clusters and excluding columns "cancel_rate", "median" and "quant_rate".

### Improving our Results

As discussed above we are rerunning our algorithm using simpler settings. For simplicity I have combined all steps into one cell. 

In [21]:
df_ind_model = df_ind.copy()
df_org_model = df_org.copy()

# Ensure they have the same column order
col_order = df_ind_model.columns
df_org_model = df_org_model[col_order]

# Combine the two
df_model = pd.concat([df_org_model, df_ind_model])
df_model = df_ind_model.copy()

# We define the columns we want to use
cols_to_use = ['total_spend',
               'lifetime',
               'time_inactive',
               'quantity',
               'unq_products']

# Filter for those columns
df_model = df_model[cols_to_use]

df_scaled = modeling.log_scale_dataset(df_model)


# Run the KMeans algorithm using the 
# number of clusters chosen
num_of_clusters = 3

model, results = modeling.run_kmeans(df=df_scaled,
                           cluster_num = num_of_clusters,
                           fit_only=False,
                           iter_num=1000)

# Assign the clusters to the scaled and 
# non-scaled dataset
df_model['cluster'] = results
df_scaled['cluster'] = results

In [23]:
for col in cols_to_use:
    
    if col == "orders":
        
        df_count = df_model.groupby(["cluster", "orders"]).cluster.count().reset_index(name="order_count")
        df_count['cluster'] = df_count.cluster.astype(str)
        
        fig = px.bar(df_count.sort_values(by="cluster"), x=col, y="order_count", color="cluster", template="ggplot2")
        
        fig.update_traces(
            textposition="outside",
            marker_line_color="rgb(45, 46, 45)",
            marker_line_width=1,
            opacity=0.9,
    )
        
    else:
        fig = visualize.make_histogram(df_model.sort_values(by="cluster"), col_name=col, color="cluster")
    fig.update_layout(height=450)
    iplot(fig)

**Comments**

After reruning the algorithm with the revised settings the above charts show the following:

*cluster 0:* These are customers that haven't have made their first purchase more than 3 months ago (100 days+) and have been inactive for a while. They have only spent an average amount of money in their lifetime and have bought only a small number of unique items (5-20). These are likely to be either seasonal customers or customers that are only shopping very infrequently. This group could also have customers that are at risk of not coming back. We will call these **Occasional Shoppers**.

*cluster 1:* Cluster 1, originally might seem very similar to *Occasional Shoppers* however there is a fundamental difference. These customers have only made their first purchase (lifetime) just recently. This means they are new customers hence the low values on other metrics. We will call these the **Newcomers**. 


*cluster 2:* These customers have also been in our website for a while but are highly active, spent the most money and bought most of the items (when compared to other groups). These are clearly your regular customers that shop frequently and a lot. We will call these **Regular Shoppers**.

### Executive View and Marketing Strategy

We have succesfully clustered our customers but so what? How can we take actions against this? The first thing we need to understand is that although the charts above are very informative no executive or marketing director will want to see these. We need to come up with an elegant way to summarize our findings. 

This will evidently lose some of the information but will keep the same message. An easy to use and very dynamic chart is Plotly's  the [radial chart](https://plotly.com/python/radar-chart/). It is very intuitive for any audience and can visualize effectively up to 6-7 dimensions. Before we do anything we will rename our clusters according to the comments above.

In [37]:
# Define a dictionary with the category
# names we decided and update it
cluster_dict = {0 : "Occasional Shoppers",
                1 : "The Newcommers",
                2 : "Our Regulars"}

df_model['cluster_name'] = df_model['cluster'].apply(lambda x: cluster_dict[x])

#### Cluster Proportions

Before we get an understanding of what our customers properties are, we can look at the proportion of each cluster. Although  pie charts are not popular in the space of data science for simple proportion tasks of less than 5 groups they are perfect. They are also really "familiar" and intuitive for most audiences.

In [39]:
fig = px.pie(df_model.cluster_name.value_counts(normalize=True).reset_index(name="count"),
       names="index",
       values="count",
       title="Customer Cluster Proportion",
       template="ggplot2")

colors = ['#8ab0ed', '#ebd663', '#eb7363']

fig.update_traces(hoverinfo='label+percent', textfont_size=15,
                  marker=dict(colors=colors, line=dict(color='#000000', width=2)))

iplot(fig)

**Comments**

The above chart shows that the majority of our customers are our Regulars. This is always a good thing for an E-commerce business. In terms of long term strategy the company can focus on converting that 22.7% of Newcomers into regulars or simply reduce the 34.9% of occasional shoppers by converting them into regulars. In the next section we discuss how we might go about doing something like that.

#### Building a radial view

The first thing we need to do for a radial chart is narrow down all values into a single metric per cluster. We can test different metrics but a simple one to use is the Median.The use of the median ensures that we are not affected too much by the extreme values. It also helps summarize the general ideas of each cluster. 

In [40]:
# We groupby by each cluster name and divide by the maximum value
# to ensure we have values between 0 to 1. This will enable the
# chart to be much more readable. 
df_radial = df_model.sort_values(by="cluster").drop("cluster", axis=1).copy()
df_radial = (df_radial.groupby("cluster_name").median() / df_radial.groupby("cluster_name").median().max()).reset_index()
df_radial = df_radial.melt(id_vars="cluster_name")

utils.quick_summary(df_radial, "Dataframe summarising all clusters", show_summary=False)



[1mDATAFRAME SUMMARISING ALL CLUSTERS[0m
-----------------------------------


Number of rows: 15 	 Number of Columns: 3


Unnamed: 0,cluster_name,variable,value
0,Occasional Shoppers,total_spend,0.203693
1,Our Regulars,total_spend,1.0
2,The Newcommers,total_spend,0.216849
3,Occasional Shoppers,lifetime,0.922778
4,Our Regulars,lifetime,1.0


We are now ready to build the radial chart. Again, plotly makes it really easy for us.

In [52]:
fig = px.line_polar(df_radial.sort_values(by="cluster_name", ascending=True),
              r='value',
              theta='variable',
              color="cluster_name",
              template="plotly_dark",
              title="Customer Cluster Properties",
              line_close=True)

fig.update_traces(fill='toself', opacity=.7)
fig.update_layout(legend= dict(title=None,     
                               orientation="h",
                               yanchor="bottom",
                               y=-0.17,
                               xanchor="right",
                               x=.78))
iplot(fig)

**Comments**

The above chart shows the same properties we talked about for each cluster but in a nice and elegant summary chart. This is the kind of chart you can include in presentations and summary reports. 

### Marketing Strategies

Now that we have clustered our customers, how can we use them to improve our website and the organisation's performance? Here are a few recommendations:

1. Given that you clearly know who your **Regular customer** is now we should ensure that we don't lose these groups of customers. We know they like variety (lots of unique products) and are willing to spend on a lot on every order. That means they could benefit from various offers such as "Spend £50 more and get 10% off", or you can send them a loyalty scheme which will allow them to earn rewards as they make more purchases.


2. The **Occasional Shoppers** are a tricky group of customers that we need to understand a bit more before we can take action. We know that they have been inactive for a while therefore a simple "Here is 30% off for your return" offer or something like "shop within the next 2 weeks to get bonus" can get them to come back.


3. Finally, we have the **Newcomers**. These are your new customers, which will eventually end up in one of the two groups above. Your goal should be to convert as many as you can into Regular customers. Things we can do is ensure we keep them interested by sending them offers of products that they will potentially like. This will ensure they get a view of all the range of products we have. We can also create a referral system that invite where they can invite their friends and get rewards. It is likely that when someone just buys something from a website for the first time they will want to tell their friends about it.

## Customer Segmentation Conclusions

In this notebook we looked at the KMeans algorithm and how it can be used to create some powerful insights by clustering our customers into 3 distinct groups. Overall:

- We started by using the Elbow method to determine the right number of clusters to create. We had a go at using 4 clusters and a lot of our columns but that proved to be counter productive as some columns were much more dominant than others, failing to create that nice split. We also felt that 4 clusters were a bit too much so we went down to 3.


- On the second run we got a much clearer allocation, using the settings above. We ended up with three types of customer groups, the occasional shoppers, our regulars and the newcomers. We discovered how we can summarize their properties in an intuitive and easy to follow radial chart.


- Finally we made some recommendations based on the information we obtained for our customer. 

The next step of this project is to understand the frequency or time between orders for the average customer. This will help us pace our customers and keep track of customers that are at risk of leaving. We look at this ins Notebook 4.