# 5. Customer Segmentation 
The vast majority of my articles are written with intention of highlighting some mathematical or computer science concept and tying it in to a real world example. In other words, we are operating as follows:

$$\text{Determine concept to teach} \longrightarrow \text{Find problem to demonstrate concept}$$

For instance, in the case of Bayesian Classifiers I _decided_ that I wanted to cover that particular concept and all of it's inner workings, and I then found problems (that I encountered working as a data scientist) in order to demonstrate how a Bayes Classifier would work in the real world. 

However, this is not how things work in the real world! TODO: Mention taleb ideas relating to trying to fit real world into Platonic framework (when it may not really be that way). In reality, a data scientist is often given a vague problem that needs solving by making use of data. If lucky, they will be given a stand alone data set (i.e. a csv), but often they may not even have that, leaving them to wrangle the necessary data from databases, API's, and other sources. 

I want to take some time to write an article that demonstrates how to handle being confronted with a data set and simply being tasked to "explore" it and find something useful (known as **data mining**). This post will specifically be based on the exploration of the [online retail data set](http://archive.ics.uci.edu/ml/datasets/Online+Retail). 

### 5.1 Background and Context
Assume that you are a data scientist for an online retail store. Management comes to you saying that they have a set of customer transaction data over the course of ~ 1 year. They want you to simply explore and see if there is anything interesting; in other words, are there **patterns** in the data that could be relevant to marketing, **trends** that could be useful in customer prediction, and so on. We are starting with a _very_ blank canvas; where do we begin? 

There is a "general" exploration framework that is generally followed by most data scientists. It almost always starts with **data preparation** and **basic exploration**. Data preparation consists of _gathering_ relevant data into a single location. For example, if you are an energy utility and you have customers that have billing and energy usage data held in different databases, this phase may consist of querying the different databases and exporting the desired tables (potentially for specific time intervals) to a specific format (csv). Once the relevant data is in a desirable format, the next phase consists of basic exploration. What exactly is meant by that? Well, that consists of things such as:
* Determine size and shape of data set
* Inspect variables of interest
* Gather metrics about columns
* Potentially perform different groupings
* Feature generation 

For this post we are starting at the tail end of the data preparation phase (thankfully a csv has already been created for us). For that we will be using the python scientific computing stack-pandas, numpy, scipy, matplotlib, etc. So, without further ado, let's begin! 

### 5.2 Data Preparation
To start, we can load our necessary libraries, and then load our data set:

In [2]:
import numpy as np
import pandas
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib import rc, animation
from IPython.core.display import display, HTML
from mpl_toolkits.mplot3d import Axes3D
from matplotlib.ticker import LinearLocator, FormatStrFormatter

from _plotly_future_ import v4_subplots
import cufflinks
import plotly.plotly as py
import plotly
import plotly.graph_objs as go
from plotly.offline import iplot
import plotly.figure_factory as ff

from util import get_csv_from_s3, get_obj_s3

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

sns.set(style="white", palette="husl")
sns.set_context("talk")
sns.set_style("ticks")

In [5]:
df_obj = get_obj_s3("data_customer_segmentation.csv")
df = pandas.read_csv(
    df_obj, 
    encoding="ISO-8859-1",
    dtype={
        "CustomerID": str,
        "InvoiceID": str,
    }
)

A good rule of thumb when just getting started is to simply utilize `head()` to get a glimpse of the data set:



In [38]:
display(df.head())

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850,United Kingdom


And then determine the shape of the data set, and get an idea of the different columns present (including their data types and number of missing values):

In [11]:
display(f"Dataframe Shape: {df.shape}")

'Dataframe Shape: (541909, 8)'

In [39]:
def df_info(df):
    
    df_info = pandas.DataFrame(df.dtypes).T.rename(index={0: "Column Type"})
    df_null_count = pandas.DataFrame(df.isnull().sum()).T.rename(index={0: "# Null values"})
    df_null_percent = pandas.DataFrame(
        (df.isnull().sum() * 100) / df.shape[0]
    ).T.rename(index={0: "% Null values"})
    
    return df_info.append([df_null_count, df_null_percent])

display(df_info(df))

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
Column Type,object,object,object,int64,object,float64,object,object
# Null values,0,0,1454,0,0,0,135080,0
% Null values,0,0,0.268311,0,0,0,24.9267,0


We immediately get an idea of the variables available to us for analysis, as well as where we may have particular gaps (null values). Right from the get go we see that there are `~25%` of transactions that do not have an associated `CustomerID`. Based on the data that is readily available, it is impossible to impute these values. Because the vast majority of inferences that we will try and make will tie back to a particular customer, these rows can be dropped:

In [43]:
df = df.dropna(axis=0, subset=["CustomerID"])

display(f"Dataframe shape: {df.shape}")
display(df_info(df))

'Dataframe shape: (406829, 8)'

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
Column Type,object,object,object,int64,object,float64,object,object
# Null values,0,0,0,0,0,0,0,0
% Null values,0,0,0,0,0,0,0,0


We also will want to be sure to check for duplicate entries:

In [48]:
display(f"Number of duplicate entries: {df.duplicated().sum()}")

'Number of duplicate entries: 5225'

And drop them:

In [49]:
df = df.drop_duplicates()

We have now accounted for null values and duplicate entries, meaning we are ready for basic data exploration.

### 5.3 Data Exploration
Our dataframe contains 8 different features, defined as:

* `InvoiceNo`: Invoice number. Nominal, a 6-digit integer number uniquely assigned to each transaction. If this code starts with a letter is indicates a cancellation.
* `StockCode`: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.
* `Description`: Product (item) name. Nominal.
* `Quantity`: The quantities of each product (item) per transaction. Numeric.
* `InvoiceDate`: Invoice Date and time. Numeric, the day and time when each transaction was generated.
* `UnitPrice`: Unit price. Numeric, Product price per unit in sterling.
* `CustomerID`: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.
* `Country`: Country name. Nominal, the name of the country where each customer resides.

Let's start by exploring the countries from which the orders were made. It would be useful to know how many different countries orders have originated from.

In [83]:
unique_countries = df[["Country"]].groupby("Country").nunique()

display(f"Number of countries from which transactions occured: {len(unique_countries)}")

'Number of countries from which transactions occured: 37'

We may also want to know the _number of orders per country_. To do this, we need to account for the fact that an order can have many items in it; this means that a single order may contain multiple rows. We can see this below:

In [81]:
display(df[df.InvoiceNo == "536365"])

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850,United Kingdom
5,536365,22752,SET 7 BABUSHKA NESTING BOXES,2,12/1/2010 8:26,7.65,17850,United Kingdom
6,536365,21730,GLASS STAR FROSTED T-LIGHT HOLDER,6,12/1/2010 8:26,4.25,17850,United Kingdom


We can see that above a single order, `InvoiceNo = 536365`, contained 7 different items, and hence 7 different rows in our dataframe. However, we just want to count the number of orders per country, and hence those should be grouped as a _single_ value. 

In [112]:
countries = df[
    ["InvoiceNo", "Country"]
].groupby(
    ["InvoiceNo", "Country"]
).count().reset_index()["Country"].value_counts()

We can visualize these results nicely with a chloropleth map:

In [128]:
data = dict(
    type="choropleth",
    locations=countries.index,
    locationmode="country names", 
    z=countries,
    text=countries.index, 
    colorbar={"title": "Order #"},
    colorscale=[
        [0, 'rgb(224,255,255)'],
        [0.01, 'rgb(166,206,227)'], 
        [0.02, 'rgb(31,120,180)'],
        [0.03, 'rgb(178,223,138)'], 
        [0.05, 'rgb(51,160,44)'],
        [0.10, 'rgb(251,154,153)'], 
        [0.20, 'rgb(255,255,0)'],
        [1, 'rgb(227,26,28)']
    ],    
    reversescale=False
)

layout = dict(
    title="Number of orders per country",
    width=800,
    height=500,
)

choromap = go.Figure(data = [data], layout = layout)
iplot(choromap, validate=False)

We can see that the data is heavily dominated by orders placed from the UK. Another thing that we may want to explore further is the number of customers and the number of products. 

In [144]:
temp = pandas.DataFrame(
    [{
        "Products": df.StockCode.nunique(),
        "Transactions": df.InvoiceNo.nunique(),
        "Customers": df.CustomerID.nunique()
    }],
    index=["Quantity"]
)

display(temp)

Unnamed: 0,Customers,Products,Transactions
Quantity,4372,3684,22190


We can see that our data set consists of 4372 `customers` who have bought 3684 different `products` over the course of 22,190 total `transactions`. Let's try to get an idea of the distribution of number of products purchased per transaction:

In [156]:
temp = df.groupby(["CustomerID", "InvoiceNo"], as_index=False)["StockCode"].count()
temp = temp.rename(
    columns={"StockCode": "Number of products"}
).sort_values(
    "Number of products", 
    ascending=False
)

display(temp[:5])
display(temp[-5:])

Unnamed: 0,CustomerID,InvoiceNo,Number of products
6810,14096,576339,542
6812,14096,579196,533
6813,14096,580727,529
6811,14096,578270,442
6808,14096,573576,435


Unnamed: 0,CustomerID,InvoiceNo,Number of products
13192,15738,553847,1
6996,14145,559110,1
13200,15738,C538066,1
13201,15738,C543637,1
0,12346,541431,1


It looks like at one end of the spectrum we have orders made up over greater than 500 products (many coming from `CustomerID = 14096`, and at the other we have orders that contain a single product. A histogram will help us visualize this distribution:

In [163]:
trace1 = go.Histogram(
    x=temp["Number of products"],
    nbinsx=200,
    name="Number of products",
    marker_color='blue',
)

data = [trace1]

layout = go.Layout(
    width=750,
    height=450,
    title="Distribution of products purchased per order",
    xaxis=dict(title="Number of products"),
    yaxis=dict(title="Number of Orders"),
    barmode='stack'
)

fig = go.Figure(data=data, layout=layout)
fig.update_traces(opacity=0.75)

plotly.offline.iplot(fig)

This is easily identified as a long tail distribution; we have a few big spenders (frequent customers that buy a large number of products at each order, as well as users who buy a small number of products. Note the prefix of `C` on an invoice number means that it was a cancellation (we will look into that shortly).

Another thing worth understanding is the _number of purchases per customer_. We can take a look at that distribution below:

In [192]:
num_unique_days_with_purchase_per_customer = df.groupby(["CustomerID"])["InvoiceDate"].nunique()

In [197]:
trace1 = go.Histogram(
    x=num_unique_days_with_purchase_per_customer.values,
    nbinsx=200,
    name="Number of unique days with purchase made (per customer)",
    marker_color='blue',
)

data = [trace1]

layout = go.Layout(
    width=750,
    height=450,
    title="Distribution of Number of unique days with purchase made (per customer)",
    xaxis=dict(title="Number of unique days with purchase made (per customer)"),
    yaxis=dict(title="Number of Customers"),
    barmode='stack'
)

fig = go.Figure(data=data, layout=layout)
fig.update_traces(opacity=0.75)

plotly.offline.iplot(fig)

In [201]:
# Leaving off: need to determine the difference between this:
df.groupby(["CustomerID"]).nunique()["StockCode"].sort_values()

# And this: 
temp = df.groupby(["CustomerID", "InvoiceNo"], as_index=False)["StockCode"].count()


# I am tempted to think that the first is correct, but needs further investigation 

CustomerID
12346       1
15668       1
17616       1
15753       1
15802       1
12823       1
14119       1
18113       1
14679       1
14090       1
14682       1
15118       1
12875       1
13452       1
15100       1
14705       1
16061       1
18017       1
16078       1
16093       1
17448       1
15070       1
17443       1
12943       1
16738       1
16138       1
16144       1
18141       1
16737       1
15657       1
         ... 
15555     404
16686     406
17920     407
15039     408
15529     415
17337     421
18118     424
17611     440
15547     441
12415     444
16549     445
16931     448
13263     450
17511     467
14159     501
13081     511
16033     539
14505     542
15311     571
14456     580
13089     636
14646     703
14156     716
14769     718
14606     832
14298     884
14096    1121
17841    1331
12748    1769
14911    1794
Name: StockCode, Length: 4372, dtype: int64