<a href="https://www.kaggle.com/mickaelnarboni/clients-segmentation-rfm-clustering?scriptVersionId=89231292" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Table of Contents

* [Marketing Goals](#marketing-goals)
* [RFM Clustering](#rfm-clustering)
  - [Recency](#recency)
  - [Frequency](#frequency)
  - [Monetary Value](#monetary-value)
* [RFM Normalization](#rfm-normalization)
* [Customers Segmentation](#customers-segmentation)
* [Graphic Representations](#graphic-representations)
* [Prepare for Production](#production)

<a id="marketing-goals"></a>
## Marketing Goals

**RFM** stands for **Recency - Frequency - Monetary Value**. 

Theoretically we will have segments like below:

- Low Value: Customers who are less active than others, not very frequent buyer/visitor and generates very low zero - maybe negative revenue.

- Mid Value: In the middle of everything. Often using our platform (but not as much as our High Values), fairly frequent and generates moderate revenue.

- High Value: The group we don’t want to lose. High Revenue, Frequency and low Inactivity.


Importing the relevant libraries

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import warnings
warnings.filterwarnings("ignore") # ignore the warnings about file size
import matplotlib.pyplot as plt
from matplotlib import colors
%matplotlib inline
import seaborn as sns
from time import process_time
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
import math
from sklearn.metrics import adjusted_rand_score
import plotly.io as pio
pio.renderers.default = 'iframe'
import scipy.cluster.hierarchy as shc
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import KMeans
from sklearn.cluster import Birch
from sklearn.cluster import SpectralClustering
from sklearn.cluster import DBSCAN
from sklearn.cluster import MiniBatchKMeans
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer
from yellowbrick.contrib.scatter import ScatterVisualizer
import plotly.offline as pyoff
import plotly.graph_objs as go
import plotly.express as px
!pip install plotly
!pip install -U kaleido


# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

Importing each CSV file into different dataframes.

In [None]:
geolocation = pd.read_csv('../input/olist-clients-segmentation/geolocation_p4.csv',sep='\t', index_col=[0], low_memory=False)
df = pd.read_csv('../input/olist-clients-segmentation/database_p4.csv',sep='\t', index_col=[0], low_memory=False)
df
# important note: adding  index_col=[0] allows to avoid having a column "Unnamed:0"

Option to display all columns.

In [None]:
pd.set_option("display.max_columns", None)
df.head()

<a id="rfm-clustering"></a>
## RFM Clustering

We'll use a method well-known in marketing to do customers segmentation, it is called RFM for Recency, Frequency, Monetary Value.

It relies on creating a cluster for each variable.
For the recency, we get the days of the last purchase date for each customer.
For the frequency, we can get the number of orders of each customer.
For the revenue, we'll use the price per order per customer.

Then, we use the Elbow method for each variable to see what is the best number of hyperparameters (in this case K for number of clusters) to do our segmentation.
We get a value of K=4 most fo the time and we use Kmeans clustering because this is the most accurate and commonly used method for RFM segmentation. 


<a id="recency"></a>
### Recency

Converting the order date from string to time 

In [None]:
#convert the string date field to datetime

df['order_purchase_timestamp (string)'] = pd.to_datetime(df['order_purchase_timestamp'])

Classifying the customers using *customer_unique_id* by the date they purchased a product.

In [None]:
# create a dataframe with unique customer id

df_user = pd.DataFrame(df['customer_unique_id'])
df_user.columns = ['CustomerUniqueID']
df_user

Add the max purchase date by customer into a new dataframe

In [None]:
max_purchase = df.groupby('customer_unique_id')['order_purchase_timestamp (string)'].max().reset_index()
max_purchase.columns = ['CustomerUniqueID','MaxPurchaseDate']
max_purchase

Create a column **Recency** that is going to take, from the previous dataframe, the oldest order and substract it to the current order date for each customer.
We convert the result in days to get the number of days each customers order from the oldest order. 

In [None]:
#we take our observation point as the max order date in our dataset using the order_purchase_timestamp variable 

max_purchase['Recency'] = (max_purchase['MaxPurchaseDate'].max() - max_purchase['MaxPurchaseDate']).dt.days
max_purchase.sort_values(by=['Recency'])

We create a dataframe called *df_user* that we'll use to add the results of our RFM clustering. We add the Recency variable to this final dataframe. 

In [None]:
df_user = pd.merge(df_user, max_purchase[['CustomerUniqueID', 'MaxPurchaseDate', 'Recency']], on='CustomerUniqueID')
df_user

Create a graph of the density distribution of the Recency variable through days.

In [None]:
# Set notebook mode to work in offline

pyoff.init_notebook_mode()

#plot a recency histogram

plot_data = [go.Histogram(x=df_user['Recency'])]

plot_layout = go.Layout(title='Users Recency')

fig = go.Figure(data=plot_data, layout=plot_layout)

fig.update_xaxes(title="Days since last purchase")

fig.update_yaxes(title="Number of unique customers")

pyoff.iplot(fig)

fig.write_image("recency.png")

#### Define the best clustering method by using the elbow method 

We iterate the values of k from 2 to 8 and use the Yellowbrick library to return us with the Elbow curve of the WCSS for each k value in the given range.
The optimal number of clusters is the point where the curve looks like an elbow.

Note:
WCSS stands for  Within-Cluster Sum of Square.
WCSS is the sum of squared distance between each point and the centroid in a cluster.

Create a liste with a range from 2 to 8 to test the best number of clusters for our models.

In [None]:
n_clusters = [2,3,4,5,6,7,8]

We create a dataframe that takes the Recency values and use a sample of n = 10000 to use the elbow method on it. 

In [None]:
df_recency = df_user[['Recency']].sample(n=10000)

The main goal of Kmeans is to find groups in data and the number of groups is represented by K number of clusters. It is an iterative procedure where each data point is assigned to one of the K groups based on features similarity. For instance, we can use the elbow method or the silhouette method to define K number of clusters. 

We visualize the Elbow method for KMeans clustering method.

In [None]:
# Instantiate the clustering model and visualizer

kmeans_recency = KMeans(n_clusters = n_clusters)

recency_visualizer = KElbowVisualizer(kmeans_recency, k=n_clusters, size=(600, 600))

recency_visualizer.fit(df_recency)    # Fit the data to the visualizer
fig = recency_visualizer.poof()    # Draw/show/poof the data
#recency_visualizer.show(outpath="kelbow_kmeans.png")

**Agglomerative clustering** uses a bottom-up approach, wherein each data point starts in its own cluster. These clusters are then joined greedily, by taking the two most similar clusters together and merging them. We usually draw a dendrogram to visualize the best K number of clusters to choose.

We visualize the Elbow method for Agglomerative clustering method.

In [None]:
agg_recency = AgglomerativeClustering(n_clusters = n_clusters)
 
recency_visualizer = KElbowVisualizer(agg_recency, k=n_clusters,  size=(600, 600))

recency_visualizer.fit(df_recency)    # Fit the data to the visualizer
fig = recency_visualizer.poof()    # Draw/show/poof the data

**Birch** stands for Balanced Iterative Reducing and Clustering using Hierarchies. It's a clustering algorithm that can cluster large datasets by first generating a small and compact summary of the the large dataset that retains as much information as possible. This smaller summary is then clustered instead of clustering the larger dataset.

We visualize the Elbow method for Birch clustering method.

In [None]:
birch_recency = Birch(n_clusters = n_clusters)

recency_visualizer = KElbowVisualizer(birch_recency, k=n_clusters,  size=(600, 600))

recency_visualizer.fit(df_recency)    # Fit the data to the visualizer
fig = recency_visualizer.poof()    # Draw/show/poof the data

The **Mini-batch K-means** clustering algorithm is a version of the standard K-means algorithm in machine learning. It uses small, random, fixed-size batches of data to store in memory, and then with each iteration, a random sample of the data is collected and used to update the clusters.

We visualize the Elbow method for MiniBatchKMeans clustering method.

In [None]:
minibatch_recency = MiniBatchKMeans(n_clusters = n_clusters)

recency_visualizer = KElbowVisualizer(minibatch_recency, k=n_clusters, size=(600, 600))

recency_visualizer.fit(df_recency)    # Fit the data to the visualizer
fig = recency_visualizer.poof()    # Draw/show/poof the data

We notice that k = 4 seems to be the most accurate number of clusters decision.
We decide to pick the KMeans method as this is the most commonly used method for RFM segmentation. 

Fit and predict the number of clusters based on the Recency column. We create a column called RecencyCluster that's going to range from 0 to 3.

In [None]:
#build 4 clusters for recency and add it to dataframe

kmeans = KMeans(n_clusters=4)
kmeans.fit(df_user[['Recency']])
df_user['R'] = kmeans.predict(df_user[['Recency']])
df_user

Describe each cluster for the Recency variable 

In [None]:
#show details of the dataframe
df_user.groupby('R')['Recency'].describe()

<a id="frequency"></a>
### Frequency

Create a dictionnary for the new dataframe that will include Frequency as a new variable

In [None]:
d = {'index': 'CustomerUniqueID','customer_unique_id': 'Frequency'}

We count the number of orders per customer_id and return a value_count in a dataframe to know the frequency of order per customer. 

In [None]:
#get order counts for each user and create a dataframe with it

df_frequency = pd.DataFrame(df['customer_unique_id'].value_counts()).reset_index().rename(columns=d)
df_frequency

We merge this dataframe to our RFM dataframe previously containing the Recency clustering. 

In [None]:
#add this data to our main dataframe

df_user = pd.merge(df_user, df_frequency, on='CustomerUniqueID')
df_user

Create a graph of the density distribution of the Frequency variable among the customers.

In [None]:
# Set notebook mode to work in offline

pyoff.init_notebook_mode()

#plot the histogram

plot_data = [go.Histogram(x=df_user['Frequency'])]

plot_layout = go.Layout(title='Users Frequency')

fig = go.Figure(data=plot_data, layout=plot_layout)

fig.update_xaxes(title="Number of purchases")

fig.update_yaxes(title="Number of unique customers")

pyoff.iplot(fig)


We create a dataframe that takes the Frequency values and use a sample of n = 10000 to use the elbow method on it. 
We keep k that range from 2 to 8 clusters for our decision.  

In [None]:
df_frequency = df_user[['Frequency']].sample(n=10000)

We visualize the Elbow method for KMeans clustering method.

In [None]:
# Instantiate the clustering model and visualizer

kmeans_frequency = KMeans(n_clusters = n_clusters)

frequency_visualizer = KElbowVisualizer(kmeans_frequency, k=n_clusters,  size=(720, 720))

frequency_visualizer.fit(df_frequency)    # Fit the data to the visualizer
fig = frequency_visualizer.poof()    # Draw/show/poof the data

We visualize the Elbow method for Agglomerative clustering method.

In [None]:
# Instantiate the clustering model and visualizer

agg_frequency = AgglomerativeClustering(n_clusters = n_clusters)

frequency_visualizer = KElbowVisualizer(agg_frequency, k=n_clusters,  size=(720, 720))

frequency_visualizer.fit(df_frequency)    # Fit the data to the visualizer
fig = frequency_visualizer.poof()    # Draw/show/poof the data

We visualize the Elbow method for Birch clustering method.

In [None]:
# Instantiate the clustering model and visualizer

birch_frequency = Birch(n_clusters = n_clusters)

frequency_visualizer = KElbowVisualizer(birch_frequency, k=n_clusters, size=(720, 720))

frequency_visualizer.fit(df_frequency)    # Fit the data to the visualizer
fig = frequency_visualizer.poof()    # Draw/show/poof the data

Again, K=4 seems to be the most accurate number of clusters, and we keep KMeans clustering method to keep an harmony in the method used. 

In [None]:
kmeans_frequency = KMeans(n_clusters=4)
kmeans_frequency.fit(df_user[['Frequency']])
df_user['F'] = kmeans_frequency.predict(df_user[['Frequency']])
df_user

Describe each cluster for the Frequency variable 

In [None]:
#show details of the dataframe
df_user.groupby('F')['Frequency'].describe()

<a id="monetary-value"></a>
### Monetary Value

Create a dictionnary for the new dataframe that will include Monetary Value as a new variable.

In [None]:
d = {'price': 'Monetary Value','customer_unique_id': 'CustomerUniqueID'}

We sum up the price of each order per customer_id and create a dataframe to know the monetary value per customer. 

In [None]:
# calculate revenue for each customer

df_revenue = df.groupby('customer_unique_id').price.sum().reset_index().rename(columns=d)
df_revenue

We merge this dataframe to our RFM dataframe previously containing the Recency and Frequency clusterings. 

In [None]:
#add this data to our main dataframe

df_user = pd.merge(df_user, df_revenue, on='CustomerUniqueID')
df_user

Create a graph of the density distribution of the Monetary Value variable among the customers.

In [None]:
# Set notebook mode to work in offline

pyoff.init_notebook_mode()

#plot the histogram

plot_data = [go.Histogram(x=df_user['Monetary Value'])]

plot_layout = go.Layout(title='Users Monetary Value')

fig = go.Figure(data=plot_data, layout=plot_layout)

fig.update_xaxes(title="Monetary value")

fig.update_yaxes(title="Number of unique customers")

pyoff.iplot(fig)

We create a dataframe that takes the Revenue values and use a sample of n = 10000 to use the elbow method on it. 
We keep k that range from 2 to 8 clusters for our decision.  

In [None]:
df_revenue = df_user[['Monetary Value']].sample(n=10000)

We visualize the Elbow method for KMeans clustering method.

In [None]:
# Instantiate the clustering model and visualizer

kmeans_revenue = KMeans(n_clusters = n_clusters)

revenue_visualizer = KElbowVisualizer(kmeans_revenue, k=n_clusters,  size=(720, 720))

revenue_visualizer.fit(df_revenue)    # Fit the data to the visualizer
fig = revenue_visualizer.poof()    # Draw/show/poof the data

We visualize the Elbow method for Agglomerative clustering method.

In [None]:
# Instantiate the clustering model and visualizer

agg_revenue = AgglomerativeClustering(n_clusters = n_clusters)

revenue_visualizer = KElbowVisualizer(agg_revenue, k=n_clusters,  size=(720, 720))

revenue_visualizer.fit(df_revenue)    # Fit the data to the visualizer
fig = revenue_visualizer.poof()    # Draw/show/poof the data

We visualize the Elbow method for Birch clustering method.

In [None]:
# Instantiate the clustering model and visualizer

birch_revenue = Birch(n_clusters = n_clusters)

revenue_visualizer = KElbowVisualizer(birch_revenue, k=n_clusters,  size=(720, 720))

revenue_visualizer.fit(df_revenue)    # Fit the data to the visualizer
fig = revenue_visualizer.poof()    # Draw/show/poof the data

We visualize the Elbow method for MiniBatchKMeans clustering method.

In [None]:
# Instantiate the clustering model and visualizer

minibatch_revenue = MiniBatchKMeans(n_clusters = n_clusters)

revenue_visualizer = KElbowVisualizer(minibatch_revenue, k=n_clusters,  size=(720, 720))

revenue_visualizer.fit(df_revenue)    # Fit the data to the visualizer
fig = revenue_visualizer.poof()    # Draw/show/poof the data

Again, K=4 seems to be the most accurate number of clusters, and we keep KMeans clustering method to keep an harmony in the method used. 

In [None]:
kmeans_revenue = KMeans(n_clusters=4)
kmeans_revenue.fit(df_user[['Monetary Value']])
df_user['M'] = kmeans_revenue.predict(df_user[['Monetary Value']])
df_user

Describe each cluster for the Monetary Value variable 

In [None]:
#show details of the dataframe
df_user.groupby('M')['Monetary Value'].describe()

<a id="other-variables"></a>
## Other Variables

We consider adding other variables to our RFM segmentation.

In [None]:
df.head()

<a id="review-score"></a>
### Review Score

In [None]:
d = {'review_score': 'Review Score','customer_unique_id': 'CustomerUniqueID'}

In [None]:
# calculate reviews for each customer

df_reviews = df[['customer_unique_id','review_score']].sort_values(by='review_score', ascending=False).rename(columns=d)
df_reviews

In [None]:
#add this data to our main dataframe

df_user = pd.merge(df_user, df_reviews, on='CustomerUniqueID')
df_user

In [None]:
# Set notebook mode to work in offline

pyoff.init_notebook_mode()

#plot the histogram

plot_data = [go.Histogram(x=df_user['Review Score'])]

plot_layout = go.Layout(title='Users Review Score')

fig = go.Figure(data=plot_data, layout=plot_layout)

fig.update_xaxes(title="Review Score")

fig.update_yaxes(title="Number of unique customers")

pyoff.iplot(fig)

<a id="freight-value"></a>
### Freight Value

In [None]:
d = {'freight_value': 'Freight Value','customer_unique_id': 'CustomerUniqueID'}

In [None]:
# calculate reviews for each order

df_freight = df[['customer_unique_id','freight_value']].sort_values(by='freight_value', ascending=False).rename(columns=d)
df_freight

In [None]:
#add this data to our main dataframe

df_user = pd.merge(df_user, df_freight, on='CustomerUniqueID')
df_user

In [None]:
# Set notebook mode to work in offline

pyoff.init_notebook_mode()

#plot the histogram

plot_data = [go.Histogram(x=df_user['Freight Value'])]

plot_layout = go.Layout(title='Users Freight Value')

fig = go.Figure(data=plot_data, layout=plot_layout)

fig.update_xaxes(title="Freight Value")

fig.update_yaxes(title="Number of unique customers")

pyoff.iplot(fig)

<a id="delay-date"></a>
### Delay Date

<a id="rfm-normalization"></a>
## RFM Normalization
We want to have an overview of each variable using boxplots representations. We first need to normalize our data and by transforming the variables using StandardScaler() from scikit-learn library so the values are closer to each other.  

Standardizing the data allows us to not biased the clustering segmentation. For instance, the variable Revenue has higher values scale than Frequency and Recency. 

In [None]:
# get the three variables into a dataframe to apply transformation
rfm = df_user[['Recency','Frequency','Monetary Value', 'Review Score', 'Freight Value']]
rfm

We transform the RFM value to a logarithmic scale 

In [None]:
# apply logarithmic transformation to get the same scale
rfm_log = np.log(rfm)

We replace the 0 value to avoid getting infinite values after using a logarithmic transformation. 

In [None]:
rfm_log = rfm_log.replace([np.inf, -np.inf], 0)
rfm_log = pd.DataFrame(data = rfm_log, 
                            index = df_user.index, 
                            ) 
rfm_log

We use StandardScaler() to standardize out data before using a clustering method on the three RFM variables in order to get our final cluster of clients.

In [None]:
# standardization of the three variables between each other to not biased the clustering method
scaler = StandardScaler()
rfm_standard = scaler.fit_transform(rfm_log)

Create a dataframe ready for clustering

In [None]:
#turn the processed data back into a dataframe
rfm_standard = pd.DataFrame(data = rfm_standard, 
                            index = df_user.index, 
                            ) 
rfm_standard.columns = ['Recency_standard', 'Frequency_standard', 'Monetary_value_standard', 'Review_score_standard', 'Freight_value_standard']
rfm_standard

Observe the three RFM variables before clustering.
Recency variable observation at the logarithmic scale.

In [None]:
# Recency

# boxplot 

fig = plt.figure(figsize =([8, 8])) 
sns.set_style('darkgrid')
plt.style.use('ggplot')
sns.boxplot(x=rfm_standard['Recency_standard'],color="blue", orient="h")
plt.xlabel("Days of last purchase", size=16)
plt.title("Distribution of the Recency variable (log)", size=18, y=1.03)
plt.show()

Frequency variable observation at the logarithmic scale.

In [None]:
# Frequency

# boxplot

fig = plt.figure(figsize =([8, 8])) 
sns.set_style('darkgrid')
plt.style.use('ggplot')
sns.boxplot(x=rfm_standard['Frequency_standard'],color="blue", orient="h")
plt.xlabel("Number of orders", size=16)
plt.title("Distribution of the Frequency variable (log)", size=18, y=1.03)
plt.show()

Monetary value variable observation at the logarithmic scale.

In [None]:
# Revenue

# boxplot

fig = plt.figure(figsize =([8, 8])) 
sns.set_style('darkgrid')
plt.style.use('ggplot')
sns.boxplot(x=rfm_standard['Monetary_value_standard'],color="blue", orient="h")
plt.xlabel("Monetary value", size=16)
plt.title("Distribution of the Monetary Value variable (log)", size=18, y=1.03)
plt.show()

In [None]:
# Reviews

# boxplot

fig = plt.figure(figsize =([8, 8])) 
sns.set_style('darkgrid')
plt.style.use('ggplot')
sns.boxplot(x=rfm_standard['Review_score_standard'],color="blue", orient="h")
plt.xlabel("Review score", size=16)
plt.title("Distribution of the Review Score variable (log)", size=18, y=1.03)
plt.show()

In [None]:
# Freight

# boxplot

fig = plt.figure(figsize =([8, 8])) 
sns.set_style('darkgrid')
plt.style.use('ggplot')
sns.boxplot(x=rfm_standard['Freight_value_standard'],color="blue", orient="h")
plt.xlabel("Freight Value", size=16)
plt.title("Distribution of the Freight Value variable (log)", size=18, y=1.03)
plt.show()

<a id="customers-segmentation"></a>
## Customers Segmentation

Create a sample of the transformed RFM variables to test the Elbow method on different clustering methods. 

In [None]:
rfm_sample = rfm_standard.sample(n=10000)

We visualize the Elbow method for KMeans clustering method.

In [None]:
# Instantiate the clustering model and visualizer

kmeans_cluster = KMeans(n_clusters = n_clusters)

visualizer = KElbowVisualizer(kmeans_cluster, k=n_clusters,  size=(720, 720))

visualizer.fit(rfm_sample)    # Fit the data to the visualizer
fig = visualizer.poof()    # Draw/show/poof the data

We visualize the Elbow method for Agglomerative clustering method.

In [None]:
# Instantiate the clustering model and visualizer

agg_cluster = AgglomerativeClustering(n_clusters = n_clusters)

visualizer = KElbowVisualizer(agg_cluster, k=n_clusters,  size=(720, 720))

visualizer.fit(rfm_sample)    # Fit the data to the visualizer
fig = visualizer.poof()    # Draw/show/poof the data

We visualize the Elbow method for Birch clustering method.

In [None]:
# Instantiate the clustering model and visualizer

birch_cluster = Birch(n_clusters = n_clusters)

visualizer = KElbowVisualizer(birch_cluster, k=n_clusters,  size=(720, 720))

visualizer.fit(rfm_sample)    # Fit the data to the visualizer
fig = visualizer.poof()    # Draw/show/poof the data

We visualize the Elbow method for MiniBatchKMeans clustering method.

In [None]:
# Instantiate the clustering model and visualizer

minibatch_cluster = MiniBatchKMeans(n_clusters = n_clusters)

visualizer = KElbowVisualizer(minibatch_cluster, k=n_clusters,  size=(720, 720))

visualizer.fit(rfm_sample)    # Fit the data to the visualizer
fig = visualizer.poof()    # Draw/show/poof the data

We notice that K=4 or 5 depending on the clustering method we decide to use. We decide to go with KMeans again, but we could have gone with Agglomerative clustering. 

In [None]:
kmeans_cluster = KMeans(n_clusters=4)
kmeans_cluster.fit(rfm_standard)
rfm_standard['RFM Clusters'] = kmeans_cluster.predict(rfm_standard)
rfm_standard

We reindex the RFM clusters to each customer_id entry in our df_user main dataframe. 

In [None]:
df_user = pd.concat([df_user, rfm_standard], axis=1)
df_user

We create a dataframe that contains our RFM Clusters ordered from the best customers to the least one so we can make a segmentation.

In [None]:
#segmentation using mean() to see details

customer_segments = df_user.groupby('RFM Clusters')['Recency','Frequency','Monetary Value','Review Score','Freight Value'].mean()
customer_segments.sort_values(by=['Frequency'], ascending=False)
customer_segments.sort_values(by=['Monetary Value'], ascending=False)


We create an overall score that shows us the sum of the three cluster for each customer. 

From this, we have four groups of typical customers based on recency, frequency and revenue scoring. 

From now on, we can create segments of customers using this scoring. 

By reading our scope of work, we understand that Olist is looking for the customers that bring them the best monetary value and frequency of order so we're going to order our segments keeping these info in mind. 

From the best groups of customers to the worst, we call the segments as followed:

- **Diamond**
- **Gold**
- **Silver**
- **Bronze**

In [None]:
df_user.loc[df_user['RFM Clusters']== 0,'Segment'] = 'Bronze'
df_user.loc[df_user['RFM Clusters']== 2,'Segment'] = 'Silver' 
df_user.loc[df_user['RFM Clusters']== 3,'Segment'] = 'Gold' 
df_user.loc[df_user['RFM Clusters']== 1,'Segment'] = 'Diamond'

We create a dataframe to store the count of each segment of customers.

In [None]:
segments_counts = df_user['Segment'].value_counts().sort_values(ascending=True)
segments_counts

We notice that the best customers of Olist represent the **Diamond (12.85%) + Gold (35.19%) = 48.04%**.
The least interesting customers represent the **Bronze (34.91%)**.

In [None]:
plt.style.use('ggplot')
sns.set_style('darkgrid')

fig, ax = plt.subplots(figsize=(10, 10))
bars = ax.barh(range(len(segments_counts)), segments_counts, color='purple')
plt.title("Customers repartition per segment", size=22, y=1.03)
plt.xlabel("Customers base", size=18)
plt.ylabel("% Customers per segment", size=18)
ax.tick_params(left=False, bottom=False, labelbottom=False)
ax.set_yticks(range(len(segments_counts)))
ax.set_yticklabels(segments_counts.index,  fontsize = 14)

for i, bar in enumerate(bars):
        value = bar.get_width()
        ax.text(value,bar.get_y() + bar.get_height()/2,'{:,} ({:,.2f}%)'.format(int(value),
                float(value*100/segments_counts.sum())),
                va='center',
                ha='left'
               )

plt.savefig('count_segmentation.png', dpi=300, format='png', bbox_inches='tight') # don't crop the legend while saving the figure
plt.show()

<a id="graphic-representations"></a>
## Graphic representations

We'll use Plotly library to represent our segments in a 2D and 3D graph using the standardized variables for a better representation.

In [None]:
# Set notebook mode to work in offline

pyoff.init_notebook_mode()

fig = px.scatter_3d(
    df_user, x="Recency_standard", y="Frequency_standard", z="Monetary_value_standard", color='Segment',
    title='3D representation for RFM segments KMeans-based clustering', opacity = 0.7)
fig.show()

In [None]:
# Set notebook mode to work in offline

pyoff.init_notebook_mode()

fig = px.scatter(
    df_user, x="Recency_standard", y="Frequency_standard", color='Segment',
    title='Recency VS Frequency for RFM segments KMeans-based clustering',
)
fig.show()

In [None]:
# Set notebook mode to work in offline

pyoff.init_notebook_mode()

fig = px.scatter(
    df_user, x="Recency_standard", y="Monetary_value_standard", color='Segment',
    title='Recency VS Monetary Value for RFM segments KMeans-based clustering',
)
fig.show()

In [None]:
# Set notebook mode to work in offline

pyoff.init_notebook_mode()

fig = px.scatter(
    df_user, x="Frequency_standard", y="Monetary_value_standard", color='Segment',
    title='Frequency VS Monetary Value for RFM segments KMeans-based clustering',
)
fig.show()

<a id="production"></a>
## Prepare for Production

In [None]:
os.chdir(r'./')
df_user.to_csv('df_segmentation.csv',sep = '\t',index = True)
from IPython.display import FileLink
FileLink(r'df_segmentation.csv')