<a href="https://www.kaggle.com/mickaelnarboni/clients-segmentation-rfm-maintenance?scriptVersionId=88739643" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Table of Contents

* [Dataframes frequency](#dataframes-frequency)
* [Clusters stability](#clusters-stability)
* [Adjusted Rand Index](#ari)
* [Final thoughts](#final-thoughts)


Import the relevant library for the notebook.

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import warnings
warnings.filterwarnings("ignore") # ignore the warnings about file size
import matplotlib.pyplot as plt
from matplotlib import colors
%matplotlib inline
import seaborn as sns
from time import process_time
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
import math
from math import *
from sklearn.metrics import adjusted_rand_score
import plotly.io as pio
pio.renderers.default = 'iframe'
import scipy.cluster.hierarchy as shc
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import KMeans
from sklearn.cluster import Birch
from sklearn.cluster import SpectralClustering
from sklearn.cluster import DBSCAN
from sklearn.cluster import MiniBatchKMeans
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer
from yellowbrick.contrib.scatter import ScatterVisualizer
import plotly.offline as pyoff
import plotly.graph_objs as go
import plotly.express as px


# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/olist-clients-segmentation/database_p4.csv
/kaggle/input/olist-clients-segmentation/geolocation_p4.csv
/kaggle/input/olist-clients-segmentation/rfm_clustering.csv


Import our dataframe. 

In [3]:
df = pd.read_csv('../input/olist-clients-segmentation/df_segmentation.csv',sep='\t', index_col=[0], low_memory=False)
df

Unnamed: 0,CustomerUniqueID,MaxPurchaseDate,Recency,R,Frequency,F,Monetary Value,M,Recency_standard,Frequency_standard,Monetary_value_standard,RFM Clusters,Segment
0,861eff4711a542e4b93843c6dd7febb0,2017-05-16 15:05:35,474,2,1,0,124.99,0,1.078731,-0.574149,0.163933,0,Bronze
1,290c77bc529b7ac935b93aa66c333dc3,2018-01-12 20:48:24,233,0,1,0,289.00,0,0.283691,-0.574149,0.981946,0,Bronze
2,060e732b5b29e8181a18229c7b0b2b5e,2018-05-19 16:07:45,106,3,1,0,139.94,0,-0.598033,-0.574149,0.274193,0,Bronze
3,259dac757896d24d7702b9acbbff3f3c,2018-03-13 16:06:38,173,0,1,0,149.94,0,-0.049639,-0.574149,0.341553,0,Bronze
4,345ecd01c38d18a9036ed96c73b8d066,2018-07-29 09:51:30,35,3,1,0,230.00,0,-1.838550,-0.574149,0.759096,3,Gold
...,...,...,...,...,...,...,...,...,...,...,...,...,...
115604,1a29b476fee25c95fbafc67c5ac95cf8,2018-04-07 15:48:17,148,0,1,0,74.90,0,-0.224371,-0.574149,-0.335818,1,Diamond
115605,d52a67c98be1cf6a5c84435bd38d095d,2018-04-04 08:20:22,152,0,1,0,114.90,0,-0.194516,-0.574149,0.081788,0,Bronze
115606,e9f50caf99f032f0bf3c55141f019d99,2018-04-08 20:11:50,147,0,1,0,37.00,0,-0.231961,-0.574149,-1.024074,1,Diamond
115607,73c2643a0a458b49f58cea58833b192e,2017-11-03 21:08:33,303,1,1,0,689.00,3,0.577779,-0.574149,1.829843,0,Bronze


Return the shape of the dataframe. 

In [None]:
df.shape

We order the entries by Recency so it'll be easier for us to create regular interval of recency to split our customers.
We notice that our latest customer order has been fulfilled 728 days ago (almost 2 years) before downloading the Olist data.

In [None]:
df.sort_values(by=['Recency'], ascending=False, inplace=True)
df

<a id="dataframes-frequency"></a>
## Dataframes frequency

We split the dataframe into n = 90 number of days (business quarter) between each split based from the latest to the newest recency of clients to know until what point the model will be obsolete and need a maintenance work by Olist. 

In [None]:
n = 90

Create B0 subdataframe based on n condition.

In [None]:
B0 = df[(df['Recency'] <= df.loc[df['Recency'] == 728]) & (df['Recency'] > df.loc[df['Recency'] == 692])]
B0

Create B1 subdataframe based on n condition.

In [None]:
B1 = df[(df['Recency'] < B0['Recency'].iloc[-1]) & (df['Recency'] > B0['Recency'].iloc[-1] - n)]

In [None]:
B1 = df[(df['Recency'] < B0['Recency'].iloc[-1]) & (df['Recency'] > B0['Recency'].iloc[-1] - n)]
B1

Concat the two last subdataframes to create a cumulative dataframe.

In [None]:
B1 = pd.concat([B0, B1], axis=0)
B1.sort_values(by=['Recency'], ascending=False, inplace=True)
B1

Create B2 subdataframe based on n condition.

In [None]:
B2 = df[(df['Recency'] < B1['Recency'].iloc[-1]) & (df['Recency'] > B1['Recency'].iloc[-1] - n)]
B2

Concat the two last subdataframes to create a cumulative dataframe.

In [None]:
B2 = pd.concat([B2, B1], axis=0)
B2.sort_values(by=['Recency'], ascending=False, inplace=True)
B2

Create B3 subdataframe based on n condition.

In [None]:
B3 = df[(df['Recency'] < B2['Recency'].iloc[-1]) & (df['Recency'] > B2['Recency'].iloc[-1] - n)]
B3

Concat the two last subdataframes to create a cumulative dataframe.

In [None]:
B3 = pd.concat([B3, B2], axis=0)
B3.sort_values(by=['Recency'], ascending=False, inplace=True)
B3

Create B4 subdataframe based on n condition.

In [None]:
B4 = df[(df['Recency'] < B3['Recency'].iloc[-1]) & (df['Recency'] > B3['Recency'].iloc[-1] - n)]
B4

Concat the two last subdataframes to create a cumulative dataframe.

In [None]:
B4 = pd.concat([B4, B3], axis=0)
B4.sort_values(by=['Recency'], ascending=False, inplace=True)
B4

Create B5 subdataframe based on n condition.

In [None]:
B5 = df[(df['Recency'] < B4['Recency'].iloc[-1]) & (df['Recency'] > B4['Recency'].iloc[-1] - n)]
B5

Concat the two last subdataframes to create a cumulative dataframe.

In [None]:
B5 = pd.concat([B5, B4], axis=0)
B5.sort_values(by=['Recency'], ascending=False, inplace=True)
B5

Create B6 subdataframe based on n condition.

In [None]:
B6 = df[(df['Recency'] < B5['Recency'].iloc[-1]) & (df['Recency'] > B5['Recency'].iloc[-1] - n)]
B6

Concat the two last subdataframes to create a cumulative dataframe.

In [None]:
B6 = pd.concat([B6, B5], axis=0)
B6.sort_values(by=['Recency'], ascending=False, inplace=True)
B6

Create B7 subdataframe based on n condition.

In [None]:
B7 = df[(df['Recency'] < B6['Recency'].iloc[-1]) & (df['Recency'] > B6['Recency'].iloc[-1] - n)]
B7

Concat the two last subdataframes to create a cumulative dataframe.

In [None]:
B7 = pd.concat([B7, B6], axis=0)
B7.sort_values(by=['Recency'], ascending=False, inplace=True)
B7

Create B8 subdataframe based on n condition.

Concat the two last subdataframes to create a cumulative dataframe.

We create a dataframe for each subdataframe that will contain our three RFM variables after transformation. 

In [None]:
rfm_B0 = B0[['Recency_standard','Frequency_standard','Monetary_value_standard']]
rfm_B1 = B1[['Recency_standard','Frequency_standard','Monetary_value_standard']]
rfm_B2 = B2[['Recency_standard','Frequency_standard','Monetary_value_standard']]
rfm_B3 = B3[['Recency_standard','Frequency_standard','Monetary_value_standard']]
rfm_B4 = B4[['Recency_standard','Frequency_standard','Monetary_value_standard']]
rfm_B5 = B5[['Recency_standard','Frequency_standard','Monetary_value_standard']]
rfm_B6 = B6[['Recency_standard','Frequency_standard','Monetary_value_standard']]
rfm_B7 = B7[['Recency_standard','Frequency_standard','Monetary_value_standard']]
rfm_B8 = B8[['Recency_standard','Frequency_standard','Monetary_value_standard']]
rfm_B9 = B9[['Recency_standard','Frequency_standard','Monetary_value_standard']]
rfm_B10 = B10[['Recency_standard','Frequency_standard','Monetary_value_standard']]

In [None]:
B0 = df.iloc[0:10000]
B1 = df.iloc[0:20000]
B2 = df.iloc[0:30000]
B3 = df.iloc[0:40000]
B4 = df.iloc[0:50000]
B5 = df.iloc[0:60000]
B6 = df.iloc[0:70000]
B7 = df.iloc[0:80000]
B8 = df.iloc[0:90000]
B9 = df.iloc[0:100000]
B10 = df.iloc[0:110000]

<a id="clusters-stability"></a>
## Clusters stability

Create a KMeans method with the same hyperparameters tested in the modeling notebook.

In [None]:
# Use the hyperparameters from our model K=4 using KMeans clustering method
kmeans_cluster = KMeans(n_clusters=4)

Create a model M(i) to fit each subdataframe B(i).

In [None]:
M0 = kmeans_cluster.fit(rfm_B0)
M1 = kmeans_cluster.fit(rfm_B1)
M2 = kmeans_cluster.fit(rfm_B2)
M3 = kmeans_cluster.fit(rfm_B3)
M4 = kmeans_cluster.fit(rfm_B4)
M5 = kmeans_cluster.fit(rfm_B5)
M6 = kmeans_cluster.fit(rfm_B6)
M7 = kmeans_cluster.fit(rfm_B7)
M8 = kmeans_cluster.fit(rfm_B8)
M9 = kmeans_cluster.fit(rfm_B9)
M10 = kmeans_cluster.fit(rfm_B10)

Create the arrays of labels of our model M0 to predict each subdataframe B(i).

In [None]:
label_00 = M0.predict(rfm_B0)
label_01 = M0.predict(rfm_B1)
label_02 = M0.predict(rfm_B2)
label_03 = M0.predict(rfm_B3)
label_04 = M0.predict(rfm_B4)
label_05 = M0.predict(rfm_B5)
label_06 = M0.predict(rfm_B6)
label_07 = M0.predict(rfm_B7)
label_08 = M0.predict(rfm_B8)
label_09 = M0.predict(rfm_B9)
label_010 = M0.predict(rfm_B10)

Create the arrays of labels of our model M(i) to fit and predict the subdataframes B(i).

In [None]:
label_11 = M1.fit_predict(rfm_B1)
label_22 = M2.fit_predict(rfm_B2)
label_33 = M3.fit_predict(rfm_B3)
label_44 = M4.fit_predict(rfm_B4)
label_55 = M5.fit_predict(rfm_B5)
label_66 = M6.fit_predict(rfm_B6)
label_77 = M7.fit_predict(rfm_B7)
label_88 = M8.fit_predict(rfm_B8)
label_99 = M9.fit_predict(rfm_B9)
label_1010 = M10.fit_predict(rfm_B10)

Now, we want to compare our model M0 that predict B(i) to each model M(i) that fit and predict M(i).

In [None]:
ari0 = adjusted_rand_score(label_00, label_00)
ari1 = adjusted_rand_score(label_01, label_11)
ari2 = adjusted_rand_score(label_02, label_22)
ari3 = adjusted_rand_score(label_03, label_33)
ari4 = adjusted_rand_score(label_04, label_44)
ari5 = adjusted_rand_score(label_05, label_55)
ari6 = adjusted_rand_score(label_06, label_66)
ari7 = adjusted_rand_score(label_07, label_77)
ari8 = adjusted_rand_score(label_08, label_88)
ari9 = adjusted_rand_score(label_09, label_99)
ari10 = adjusted_rand_score(label_010, label_1010)

<a id="ari"></a>
## Adjusted Rand Index

The ARI allows us to score our model stability to see if the predicted clustering are sorted in the same clusters as our model. The closer the score is from 1, the better.
We do that on the three variables we've been building in our model notebook: Recency, Frequency, Monetary Value.
We apply the models to each new dataframe we previously created to track the evolution of the stability along time.

We create a table that contains all our ARI scores. 

In [None]:
d = {'ARI'}
list1 = [ari0, ari1, ari2, ari3, ari4, ari5, ari6, ari7, ari8, ari9, ari10]
ari_table = pd.DataFrame(data = list1, columns=d)
ari_table

Plot a curve of the ARI on each subdataframe.
We get the ARI for each new 10,000 orders based on recency from the latest to the newest.

In [None]:
pd.options.plotting.backend = "plotly"
fig = ari_table.plot(title = "RFM Predictions for clients segmentation with KMeans using ARI", labels=dict(index="10k Orders Intervals", value="Precision Score", variable=""))
fig.show()

<a id="final-thoughts"></a>
## Final thoughts