# <font color= 'blue'> Project: High Value Customer Identification (Insiders)</font>


**Business Challange**

A loyalty program of customers to increase sales frquency.

**Business Planning (IOT)**

**<font color= 'green'>Input**</font>

**1. Business Problem**
- Select the most valuable customers to join a loyalty program.

**2. Dataset**
    
<u>One year e-commerce sales.</u>
    
   - Invoice No: Invoice number (A 6-digit integral number uniquely assigned to each transaction)

   - Stock Code: Product (item) code

   - Description: Product (item) name

   - Quantity: The quantities of each product (item) per transaction

   - Invoice Date: The day when each transaction was generated

   - Unit Price: Unit price (Product price per unit)

   - Customer ID: Customer number (Unique ID assigned to each customer)
    
   - Country: Country name (The name of the country where each customer resides)
    
**<font color= 'green'>Output**</font>
- **1.** <u>Indicate customers who will be part of a loyalty program called Insiders.</u>
     - List: client_id | is_insider
             10323 |   yes
             32413 |   no
- **2.**<u> A report with the answers for the business questions.</u>
    - Who are the customers eligible to join the program?
    - How many customers will be part os this group?
    - What are the main characteristics of these customers?
    - What is the contribution percentage revenue from Insiders?
    - What is the group's revenue expectation for the coming months?
    - What are the condictions for select customers to join Insiders?
    - What are the condictions for removing Insiders customers?
    - What is the guarantee that the Insiders program is better than the rest of the base?
    - What actions can the marketing team take to increase revenue?
    
**<font color= 'green'>Taks**</font>
- <u>**1.** Who are the customers eligible to join the program?</u>
  - What does it mean, to be elegible? What does high-value customers mean?
  - Revenue: ticket, bascket size, high LTV (Lifetime Value), churn probability,high TVC prevision, purchasing propensity.
  - Cost: lower return rate.
  - Purchase experience: high average evaluation rate.
     
- <u>**2.** How many customers will be part os this group?</u>
  - Total nambers of customers.
  - % Insiders group.
        
- <u>**3.** What are the main characteristics of these customers?</u>
  - Age
  - Location
  - Others characteristics.
  - Ticket, bascket size, high LTV, churn probability,high TVC prevision, purchasing propensity.
       
        
- <u>**4.** What is the contribution percentage revenue from Insiders?</u>
   - Total revenue for the year.
   - Insiders group revenue.
         
- <u>**5.** What is the group's revenue expectation for the coming months?</u>
  - LTV Insisders group.
  - Cohort analysis.
        
- <u>**6.** What are the condictions for select customers to join Insiders?</u>
  - Define the periodicity
  - The person needs to have similar characteristics with someone in the group.
    
- <u>**7.** What are the condictions for removing Insiders customers?</u>
  - Define the periodicity
  - The person doen't need to have similar characteristics with someone in the group.
        
- <u>**8.** What is the guarantee that the Insiders program is better than the rest of the base?</u>
  - A/B test
  - A/B bayesian test
  - Hypothesis test
        
- <u>**9.** What actions can the marketing team take to increase revenue?</u>
  - Discont
  - Purchase preference
  - Purchase shipping
  - Company visit  
        
**<font color= 'green'>Benchmark Solutions**</font>

- **Desk Research**
   - RFM model (recency, frequency,, monetary): sorted data to have a RFM Score.
   - Recency: How recently a customer has made a purchase
   - Frequency: How often a customer makes a purchase
   - Monetary Value: How much money a customer spends on purchases
     
- <u>example project:</u> https://guillaume-martin.github.io/rfm-segmentation-with-python.html

In [None]:
import numpy      as np
import pandas     as pd
import seaborn    as sns
import umap.umap_ as umap

from sklearn.manifold import TSNE
from sklearn          import cluster       as c
from sklearn          import metrics       as m
from sklearn          import mixture       as mx
from sklearn          import ensemble      as en
from sklearn          import preprocessing as pp
from sklearn          import decomposition as dd
from plotly           import express       as px
from matplotlib       import pyplot        as plt
from scipy.cluster    import hierarchy     as hc

from datetime            import datetime
from pandas_profiling    import ProfileReport
from IPython.display     import Image, HTML
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer
from sklearn.neighbors import NearestNeighbors

# 0.0. Imports

In [None]:
import re
import numpy   as np
import pandas  as pd
import seaborn as sns

import umap.umap_ as umap

from matplotlib import pyplot as plt

from sklearn import cluster       as c
from sklearn import metrics       as m
from sklearn import ensemble      as en
from sklearn import preprocessing as pp
from sklearn import decomposition as dd
from sklearn import manifold      as mn
from sklearn import mixture       as mx
from IPython.display import Image, HTML
from plotly import express as px
from scipy.cluster import hierarchy as hc
from sqlalchemy import create_engine

## 0.2. Load dataset

In [None]:
 def jupyter_settings():
    %matplotlib inline
    %pylab inline
    
    plt.style.use( 'bmh' )
    plt.rcParams['figure.figsize'] = [25, 12]
    plt.rcParams['font.size'] = 24
    
    display( HTML( '<style>.container { width:100% !important; }</style>') )
    pd.options.display.max_columns = None
    pd.options.display.max_rows = None
    pd.set_option( 'display.expand_frame_repr', False )
    
    sns.set()
    
jupyter_settings() 
    

In [None]:
# load data
df_raw = pd.read_csv('data/Ecommerce.csv')

# drop extra column
df_raw = df_raw.drop(columns = ['Unnamed: 8'], axis = 1)

## 0.1. Helper Functions

# 1.0. Data Description

In [None]:
df1 = df_raw.copy()

## 1.1. Rename columns

In [None]:
cols_new = ['invoice_no', 'stock_code', 'description', 'quantity', 'invoice_date',
       'unit_price', 'customer_id', 'country']
df1.columns = cols_new

## 1.2. Data dimensions

In [None]:
print('Number of Rows: {}'.format(df1.shape[0]))
print('Number of Columns: {}'.format(df1.shape[1]))

## 1.3. Data types

In [None]:
df1.dtypes

## 1.4. Check NA

In [None]:
df1.isna().sum()

## 1.5. Replace NA

In [None]:
df_missing = df1.loc[df1['customer_id'].isna(),:]
df_not_missing = df1.loc[~df1['customer_id'].isna(),:]

In [None]:
# create reference
df_backup = pd.DataFrame(df_missing['invoice_no'].drop_duplicates())
df_backup['customer_id'] = np.arange(19000, 19000+len(df_backup),1)

# merge original with reference dataframe
df1 = pd.merge(df1, df_backup, on = 'invoice_no', how= 'left')

# coalesce
df1['customer_id'] = df1['customer_id_x'].combine_first(df1['customer_id_y'])

# drop extra columns
df1 = df1.drop(columns=['customer_id_x','customer_id_y'], axis=1)
   
df1.head()    


In [None]:
# No NaN in 'customer_id'
df1.isna().sum()

## 1.6. Change dtypes

In [None]:
# invoice_date
df1["invoice_date"] = pd.to_datetime(df1["invoice_date"],infer_datetime_format=True)

# customer_id
df1['customer_id'] = df1['customer_id'].astype(int)
df1.dtypes

## 1.7. Descriptive Statistics

In [None]:
num_attributes = df1.select_dtypes(include =['int64', 'float64'])
cat_attributes = df1.select_dtypes(exclude = ['int64', 'float64', 'datetime64[ns]'])

In [None]:
 # central tendency - mean, median
ct1 = pd.DataFrame( num_attributes.apply( np.mean ) ).T
ct2 = pd.DataFrame( num_attributes.apply( np.median ) ).T
             
# dispersion - desvio padrão, minimo, maximo, range, skew, kurtosis
d1 = pd.DataFrame( num_attributes.apply( np.std ) ).T
d2 = pd.DataFrame( num_attributes.apply( np.min ) ).T
d3 = pd.DataFrame( num_attributes.apply( np.max ) ).T
d4 = pd.DataFrame( num_attributes.apply( lambda x: x.max() - x.min() ) ).T
d5 = pd.DataFrame( num_attributes.apply( lambda x: x.skew() ) ).T
d6 = pd.DataFrame( num_attributes.apply( lambda x: x.kurtosis() ) ).T

# concatenate
m1 = pd.concat( [d2, d3, d4, ct1, ct2, d1, d5, d6] ).T.reset_index()
m1.columns = ['attributes', 'min', 'max', 'range', 'mean', 'mediana', 'std', 'skew', 'kurtosis']
m1

#### 1.7.1.1. Numerical Attributes - Investigating

1. Could be negative quantity, return?
2. Unit price = 0. Could it be sales?


### 1.7.2. Categorical Attributes

#### Invoice number

In [None]:
#cat_attributes['invoice_no'].astype(int)  -> Data contains 'invoice_no' with numbers and letters.
df_letter_invoices = df1.loc[df1['invoice_no'].apply(lambda x: bool(re.search( '[^0-9]+', x ))), :]
len(df_letter_invoices)

print('Total number of invoices:{}'.format(len(df_letter_invoices)))
print('Total number os negative quantity:{}'.format(len(df_letter_invoices[df_letter_invoices['quantity']< 0])))

#### Stock Code

In [None]:
# check stock codes only characters
df1.loc[df1['stock_code'].apply( lambda x: bool( re.search( '^[a-zA-Z]+$', x ) ) ), 'stock_code'].unique()

# Action:
## 1. Remove stock_code in ['POST', 'D', 'M', 'PADS', 'DOT', 'CRUK']

#### Description

In [None]:
# Action: Delete description

#### Country

In [None]:
len(df1['country'].unique())

In [None]:
df1['country'].value_counts(normalize = True).head()

In [None]:
df1[['customer_id', 'country']].drop_duplicates().groupby('country').count().reset_index().sort_values('customer_id', ascending = False).head()

# 2.0. Data Filtering

In [None]:
df2 = df1.copy()

In [None]:
df2.dtypes

In [None]:
# === Numerical attributes ====
df2 = df2.loc[df2['unit_price'] >= 0.04, :]

# === Categorical attributes ====
df2 = df2[~df2['stock_code'].isin( ['POST', 'D', 'DOT', 'M', 'S', 'AMAZONFEE', 'm', 'DCGSSBOY', 'DCGSSGIRL', 'PADS', 'B', 'CRUK'] ) ]

# description
df2 = df2.drop( columns='description', axis=1 )

# map -  
df2 = df2[~df2['country'].isin( ['European Community', 'Unspecified' ] ) ]

# bad users - outlier
df2 = df2[~df2['customer_id'].isin( [16446] )]

# quantity
df2_returns = df2.loc[df1['quantity'] < 0, :]
df2_purchase = df2.loc[df1['quantity'] >= 0, :]

# 3.0. Feature Engineering

In [None]:
df3 = df2.copy()

In [None]:
# Feature Ideas:
## 1) Moving Average - 7d, 14d, 30d
## 2) Purchase quantity by month, before the 15th and after the 15th.
## 3) Average Financial

## 3.1. Feature Creation

In [None]:
# data reference
df_ref = df3.drop(['invoice_no', 'stock_code', 'quantity', 'invoice_date', 'unit_price', 'country'],
                   axis =1 ).drop_duplicates( ignore_index = True)

In [None]:
df_ref.shape

### 3.1.1. Gross Revenue

In [None]:
# Gross Revenue
df2_purchase.loc[:,'gross_revenue'] = df2_purchase.loc[:,'quantity'] * df2_purchase.loc[:,'unit_price']

# Monetary
df_monetary = df2_purchase.loc[:,['customer_id', 'gross_revenue']].groupby('customer_id').sum().reset_index()
df_ref = pd.merge(df_ref, df_monetary, on = 'customer_id', how = 'left')
df_ref.isna().sum()

### 3.1.2. Recency - Day from last purchase

In [None]:
 # Recency - Last day purchase
df_recency = df2_purchase.loc[:, ['customer_id', 'invoice_date']].groupby( 'customer_id' ).max().reset_index()
df_recency['recency_days'] = ( df2['invoice_date'].max() - df_recency['invoice_date'] ).dt.days
df_recency = df_recency[['customer_id', 'recency_days']].copy()
df_ref = pd.merge( df_ref, df_recency, on='customer_id', how='left' )
df_ref.isna().sum()

### 3.1.5. Quantity of products purchased

In [None]:
df_freq = (df2_purchase.loc[:,['customer_id', 'stock_code']].groupby('customer_id')
                                                        .count()
                                                        .reset_index()
                                                        .rename(columns={'stock_code': 'quantity_products'}))
                                                        
                                                        
df_ref = pd.merge( df_ref, df_freq, on = 'customer_id', how ='left')
df_ref.isna().sum()

### 3.1.8. Frequency Purchase

In [None]:
df_aux = ( df2_purchase[['customer_id', 'invoice_no', 'invoice_date']].drop_duplicates()
                                                             .groupby( 'customer_id')
                                                             .agg( max_ = ( 'invoice_date', 'max' ), 
                                                                   min_ = ( 'invoice_date', 'min' ),
                                                                   days_= ( 'invoice_date', lambda x: ( ( x.max() - x.min() ).days ) + 1 ),
                                                                   buy_ = ( 'invoice_no', 'count' ) ) ).reset_index()
# Frequency
df_aux['frequency'] = df_aux[['buy_', 'days_']].apply( lambda x: x['buy_'] / x['days_'] if  x['days_'] != 0 else 0, axis=1 )

# Merge
df_ref = pd.merge( df_ref, df_aux[['customer_id', 'frequency']], on='customer_id', how='left' )

df_ref.isna().sum()

### 3.1.9. Number of Returns

In [None]:
df2_returns.head()

In [None]:
# Number of Returns
df_returns = df2_returns[['customer_id', 'quantity']].groupby( 'customer_id' ).sum().reset_index().rename( columns={'quantity':'quantity_returns'} )
df_returns['quantity_returns'] = df_returns['quantity_returns'] * -1

df_ref = pd.merge( df_ref, df_returns, how='left', on='customer_id' )
df_ref.loc[df_ref['quantity_returns'].isna(), 'quantity_returns'] = 0

df_ref.isna().sum()

# 4.0. Exploratory Data Analysis (EDA)

In [None]:
df4 = df_ref.dropna()
df4.isna().sum()

## 4.1. Univariate Analysis

**Notes.01**
1) What do we look for in a clustering problem?
- Cohesive and separate cluster.
- Variability
    - Metrics:
        - Min, max, range (dispersion).
        - Mean and Median.
        - Standard deviation( std) and variance.
        - Coefficient of variation (CV) = std/mean
Obs: The cluster type differs with each type of business problem.

**Explore/Delete**

**1.** Gross Revenue - ok


In [None]:
#profile = ProfileReport(df4)
#profile.to_file('output_v2.html')

# to visualize: file:///Users/anaotavio/Documents/repos/insiders_clustering/output/output_v2.html

### 4.1.1. Gross Revenue/Quantity of Items

In [None]:
# outlier?
df4[df4['customer_id']==14646]

In [None]:
df3[df3['customer_id']==14646].sort_values('quantity', ascending=True).head()

### 4.1.2. Quantity of Products

In [None]:
# outlier?
df4[df4['quantity_products']==7838]

In [None]:
df3[df3['customer_id']==17841].sort_values('quantity', ascending=True).head()

### 4.1.4. Frequecy

In [None]:
df4[df4['frequency']==17]

In [None]:
df3[df3['customer_id']==17850].sort_values('quantity', ascending=True).head()

## 4.2. Bivariate Analysis

In [None]:
cols = ['customer_id']
df42 = df4.drop(cols, axis=1)

In [None]:
df42.columns

## 4.3 Space Study

1. PCA : axis variability reduction
2. UMAP: variability reduction via Baysean system   
3. t-SNE: 
4. Decision Tree:

In [None]:
# original dataset
#df43 = df4.drop(columns = ['customer_id'], axis=1).copy()

# selected dataset
cols_selected = ['customer_id', 'gross_revenue', 'recency_days', 'quantity_products', 'frequency', 'quantity_returns']
df43 = df4[ cols_selected ].drop( columns='customer_id', axis=1 )

In [None]:
df43.shape

In [None]:
# MinMax Scaler
# from sklearn import prepocessing as pp
mm = pp.MinMaxScaler()

df43['gross_revenue']     = mm.fit_transform( df43[['gross_revenue']] )
df43['recency_days']      = mm.fit_transform( df43[['recency_days']] )
df43['quantity_products'] = mm.fit_transform( df43[['quantity_products']])
df43['frequency']         = mm.fit_transform( df43[['frequency']])
df43['quantity_returns']  = mm.fit_transform( df43[['quantity_returns']])

X = df43.copy()

### 4.3.1 PCA

In [None]:
#from sklearn import decomposition as dd

In [None]:
pca = dd.PCA( n_components=X.shape[1] )

principal_components = pca.fit_transform( X )

# plot explained variable
features = range( pca.n_components_ )

plt.bar( features, pca.explained_variance_ratio_, color='black' )

# pca component
df_pca = pd.DataFrame( principal_components )

In [None]:
sns.scatterplot(x= 0, y=1, data=df_pca)

### 4.3.2. UMAP

In [None]:
#!pip install llvmlite==0.37.0rc2 --ignore-installed
#!pip install umap-learn
#import umap.umap_ as umap

# UMAP: cluster designed with high dimensionality

reducer = umap.UMAP(random_state=42)
embedding = reducer.fit_transform(X)

#embedding
df_umap = pd.DataFrame()
df_umap['embedding_x'] = embedding[:, 0]
df_umap['embedding_y'] = embedding[:, 1]

# plot UMAP
sns.scatterplot(x='embedding_x',
                y='embedding_y',
                data= df_umap)

### 4.3.2. t-SNE

In [None]:
#from sklearn.manifold import TSNE
reducer = mn.TSNE(n_components=2,n_jobs=-1,random_state=42)
embedding = reducer.fit_transform(X)

#embedding
df_tsne = pd.DataFrame()
df_tsne['embedding_x'] = embedding[:, 0]
df_tsne['embedding_y'] = embedding[:, 1]

# plot UMAP
sns.scatterplot(x='embedding_x',
                y='embedding_y',
                data= df_tsne)

### 4.3.4. Tree-Based Embedding

In [None]:
 # training dataset
X = df43.drop( columns=['gross_revenue'], axis=1 )
y = df43['gross_revenue']

# model definition
rf_model = en.RandomForestRegressor( n_estimators=100, random_state=42 )

# model training
rf_model.fit( X, y )

# Leaf 
df_leaf = pd.DataFrame( rf_model.apply( X ) )

In [None]:
 # Reduzer dimensionality
reducer = umap.UMAP( random_state=42 )
embedding = reducer.fit_transform( df_leaf )

# embedding
df_tree = pd.DataFrame()
df_tree['embedding_x'] = embedding[:, 0]
df_tree['embedding_y'] = embedding[:, 1]

# plot UMAP
sns.scatterplot( x='embedding_x', 
                 y='embedding_y', 
                 data=df_tree )

# 5.0. Data Preparation

In [None]:
# Tree-Based Embedding
df5 = df_tree.copy()
df5.to_csv('tree_based_embedding.cdv')

# UMAP Embedding
#df5 = df_umap.copy()

# TSNE Embedding
#df5 = df_tsne.copy()

In [None]:
# Standard Scaler
#from sklearn import preprocessing as pp
#mm = pp.MinMaxScaler()
#ss = pp.StandardScaler()
#rs = pp.RobustScaler()

#df5['gross_revenue']          = mm.fit_transform(df5[['gross_revenue']])
#df5['recency_days']           = mm.fit_transform(df5[['recency_days']])
# df5['quantity_invoices']      = mm.fit_transform(df5[['quantity_invoices']])
# df5['quantity_items']         = mm.fit_transform(df5[['quantity_items']])
#df5['quantity_products']      = mm.fit_transform(df5[['quantity_products']])
# df5['avg_ticket']             = mm.fit_transform(df5[['avg_ticket']])
# df5['avg_recency_days']       = mm.fit_transform(df5[['avg_recency_days']])
#df5['frequency']              = mm.fit_transform(df5[['frequency']])
#df5['quantity_returns']       = mm.fit_transform(df5[['quantity_returns']])
# df5['avg_basket_size']        = mm.fit_transform(df5[['avg_basket_size']])
# df5['avg_unique_basket_size'] = mm.fit_transform(df5[['avg_unique_basket_size']])
    
#variable = 'avg_unique_basket_size'

In [None]:
# Data as is
#print('Min:{} - Max:{}'.format(df5_aux[variable].min(), df5_aux[variable].max()))
#sns.displot(df5_aux[variable]);

In [None]:
# Data Standardization/Rescale
#print('Min:{} - Max:{}'.format(df5[variable].min(), df5[variable].max()))
#sns.displot(df5[variable]);

In [None]:
# BoxPlot
#sns.boxplot(df5_aux[variable]);

# 6.0. Feature Selection

In [None]:
#cols_selected = ['customer_id', 'gross_revenue', 'recency_days', 'quantity_products', 'frequency', 'quantity_returns']
#df6 = df5[ cols_selected ].copy()
#df6 = df_tree.copy()

# 7.0. Hyperparameter Fine Tunning

In [None]:
X = df5.copy()

In [None]:
X.head()

In [None]:
#clusters = [2, 3, 4, 5, 6, 7, 8, 9]
clusters = np.arange( 2, 25, 1)
clusters

## 7.1. K-Means

In [None]:
from sklearn import metrics as m

In [None]:
kmeans_list = []
for k in clusters:
    # model definition
    kmeans_model = c.KMeans( n_clusters=k, n_init=100, random_state=42 )

    # model training
    kmeans_model.fit( X )

    # model predict
    labels = kmeans_model.predict( X )

    # model performance
    sil = m.silhouette_score( X, labels, metric='euclidean' )
    kmeans_list.append( sil )

In [None]:
plt.plot(clusters, kmeans_list, linestyle='--', marker='o', color='b')
plt.xlabel('K');
plt.ylabel('Silhouette Score');
plt.title('Silhouette Score x K')

## 7.2. GMM

In [None]:
gmm_list = []
for k in clusters:
    # model definition
    gmm_model = mx.GaussianMixture( n_components=k, n_init=10, random_state=42 )

    # model training
    gmm_model.fit( X )

    # model predict
    labels = gmm_model.predict( X )

    # model performance
    sil = m.silhouette_score( X, labels, metric='euclidean' )
    gmm_list.append( sil )

In [None]:
plt.plot(clusters, gmm_list, linestyle='--', marker='o', color='b')
plt.xlabel('K');
plt.ylabel('Silhouette Score');
plt.title('Silhouette Score x K')

## 7.3. Hierarchical Clustering

In [None]:
# from scipy.cluster import hierarchy as hc

# model definition and training
hc_model = hc.linkage(X, 'ward')

In [None]:
#hc.dendrogram(
    #hc_model,
    #leaf_rotation = 90,
    #leaf_font_size=8
#)

#plt.plot()

In [None]:
#hc.dendrogram(
    #hc_model,
    #truncate_mode='lastp',
    #p=12,
    #leaf_rotation = 90,
   # leaf_font_size=8,
    #show_contracted=True
#)

#plt.plot()

### 7.3.1. HClustering Silhouette Score

In [None]:
hc_list = []
for k in clusters:
    # model definition & training
    hc_model = hc.linkage( X, 'ward' )

    # model predict
    labels = hc.fcluster( hc_model, k, criterion='maxclust' )

    # metrics
    sil = m.silhouette_score( X, labels, metric='euclidean' )
    hc_list.append( sil )

In [None]:
plt.plot(clusters, hc_list, linestyle='--', marker='o', color='b')

## 7.5. Results

In [None]:
## 7.5. Results - Tree Based Embedding

df_results = pd.DataFrame( 
    {'KMeans': kmeans_list, 
     'GMM': gmm_list, 
     'HC': hc_list}
).T

df_results.columns = clusters
df_results.style.highlight_max( color='lightgreen', axis=1 )

In [None]:
 ## 7.6. Results - UMAP Embedding

df_results = pd.DataFrame( 
    {'KMeans': kmeans_list, 
     'GMM': gmm_list, 
     'HC': hc_list}
).T

df_results.columns = clusters
df_results.style.highlight_max( color='lightgreen', axis=1 )

In [None]:
## 7.7. Results - TSNE Embedding

df_results = pd.DataFrame( 
    {'KMeans': kmeans_list, 
     'GMM': gmm_list, 
     'HC': hc_list}
).T

df_results.columns = clusters
df_results.style.highlight_max( color='lightgreen', axis=1 )

# 8.0. Model Training

## 8.1. K-Means - GMM

In [None]:
 ## model definition
k = 9
kmeans = c.KMeans( init='random', n_clusters=k, random_state=42  )
#
## model training
kmeans.fit( X )
#
## clustering
labels = kmeans.labels_

In [None]:
# model definition
#gmm_model = mx.GaussianMixture( n_components=k,n_init=300, random_state=42 )

# model training
#gmm_model.fit( X )

# model predict
#labels = gmm_model.predict( X )

## 8.2. Cluster Validation

In [None]:
# WSS (Within- cluster sum of square)
#print('WSS value: {}'.format( kmeans.inertia_))

# SS (Silhouette Score)
print('SS value: {}'.format (m.silhouette_score(X, labels, metric='euclidean')))

# 9.0. Cluster Analysis

In [None]:
X.head()

In [None]:
df9 = X.copy()
df9['cluster'] = labels

## 9.1. Visualization Inspection

In [None]:
#from plotly import express as px
#fig = px.scatter_3d(df9, x='recency_days', y='invoice_no', z='gross_revenue', color='cluster')
#fig.show()

#visualizer = SilhouetteVisualizer(kmeans, colors='yellowbrick')
#visualizer.fit(X)
#viausalizer.finalize()

sns.scatterplot( x='embedding_x', y='embedding_y', hue='cluster', data=df9, palette='deep')

## 9.2. 2d plot

In [None]:
#df_viz = df9.drop( columns = 'customer_id', axis=1)
#sns.pairplot(df_viz, hue='cluster')

## 9.3. UMAP 


In [None]:
#!pip install llvmlite==0.37.0rc2 --ignore-installed
#!pip install umap-learn
#import umap.umap_ as umap

# UMAP: cluster designed with high dimensionality

#reducer = umap.UMAP(n_neighbors=90, random_state=42)
#embedding = reducer.fit_transform(X)

#embedding
#df_viz['embedding_x'] = embedding[:, 0]
#df_viz['embedding_y'] = embedding[:, 1]

# plot UMAP
#sns.scatterplot(x='embedding_x',
                #y='embedding_y',
                #hue='cluster',
                #palette=sns.color_palette('hls',
                                          #n_colors=len(
                                              #df_viz['cluster'].unique())),
                #data=df_viz)

## 9.2. Cluster Profile

In [None]:
df92 = df4[cols_selected].copy()
df92['cluster'] = labels
df92.head()

In [None]:
# Number of customer
df_cluster = df92[['customer_id', 'cluster']].groupby('cluster').count().reset_index()
df_cluster['perc_customer'] = 100*(df_cluster['customer_id']/df_cluster['customer_id'].sum())

# Avg gross revenue
df_avg_gross_revenue = df92[['gross_revenue', 'cluster']].groupby('cluster').mean().reset_index()
df_cluster = pd.merge(df_cluster, df_avg_gross_revenue, how = 'inner', on ='cluster')

# Avg recency days
df_avg_recency_days = df92[['recency_days', 'cluster']].groupby('cluster').mean().reset_index()
df_cluster = pd.merge(df_cluster, df_avg_recency_days, how = 'inner', on ='cluster')

# Quantity Products
df_avg_quantity_products = df92[['quantity_products', 'cluster']].groupby('cluster').mean().reset_index()
df_cluster = pd.merge(df_cluster, df_avg_quantity_products, how = 'inner', on ='cluster')

# Frequency
df_avg_frequency = df92[['frequency', 'cluster']].groupby('cluster').mean().reset_index()
df_cluster = pd.merge(df_cluster, df_avg_frequency, how = 'inner', on ='cluster')

# Quantity Returns
df_avg_quantity_returns = df92[['quantity_returns', 'cluster']].groupby('cluster').mean().reset_index()
df_cluster = pd.merge(df_cluster, df_avg_quantity_returns, how = 'inner', on ='cluster')

df_cluster

**Cluster 1: Insiders**

**Cluster 0: More Products**

**Cluster 5: Spend Money**

**Cluster 2: Even More Products**

**Cluster 4: Spend More Money**

**Cluster 8: Less Days**

**Cluster 3: Less 1k**

**Cluster 6: Stop Returners**

**Cluster 7: More Buy**


# 10.0. EAD

### 10.1. Mindmap Hipothesis

In [None]:
df10 = df92.copy()
df10.head()

In [None]:
Image('/Users/anaotavio/Documents/repos/insiders_clustering/img/mindmap.png')

### 10.2. Business Hypothesis

#### Purchase Hypotheses

**1.**Insiders customers use a credit card for 80% of their purchases.

This assumption has no value because there is no information about the type of payment in the dataset.

**2.** Insiders customers have an average ticket of 10% over the More Products cluster.

This assumption has no feature in the df92.

**3.** Insiders customers have a basket size above 5 products.

This assumption has no feature in the df92.

**4. Insiders cluster has a volume of product purchases above 10% of total purchases.**

**5. Insiders cluster has a volume of gross revenue above 10% of total purchases.**

**6. Insider cluster has an average number of returns below the average of total customers..**

**7. The median revenue by cluster Insiders is 10% above than overall median revenue.**

**8. The GMV of cluster Insiders is concentrated in the 3rd quartile.**

### 10.3. Final Hypothesis list

**4. Insiders cluster has a volume of product purchases above 10% of total purchases.**

**5. Insiders cluster has a volume of gross revenue above 10% of total purchases.**

**6. Insider cluster has an average number of returns below the average of total customers..**

**7. The median revenue by cluster Insiders is 10% above than overall median revenue.**

**8. The GMV of cluster Insiders is concentrated in the 3rd quartile.**

### 10.4. Hypotheses Validation

**H1:** Insiders cluster has a volume of product purchases above 10% of total purchases.

**True:** Insiders cluster has a product purchase volume of 38%

In [None]:
# sum of Insiders quantity products
df_sales_insiders = df10.loc[df10['cluster']==1, 'quantity_products'].sum()
df_sales_total = df10.loc[:, 'quantity_products'].sum()

# sum of total quantity products
print('% Sales Insiders: {:.2f}%'.format(100*df_sales_insiders / df_sales_total))

**H2:** Insiders cluster has a volume of gross revenue above 10% of total purchases.

**True:** Insiders cluster has GMV colume of 41%.

In [None]:
# gmv: gross margin value
# sum of Insiders quantity products
df_gmv_insiders = df10.loc[df10['cluster']==1, 'gross_revenue'].sum()
df_gmv_total = df10.loc[:, 'gross_revenue'].sum()

# sum of total quantity products
print('% GMV: {:.2f}%'.format(100*df_gmv_insiders / df_gmv_total))
    

**H3:** Insider cluster has an average number of returns below the average of total customers.
    
**False:** Insider cluster has an average number of returns above the average of total customers.

In [None]:
# average returns Insiders
df_avg_return_insiders = df10.loc[df10['cluster']==1, 'quantity_returns'].mean()

# average total returns
df_average_return_all = df10['quantity_returns'].mean()

print('Avg Return Insiders:{} vs Avg Return All:{}'.format(np.round(df_avg_return_insiders, 0), np.round(df_average_return_all, 0) ))

**H4:** The median revenue by cluster Insiders is 10% above than overall median revenue.
    
**True:** AThe median revenue is 496% above the average.

In [None]:
# GMV Insiders median
df_median_gmv_insiders = df10.loc[df10['cluster']==1, 'gross_revenue'].median()

# GMV Total median
df_median_gmv_total = df10['gross_revenue'].median()

gmv_diff = (df_median_gmv_insiders - df_median_gmv_total)/df_median_gmv_total
print('Median Diff: {:.2f}%'.format(100*gmv_diff))

**H5:** The GMV of cluster Insiders is concentrated in the 3rd quartile.

**False:** The GMV of cluster Insiders is concentrated in the 1rd quartile


In [None]:
np.percentile(df10.loc[df10['cluster']==1, 'gross_revenue'], q=0.1)

In [None]:
np.percentile(df10.loc[df10['cluster']==1, 'gross_revenue'], q=0.9)

In [None]:
df_aux = df10.loc[(df10['cluster'] == 1) & (df10['gross_revenue'] < 50000 ), 'gross_revenue'];
sns.violinplot( x=df_aux )

### 10.5. Answers Framework

#### Business Questions

**1. Who are the customers eligible to join the program?**

In [None]:
insiders = df10.loc[df10['cluster']==1, 'customer_id']

**2. How many customers will be part os this group?**

In [None]:
insiders.size

**3. What are the main characteristics of these customers?**

**Cluster Insider**
- Number of customers: 495 (25,26% of customers)
- Average Recency: 77 days
- Average Purchase: 499 
- Avarage Revenue: 8311.14 dollars
- Frequency of purchases: 0.37 products/day
- Quantity of Returns: 290

**4. What is the contribution percentage revenue from Insiders?**

In [None]:
# gmv: gross margin value
df_gmv_insiders = df10.loc[df10['cluster']==1, 'gross_revenue'].sum()
df_gmv_total = df10.loc[:, 'gross_revenue'].sum()

print('% GMV from Insiders: {:.2f}%'.format(100*df_gmv_insiders / df_gmv_total))

**5. What is the group's revenue expectation for the coming months?**

**6. What are the condictions for select customers to join Insiders?**

**7. What are the condictions for removing Insiders customers?**

**8. What is the guarantee that the Insiders program is better than the rest of the base?**

**9. What actions can the marketing team take to increase revenue?**

# 11.0. Deploy to Production