# <font color= 'blue'> Project: High Value Customer Identification (Insiders)</font>


## Business Challange

A loyalty program of customers to increase sales frquency.

## Business Planning (IOT)

### Input 
    1. Business Problem
        - Select the most valuable customers to join a loyalty program.
    
    2. Dataset
        - One year e-commerce sales.
        Invoice No: Invoice number (A 6-digit integral number uniquely assigned to each transaction)

        Stock Code: Product (item) code

        Description: Product (item) name

        Quantity: The quantities of each product (item) per transaction

        Invoice Date: The day when each transaction was generated

        Unit Price: Unit price (Product price per unit)

        Customer ID: Customer number (Unique ID assigned to each customer)

        Country: Country name (The name of the country where each customer resides)
    
### Output
    1. Indicate customers who will be part of a loyalty program called Insiders.
            - List: client_id | is_insider
                        10323 |   yes
                        32413 |   no
    2. A report with the answers for the business questions.
    - Who are the customers eligible to join the program?
    - How many customers will be part os this group?
    - What are the main characteristics of these customers?
    - What is the contribution percentage revenue from Insiders?
    - What is the group's revenue expectation for the coming months?
    - What are the condictions for select customers to join Insiders?
    - What are the condictions for removing Insiders customers?
    - What is the guarantee that the Insiders program is better than the rest of the base?
    - What actions can the marketing team take to increase revenue?
    
### Taks
    1. Who are the customers eligible to join the program?
        - What does it mean, to be elegible? What does high-value customers mean?
             - Revenue: ticket, bascket size, high LTV (Lifetime Value), churn probability,high TVC prevision, purchasing propensity.
             - Cost: lower return rate.
             - Purchase experience: high average evaluation rate.
     
    2. How many customers will be part os this group?
        - Total nambers of customers.
        - % Insiders group.
        
    3. What are the main characteristics of these customers?
        - Age
        - Location
        - Others characteristics.
        - Ticket, bascket size, high LTV, churn probability,high TVC prevision, purchasing propensity.
        
    4. What is the contribution percentage revenue from Insiders?
         - Total revenue for the year.
         - Insiders group revenue.
         
    5. What is the group's revenue expectation for the coming months?
        - LTV Insisders group.
        - Cohort analysis.
        
    6. What are the condictions for select customers to join Insiders?
        - Define the periodicity
        - The person needs to have similar characteristics with someone in the group.
    
    7. What are the condictions for removing Insiders customers?
        - Define the periodicity
        - The person doen't need to have similar characteristics with someone in the group.
        
    8. What is the guarantee that the Insiders program is better than the rest of the base?
        - A/B test
        - A/B bayesian test
        - Hypothesis test
        
    9. What actions can the marketing team take to increase revenue?
        - Discont
        - Purchase preference
        - Purchase shipping
        - Company visit  
        
## Benchmark Solutions

###  Desk Research
 - RFM model (recency, frequency,, monetary): sorted data to have a RFM Score.
 
     Recency: How recently a customer has made a purchase
     
     Frequency: How often a customer makes a purchase
     
     Monetary Value: How much money a customer spends on purchases
     

 - example project: https://guillaume-martin.github.io/rfm-segmentation-with-python.html

# 0.0. Imports

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import umap.umap_ as umap

from pandas_profiling import ProfileReport

from matplotlib import pyplot as plt
from IPython.display import Image, HTML
from datetime import datetime
from sklearn import cluster as c
from sklearn import metrics as m
from plotly import express as px
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer

## 0.1. Helper Functions

In [None]:
 def jupyter_settings():
    %matplotlib inline
    %pylab inline
    
    plt.style.use( 'bmh' )
    plt.rcParams['figure.figsize'] = [25, 12]
    plt.rcParams['font.size'] = 24
    
    display( HTML( '<style>.container { width:100% !important; }</style>') )
    pd.options.display.max_columns = None
    pd.options.display.max_rows = None
    pd.set_option( 'display.expand_frame_repr', False )
    
    sns.set()
    
jupyter_settings() 
    

## 0.2. Load dataset

In [None]:
# load data
df_raw = pd.read_csv('data/Ecommerce.csv')

# drop extra column
df_raw = df_raw.drop(columns = ['Unnamed: 8'], axis = 1)

# 1.0. Data Description

In [None]:
df1 = df_raw.copy()

## 1.1. Rename columns

In [None]:
cols_new = ['invoice_no', 'stock_code', 'description', 'quantity', 'invoice_date',
       'unit_price', 'customer_id', 'country']
df1.columns = cols_new

## 1.2. Data dimensions

In [None]:
print('Number of Rows: {}'.format(df1.shape[0]))
print('Number of Columns: {}'.format(df1.shape[1]))

## 1.3. Data types

In [None]:
df1.dtypes

In [None]:
df1['invoice_no'] = df1['invoice_no'].astype(int)

In [None]:
df1['stock_code'] = df1['stock_code'].astype(int)

## 1.4. Check NA

In [None]:
df1.isna().sum()

## 1.5. Replace NA

In [None]:
df_missing = df1.loc[df1['customer_id'].isna(),:]
df_not_missing = df1.loc[~df1['customer_id'].isna(),:]

In [None]:
# create reference
df_backup = pd.DataFrame(df_missing['invoice_no'].drop_duplicates())
df_backup['customer_id'] = np.arange(19000, 19000+len(df_backup),1)

# merge original with reference dataframe
df1 = pd.merge(df1, df_backup, on = 'invoice_no', how= 'left')

# coalesce
df1['customer_id'] = df1['customer_id_x'].combine_first(df1['customer_id_y'])

# drop extra columns
df1 = df1.drop(columns=['customer_id_x','customer_id_y'], axis=1)
   
df1.head()    


In [None]:
# No NaN in 'customer_id'
df1.isna().sum()

## 1.6. Change dtypes

In [None]:
# invoice_date
df1["invoice_date"] = pd.to_datetime(df1["invoice_date"],infer_datetime_format=True)

# customer_id
df1['customer_id'] = df1['customer_id'].astype(int)
df1.dtypes

## 1.7. Descriptive Statistics

In [None]:
num_attributes = df1.select_dtypes(include =['int64', 'float64'])
cat_attributes = df1.select_dtypes(exclude = ['int64', 'float64', 'datetime64[ns]'])

In [None]:
 # central tendency - mean, median
ct1 = pd.DataFrame( num_attributes.apply( np.mean ) ).T
ct2 = pd.DataFrame( num_attributes.apply( np.median ) ).T
             
# dispersion - desvio padrão, minimo, maximo, range, skew, kurtosis
d1 = pd.DataFrame( num_attributes.apply( np.std ) ).T
d2 = pd.DataFrame( num_attributes.apply( np.min ) ).T
d3 = pd.DataFrame( num_attributes.apply( np.max ) ).T
d4 = pd.DataFrame( num_attributes.apply( lambda x: x.max() - x.min() ) ).T
d5 = pd.DataFrame( num_attributes.apply( lambda x: x.skew() ) ).T
d6 = pd.DataFrame( num_attributes.apply( lambda x: x.kurtosis() ) ).T

# concatenate
m = pd.concat( [d2, d3, d4, ct1, ct2, d1, d5, d6] ).T.reset_index()
m.columns = ['attributes', 'min', 'max', 'range', 'mean', 'mediana', 'std', 'skew', 'kurtosis']
m

#### 1.7.1.1. Numerical Attributes - Investigating

1. Could be negative quantity, return?
2. Unit price = 0. Could it be sales?


### 1.7.2. Categorical Attributes

#### Invoice number

In [None]:
#cat_attributes['invoice_no'].astype(int)  -> Data contains 'invoice_no' with numbers and letters.
df_letter_invoices = df1.loc[df1['invoice_no'].apply(lambda x: bool(re.search( '[^0-9]+', x ))), :]
len(df_letter_invoices)

print('Total number of invoices:{}'.format(len(df_letter_invoices)))
print('Total number os negative quantity:{}'.format(len(df_letter_invoices[df_letter_invoices['quantity']< 0])))

#### Stock Code

In [None]:
# check stock codes only characters
df1.loc[df1['stock_code'].apply( lambda x: bool( re.search( '^[a-zA-Z]+$', x ) ) ), 'stock_code'].unique()

# Action:
## 1. Remove stock_code in ['POST', 'D', 'M', 'PADS', 'DOT', 'CRUK']

#### Description

In [None]:
# Action: Delete description

#### Country

In [None]:
len(df1['country'].unique())

In [None]:
df1['country'].value_counts(normalize = True).head()

In [None]:
df1[['customer_id', 'country']].drop_duplicates().groupby('country').count().reset_index().sort_values('customer_id', ascending = False).head()

# 2.0. Data Filtering

In [None]:
df2 = df1.copy()

In [None]:
df2.dtypes

In [None]:
# === Numerical attributes ====
df2 = df2.loc[df2['unit_price'] >= 0.04, :]

# === Categorical attributes ====
df2 = df2[~df2['stock_code'].isin( ['POST', 'D', 'DOT', 'M', 'S', 'AMAZONFEE', 'm', 'DCGSSBOY', 'DCGSSGIRL', 'PADS', 'B', 'CRUK'] ) ]

# description
df2 = df2.drop( columns='description', axis=1 )

# map -  
df2 = df2[~df2['country'].isin( ['European Community', 'Unspecified' ] ) ]

# bad users
#df2 = df2[~df2['customer_id'].isin( [16446] )]

# quantity
df2_returns = df2.loc[df1['quantity'] < 0, :]
df2_purchase = df2.loc[df1['quantity'] >= 0, :]

# 3.0. Feature Engineering

In [None]:
df3 = df2.copy()

In [None]:
# Feature Ideas:
## 1) Moving Average - 7d, 14d, 30d
## 2) Purchase quantity by month, before the 15th and after the 15th.
## 3) Average Financial

## 3.1. Feature Creation

In [None]:
# data reference
df_ref = df3.drop(['invoice_no', 'stock_code', 'quantity', 'invoice_date', 'unit_price', 'country'],
                   axis =1 ).drop_duplicates( ignore_index = True)

### 3.1.1. Gross Revenue

In [None]:
# Gross Revenue
df2_purchase.loc[:,'gross_revenue'] = df2_purchase.loc[:,'quantity'] * df2_purchase.loc[:,'unit_price']

# Monetary
df_monetary = df2_purchase.loc[:,['customer_id', 'gross_revenue']].groupby('customer_id').sum().reset_index()
df_ref = pd.merge(df_ref, df_monetary, on = 'customer_id', how = 'left')
df_ref.isna().sum()

### 3.1.2. Recency - Day from last purchase

In [None]:
 # Recency - Last day purchase
df_recency = df2_purchase.loc[:, ['customer_id', 'invoice_date']].groupby( 'customer_id' ).max().reset_index()
df_recency['recency_days'] = ( df2['invoice_date'].max() - df_recency['invoice_date'] ).dt.days
df_recency = df_recency[['customer_id', 'recency_days']].copy()
df_ref = pd.merge( df_ref, df_recency, on='customer_id', how='left' )
df_ref.isna().sum()

### 3.1.3. Quantity of purchased

In [None]:
# Frequency - It depens on product returns
df_freq =(df2_purchase.loc[:,['customer_id', 'invoice_no']].drop_duplicates()
                                                          .groupby('customer_id')
                                                          .count()
                                                          .reset_index()
                                                          .rename(columns={'invoice_no': 'quantity_invoices'}))
df_ref = pd.merge( df_ref, df_freq, on = 'customer_id', how ='left')
df_ref.isna().sum()

### 3.1.4. Quantity of items purchased

In [None]:
df_freq = (df2_purchase.loc[:,['customer_id', 'quantity']].groupby('customer_id')
                                                        .sum()
                                                        .reset_index()
                                                        .rename(columns={'quantity': 'quantity_items'}))
                                                        
                                                        
df_ref = pd.merge( df_ref, df_freq, on = 'customer_id', how ='left')
df_ref.isna().sum()

### 3.1.5. Quantity of products purchased

In [None]:
df_freq = (df2_purchase.loc[:,['customer_id', 'stock_code']].groupby('customer_id')
                                                        .count()
                                                        .reset_index()
                                                        .rename(columns={'stock_code': 'quantity_products'}))
                                                        
                                                        
df_ref = pd.merge( df_ref, df_freq, on = 'customer_id', how ='left')
df_ref.isna().sum()

### 3 .1.6. Average Ticket Value

In [None]:
# Average Ticket
df_avg_ticket = df2_purchase[['customer_id', 'gross_revenue']].groupby('customer_id').mean().reset_index().rename(columns={'gross_revenue': 'avg_ticket'})
df_ref = pd.merge( df_ref, df_avg_ticket, on = 'customer_id', how = 'left')
df_ref.isna().sum()

### 3.1.7. Average Recency Days

In [None]:
# Average recency days
df_aux = df2[['customer_id', 'invoice_date']].drop_duplicates().sort_values(['customer_id', 'invoice_date'], ascending=['False', 'False'])
df_aux['next_customer_id'] = df_aux['customer_id'].shift() # next customer
df_aux['previous_date'] = df_aux['invoice_date'].shift() # next invoice date

df_aux['avg_recency_days'] = df_aux.apply(lambda x: (x['invoice_date'] - x['previous_date']).days if x['customer_id'] ==x['next_customer_id'] else np.nan, axis=1 )

df_aux = df_aux.drop( ['invoice_date', 'next_customer_id', 'previous_date'], axis=1 ).dropna()

# average recency
df_avg_recency_days = df_aux.groupby( 'customer_id' ).mean().reset_index()

# merge
df_ref = pd.merge(df_ref, df_avg_recency_days, on='customer_id', how='left')
df_ref.isna().sum()

### 3.1.8. Frequency Purchase

In [None]:
df_aux = ( df2_purchase[['customer_id', 'invoice_no', 'invoice_date']].drop_duplicates()
                                                             .groupby( 'customer_id')
                                                             .agg( max_ = ( 'invoice_date', 'max' ), 
                                                                   min_ = ( 'invoice_date', 'min' ),
                                                                   days_= ( 'invoice_date', lambda x: ( ( x.max() - x.min() ).days ) + 1 ),
                                                                   buy_ = ( 'invoice_no', 'count' ) ) ).reset_index()
# Frequency
df_aux['frequency'] = df_aux[['buy_', 'days_']].apply( lambda x: x['buy_'] / x['days_'] if  x['days_'] != 0 else 0, axis=1 )

# Merge
df_ref = pd.merge( df_ref, df_aux[['customer_id', 'frequency']], on='customer_id', how='left' )

df_ref.isna().sum()

### 3.1.9. Number of Returns

In [None]:
df2_returns.head()

In [None]:
# Number of Returns
df_returns = df2_returns[['customer_id', 'quantity']].groupby( 'customer_id' ).sum().reset_index().rename( columns={'quantity':'quantity_returns'} )


In [None]:
df_returns['quantity_returns'] = df_returns['quantity_returns'] * -1

df_ref = pd.merge( df_ref, df_returns, how='left', on='customer_id' )
df_ref.loc[df_ref['quantity_returns'].isna(), 'quantity_returns'] = 0

df_ref.isna().sum()

### 3.2.0. Basket Size - Quantity items per basket

- Invoice No = purchase
- Stock Code = product
- Quantity = Item

In [None]:
df_aux = (df2_purchase.loc[:,['customer_id', 'invoice_no', 'quantity']].groupby('customer_id')
                                                           .agg(n_purchase=('invoice_no', 'nunique'),
                                                                n_products=('quantity', 'sum'))
                                                            .reset_index())
# calculation
df_aux['avg_basket_size'] = df_aux['n_products']/df_aux['n_purchase']

#merge
df_ref = pd.merge(df_ref, df_aux[['customer_id', 'avg_basket_size']], how='left', on='customer_id')
df_ref.isna().sum()

### 3.2.1. Unique Basket Size - Quantity of differents products per purchase

- Invoice No = purchase
- Stock Code = product
- Quantity = Item

In [None]:
df_aux = (df2_purchase.loc[:,['customer_id', 'invoice_no', 'stock_code']].groupby('customer_id')
                                                           .agg(n_purchase=('invoice_no', 'nunique'),
                                                                n_products=('stock_code', 'count'))
                                                            .reset_index())
# calculation
df_aux['avg_unique_basket_size'] = df_aux['n_products']/df_aux['n_purchase']

# merge
df_ref = pd.merge(df_ref, df_aux[['customer_id', 'avg_unique_basket_size']], how='left', on='customer_id')
df_ref.isna().sum()

In [None]:
df_ref.head()

# 4.0. Exploratory Data Analysis (EDA)

In [None]:
df4 = df_ref.dropna()
df4.isna().sum()

## 4.1. Univariate Analysis

In [None]:
#from pandas_profiling import ProfileReport
profile = ProfileReport (df4)
profile.to_file('output.html')

# to visualize: /output.html

In [None]:
df4.sort_values('gross_revenue', ascending=False).drop_duplicates().head(10)

## 4.2. Bivariate Analysis

# 5.0. Data Preparation

In [None]:
df5 = df4.copy()

In [None]:
from sklearn import preprocessing as pp

In [None]:
ss = pp.StandardScaler()

df5['gross_revenue'] = ss.fit_transform(df5[['gross_revenue']])
df5['recency_days'] = ss.fit_transform(df5[['recency_days']])
df5['invoice_no'] = ss.fit_transform(df5[['invoice_no']])
df5['avg_ticket'] = ss.fit_transform(df5[['avg_ticket']])


In [None]:
df5.head()

# 6.0. Feature Selection

In [None]:
df6 = df5.copy()

In [None]:
sns.distplot(np.log(df5['gross_revenue']))

# 7.0. Hyperparameter Fine Tunning

In [None]:
X = df6.drop(columns = ['customer_id'])

In [None]:
X.head()

In [None]:
clusters = [2, 3, 4, 5, 6, 7]

## 7.1. Within-Cluster Sum os Square (WSS)

In [None]:
# WSS: Within- CLuster Sum of Squares -> compactness
wss = []
for k in clusters:
    # model definition
    kmeans = c.KMeans(init='random', n_clusters=k, n_init=10, max_iter=300, random_state=42)
    
    # model training
    kmeans.fit(X)
    
    # validation
    wss.append(kmeans.inertia_)
    

plt.subplot(2, 1, 1 )
# Plot wss - Elbow Method
plt.plot(clusters, wss, linestyle='--', marker = 'o', color='b')
plt.xlabel('K');
plt.ylabel('Within-Cluster Sum of Square');
plt.title('WSS vc k')

plt.subplot(2, 1, 2 )
#from yellowbrick.cluster import KElbowVisualizer
kmeans = KElbowVisualizer( c.KMeans(), k=clusters, timings=False)
kmeans.fit( X)
kmeans.show()

plt.show()

## 7.2. Silhouette Score

In [None]:
# Silhoutte Score: is calculated using the mean intra-cluster distance ( a ) and the mean nearest-cluster distance ( b ) for each sample
# Is about separation
# Silhouette is better than Wss because takes into consideration the nearest-cluster distance.

kmeans = KElbowVisualizer( c.KMeans(), k=clusters, metric='silhouette', timings=False)
kmeans.fit( X)
kmeans.show()


## 7.1. Silhouette Analysis

In [None]:
# from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer
k=3
fig, ax = plt.subplots(3,2, figsize=(25,18))
for k in clusters:
    km = c.KMeans( n_clusters=k, init='random', n_init=10, max_iter=100, random_state=42)
    q, mod = divmod(k,2)
    visualizer = SilhouetteVisualizer(km, colosr='yellowbrick', ax=ax[q-1][mod])
    visualizer.fit(X)
    visualizer.finalize()

# 8.0. Model Training

## 8.1. K-Means

In [None]:
# model definition
k = 3
kmeans = c.KMeans(init='random', n_clusters=k, n_init=10, max_iter=300)

# model training
kmeans.fit(X)

# clustering
labels = kmeans.labels_

## 8.2. Cluster Validation

In [None]:
# WSS (Within- cluster sum of square)
print('WSS value: {}'.format( kmeans.inertia_))

from sklearn import metrics as m
# SS (Silhouette Score)
print('SS value: {}'.format (m.silhouette_score(X, labels, metric='euclidean')))

# 9.0. Cluster Analysis

In [None]:
df9 = df6.copy()
df9['cluster'] = labels
df9.head()

## 9.1. Visualization Inspection

In [None]:
#from plotly import express as px
#fig = px.scatter_3d(df9, x='recency_days', y='invoice_no', z='gross_revenue', color='cluster')
#fig.show()

visualizer = SilhouetteVisualizer(kmeans, colors='yellowbrick')
visualizer.fit(X)
viausalizer.finalize()

## 9.2. 2d plot

In [None]:
df_viz = df9.drop( columns = 'customer_id', axis=1)
sns.pairplot(df_viz, hue='cluster')

## 9.4. UMAP -t-SNE


In [None]:
#!pip install llvmlite==0.37.0rc2 --ignore-installed
#!pip install umap-learn
#import umap.umap_ as umap

# UMAP: cluster designed with high dimensionality

reducer = umap.UMAP(n_neighbors=90, random_state=42)
embedding = reducer.fit_transform(X)

#embedding
df_viz['embedding_x'] = embedding[:, 0]
df_viz['embedding_y'] = embedding[:, 1]

# plot UMAP
sns.scatterplot(x='embedding_x',
                y='embedding_y',
                hue='cluster',
                palette=sns.color_palette('hls',
                                          n_colors=len(
                                              df_viz['cluster'].unique())),
                data=df_viz)

## 9.2. Profil de grappe

In [None]:
# Number of customer
df_cluster = df9[['customer_id', 'cluster']].groupby('cluster').count().reset_index()
df_cluster['perc_customer'] = 100*(df_cluster['customer_id']/df_cluster['customer_id'].sum())

# Avg gross revenue
df_avg_gross_revenue = df9[['gross_revenue', 'cluster']].groupby('cluster').mean().reset_index()
df_cluster = pd.merge(df_cluster, df_avg_gross_revenue, how = 'inner', on ='cluster')

# Avg recency days
df_avg_recency_days = df9[['recency_days', 'cluster']].groupby('cluster').mean().reset_index()
df_cluster = pd.merge(df_cluster, df_avg_recency_days, how = 'inner', on ='cluster')

# Avg invoice_no
df_avg_invoice_no = df9[['invoice_no', 'cluster']].groupby('cluster').mean().reset_index()
df_cluster = pd.merge(df_cluster, df_avg_invoice_no, how = 'inner', on ='cluster')

# Avg ticket
df_avg_ticket = df9[['avg_ticket', 'cluster']].groupby('cluster').mean().reset_index()
df_cluster = pd.merge(df_cluster, df_avg_ticket, how = 'inner', on ='cluster')

df_cluster

**Cluster 1: Insider Candidate**
- Number of customers: 6 (0,14% of customers)
- Average Recency: 7 days
- Average Purchase: 89 
- Avarage Revenue: $ 182.182,00 dollars
- Avarage Ticket: $ 253,62

**Cluster 0:**
- Number of customers: 28 (0,64% of customers)
- Average Recency: 6 days
- Average Purchase: 57 
- Avarage Revenue: $ 42.614,38 dollars  
- Avarage Ticket: $ 162,86

    
**Cluster 3:**
- Number of customers: 269 (6,15% of customers)
- Average Recency: 20 days
- Average Purchase: 19 
- Avarage Revenue: $ 944, 95 dollars
- Avarage Ticket: $ 62,47


**Cluster 2:**
- Number of customers: 4069 (93,06% of customers)
- Average Recency: 92 days
- Average Purchase: 4 
- Avarage Revenue: $ 1.372,57 dollars  
- Avarage Ticket: $ 25.36

# 10.0. Deploy to Production