## Project Title: Customer Segmentation
#### Business understanding: 
The online sales data is available from retailers and is available at the [UCI website](https://archive.ics.uci.edu/dataset/352/online+retail) for analysis. The business expects to use the RFM methodology, data mining techniques, and machine learning algorithms to derive meaningful customer segmentation, better understand customer purchase behavior, and identify the characteristics of customers in each segment. 

#### Business Goal 
The Business goal is to organize customers in similar groups, better understand individual customers in each cluster, and identify the customers at risk. This way business can have a customer-centric focus & approach to target individual customers. 

#### Data Understanding: 
The sales data is downloaded from [UCI website](https://archive.ics.uci.edu/dataset/352/online+retail), it is a transactional data set that contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a retailer and registered non-store online retail. The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

The original data attributes are:

|Feature Name | Description                                | Feature Type  |
|-------------|--------------------------------------------|---------------|
|InvoiceNo    | A unique number for a transaction          | Categorical   |
|StockCode    | Product number for an Item                 | Categorical   |                          
|Description  | Product Name                               | Categorical   |
|Quantity     | Product quantity in each transaction       | Integer       |
|InvoiceDate  | Day & Time when transaction was generated  | Date          |
|UnitPrice    | Price of unit product                      | Continuous    |
|CustomerID   | ID of customer                             | Categorical   |
|Country      | Country where customer resides             | Categorical   |

The data attributes added as part of Analysis & Feature Engineering are:

|Feature Name 	  | Description                                                               | Feature Type  |
|-----------------|---------------------------------------------------------------------------|---------------|
| Recency      	  | Define how recently the customer made a purchase                          | Integer       |    
| Frequency    	  | Define how often customers make purchases                                 | Integer       |
| Monetary     	  | Define the amount the customer has spent                                  | Float         |    
| Recency Score	  | Quantile-based discretization on a scale of 1-5 based on Recency value    | Integer       |
| Frequency Score | Quantile-based discretization on a scale of 1-5 based on Frequency value  | Integer       |
| Monetary Score  | Quantile-based discretization on a scale of 1-5 based on Frequency value  | Integer       |
| RFM Segment     | ID of customer                                                            | Object        |            
| Customer Type   | Country where customer resides                                            | Object        |



### Table of content
***

1. [Import Libraries](#10-import-libraries)

2. [Data Preparation & Feature Engineering](#20-data-preperation-and-feature-engineering) 
    - 2.1 -  [Read Data](#21-upload-the-dataset)
    - 2.2 -  [Evaluate Feature Discrepancy & Missing Values](#22-evaluate-discrepancy-in-feature-values)
    - 2.3 -  [Add New Features](#23-introduce-additional-features-using-rfm-technique)
    - 2.4 -  [Trim Dataset](#24-trim-dataset-with-random-sampling)
    - 2.5 -  [Encode & Scale Data](#25-encode-and-scale-the-dataset)
    - 2.6 -  [Correlation Matrix - Feature Analysis](#26-correaltion-matrix--feature-analysis)
    - 2.7 -  [Outlier Analysis](#27-outlier-analysis---box-plot)
    - 2.8 -  [Outlier Treatment](#28-outlier-treatment)
    - 2.9 -  [Model Dataset](#29-create-target-dataset-for-model)
    - 2.10 - [Pair Plot - Analysis Of RMF Features](#210-pair-plot-to-analyse-rfm-features)

3. [Unsupervised Learning Algorithm ](#30-execute-unsupervised-ml-alogrithm)
    - 3.1 - [Kmeans Algorithm](#31-kmeans-algorithm)
        - 3.1.1 - [Elbow Method - Identify Optimum Clusters](#311-elbow-method---identify-the-optimum-number-of-customer-clusters)
        - 3.1.2 - [Elbow Method Analysis](#312-elbow-method-analysis)
        - 3.1.3 - [Execute & Plot Kmeans](#313-execute-and-plot-kmeans-alogrithm)
        - 3.1.4 - [Kmeans Cluster Analysis](#314-kmeans-cluster-analysis)
        - 3.1.5 - [RFM Score per Cluster](#315-rfm-scores-per-cluster)
    - 3.2 - [Silhouette Score](#32-silhouette-score--analysis)
        - 3.2.1 - [Silhouette Plot](#321-generate-silhouette-score-and-plot-the-feature-dependency)
        - 3.2.2 - [Silhouette Cluster Analysis](#322-analysis-of-silhouette-cofficient-and-clusters)
    - 3.3 - [DBSCAN Algorithm](#33-dbscan-alogorithm)
        - 3.3.1 - [Nearest Neighbor](#331-nearestneighbor---find-optimal-eps-parameter-for-dbscan)
        - 3.3.2 - [DBSCAN Execution ](#332---execute-dbscan)
        - 3.3.3 - [DBSCAN Result Plot](#333-plot-dbscan-result)
        - 3.3.4 - [DBSCAN cluster Analysis](#334-analyze-dbscan-result)
4. [Customer Cluster Analysis](#40-customer-cluster-analysis)
    - 4.1 - [RFM Score per Cluster](#41-rfm-score-per-cluster)
    - 4.2 - [Customer Count per Cluster](#42-customer-count-per-cluster)
    - 4.3 - [Peak Purchase Time](#43-peak-purchase-time)
    - 4.4 - [Top Selling Item](#44-cluster-wise-top-selling-months)
    - 4.5 - [Top Selling Item per Cluster](#45-top-selling-items-per-cluster-and-overall)
    - 4.6 - [Customer Segment TreeMap](#46-tree-map---customer-segement-based-on-rfm-score)
        - 4.6.1 - [Dashboard - Customer Score Base ](#461-rfm-score-based-segment-tree-map)
        - 4.6.2 - [Feature Creation- Customer Type based on RFM](#462-create-customer-type-column-based-on-rfm-score) 
        - 4.6.3 - [Dashboard - Customer Segment based on RFM](#463-tree-map-for-customer-segement)
5. [Conclusion](#50-conclusion)

### 1.0 Import Libraries

In [42]:
# type: ignore
## Import the required library for the project

import warnings

import pandas as pd
import numpy as np
from time import strftime
from sklearn.model_selection import train_test_split
from category_encoders import OneHotEncoder, TargetEncoder
from sklearn.compose import ColumnTransformer
from sklearn.discriminant_analysis import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from matplotlib import pyplot as plt
import seaborn
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from sklearn.metrics import silhouette_samples, silhouette_score
from IPython.display import display, HTML
%matplotlib inline

warnings.filterwarnings('ignore')

### 2.0 Data preperation and Feature engineering

##### 2.1 Upload the dataset 

In [None]:
## Upload the dataset 
df = pd.read_csv('OnlineRetail.csv')

# Executing df info to get the information on dataframe
df.info()


##### 2.2 Evaluate discrepancy in feature values

In [None]:
## Get the rows having null values and evaluating the column values. 
#df.isnull().sum()

## Evaluate any discrepancy in data with Quantity, UnitPrice, Country and Invoice date column 

for column in df.columns:
    if column == 'Quantity':
        print('Quantity with zero values:', df.query('Quantity == 0')[['Quantity']].count())
    if column == 'InvoiceDate':
        df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])
        print('Min TimeStamp: ', df[['InvoiceDate']].min())
        print('Max TimeStamp: ', df[['InvoiceDate']].max())
    if column == 'UnitPrice':
        print('Rows have zero unit price:',df.query('UnitPrice == 0')[['UnitPrice']].count())

## As per analysis we can drop the Customer_ID and remove the rows with zero unit price. The decision on other columns will be taken at latter stage.

## Drop the customer ID with Null values
df.dropna(inplace=True) 

## Drop rows with quantity == zero
df.drop(df.query('UnitPrice == 0')[['UnitPrice']].index, inplace=True, errors='ignore')

df.isna().sum()

##### 2.3 Introduce additional features using RFM technique

In [4]:
## Calculate additional features for the dataframe

## There are invoices with returns, creating column Transaction type to identify such invoices.
## The returns are maked with Transaction type as Return 
df['TransactionType'] = np.where(df['InvoiceNo'].str.startswith('C'), 'Return','Transaction')  

## Introducing the Recency, Frequency & monitory column to evalute the purchase behaviour of the customer.
# Recency — How recently did the customer purchase?
# Frequency — How often do they purchase?
# Monetary Value — How much do they spend?

# Calculate the amount and create a column by multiplying quantity purchased by unit price  
df['Amount'] = df['Quantity'] * df['UnitPrice'] 

## Seperate the date time to indentify the highest & least selling day, as well as peak selling hours

# Create date and time column in format specified using strftime function
df['Date'] = df['InvoiceDate'].dt.strftime("%Y%m%d")
df['Time'] = df['InvoiceDate'].dt.strftime("%H")

# Calculate customer Recency score
max_date = df['InvoiceDate'].max()
df['Recency'] = (max_date - df['InvoiceDate']).dt.days

# Calculate custor shopping Frequency
df_cust = df.groupby('CustomerID')['Date'].count().reset_index()
df_cust.columns = ['CustomerID', 'Frequency']
df = df.merge(df_cust, on='CustomerID')

# Calculate customer Monetary
df_monetary = df.groupby('CustomerID')['Amount'].sum().reset_index() 
df_monetary.columns = ['CustomerID', 'Monetary']
df = df.merge(df_monetary, on='CustomerID',copy=True)

# Calculate the scores
for scores in ['Recency', 'Frequency','Monetary']:
    if scores == 'Recency':
        df[f"{scores}_score"] = pd.qcut(df[scores], 5, labels=list(range(5,0,-1)))
    else:
        df[f"{scores}_score"] = pd.qcut(df[scores], 5, labels=list(range(1,6,1)))
    
    df[f"{scores}_score"] = df[f"{scores}_score"].astype(int)

##### 2.4 Trim dataset with random sampling

In [None]:
## Run train test function to split the dataset

_,X = train_test_split(df,test_size=0.25,random_state=42)
X.reset_index(inplace=True)
X_copy = X.copy()
X.drop(['Description', 'InvoiceNo', 'StockCode', 'InvoiceDate','CustomerID','index',],axis=1,inplace=True,errors='ignore')

# Define dummy y
y = X['Recency']

# Describe the dataset 
X.describe()

##### 2.5 Encode and Scale the dataset

In [None]:
## Define pipeline to encode and scale the dataset 

# Set the column values
col = X.columns

# Column transformer to Encode the data 
pre_process = ColumnTransformer([
                                ("target_encoder", TargetEncoder(), ["TransactionType", "Country"])]
                                , remainder="passthrough")

## Pipeline JOB to execute the data encoder and scale the data. 
data = Pipeline([
                   ('preprocess', pre_process), 
                   ('scaler', StandardScaler()),
        ])

# Run the pipeline with original dataset
scaled_df = data.fit_transform(X,y)
scaled_df = pd.DataFrame(scaled_df, columns=col)

# Print pipeline step 
data

##### 2.6 Correaltion matrix & feature analysis

In [None]:
## Generate thecorelation matrix 

# Call corelation function
correlation_matrix = scaled_df.corr()
# Set the graph size
plt.figure(figsize=(10,8))  # Adjust the figure size if needed
# Execute the heatmap function
seaborn.heatmap(correlation_matrix, annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)
#Set the map title
plt.title("Correlation Matrix Heatmap")
# Draw the graph
plt.show()

##### 2.7 Outlier Analysis - Box plot

In [None]:
## Outlier Analysis of the new features  
fig = go.Figure()

# Draw the Box plot for outlier analysis 
fig.add_trace(go.Box(x=scaled_df['Recency'],   name='Recency Plot'))
fig.add_trace(go.Box(x=scaled_df['Frequency'], name='Frequency Plot'))
fig.add_trace(go.Box(x=scaled_df['Monetary'],  name='Monetary Plot'))
# Update the title 
fig.update_layout(title_text="Box plot for Monetary, Frequency, Recency")
# Draw the plot
fig.show()

##### 2.8 Outlier Treatment

In [None]:
## Outlier treatment of new features 

# calculate the first quartile
Q1 = scaled_df.quantile(0.25)

# calculate the third quartile
Q3 = scaled_df.quantile(0.75)

# The Interquartile Range (IQR) is defined as the difference between the third and first quartile
# calculate IQR for each numeric variable
IQR = Q3 - Q1

# retrieve the dataframe without the outliers
# '~' returns the values that do not satisfy the given conditions 
# i.e. it returns values between the range [Q1-1.5*IQR, Q3+1.5*IQR]
# '|' is used as 'OR' operator on multiple conditions   
# 'any(axis=1)' checks the entire row for atleast one 'True' entry (those rows represents outliers in the data)
scaled_df = scaled_df[~((scaled_df < (Q1 - 1.5 * IQR)) | (scaled_df > (Q3 + 1.5 * IQR))).any(axis=1)]

# check the shape of the data
scaled_df.shape

##### 2.9 Create Target Dataset for Model 

In [None]:
## Generate the target dataset for Model 

# Select the data based on index from scaled dataframe 
X = X.loc[scaled_df.index]
# Make a copy of X, this will be used at latter stage for analysis
X_copy = X_copy.loc[scaled_df.index]

# Generate the Target Dataset with new feature RFM, this dataset will be primary used for Models   
Target_Data = scaled_df[['Recency', 'Frequency','Monetary']]
Target_Data.head()

##### 2.10 Pair plot to analyse RFM features 

In [None]:
## Pair plot to visualize the trend and dependency between RFM features 

# Run pairplot with scaled dataset 
pair = sns.pairplot(scaled_df[['Recency', 'Frequency','Monetary']], diag_kws={'color':'orange'}, plot_kws={'color':'blue'})
pair.fig.suptitle("Pair plot for RFM features", y=1.08)
plt.show()


### 3.0 Execute Unsupervised ML alogrithm

#### 3.1 Kmeans Algorithm

##### 3.1.1 Elbow Method - Identify the optimum number of customer clusters

In [None]:
## Use Elbow method to Identify the optimum number of customer clusters 

# Initailize inertia 
inertia = []

# Execute KMeans algorithm for 1-10 clusters, and capture the inertia from every execution 
for i in range(1, 11, 1):
    kmeans = KMeans(n_clusters=i, random_state=42, n_init="auto").fit(Target_Data)
    inertia.append(kmeans.inertia_)

# Plot the elbow graph
plt.plot(range(1, 11), inertia, marker='o')
plt.plot(np.ones(10)*5, inertia, linestyle='--')
plt.title('Elbow Method for Optimal k')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.show()

##### 3.1.2 Elbow method analysis

The plot of intertia & number of clusters shows bend at cluster 2 & 5. Picking 5 clusters to clearly distinguish the customer group. Also, we will evaluate the customer clusters using other techniques as well. After removing the outliers & scaling the data, the data seems to be more homogeneous.      

##### 3.1.3 Execute and plot KMeans alogrithm

In [None]:
## Execute Kmeans Algorithm

# Initialize the algorith and set the number of cluster to 5 based on above selection.
model = KMeans(init='k-means++', n_clusters=5, random_state=42, n_init="auto")

# Execute the alogorithm on Target Dataset
kmean = model.fit(Target_Data)
cluster_lib = kmean.fit_predict(Target_Data)

# Add cluster segment to original X & X_copy for analysing clusters at latter stage
X['Cluster'] = cluster_lib
X_copy['Cluster'] = cluster_lib

# Plot 3D scatter graph for RFM features  
# Set size
fig = plt.figure(figsize=(10,8))
# Define 3d plot 
ax = Axes3D(fig)
# set the data
X1 = Target_Data

# select X,Y, & Z columns
xs = X1.iloc[:, 0]
ys = X1.iloc[:, 1]
zs = X1.iloc[:, 2]

# set axes to 3D
ax = plt.axes(projection = '3d')
ax.scatter3D(xs, ys, zs,c=kmean.labels_) 

# Define label
ax.set_xlabel('Recency')
ax.set_ylabel('Frequency')
ax.set_zlabel('Monetary')

# Plot graph
plt.show()

##### 3.1.4 Kmeans Cluster Analysis 

The Elbow method clearly states five distinguished clusters for the customers. The KMEAN plot indicates this segregation very clearly. The groups were formed by how recently the customers visited stores, their monetary impact, and their visit frequency.   

In [None]:
### 
fig = px.scatter(scaled_df, y="Recency", x="Frequency", color=kmean.labels_, height=400, width=600)
fig.show()

fig = px.scatter(scaled_df, x="Recency", y="Monetary", color=kmean.labels_, height=400, width=600)
fig.show()

fig = px.scatter(scaled_df, x="Monetary", y="Frequency", color=kmean.labels_, height=400, width=600)
fig.show()



##### 3.1.5 RFM Scores per cluster  

In [None]:
fig = px.histogram(X_copy, x="Recency_score", color="Cluster", height=300, width=500)
fig.show()
fig = px.histogram(X_copy, x="Monetary_score", color="Cluster", height=300, width=500)
fig.show()
fig = px.histogram(X_copy, x="Frequency_score", color="Cluster", height=300, width=500)
fig.show()


#### 3.2 Silhouette Score & Analysis

##### 3.2.1 Generate silhouette score and plot the feature dependency

In [None]:
range_n_clusters = [2, 3, 4, 5, 6]

for n_clusters in range_n_clusters:
    # Create a subplot with 1 row and 2 columns
    fig, (ax1, ax2, ax3) = plt.subplots(1, 3)
    fig.set_size_inches(18, 7)

    # The 1st subplot is the silhouette plot
    # The silhouette coefficient can range from -1, 1 but in this example all
    # lie within [-0.1, 1]
    ax1.set_xlim([-0.1, 1])
    # The (n_clusters+1)*10 is for inserting blank space between silhouette
    # plots of individual clusters, to demarcate them clearly.
    ax1.set_ylim([0, len(Target_Data) + (n_clusters + 1) * 10])

    # Initialize the clusterer with n_clusters value and a random generator
    # seed of 10 for reproducibility.
    clusterer = KMeans(n_clusters=n_clusters, random_state=10)
    cluster_labels = clusterer.fit_predict(Target_Data)

    # The silhouette_score gives the average value for all the samples.
    # This gives a perspective into the density and separation of the formed
    # clusters
    silhouette_avg = silhouette_score(Target_Data, cluster_labels)
    print(
        "For n_clusters =",
        n_clusters,
        "The average silhouette_score is :",
        silhouette_avg,
    )

    # Compute the silhouette scores for each sample
    sample_silhouette_values = silhouette_samples(Target_Data, cluster_labels)

    y_lower = 10
    for i in range(n_clusters):
        # Aggregate the silhouette scores for samples belonging to
        # cluster i, and sort them
        ith_cluster_silhouette_values = sample_silhouette_values[cluster_labels == i]

        ith_cluster_silhouette_values.sort()

        size_cluster_i = ith_cluster_silhouette_values.shape[0]
        y_upper = y_lower + size_cluster_i

        color = cm.nipy_spectral(float(i) / n_clusters)
        ax1.fill_betweenx(
            np.arange(y_lower, y_upper),
            0,
            ith_cluster_silhouette_values,
            facecolor=color,
            edgecolor=color,
            alpha=0.7,
        )

        # Label the silhouette plots with their cluster numbers at the middle
        ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))

        # Compute the new y_lower for next plot
        y_lower = y_upper + 10  # 10 for the 0 samples

    ax1.set_title("The silhouette plot for the various clusters.")
    ax1.set_xlabel("The silhouette coefficient values")
    ax1.set_ylabel("Cluster label")

    # The vertical line for average silhouette score of all the values
    ax1.axvline(x=silhouette_avg, color="red", linestyle="--")

    ax1.set_yticks([])  # Clear the yaxis labels / ticks
    ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])

    # 2nd Plot showing the actual clusters formed
    colors = cm.nipy_spectral(cluster_labels.astype(float) / n_clusters)
    ax2.scatter(
        Target_Data.iloc[:, 0], Target_Data.iloc[:, 1], marker=".", s=30, lw=0, alpha=0.7, c=colors, edgecolor="k"
    )

    # Labeling the clusters
    centers = clusterer.cluster_centers_
    # Draw white circles at cluster centers
    ax2.scatter(
        centers[:, 0],
        centers[:, 1],
        marker="o",
        c="white",
        alpha=1,
        s=200,
        edgecolor="k",
    )

    for i, c in enumerate(centers):
        ax2.scatter(c[0], c[1], marker="$%d$" % i, alpha=1, s=50, edgecolor="k")

    ax2.set_title("The visualization of the clustered data.")
    ax2.set_xlabel("Feature space for the Recency")
    ax2.set_ylabel("Feature space for the Frequency")

    plt.suptitle(
        "Silhouette analysis for KMeans clustering on sample data with n_clusters = %d"
        % n_clusters,
        fontsize=14,
        fontweight="bold",
    )

    # 3rd Plot showing the actual clusters formed
    colors = cm.nipy_spectral(cluster_labels.astype(float) / n_clusters)
    ax3.scatter(
        Target_Data.iloc[:, 1], Target_Data.iloc[:, 2], marker=".", s=30, lw=0, alpha=0.7, c=colors, edgecolor="k"
    )

    # Labeling the clusters
    centers = clusterer.cluster_centers_
    # Draw white circles at cluster centers
    ax3.scatter(
        centers[:, 1],
        centers[:, 2],
        marker="o",
        c="white",
        alpha=1,
        s=200,
        edgecolor="k",
    )

    for i, c in enumerate(centers):
        ax3.scatter(c[1], c[2], marker="$%d$" % i, alpha=1, s=50, edgecolor="k")

    ax3.set_title("The visualization of the clustered data.")
    ax3.set_xlabel("Feature space for the Frequency")
    ax3.set_ylabel("Feature space for the Monetary")

    plt.suptitle(
        "Silhouette analysis for KMeans clustering on sample data with n_clusters = %d"
        % n_clusters,
        fontsize=14,
        fontweight="bold",
    )

plt.show()

##### 3.2.2 Analysis of Silhouette coefficient and clusters

The Silhouette coefficients seems to be decreasing as number of clusters are increasing, however the outliers started appearing from 5th cluster, hence 5 clusters seems to an optimal choice for cluster. 

#### 3.3 DBSCAN Alogorithm

##### 3.3.1 NearestNeighbor - Find optimal EPS parameter for DBScan

In [None]:
## Select the optimal value of eps parameter for DBSCAN by executing Nearest neighbor algorithm. 
# 
# Import the nearest neighbor from Sklearn 
from sklearn.neighbors import NearestNeighbors # type: ignore

# Initialize and fit the algorithm
near_n = NearestNeighbors().fit(scaled_df)

# Capture the distance and Index
distances, indices = near_n.kneighbors(scaled_df)

#Plot the distance and index. 
distances = np.sort(distances, axis=0)
distances = distances[:,1]
plt.figure(figsize=(6,4))
plt.plot(distances)
plt.title("Nearest neighbor - distance/ indices")

##### 3.3.2 - Execute DBSCAN 

In [None]:
# type: ignore
# Import required libraries 
import numpy as np
from sklearn import metrics
from sklearn.cluster import DBSCAN 

# Execute DBSCAN algorithm
db = DBSCAN(eps=0.118, min_samples=10).fit(Target_Data)
labels = db.labels_

# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0) 
n_noise_ = list(labels).count(-1) 

# Print estimated numbers of clusters
print("Estimated number of clusters: %d" % n_clusters_)
print("Estimated number of noise points: %d" % n_noise_)

##### 3.3.3 Plot DBSCAN result

In [None]:
unique_labels = set(labels)
core_samples_mask = np.zeros_like(labels, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True

colors = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
    if k == -1:
        # Black used for noise.
        col = [0, 0, 0, 1]

    class_member_mask = labels == k

    xy = Target_Data[class_member_mask & core_samples_mask]
    plt.plot(
        xy.iloc[:, 1],
        xy.iloc[:, 2],
        "o",
        markerfacecolor=tuple(col),
        markeredgecolor="k",
#        markersize=14,
    )

    xy = Target_Data[class_member_mask & ~core_samples_mask]
    plt.plot(
        xy.iloc[:, 1],
        xy.iloc[:, 2],
        "o",
        markerfacecolor=tuple(col),
        markeredgecolor="k",
#        markersize=6,
    )

plt.title(f"Estimated number of clusters: {n_clusters_}")
plt.show()

##### 3.3.4 Analyze DBSCAN result

With optimum EPS parameter based on nearest neighbor, the number of customer clusters from DBSCAN is 5, which is similar to other algorithms.    

### 4.0 Customer Cluster Analysis

#### 4.1 RFM score per Cluster

In [None]:
# Generate the plot to sum up score for customer clusters
fig, (ax1, ax2, ax3) = plt.subplots(1, 3)
fig.set_size_inches(20,6)

# Plot first graph

for col in ['Recency_score', 'Frequency_score', 'Monetary_score']:
#   Create the dataframe of sum of score     
    grp = pd.DataFrame(X.groupby('Cluster')[col].sum())
    grp['Clusters']  = grp.index.values
    grp.columns = ['Aggregation', 'Clusters']
#   Set the axis labels
    if col == 'Recency_score': 
        ax1.set_title("The "f"{col} score for customer clusters.")
        ax1.set_xlabel("Cluster Number")
        ax1.set_ylabel("Sum of "f"{col} score")
        ax1.bar(list(grp.iloc[:,1].values), list(grp.iloc[:,0].values))
#   
    if col == 'Frequency_score': 
        ax2.set_title("The "f"{col} score for customer clusters.")
        ax2.set_xlabel("Cluster Number")
        ax2.set_ylabel("Sum of "f"{col} score")
        ax2.bar(list(grp.iloc[:,1].values), list(grp.iloc[:,0].values))
#   
    if col == 'Monetary_score': 
        ax3.set_title("The "f"{col} score for customer clusters.")
        ax3.set_xlabel("Cluster Number")
        ax3.set_ylabel("Sum of "f"{col} score")
        ax3.bar(list(grp.iloc[:,1].values), list(grp.iloc[:,0].values))
#   

#### 4.2 Customer count per cluster

In [None]:
#   Create the dataframe of sum of score     
grp = pd.DataFrame(X_copy.groupby('Cluster')['CustomerID'].count())
grp['Clusters']  = grp.index.values
grp.columns = ['Aggregation', 'Clusters']

fig = px.bar(grp, x='Clusters', y='Aggregation',height=400, width=600, text_auto=True, title='Count of customers per cluster')
fig.show()

#### 4.3 Peak purchase time

In [None]:
## Identify peak selling time for Retailer grouped by cluster
fig = px.histogram(X_copy, x="Time", color="Cluster", height=400, width=600, title='Purchase time by cluster').update_xaxes(categoryorder='total descending')
fig.show()

#### 4.4 Cluster wise Top selling Months

In [None]:
# type: ignore
## Plot top selling month 
from pandas import to_datetime # type: ignore
from plotly.subplots import make_subplots
import plotly.graph_objects as go

# Create the date dataframe with feature as Cluster, Date & Quantity
df_date =X_copy[['Cluster', 'Date', 'Quantity']]

# Find the Month from date column 
df_date['Month'] = to_datetime(df_date['Date']).dt.strftime('%B')

# Create subplot
fig = make_subplots(rows=3, cols=2,
                    subplot_titles=("Customer Cluster 0","Customer Cluster 1", "Customer Cluster 2",
                                    "Customer Cluster 3","Customer Cluster 4", "All Clusters"))

# Initialize the cluster variable 
clus = 0

# Loop to generate the line plot in 3x2 frame. 
for i in list(range(1,4,1)):
    for j in list(range(1,3,1)):
        if i == 3 and j==2:
            df_date1 =df_date.groupby(['Month'])[['Quantity']].sum().sort_values(by='Quantity')
            xd = df_date1.index.values
            yd = df_date1['Quantity']
            fig.add_trace(go.Line(x=xd, y=yd),row=i, col=j)
            fig.update_layout(height=800, width=1000, title_text="Cluster Wise Top Selling Month")
            fig.show()
        else: 
            df_date1 =df_date.query("Cluster == "f"{clus}").groupby(['Month'])[['Quantity']].sum().sort_values(by='Quantity')
            xd = df_date1.index.values
            yd = df_date1['Quantity']
            fig.add_trace(go.Line(x=xd, y=yd),row=i, col=j)
            clus = clus+1


#### 4.5 Top Selling items per Cluster and Overall

In [None]:
def display_side_by_side(*args):
    html_str = ''
    for df in args:
        html_str += '<div style="display:inline-block; margin-right: 25px;">' 
        html_str += df.to_html()
        html_str += '</div>'
    display(HTML(html_str))

for cls in list(range(0,6,1)):
    df_name= f"gls_{cls}"
    if cls == 5:
        globals()[df_name] = X_copy[['Description','Quantity']].groupby('Description')['Quantity'].sum().sort_values(ascending=False).head(10)
        globals()[df_name] = globals()[df_name].to_frame()
        globals()[df_name] = globals()[df_name].style.set_caption("OverAll")
        print('Top Selling Item in each Segment & over all')
        display_side_by_side(gls_0,gls_1) # type: ignore
        display_side_by_side(gls_2,gls_3) # type: ignore
        display_side_by_side(gls_4,gls_5) # type: ignore 
    else:
        globals()[df_name] = X_copy.query("Cluster == "f"{cls}")[['Description','Quantity']].groupby('Description')['Quantity'].sum().sort_values(ascending=False).head(10)
        globals()[df_name] = globals()[df_name].to_frame()
        globals()[df_name] = globals()[df_name].style.set_caption("Segment"f"{cls}")

In [None]:
## Plot to represent the Top sold StockID sold across cluster, and overall  

# Set the plot frame
fig, [(ax1, ax2), (ax3, ax4), (ax5, ax6)] = plt.subplots(3, 2)
fig.set_size_inches(40,20)

# Function to get the aggregrated value based on customer cluster number
def conv_df(cls):
    gls = X_copy.query("Cluster == "f"{cls}")[['StockCode','Quantity']].groupby('StockCode')['Quantity'].sum().sort_values(ascending=False).head(20)
    gls = gls.to_frame()
    gls['sc'] = gls.index.values
    gls.columns = ['sc', 'scode']
    return gls

# Loop to plot the top seeling stockID
for cluster in list(range(0,6,1)):
    if cluster == 5:
        dfm = conv_df(list(range(0,6,1)))
        ax6.set_title("Top sold Items across all customer segments")
        ax6.set_xlabel("Product Number")
        ax6.set_ylabel("Product sold count")
        ax6.bar(dfm['scode'], dfm['sc'], color='m')

    if cluster == 0:
        dfm = conv_df(0)
        ax1.set_title("Top sold Items across Customer Segment-0")
        ax1.set_xlabel("ClPuster Number")
        ax1.set_ylabel("Product sold count")
        ax1.bar(dfm['scode'], dfm['sc'], color='b')
    
    if cluster == 1:
        dfm = conv_df(1)
        ax2.set_title("Top sold Items across Customer Segment-1")
        ax2.set_xlabel("ClPuster Number")
        ax2.set_ylabel("Product sold count")
        ax2.bar(dfm['scode'], dfm['sc'], color='g')
    
    if cluster == 2:
        dfm = conv_df(2)
        ax3.set_title("Top sold Items across Customer Segment-2")
        ax3.set_xlabel("ClPuster Number")
        ax3.set_ylabel("Product sold count")
        ax3.bar(dfm['scode'], dfm['sc'], color='c')
    
    if cluster == 3:
        dfm = conv_df(3)
        ax4.set_title("Top sold Items across Customer Segment-3")
        ax4.set_xlabel("ClPuster Number")
        ax4.set_ylabel("Product sold count")
        ax4.bar(dfm['scode'], dfm['sc'], color='y')
    
    if cluster == 4:
        dfm = conv_df(4)
        ax5.set_title("Top sold Items across Customer Segment-4")
        ax5.set_xlabel("ClPuster Number")
        ax5.set_ylabel("Product sold count")
        ax5.bar(dfm['scode'], dfm['sc'], color= 'k')
    

#### 4.6 Tree Map - Customer Segement Based On RFM Score

##### 4.6.1 RFM Score based Segment Tree-map

In [None]:
## Generate score based Segement TreeMap

X_copy['RFM_Segment'] = X_copy['Recency_score'].astype(str) + X_copy['Frequency_score'].astype(str) + X_copy['Monetary_score'].astype(str)
segment_counts = X_copy['RFM_Segment'].value_counts().reset_index()
segment_counts.columns = ['RFM_Segment', 'Count']

# Plot the segment MAP
fig = px.treemap(segment_counts, path=['RFM_Segment'], values='Count', title='RFM Segmentation for Customer(Customer ID)')
fig.show()

##### 4.6.2 Create Customer Type Column Based On RFM Score

In [32]:
## Define the customer based RMF score. The overall RFM score is concatenation of individual string value from R+F+M score. 
# The reference of this score is taken from blog/internet, however defining customer based on RFM score is business decision and # can vary based on organizations.

# Define customers on RFM score
#  
CHAMPIONS = ['555','554','544','545','454','455''445']
LOYAL = ['543','444','435','355','354','345','344''335']
POTENTIAL_LOYALIST = ['553','551','552','541','542','533','532','531','452','451','442','441','431','453','433','432','423','353','352','351','342','341','333','323']
RECENT_CUSTOMERS = ['512','511','422','421','412','411','311']
PROMISING = ['525','524','523','522','521','515','514','513','425','424','413','414','415','315','314','313']
NEED_ATTENTION = ['535','534','443','434','343','334','325','324']
ABOUT_TO_SLEEP = ['331','321','312','221','213','231','241','251']
AT_RISK = ['255','254','245','244','253','252','243','242','235','234','225','224','153','152','145','143','142','135','134','133','125','124']
CANNOT_LOSE = ['155','154','144','214','215','115','114','113']
HIBERNATING = ['332','322','231','241','251','233','232','223','222','132','123','122','212','211']
LOST = ['111','112','121','131','141','151']


# Add new column CustType based on individual score of customer 
X_copy['CustType'] = X_copy['RFM_Segment'].apply(
    lambda i: 'CHAMPIONS' if i in CHAMPIONS else
              'LOYAL' if i in LOYAL else
              'POTENTIAL_LOYALIST' if i in POTENTIAL_LOYALIST else
              'RECENT_CUSTOMERS' if i in RECENT_CUSTOMERS else
              'PROMISING' if i in PROMISING else
              'NEED_ATTENTION' if i in NEED_ATTENTION else
              'ABOUT_TO_SLEEP' if i in ABOUT_TO_SLEEP else
              'AT_RISK' if i in AT_RISK else
              'CANNOT_LOSE' if i in CANNOT_LOSE else
              'HIBERNATING' if i in HIBERNATING else
              'LOST' if i in LOST else
              'ERR'
)

##### 4.6.3 Tree MAP for Customer Segement

In [None]:
## Draw tree Map

# Calculate the customer segment count
cust_segment_counts = X_copy['CustType'].value_counts().reset_index()
cust_segment_counts.columns = ['RFM_Cust_Segment', 'Count']

# Draw the segment MAP
fig = px.treemap(cust_segment_counts, path=['RFM_Cust_Segment'], values='Count', title='RFM Segmentation for Customer')
fig.show()

## Note we can also draw the segement on individual cluster to focus customers in individual cluster.

### 5.0 Conclusion

After trying multiple models for Customer clustering, almost all the models converge to similar results with the same number of optimal customer clusters. Using the RFM feature, customers are further clustered in various focus areas and served based on individual choices. Identified the peak selling day/ time for the customers, along with the top-selling items for the customer segment.

Cluster 1 has the highest count of the number of customers and customers in this segment are the most frequent visitors to the website, and contribute more in monetary value. More business analysis is required on the cluster 2 & 3 customers as they were frequent visitors and have contributed fairly in terms of monetary value to the website, however, due to some reason they have not visited the website as their recency score is way lower than other segment.

Lastly have categorized the customers with RFM scores in multiple segments like potential loyalists, need attention, lost customers, and so on. The business can target individual segments to generate more sales.