<a href="https://colab.research.google.com/github/Ranadeep-DS/Unsupervised_ML_Online_Retail_Customer_Segmentation/blob/main/Unsupervised_ML_Online_Retail_Customer_Segmentation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Online Retail Customer Segmentation



##### **Project Type**    - Unsupervised Machine Learning
##### **Contribution**    - Individual


# **Project Summary -**

In this project, my task is to identify major customer segments on a transactional data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

# **Dataset Link-**

https://docs.google.com/spreadsheets/d/1_W3Jfp1bTWpPFmqyGgGXYJGd0rHIV8dD/edit?usp=sharing&ouid=111621533088265830720&rtpof=true&sd=true

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


We have an online retail customers details, from which we need to identify various types of customers based on their spending behaviour like Recency, Frequency and Monetary.
We do segmentation of these customers based on High Value Customers, Medium Value Customers and Low Value Customers.
Now our job is to execute this.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
from numpy import math

import matplotlib.pyplot as plt
import seaborn as sns

from datetime import datetime
import datetime as dt

from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score
from sklearn import metrics


### Dataset Loading

In [None]:
# Load Dataset
cust_df = pd.read_csv('Online Retail.csv')


### Dataset First View

In [None]:
# Dataset First Look
cust_df.head()


In [None]:
cust_df.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
len(cust_df)

In [None]:
cust_df.shape

In [None]:
cust_df.columns


### Dataset Information

In [None]:
# Dataset Info
cust_df.info()

In [None]:
cust_df.describe()

In [None]:
# There are null values in cust_id and description columns.
# There are negative values in Quantity and Unit price columns, they can be due to cancelled orders


#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
cust_df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
cust_df.isnull().sum()

In [None]:
# There are 5268 duplicate values and some null values in the data set.
# I am treating these values.

In [None]:
cust_df.drop_duplicates(inplace = True)

In [None]:
cust_df.dropna(inplace=True)

In [None]:
cust_df.duplicated().sum()

In [None]:
cust_df.isnull().sum()

In [None]:
# Now the dataset has zero duplicates and zero null values.

In [None]:
cust_df.shape

### What did you know about your dataset?

The data set was having duplicates and missing values, after dropping those rows now the dataset has total 401604 rows and 8 columns.

## ***2. Understanding Your Variables***

In [None]:
# Our dataset has certain cancelled orders we treat them first.
cust_df['InvoiceNo'] = cust_df['InvoiceNo'].astype('str')

In [None]:
# All the invoices of cancelled orders starts with 'C'.

In [None]:
cust_df[cust_df['InvoiceNo'].str.startswith('C')]

In [None]:
# There are total 8872 cancelled orders in the dataset.
# We will drop these columns.

In [None]:
cust_df = cust_df[~cust_df['InvoiceNo'].str.startswith('C')]

In [None]:
cust_df.describe()

In [None]:
# In unit price still there are zero values, which indicates that the price is zero and item is for free
# So these rows need to be dropped too.

In [None]:
cust_df = cust_df[cust_df['UnitPrice']>0]

In [None]:
cust_df.describe()

In [None]:
cust_df.shape

In [None]:
cust_df.head()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

cust_df['InvoiceDate'] = pd.to_datetime(cust_df['InvoiceDate'], format = "%m/%d/%y %H:%M")


In [None]:
# create some new features from invoice date
cust_df['day'] = cust_df['InvoiceDate'].dt.day_name()
cust_df['year'] = cust_df['InvoiceDate'].apply(lambda x: x.year)
cust_df['month_num'] = cust_df['InvoiceDate'].apply(lambda x: x.month)
cust_df['day_num'] = cust_df['InvoiceDate'].apply(lambda x: x.day)
cust_df['hour'] = cust_df['InvoiceDate'].apply(lambda x: x.hour)
cust_df['minute'] = cust_df['InvoiceDate'].apply(lambda x: x.minute)
cust_df['month'] = cust_df['InvoiceDate'].dt.month_name()


In [None]:
# create TotalAmount from Quantity and UnitPrice
cust_df['TotalAmount'] = cust_df['Quantity'] * cust_df['UnitPrice']


In [None]:
cust_df.head()

### What all manipulations have you done and insights you found?

Here I have created new features like day name, month name, day, month, year, hour and minute columns from existing Invoice date column.
Created total amount column from quantity and unitprice columns.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code

desc_df = cust_df['Description'].value_counts().reset_index()

In [None]:
desc_df.head()

In [None]:
plt.rcParams['figure.figsize'] = [18,7]
sns.set(rc={'figure.figsize':(18,7)})

In [None]:
sns.barplot(data=desc_df[:5], x ='Description', y='count')
plt.title('Top 5 most sold products')
plt.show()

##### 1. Why did you pick the specific chart?

I have picked the bar chart to show the top 5 sold products.

##### 2. What is/are the insight(s) found from the chart?

White hanging heart T-Light Holder is the top selling product followed by Regency cake stand, Jumbo bag, Assorted colour bird ornament, Party bunting.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

All the above mentioned products are doing great.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
desc_df.tail()

In [None]:
sns.barplot(data= desc_df[-5:], x='Description', y='count')
plt.title('Bottom 5 selling products')
plt.show()

##### 1. Why did you pick the specific chart?

I have plotted the bottom 5 selling products

##### 2. What is/are the insight(s) found from the chart?

Green with metal bag charm, White with metal bag charm, Blue/Nat shell necklace, pink easter hens, paper craft little birdie are least selling products.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

More focus should go on these products.


#### Chart - 3

In [None]:
# Chart - 3 visualization code
country_df = cust_df['Country'].value_counts().reset_index()

In [None]:
country_df.head()

In [None]:
sns.barplot(data=country_df[:5], x = 'Country', y='count')
plt.title('Top 5 countries based on number of customers')

##### 1. Why did you pick the specific chart?

To find the top 5 countries based on number of customers


##### 2. What is/are the insight(s) found from the chart?

UK is the country with highest number of customers followed by Germany, France, EIRE and Spain.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

These are the countries from which highest business is happening.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
country_df.tail()


In [None]:
sns.barplot(data=country_df[-5:], x='Country',y='count')
plt.title("Bottom 5 countries based on number of customers")
plt.show()

##### 1. Why did you pick the specific chart?

To plot bottom 5 countries based on number of customers

##### 2. What is/are the insight(s) found from the chart?

Saudi Arabia is bottom most country followed by Bahrain, Czech Republic, Brazil and Lithuania.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

These are the countries where lowest business is happening, So more focus should go to improve the business in these countries.

#### Chart - 5

In [None]:
# Total unique customers
len(cust_df['CustomerID'].unique())

In [None]:
# There are total 4338 unique customers.

In [None]:
share_df = (cust_df['CustomerID'].value_counts()/sum(cust_df['CustomerID'].value_counts()) * 100).reset_index()
share_df = share_df.rename(columns = {'count':'ordershare'})
share_df.head(10)

In [None]:
share_df['cummulative_ordershare']=share_df['ordershare'].cumsum()

In [None]:
sns.barplot(data=share_df[:10], x='CustomerID', y='cummulative_ordershare')
plt.title('Top 10 customers based on order share')
plt.show()

##### 1. Why did you pick the specific chart?

Here I am plotting the cummulative share of Top10 customers

##### 2. What is/are the insight(s) found from the chart?

Only Top10 customers hold 9% of order share out of 4338 customers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

These 10 customers are wholesalers.

#### Chart - 6

In [None]:
 # plot the distribution of the numerical features
num_features = ['Quantity', 'UnitPrice', 'TotalAmount']
count = 1
plt.subplots(figsize=(20,13))
for feature in num_features:
  plt.subplot(2,2,count)
  sns.distplot(cust_df[feature])
  plt.title(f"Distribution of the variable {feature}", fontsize=16)
  plt.xlabel(f"{feature}")
  plt.ylabel("Density")
  count += 1

In [None]:
# Here the data is skewed so we apply log transformation.

In [None]:
count = 1
plt.subplots(figsize=(20,13))
for feature in num_features:
  plt.subplot(2,2,count)
  sns.distplot(np.log1p(cust_df[feature]))
  plt.title(f"Distribution of the variable {feature}", fontsize=16)
  plt.xlabel(f"{feature}")
  plt.ylabel("Density")
  count += 1

#### Chart - 7

In [None]:
# Chart - 7 visualization code
day_df = cust_df['day'].value_counts().reset_index()
day_df

In [None]:
sns.barplot(data=day_df, x='day',y='count')
plt.title('Purchases as per day')
plt.show()

##### 1. Why did you pick the specific chart?

Here we are plotting total purchases as per day

##### 2. What is/are the insight(s) found from the chart?

Here Thursday is having highest purchases, followed by Wednesday, Tuesday, Monday, Sunday and Friday.
No purchases on Saturday

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

More focus should on improving sales on Weekends, i.e Friday, Saturday and Sunday on which sales are very less.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
month_df = cust_df['month'].value_counts().reset_index()
month_df

In [None]:
sns.barplot(x='month', y='count', data=month_df)
plt.title('Purchases made as per month')
plt.show()

##### 1. Why did you pick the specific chart?

Here we are plotting purchases made as per month.

##### 2. What is/are the insight(s) found from the chart?

Highest purchases in November month, followed by October, December and lowest purchases in months of January and February.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

More focus should go in months of January and February to improve the sales in those months.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
hour_df = cust_df['hour'].value_counts().reset_index()
hour_df

In [None]:
sns.barplot(x='hour', y='count', data=hour_df)
plt.title('Purchases made as per hour of a day')
plt.show()

In [None]:
def time_day(time):
  if (time >= 6 and time <= 11):
    return 'Morning'
  elif (time >= 12 and time <= 17):
    return 'Afternoon'
  else:
    return 'Evening'


cust_df['time_day'] = cust_df['hour'].apply(time_day)
cust_df.head()

In [None]:
sns.countplot(x='time_day', data=cust_df)
plt.title('Purchases made during the time of the day')
plt.show()

##### 1. Why did you pick the specific chart?

Plotting purchases made on different times of a day

##### 2. What is/are the insight(s) found from the chart?

More purchases are in Afternoon and then in Morning.
Least purchases are made in Evening.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

More focus should go in increasing purchases in Evening.

## ***5. ML Model Implementation***

In [None]:
# I am using only UK data for model building
cust_df = cust_df[cust_df['Country'] == 'United Kingdom']

cust_df.shape

In [None]:
# calculating RFM scores
# set latest date to '2011-12-10' as the last invoice date was '2011-12-09'
latest_date = dt.datetime(2011,12,10)

# create rfm modeling scores for each customer
rfm_df = cust_df.groupby('CustomerID').agg({'InvoiceDate': lambda x: (latest_date - x.max()).days, 'InvoiceNo': lambda x: len(x),
                                            'TotalAmount': lambda x: x.sum()})

# convert invoice date into type int
rfm_df['InvoiceDate'] = rfm_df['InvoiceDate'].astype(int)

# rename columns to frequency, recency, monetary
rfm_df.rename(columns={'InvoiceDate': 'Recency', 'InvoiceNo': 'Frequency', 'TotalAmount': 'Monetary'}, inplace=True)

rfm_df.reset_index().head()

In [None]:
# plot the distribution of the RFM values
count = 1
plt.subplots(figsize=(20,13))
for feature in rfm_df:
  plt.subplot(2,2,count)
  sns.distplot(rfm_df[feature])
  plt.title(f"Distribution of the variable {feature}", fontsize=16)
  plt.xlabel(f"{feature}")
  plt.ylabel("Density")
  count += 1

In [None]:
# treat the negative and zero values to handle infinite numbers during log transformation
def handle_negative(num):
  if num <= 0:
    return 1
  else:
    return num

# apply the function to recency and monetary columns
rfm_df['Recency'] = [handle_negative(x) for x in rfm_df['Recency']]
rfm_df['Monetary'] = [handle_negative(x) for x in rfm_df['Monetary']]

# apply log transfomation to RFM values
log_df = rfm_df[['Recency', 'Frequency', 'Monetary']].apply(np.log, axis=1).round(3)

In [None]:
# plot the log transformed distribution
count = 1
plt.subplots(figsize=(20,13))
for feature in log_df:
  plt.subplot(2,2,count)
  sns.distplot(log_df[feature])
  plt.title(f"Distribution of the variable {feature}", fontsize=16)
  plt.xlabel(f"{feature}")
  plt.ylabel("Density")
  count += 1

In [None]:
# These logtransformed distributions are normal
# Now we apply log transformations to all 3 columns to proceed furthur.

In [None]:
rfm_df['Recency_log'] = rfm_df['Recency'].apply(math.log)
rfm_df['Frequency_log'] = rfm_df['Frequency'].apply(math.log)
rfm_df['Monetary_log'] = rfm_df['Monetary'].apply(math.log)
# display the rfm_df
rfm_df.head()

In [None]:
features = ['Recency_log', 'Frequency_log', 'Monetary_log']

# scaling our data
X_features = rfm_df[features].values
scaler = StandardScaler()
X = scaler.fit_transform(X_features)

In [None]:
# We are applying elbow method to find optimum number of clusters.

In [None]:
from yellowbrick.cluster import KElbowVisualizer
SSE = {}
for k in range(1,15):
  km = KMeans(n_clusters = k, init = 'k-means++', max_iter = 1000)
  km = km.fit(X)
  SSE[k] = km.inertia_

# plot the graph for SSE and number of clusters
visualizer = KElbowVisualizer(km, k=(1,15), metric='distortion', timings=False)
visualizer.fit(X)
visualizer.poof()
plt.show()

In [None]:
# Here the optimum number of clusters are 3

In [None]:
from sklearn.cluster import KMeans

# Assuming X is your data array
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)
y_km = kmeans.predict(X)

# Plot the clusters
plt.figure(figsize=(10, 6))
plt.title('Customer Segmentation based on Recency and Frequency')
plt.scatter(X[:,0], X[:,1], c=y_km, s=50, cmap='Set1', label='Clusters')

# Plot and annotate the centers
centers = kmeans.cluster_centers_
plt.scatter(centers[:,0], centers[:,1], c='black', s=200, alpha=0.5, marker='x')
for i, center in enumerate(centers):
    plt.annotate(f'Cluster {i}', (center[0], center[1]), textcoords="offset points", xytext=(0,10), ha='center')

plt.xlabel('Recency')
plt.ylabel('Frequency')
plt.legend()
plt.show()

In [None]:
quantiles = rfm_df.quantile(q=[0.25,0.5,0.75])
quantiles = quantiles.to_dict()

In [None]:
def RScore(x,p,d):
  if x <= d[p][0.25]:
    return 1
  elif x <= d[p][0.5]:
    return 2
  elif x <= d[p][0.75]:
    return 3
  else:
    return 4

def FnMScore(x,p,d):
  if x <= d[p][0.25]:
    return 4
  elif x <= d[p][0.5]:
    return 3
  elif x <= d[p][0.75]:
    return 2
  else:
    return 1

In [None]:
rfm_df['R'] = rfm_df['Recency'].apply(RScore, args=('Recency', quantiles, ))
rfm_df['F'] = rfm_df['Frequency'].apply(FnMScore, args=('Frequency', quantiles, ))
rfm_df['M'] = rfm_df['Monetary'].apply(FnMScore, args=('Monetary', quantiles, ))
rfm_df.reset_index().head()

In [None]:
# add RFM group column
rfm_df['RFMGroup'] = rfm_df['R'].map(str) + rfm_df['F'].map(str) + rfm_df['M'].map(str)

# calculate RFM score from RFM group column
rfm_df['RFMScore'] = rfm_df[['R', 'F', 'M']].sum(axis=1)
rfm_df.reset_index().head()

In [None]:
rfm_df['Cluster'] = kmeans.labels_
rfm_df.head()

In [None]:
rfm_df.tail()

**Cluster 1:**
Recency: High (average around 165 days)
Frequency: Low (average around 15 transactions)
Monetary: Low (average around $286)
Interpretation: Customers in this cluster are likely to be 'At-Risk' or 'Lapsed' customers. They haven't made purchases recently, and when they did, they didn't do so very frequently and didn't spend much. These customers might have been one-time buyers or occasional shoppers. Engaging them with reactivation campaigns or exploring why they haven’t returned can be a strategic move.


**Cluster 2:**
Recency: Very Low (average around 11 days)
Frequency: Very High (average around 259 transactions)
Monetary: Very High (average around $5933)
Interpretation: This cluster represents your 'Champions' or 'Loyal' customers. They shop frequently, recently, and spend the most. They are the most valuable segment, likely to respond positively to new offers, up-sell and cross-sell opportunities. Maintaining their high engagement level is crucial, and they can also be targeted for feedback or as brand ambassadors.


**Cluster 0:**
Recency: Moderate (average around 68 days)
Frequency: Moderate (average around 69 transactions)
Monetary: Moderate (average around $1200)
Interpretation: Customers in this cluster can be seen as 'Potential Loyalists' or 'Promising' customers. They have a balanced score in all three RFM metrics. These customers have the potential to become more valuable if properly engaged. Tailored marketing strategies, loyalty programs, and incentives to increase their purchase frequency and value can be effective.

# **Conclusion**

**Cluster 1** ('At-Risk/Lapsed'): Focus on re-engagement strategies. Understand their needs and reasons for not returning. Offer incentives or feedback surveys to encourage them to revisit and make purchases.

**Cluster 2**('Champions/Loyal'): Prioritize maintaining their high level of engagement. Offer exclusive deals, loyalty programs, and early access to new products. They can also be engaged in referral programs.

**Cluster 0** ('Potential Loyalists/Promising'): Encourage them to visit and buy more often. Personalized communication, recommending products based on past purchases, and loyalty rewards can be effective.
Thus, from our comparison table we can conclude that KMeans clustering on Recency, Frequency and Monetary data gives us the best result with the optimal number of clusters as 3. We can use this model to cluster our data in 3 segments and develop better marketing strategies.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***