<a href="https://colab.research.google.com/github/DATA3750/WeeklyDemo/blob/main/Wk9_customer_segmentation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Customer Segmentation

## Problem Statement

Although the business of supersore is doing great, the management isn't clear about the characteristics of their customers, ie. who are their core and loyal customers, which region or country are they from, what are the range of products frequently bought by each type of customers, etc?


>Performing Customer Segmentation (and another relevant business analysis) could derive more detail information about the existing customers.

## Import Libraries/Packages


In [None]:
#For Data Processing
import pandas as pd
import numpy as np

#For Data Visualisation
import matplotlib.pyplot as plt
import seaborn as sns
#Set Visualisation style to 'ggplot'
sns.set()

#For Datetime Manipulation
import datetime as dt

## Importing of Dataset
Import dataset from csv file to DataFrame.

In [None]:
# Import csv using Pandas and store into a dataframe

dataset = pd.read_csv('https://raw.githubusercontent.com/Dong2Yo/Dataset/main/superstore.csv')

#Get an overview of the dataset to check on the Dtype and any missing values
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51290 entries, 0 to 51289
Data columns (total 27 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Category        51290 non-null  object 
 1   City            51290 non-null  object 
 2   Country         51290 non-null  object 
 3   Customer.ID     51290 non-null  object 
 4   Customer.Name   51290 non-null  object 
 5   Discount        51290 non-null  float64
 6   Market          51290 non-null  object 
 7   记录数             51290 non-null  int64  
 8   Order.Date      51290 non-null  object 
 9   Order.ID        51290 non-null  object 
 10  Order.Priority  51290 non-null  object 
 11  Product.ID      51290 non-null  object 
 12  Product.Name    51290 non-null  object 
 13  Profit          51290 non-null  float64
 14  Quantity        51290 non-null  int64  
 15  Region          51290 non-null  object 
 16  Row.ID          51290 non-null  int64  
 17  Sales           51290 non-null 

>The dataset is successfully imported into the dataframe. It contains 51,290 entries of data, with 25 columns. There are no missing values in any columns. All columns are in the correct Dtype, except "Order.Date" and "Ship.Date" which should be as datetime Dtype. They will be change in the following Data Cleaning Process.

## Data Cleaning

In [None]:
#Change Order.Date & Ship.Date to datetime dtype.
dataset['Order.Date'] = pd.to_datetime(dataset['Order.Date'], format='%Y-%m-%d %H:%M:%S.%f')
dataset['Ship.Date'] = pd.to_datetime(dataset['Ship.Date'], format='%Y-%m-%d %H:%M:%S.%f')
dataset.drop('记录数', axis=1, inplace=True)
dataset.drop('Row.ID', axis=1, inplace=True)
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51290 entries, 0 to 51289
Data columns (total 25 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   Category        51290 non-null  object        
 1   City            51290 non-null  object        
 2   Country         51290 non-null  object        
 3   Customer.ID     51290 non-null  object        
 4   Customer.Name   51290 non-null  object        
 5   Discount        51290 non-null  float64       
 6   Market          51290 non-null  object        
 7   Order.Date      51290 non-null  datetime64[ns]
 8   Order.ID        51290 non-null  object        
 9   Order.Priority  51290 non-null  object        
 10  Product.ID      51290 non-null  object        
 11  Product.Name    51290 non-null  object        
 12  Profit          51290 non-null  float64       
 13  Quantity        51290 non-null  int64         
 14  Region          51290 non-null  object        
 15  Sa

## **Recency, Frequency, Monetary (RFM) Analysis**
Now that we have identified where the customers, who are accounted for 80% of sales, are geographically located, we will classify them into different groups. One way to classify customers is using the RFM Analysis.

We want to group each customer by how recent (Recency) they have bought from Global Superstore, how frequent (Frequency) have they bought from in 2011-2014, and how much (Monetary) have they spent in Global Superstore. The higher the RFM score, the better the grade of customers to Global Superstore.

![LRFMV]("customer_segmentation.PNG")


### Assign RFM scores & classify customers' grouping to each sub dataset
I attempt on this project in end 2023 but the dataset is range between 2011-2014. To be more relevant, I assumed that Global Superstore is doing this project in Jan 2015 so the anchor date for Recency will be based on 01 Jan 2015. I will give a score of 1-5 for each Recency, Frequency & Monetary first before adding them up to get the RFM score for each customer. After every customers are assigned a RFM score, then I will group them into 4 different groups - namely Champions, Loyalists, Potential Churn & Churn customers.

In [None]:
# Convert 'order date' to datetime
dataset['Order.Date'] = pd.to_datetime(dataset['Order.Date'])

# Calculate recency based on the current date
current_date = pd.to_datetime('2015-1-1')
dataset['Recency'] = (current_date - dataset['Order.Date']).dt.days

# Assign recency scores
dataset['Recency.Score'] = pd.qcut(dataset['Recency'], q=5, labels=range(5, 0, -1))

# Print the DataFrame with recency and recency scores
print(dataset)

              Category         City        Country Customer.ID  \
0      Office Supplies  Los Angeles  United States   LS-172304   
1      Office Supplies  Los Angeles  United States   MV-174854   
2      Office Supplies  Los Angeles  United States   CS-121304   
3      Office Supplies  Los Angeles  United States   CS-121304   
4      Office Supplies  Los Angeles  United States   AP-109154   
...                ...          ...            ...         ...   
51285  Office Supplies  Los Angeles  United States   AM-103604   
51286  Office Supplies  Los Angeles  United States   AM-103604   
51287  Office Supplies  Los Angeles  United States   HR-147704   
51288  Office Supplies  Los Angeles  United States   RM-196754   
51289  Office Supplies  Los Angeles  United States   FH-143654   

          Customer.Name  Discount Market Order.Date        Order.ID  \
0      Lycoris Saunders       0.0     US 2011-01-07  CA-2011-130813   
1         Mark Van Huff       0.0     US 2011-01-21  CA-2011-1486

In [None]:
# Calculate frequency for each customer
frequency = dataset.groupby('Customer.Name').size().reset_index(name='Frequency')

# Assign frequency scores
frequency['Frequency.Score'] = pd.qcut(frequency['Frequency'], q=5, labels=range(1, 6))

# Merge frequency scores back to the original dataset
dataset = pd.merge(dataset, frequency, on='Customer.Name', how='left')

# Print the DataFrame with frequency and frequency scores
print(dataset)


              Category         City        Country Customer.ID  \
0      Office Supplies  Los Angeles  United States   LS-172304   
1      Office Supplies  Los Angeles  United States   MV-174854   
2      Office Supplies  Los Angeles  United States   CS-121304   
3      Office Supplies  Los Angeles  United States   CS-121304   
4      Office Supplies  Los Angeles  United States   AP-109154   
...                ...          ...            ...         ...   
51285  Office Supplies  Los Angeles  United States   AM-103604   
51286  Office Supplies  Los Angeles  United States   AM-103604   
51287  Office Supplies  Los Angeles  United States   HR-147704   
51288  Office Supplies  Los Angeles  United States   RM-196754   
51289  Office Supplies  Los Angeles  United States   FH-143654   

          Customer.Name  Discount Market Order.Date        Order.ID  \
0      Lycoris Saunders       0.0     US 2011-01-07  CA-2011-130813   
1         Mark Van Huff       0.0     US 2011-01-21  CA-2011-1486

In [None]:
# Remove "Value." prefix from column names
## dataset.columns = dataset.columns.str.replace('Value.', '')

In [None]:

# Assuming the 'profit' column represents the monetary value for each order
# Calculate total monetary value for each customer
monetary = dataset.groupby('Customer.Name')['Profit'].sum().reset_index()
monetary.rename(columns={'Profit': 'Monetary'}, inplace=True)

# Assign monetary scores
monetary['Monetary.Score'] = pd.qcut(monetary['Monetary'], q=5, labels=range(1, 6))

# Merge monetary scores back to the original dataset
dataset = pd.merge(dataset, monetary, on='Customer.Name', how='left')

# Print the DataFrame with monetary and monetary scores
print(dataset)


              Category         City        Country Customer.ID  \
0      Office Supplies  Los Angeles  United States   LS-172304   
1      Office Supplies  Los Angeles  United States   MV-174854   
2      Office Supplies  Los Angeles  United States   CS-121304   
3      Office Supplies  Los Angeles  United States   CS-121304   
4      Office Supplies  Los Angeles  United States   AP-109154   
...                ...          ...            ...         ...   
51285  Office Supplies  Los Angeles  United States   AM-103604   
51286  Office Supplies  Los Angeles  United States   AM-103604   
51287  Office Supplies  Los Angeles  United States   HR-147704   
51288  Office Supplies  Los Angeles  United States   RM-196754   
51289  Office Supplies  Los Angeles  United States   FH-143654   

          Customer.Name  Discount Market Order.Date        Order.ID  \
0      Lycoris Saunders       0.0     US 2011-01-07  CA-2011-130813   
1         Mark Van Huff       0.0     US 2011-01-21  CA-2011-1486

In [None]:
import pandas as pd


# Convert the 'order date' to datetime
dataset['Order.Date'] = pd.to_datetime(dataset['Order.Date'])

# Find the first and last visit for each customer
first_visit = dataset.groupby('Customer.Name')['Order.Date'].min().reset_index()
last_visit = dataset.groupby('Customer.Name')['Order.Date'].max().reset_index()

# Calculate the length (number of days between first and last visit) for each customer
length = pd.merge(first_visit, last_visit, on='Customer.Name', suffixes=('_first', '_last'))
length['Length'] = (length['Order.Date_last'] - length['Order.Date_first']).dt.days

# Assign length scores
length['Length.Score'] = pd.qcut(length['Length'], q=5, labels=range(1, 6))

# Merge length back to the original dataset
dataset = pd.merge(dataset, length[['Customer.Name', 'Length', 'Length.Score']], on='Customer.Name', how='left')

# Print the DataFrame with Length column
print(dataset)


              Category         City        Country Customer.ID  \
0      Office Supplies  Los Angeles  United States   LS-172304   
1      Office Supplies  Los Angeles  United States   MV-174854   
2      Office Supplies  Los Angeles  United States   CS-121304   
3      Office Supplies  Los Angeles  United States   CS-121304   
4      Office Supplies  Los Angeles  United States   AP-109154   
...                ...          ...            ...         ...   
51285  Office Supplies  Los Angeles  United States   AM-103604   
51286  Office Supplies  Los Angeles  United States   AM-103604   
51287  Office Supplies  Los Angeles  United States   HR-147704   
51288  Office Supplies  Los Angeles  United States   RM-196754   
51289  Office Supplies  Los Angeles  United States   FH-143654   

          Customer.Name  Discount Market Order.Date        Order.ID  \
0      Lycoris Saunders       0.0     US 2011-01-07  CA-2011-130813   
1         Mark Van Huff       0.0     US 2011-01-21  CA-2011-1486

### LRFM Score

In [None]:
import pandas as pd


# Convert relevant columns to numeric data type
dataset['Length.Score'] = pd.to_numeric(dataset['Length.Score'], errors='coerce')
dataset['Recency.Score'] = pd.to_numeric(dataset['Recency.Score'], errors='coerce')
dataset['Frequency.Score'] = pd.to_numeric(dataset['Frequency.Score'], errors='coerce')
dataset['Monetary.Score'] = pd.to_numeric(dataset['Monetary.Score'], errors='coerce')

# Calculate LRFM.Score as the average of individual scores rounded up
dataset['LRFM.Score'] = ((dataset['Length.Score'] + dataset['Recency.Score'] +
                          dataset['Frequency.Score'] + dataset['Monetary.Score']) / 4).round()

# Scale LRFM.Score to a range of 1 to 5
min_score = dataset['LRFM.Score'].min()
max_score = dataset['LRFM.Score'].max()
dataset['LRFM.Score'] = ((dataset['LRFM.Score'] - min_score) / (max_score - min_score) * 4) + 1

# Round up to the nearest integer
dataset['LRFM.Score'] = dataset['LRFM.Score'].round().astype(int)

# Print the DataFrame with the new 'LRFM.Score' column
print(dataset)




              Category         City        Country Customer.ID  \
0      Office Supplies  Los Angeles  United States   LS-172304   
1      Office Supplies  Los Angeles  United States   MV-174854   
2      Office Supplies  Los Angeles  United States   CS-121304   
3      Office Supplies  Los Angeles  United States   CS-121304   
4      Office Supplies  Los Angeles  United States   AP-109154   
...                ...          ...            ...         ...   
51285  Office Supplies  Los Angeles  United States   AM-103604   
51286  Office Supplies  Los Angeles  United States   AM-103604   
51287  Office Supplies  Los Angeles  United States   HR-147704   
51288  Office Supplies  Los Angeles  United States   RM-196754   
51289  Office Supplies  Los Angeles  United States   FH-143654   

          Customer.Name  Discount Market Order.Date        Order.ID  \
0      Lycoris Saunders       0.0     US 2011-01-07  CA-2011-130813   
1         Mark Van Huff       0.0     US 2011-01-21  CA-2011-1486

In [None]:
import pandas as pd


# Create a new column 'LRFM.Score' by concatenating individual scores
dataset['LRFM.Score'] = dataset['Length.Score'].astype(str) + dataset['Recency.Score'].astype(str) + \
                        dataset['Frequency.Score'].astype(str) + dataset['Monetary.Score'].astype(str)

# Print the DataFrame with the new 'LRFM.Score' column
print(dataset)


              Category         City        Country Customer.ID  \
0      Office Supplies  Los Angeles  United States   LS-172304   
1      Office Supplies  Los Angeles  United States   MV-174854   
2      Office Supplies  Los Angeles  United States   CS-121304   
3      Office Supplies  Los Angeles  United States   CS-121304   
4      Office Supplies  Los Angeles  United States   AP-109154   
...                ...          ...            ...         ...   
51285  Office Supplies  Los Angeles  United States   AM-103604   
51286  Office Supplies  Los Angeles  United States   AM-103604   
51287  Office Supplies  Los Angeles  United States   HR-147704   
51288  Office Supplies  Los Angeles  United States   RM-196754   
51289  Office Supplies  Los Angeles  United States   FH-143654   

          Customer.Name  Discount Market Order.Date        Order.ID  \
0      Lycoris Saunders       0.0     US 2011-01-07  CA-2011-130813   
1         Mark Van Huff       0.0     US 2011-01-21  CA-2011-1486

### Assign Customer Category

### Segment	Description
* **Champions:**	Bought recently, buy often and spend the most.
* **Loyal Customers:**	Buy on a regular basis. Responsive to promotions.
* **Potential Loyalists:**	Recent customers with average frequency.
* **Recent Customers:**	Bought most recently, but not often.
* **Promising Customers:**	Recent shoppers, but haven’t spent much.
* **Customers Needing Attention:**	Above average recency, frequency and monetary values. May not have bought very recently though.
* **About To Sleep:**	Below average recency and frequency. Will lose them if not reactivated.
* **At Risk:**	Purchased often but a long time ago. Need to bring them back!
* **Can’t Lose Them:**	Used to purchase frequently but haven’t returned for a long time.
* **Hibernating:**	Last purchase was long back and low number of orders.
* **Lost:** Purchased long time ago and never came back.

In [None]:
# Create a segment map of only 11 segments based on only two scores: 'r' and 'fm'

segment_map = {
    r'22': 'hibernating',
    r'[1-2][1-2]': 'lost',
    r'15': 'can\'t lose',
    r'[1-2][3-5]': 'at risk',
    r'3[1-2]': 'about to sleep',
    r'33': 'need attention',
    r'55': 'champions',
    r'[3-5][4-5]': 'loyal customers',
    r'41': 'promising',
    r'51': 'new customers',
    r'[4-5][2-3]': 'potential loyalists'
}

dataset['Segment'] = dataset['Recency.Score'].map(str) + dataset['LRFM.Score'].map(str)
dataset['Segment'] = dataset['Segment'].replace(segment_map, regex=True)

dataset.head()

Unnamed: 0,Category,City,Country,Customer.ID,Customer.Name,Discount,Market,Order.Date,Order.ID,Order.Priority,...,weeknum,Recency,Recency.Score,Frequency,Frequency.Score,Monetary,Monetary.Score,Length,Length.Score,LRFM.Score
0,Office Supplies,Los Angeles,United States,LS-172304,Lycoris Saunders,0.0,US,2011-01-07,CA-2011-130813,High,...,2,1455,1,79,5,1377.40468,2,1450,5,5152
1,Office Supplies,Los Angeles,United States,MV-174854,Mark Van Huff,0.0,US,2011-01-21,CA-2011-148614,Medium,...,4,1441,1,57,2,1810.6391,3,1424,5,5123
2,Office Supplies,Los Angeles,United States,CS-121304,Chad Sievert,0.0,US,2011-08-05,CA-2011-118962,Medium,...,32,1245,1,48,1,-4.0671,1,1373,3,3111
3,Office Supplies,Los Angeles,United States,CS-121304,Chad Sievert,0.0,US,2011-08-05,CA-2011-118962,Medium,...,32,1245,1,48,1,-4.0671,1,1373,3,3111
4,Office Supplies,Los Angeles,United States,AP-109154,Arthur Prichep,0.0,US,2011-09-29,CA-2011-146969,High,...,40,1190,1,87,5,3137.6156,5,1381,4,4155


In [None]:
import pandas as pd


# Define segment functions based on recency, frequency, and monetary values
def segment_customers(row):
    if row['Recency.Score'] >= 4 and row['Frequency.Score'] >= 4 and row['Monetary.Score'] >= 4:
        return 'Champions'
    elif row['Recency.Score'] >= 3 and row['Frequency.Score'] >= 3:
        return 'Loyal Customers'
    elif row['Recency.Score'] >= 3 and row['Frequency.Score'] >= 2:
        return 'Potential Loyalists'
    elif row['Recency.Score'] == 4 and row['Frequency.Score'] < 2:
        return 'Recent Customers'
    elif row['Recency.Score'] == 4 and row['Monetary.Score'] <= 2:
        return 'Promising Customers'
    elif row['Recency.Score'] > 3 and row['Frequency.Score'] > 3 and row['Monetary.Score'] > 3:
        return 'Customers Needing Attention'
    elif row['Recency.Score'] < 2 and row['Frequency.Score'] < 2:
        return 'About To Sleep'
    elif row['Recency.Score'] > 3 and row['Frequency.Score'] > 3 and row['Monetary.Score'] <= 2:
        return 'At Risk'
    elif row['Recency.Score'] > 3 and row['Frequency.Score'] > 3 and row['Monetary.Score'] <= 2:
        return 'Can’t Lose Them'
    elif row['Recency.Score'] < 2 and row['Frequency.Score'] < 2:
        return 'Hibernating'
    else:
        return 'Lost'

# Apply segment function to each row in the dataset
dataset['Segment'] = dataset.apply(segment_customers, axis=1)

# Print the DataFrame with the new 'Segment' column
print(dataset)


              Category         City        Country Customer.ID  \
0      Office Supplies  Los Angeles  United States   LS-172304   
1      Office Supplies  Los Angeles  United States   MV-174854   
2      Office Supplies  Los Angeles  United States   CS-121304   
3      Office Supplies  Los Angeles  United States   CS-121304   
4      Office Supplies  Los Angeles  United States   AP-109154   
...                ...          ...            ...         ...   
51285  Office Supplies  Los Angeles  United States   AM-103604   
51286  Office Supplies  Los Angeles  United States   AM-103604   
51287  Office Supplies  Los Angeles  United States   HR-147704   
51288  Office Supplies  Los Angeles  United States   RM-196754   
51289  Office Supplies  Los Angeles  United States   FH-143654   

          Customer.Name  Discount Market Order.Date        Order.ID  \
0      Lycoris Saunders       0.0     US 2011-01-07  CA-2011-130813   
1         Mark Van Huff       0.0     US 2011-01-21  CA-2011-1486

### Cluster Analysis

In [None]:
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt


# Drop unnecessary columns for clustering
cluster_data = dataset[['Sales', 'Quantity', 'Profit']]


# Convert categorical variables to dummy variables if any
# (if you have categorical variables, they need to be preprocessed appropriately)

# Scale the features
scaler = StandardScaler()
scaled_data = scaler.fit_transform(cluster_data)

# Perform K-means clustering
kmeans = KMeans(n_clusters=5, random_state=42)
kmeans.fit(scaled_data)

# Add cluster labels to the dataset
cluster_data['Cluster'] = kmeans.labels_

# Visualize the clusters
plt.figure(figsize=(10, 6))
plt.scatter(cluster_data['Sales'], cluster_data['Profit'], c=cluster_data['Cluster'], cmap='viridis', alpha=0.5)
plt.xlabel('Sales')
plt.ylabel('Profit')
plt.title('Cluster Analysis of Global Superstore Data')
plt.colorbar(label='Cluster')
plt.grid(True)
plt.show()


In [None]:
import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

dataset = dataset[['Recency', 'Frequency', 'Monetary']]
# Perform K-means clustering
kmeans = KMeans(n_clusters=4, random_state=42)
kmeans.fit(dataset)

# Add cluster labels to the dataset
dataset['Cluster'] = kmeans.labels_

# Visualize the clusters
plt.figure(figsize=(10, 6))
plt.scatter(dataset['Frequency'], dataset['Monetary'], c=dataset['Cluster'], cmap='viridis', alpha=0.5)
plt.xlabel('Recency')
plt.ylabel('Monetary')
plt.title('Customer Cluster Analysis based on RFM Scores')
plt.colorbar(label='Cluster')
plt.grid(True)
plt.show()
