# CUSTOMER SEGMENTATION

### The aim is to group customers based on purchasing behavior for tasks like marketing or personalization.
### I will be conducting an RFM analysis
### R - Recency - How recent a customer made a pruchase
### F - Frequency - How often a customer makes purchases
### M - Monetary - How much a customer spends

In [1]:
# Imports

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.ticker import LogLocator
from datetime import datetime
import joblib

In [2]:
# Creating dataframe from first dataset

df1 = pd.read_excel("online_retail.xlsx")
df1.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


In [3]:
df1.tail()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
541904,581587,22613,PACK OF 20 SPACEBOY NAPKINS,12,2011-12-09 12:50:00,0.85,12680.0,France
541905,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,2011-12-09 12:50:00,2.1,12680.0,France
541906,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,2011-12-09 12:50:00,4.15,12680.0,France
541907,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,2011-12-09 12:50:00,4.15,12680.0,France
541908,581587,22138,BAKING SET 9 PIECE RETROSPOT,3,2011-12-09 12:50:00,4.95,12680.0,France


In [None]:
# Creating dataframe from second dataset

df2 = pd.read_excel("online_retail_II.xlsx")
df2.head()

In [None]:
df2.tail()

In [None]:
df1.info()

In [None]:
df2.info()

#### Both dataframes share the same features, datatypes of features are similar as well however there are 3 feature names that are not uniform in both dataframes i.e., `InvoiceNo` as `Invoice`, `UnitPrice` as `Price` and `CustomerID` as `Customer ID` in df1 and df2 respectively

### Renaming features/column names

In [None]:
# Renaming column names in df2

df2.rename(columns={
    "Invoice":"InvoiceNo",
    "Price":"UnitPrice",
    "Customer ID":"CustomerID"
}, inplace=True)

In [None]:
# Checking to see if column names have changed

df2.columns

In [None]:
# Concantenate the two dataframes into one dataframe

df = pd.concat([df1,df2], ignore_index=True)
df.shape

In [None]:
# Checking for missing values

df.isna().sum()

In [None]:
df.info()

In [None]:
null_df = df[df["CustomerID"].isna() == True]
null_df.head(10)

In [None]:
unique_years = df['InvoiceDate'].dt.year.unique()
print(sorted(unique_years))

## Data Cleaning

### Handle Missing Values:

#### -> CustomerID has missing values, which are essential for customer-based segmentation. I'll drop rows where CustomerID is missing.
#### -> Description also has missing values but its not critical for segmentation and can be ignored for now.

### Remove Negative Quantities:

#### -> Negative quantities likely indicate returns. We'll remove them for now.

### Create New Features:

#### -> Total revenue per row: TotalPrice = Quantity * UnitPrice

### Dropping Columns based on Domain knowledge

#### -> The Description column does not provide customer-level behavior insights directly relevant for segmentation.

In [None]:
# Checking for True Duplicates

len(df[df.duplicated(keep=False)])

In [None]:
def wrangle(df):
    # 1. Sort by InvoiceDate
    df = df.sort_values("InvoiceDate").reset_index(drop=True)

    # 2. Drop rows with null values in CustomerID
    df = df.dropna(subset=["CustomerID"])

    # 3. Remove negative quantities
    df = df[df["Quantity"] > 0]

    # 4. Convert CustomerID to int for clarity
    df["CustomerID"] = df["CustomerID"].astype(int)

    # 5. Create a new feature for total revenue per row
    df["TotalPrice"] = df["UnitPrice"] * df["Quantity"]

    # 6. Drop the Description column
    df = df.drop(columns=["Description"])

    # 7. Remove duplicate rows
    df = df.drop_duplicates()

    return df


In [None]:
df = wrangle(df)
df.isna().sum()

In [None]:
df.info()

In [None]:
df.head()

### Group Transactions By Customer

In [None]:
customer_data = df.groupby("CustomerID").agg({
    'InvoiceNo':'nunique', # Frequency: number of transactions
    'TotalPrice':'sum',    # Monetary: total money spent
    'InvoiceDate':'max'    # Recency: Last Purchase Date
}).reset_index()

customer_data.rename(columns={
    'InvoiceNo':'Frequency',
    'TotalPrice':'Monetary',
    'InvoiceDate':'LastPurchaseDate'
}, inplace=True)

customer_data.head()

## Exploratory Data Analysis (EDA)

### 1. Frequency of Transactions

In [None]:
sns.histplot(customer_data['Frequency'], kde=False, bins=30)
plt.yscale('log')  # Apply log scale to the y-axis

plt.title('Customer Transaction Frequency (Log Scale)', fontsize=14)
plt.xlabel('Number of Transactions', fontsize=12)
plt.ylabel('Number of Customers (Log Scale)', fontsize=12)

# Adding grid for readability
plt.grid(visible=True, which='both', linestyle='--', linewidth=0.5)

# Adjust figure size for reports
plt.gcf().set_size_inches(8, 6)

plt.show();


### 2. Monetary Distribution:

In [None]:
sns.histplot(customer_data['Monetary'], kde=True, bins=100)
plt.title('Customer Monetary Value Distribution')
plt.xlabel('Total Spend')
plt.ylabel('Number of Customers')
plt.show();

### 3. Recency Analysis:

In [None]:
most_recent_date = customer_data['LastPurchaseDate'].max() + pd.Timedelta(days=1)

# 1. Define the most recent date in the dataset as the reference point
customer_data['LastPurchaseDate'] = pd.to_datetime(customer_data['LastPurchaseDate'])

# 2. Calculate recency for each customer
customer_data['Recency'] = (most_recent_date - customer_data['LastPurchaseDate']).dt.days

# 3. Plot the recency distribution
sns.histplot(customer_data['Recency'], kde=False, bins=30, color='skyblue')
plt.title('Customer Recency Distribution')
plt.xlabel('Days Since Last Purchase')
plt.ylabel('Number of Customers')
plt.grid(visible=True, which="both", linestyle='--', linewidth=0.5)
plt.gcf().set_size_inches(8, 6)
plt.show()