# üéØ Customer Segmentation Using K-Means (Unsupervised Learning)

**Industry-Grade Customer Behavior Analysis with Transactional Data**

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/RansiluRanasinghe/Customer-Segmentation-K-Means/blob/main/notebook.ipynb)

---

## üìå Project Overview

This notebook implements an **end-to-end, production-aligned unsupervised machine learning pipeline** for customer segmentation using **K-Means clustering**.

The project simulates how machine learning engineers and data science teams analyze large-scale retail transaction data to discover meaningful customer segments **without labeled data**. The focus is on **data validation, feature engineering, model robustness, and business interpretability**, rather than on predictive accuracy.

---

## üéØ What This Notebook Does

Using real-world online retail transaction data, the pipeline:

1. ‚úÖ **Cleans and validates** raw transactional records
2. ‚úÖ **Aggregates data** at the customer level
3. ‚úÖ **Engineers RFM** (Recency, Frequency, Monetary) behavioral features
4. ‚úÖ **Applies K-Means clustering** to identify natural customer groups
5. ‚úÖ **Evaluates clusters** using industry-appropriate unsupervised metrics
6. ‚úÖ **Interprets results** from a business and operational perspective

---

## üí° Design Philosophy

This notebook reflects **industry best practices** for unsupervised modeling and is designed to be:
- üìä **Reproducible** ‚Äî Clear, sequential workflow
- üîç **Interpretable** ‚Äî Business-focused insights
- üîß **Extendable** ‚Äî Ready for real-world customer analytics use cases

---

**Author:** Ransilu Ranasinghe  
**GitHub:** [RansiluRanasinghe](https://github.com/RansiluRanasinghe)  
**LinkedIn:** [ransilu-ranasinghe](https://www.linkedin.com/in/ransilu-ranasinghe-a596792ba)

---

In [2]:
import numpy as np
import pandas as pd
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler, PowerTransformer
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, davies_bouldin_score

In [3]:
plt.style.use("seaborn-v0_8-darkgrid")

import warnings
warnings.filterwarnings("ignore")

In [4]:
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

####Loading the data

In [5]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("mashlyn/online-retail-ii-uci")

print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/mashlyn/online-retail-ii-uci?dataset_version_number=3...


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 14.5M/14.5M [00:00<00:00, 191MB/s]

Extracting files...





Path to dataset files: /root/.cache/kagglehub/datasets/mashlyn/online-retail-ii-uci/versions/3


In [6]:
import os

DATASET_PATH = "/root/.cache/kagglehub/datasets/mashlyn/online-retail-ii-uci/versions/3"

os.listdir(DATASET_PATH)

['online_retail_II.csv']

In [9]:
df = pd.read_csv(os.path.join(DATASET_PATH, "online_retail_II.csv"), encoding="ISO-8859-1")

display(df.head(5))

Unnamed: 0,Invoice,StockCode,Description,Quantity,InvoiceDate,Price,Customer ID,Country
0,489434,85048,15CM CHRISTMAS GLASS BALL 20 LIGHTS,12,2009-12-01 07:45:00,6.95,13085.0,United Kingdom
1,489434,79323P,PINK CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
2,489434,79323W,WHITE CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
3,489434,22041,"RECORD FRAME 7"" SINGLE SIZE",48,2009-12-01 07:45:00,2.1,13085.0,United Kingdom
4,489434,21232,STRAWBERRY CERAMIC TRINKET BOX,24,2009-12-01 07:45:00,1.25,13085.0,United Kingdom


In [12]:
print("Dataset shape:", df.shape)
print("Column names:", list(df.columns))
print("\nDatatypes: \n\n", df.dtypes)

Dataset shape: (1067371, 8)
Column names: ['Invoice', 'StockCode', 'Description', 'Quantity', 'InvoiceDate', 'Price', 'Customer ID', 'Country']

Datatypes: 

 Invoice         object
StockCode       object
Description     object
Quantity         int64
InvoiceDate     object
Price          float64
Customer ID    float64
Country         object
dtype: object


####Dataset Analysis

In [17]:
column_mapping = {
    "Invoice" : "InvoiceNO",
    "Price" : "UnitPrice",
    "Customer ID" : "CustomerID"
}

df = df.rename(columns=column_mapping)

display(df.head(5))

Unnamed: 0,InvoiceNO,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,489434,85048,15CM CHRISTMAS GLASS BALL 20 LIGHTS,12,2009-12-01 07:45:00,6.95,13085.0,United Kingdom
1,489434,79323P,PINK CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
2,489434,79323W,WHITE CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
3,489434,22041,"RECORD FRAME 7"" SINGLE SIZE",48,2009-12-01 07:45:00,2.1,13085.0,United Kingdom
4,489434,21232,STRAWBERRY CERAMIC TRINKET BOX,24,2009-12-01 07:45:00,1.25,13085.0,United Kingdom


In [18]:
required_columns = ["InvoiceNO", "Quantity", "UnitPrice", "InvoiceDate", "CustomerID"]
missing_columns = [col for col in required_columns if col not in df.columns]

if missing_columns:
  raise ValueError(f"Missing required columns: {missing_columns}")

In [19]:
print("Record count: ", len(df))
print("\n Missing values per column: ")
print(df.isna().sum())

Record count:  1067371

 Missing values per column: 
InvoiceNO           0
StockCode           0
Description      4382
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     243007
Country             0
dtype: int64


In [21]:
print("Checking Bussiness rule violations: ")

print("Negative quantities: ", (df["Quantity"] <= 0).sum())
print("Negative prices: ", (df["UnitPrice"] <= 0).sum())

print("Missing customer ID ", df["Quantity"].isnull().sum())

Checking Bussiness rule violations: 
Negative quantities:  22950
Negative prices:  6207
Missing customer ID  0


In [22]:
cancellations = df[df["InvoiceNO"].astype(str).str.startswith("C")]
print("Cancellations: ", len(cancellations))

Cancellations:  19494


In [23]:
print("Unique Products: ", df["StockCode"].nunique())
print("Unique Customers: ", df["CustomerID"].nunique())
print("Countries represented: ", df["Country"].nunique())

Unique Products:  5305
Unique Customers:  5942
Countries represented:  43
