# Customer Segmentation in Retail Using RFM and K-Means Clustering

Analyzing customer behavior is crucial for retail businesses. This notebook demonstrates how to perform customer segmentation using RFM (Recency, Frequency, Monetary) analysis and K-Means clustering.

## Data Preparation

### Loading Data

In [None]:
import pandas as pd
import numpy as np

# Loading the dataset and adding new columns for status and total amount
file_path = 'C:/Users/5060916/Desktop/onlinesales.csv'
df = pd.read_csv(file_path, quotechar='"', delimiter=',')
df['status'] = df['InvoiceNo'].apply(lambda x: 'canceled' if x.startswith('C') else 'active')
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'], format='%m/%d/%y %H:%M')
df['TotalAmount'] = df['Quantity'] * df['UnitPrice']

### Filtering Active Invoices

In [None]:
# We only keep invoices that are active (not canceled)
active_invoices = df[df['status'] != 'canceled']

## Data Cleaning and Analysis

### Null CustomerID Analysis

**Null Ratio:** 16% of invoices have null CustomerIDs.

**Country Analysis:** No significant correlation with missing CustomerIDs.

### Handling Negative and Zero Amounts

In [None]:
# We remove transactions with zero or negative amounts and null CustomerIDs
active_invoices = active_invoices[active_invoices['TotalAmount'] > 0]
active_invoices = active_invoices[~active_invoices['CustomerID'].isnull()]

### Removing Non-Item Stock Codes

In [None]:
# We exclude non-item codes like postage and bank charges
active_invoices = active_invoices[~active_invoices['StockCode'].isin(['POST','DOT','BANK CHARGES'])]

## RFM Analysis

### Calculating RFM Metrics

In [None]:
# Recency, Frequency, and Monetary calculations for each customer
analysis_date = active_invoices['InvoiceDate'].max()
rfm = active_invoices.groupby('CustomerID').agg({
    'InvoiceDate': lambda x: (analysis_date - x.max()).days,  # Recency
    'InvoiceNo': 'nunique',  # Frequency
    'UnitPrice': lambda x: np.sum(x * active_invoices.loc[x.index, 'Quantity'])  # Monetary
}).reset_index()
rfm.columns = ['CustomerID', 'Recency', 'Frequency', 'Monetary']

## Clustering

### Data Normalization and Clustering

The StandardScaler is used to normalize features by removing the mean and scaling them to unit variance. This is important in algorithms like K-Means clustering because:

Equal Weight to Features: K-Means relies on the Euclidean distance between points, and without scaling, features with larger ranges (e.g., Monetary) would dominate the clustering process, while features with smaller ranges (e.g., Recency or Frequency) would be undervalued. Scaling ensures that all features contribute equally to the clustering.

Improved Convergence: K-Means converges more efficiently when the data is scaled, leading to better results in fewer iterations.

In [None]:
# Normalizing the RFM data for clustering
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

scaler = StandardScaler()
rfm_normalized = scaler.fit_transform(rfm[['Recency', 'Frequency', 'Monetary']])

# Applying K-Means clustering to segment customers
kmeans = KMeans(n_clusters=4, init='k-means++', max_iter=300, n_init=10, random_state=0)
rfm['Cluster'] = kmeans.fit_predict(rfm_normalized)

**Cluster Summary:**
- Cluster 0: Core Customers
- Cluster 1: New Customers
- Cluster 2: Loyal Customers
- Cluster 3: Low Priority Customers

## Segmenting Using RFM Scores

### Calculating RFM Scores

In [None]:
# Splitting RFM data into quartiles and assigning scores (higher is better)
r_quarters = rfm['Recency'].quantile(q=[0.0, 0.25,0.5,0.75, 1]).to_list()
f_quarters = rfm['Frequency'].quantile(q=[0.0, 0.4,0.6,0.75, 1]).to_list()
m_quarters = rfm['Monetary'].quantile(q=[0.0, 0.25,0.5,0.75, 1]).to_list()

rfm['r_score'] = pd.cut(rfm['Recency'], bins=r_quarters, labels=[4,3,2,1],include_lowest=True)
rfm['f_score'] = pd.cut(rfm['Frequency'], bins=f_quarters, labels=[1,2,3,4],include_lowest=True)
rfm['m_score'] = pd.cut(rfm['Monetary'], bins=m_quarters, labels=[1,2,3,4],include_lowest=True)
rfm['rfm_score'] = rfm['r_score'].astype(str) + rfm['f_score'].astype(str) + rfm['m_score'].astype(str)

### RFM Segment Mapping

In [None]:
# Mapping the RFM scores to meaningful segments for better interpretation
agg_map = {
    r'444': '01_core',
    r'[3-4][3-4][3-4]': '02_loyal_large',
    r'[3-4][3-4][1-2]': '02_loyal_small',
    r'[3-4][1-2][3-4]': '03_new_large',
    r'[3-4][1-2][1-2]': '03_new_small',
    r'[1-2][3-4][3-4]': '04_lost_loyal_large',
    r'[1-2][3-4][1-2]': '04_lost_loyal_small',
    r'[1-2][1-2][3-4]': '05_promising',
    r'[1-2][1-2][1-2]': '06_low_priority',
}

rfm['RFM_segment'] = rfm['rfm_score'].replace(agg_map, regex=True)

## Final note

Clustering and RFM segmentation provide insights into customer behavior, aiding in targeted marketing strategies.