## RFM Segmentation


### 1. Connect to PostgreSQL

In [1]:
pip install sqlalchemy psycopg2-binary

Note: you may need to restart the kernel to use updated packages.


In [2]:
from sqlalchemy import create_engine
from sqlalchemy import text
import pandas as pd

In [3]:
engine = create_engine('postgresql://postgres:postgres123@localhost:5432/db_retail')

This notebook is now connected to a PostgreSQL database named **db_retail** using SQLAlchemy and now we can access the table where we sent the data of excel file **Online_Retail_Clean_Data.csv**

The following SQL query is preparing data for **RFM analysis**, which helps businesses group customers based on their buying behavior:

**R = Recency → How recently a customer made a purchase**

**F = Frequency → How often they made purchases**

**M = Monetary → How much money they spent in total**

In [4]:
pd.read_sql("""
WITH max_date AS (
    SELECT MAX(CAST(invoicedate AS DATE)) AS max_invoice_date
    FROM online_retail_data
),
rfm_raw AS (
    SELECT 
        customer_id,
        -- Days since last purchase (recency)
        (SELECT max_invoice_date FROM max_date) - MAX(CAST(invoicedate AS DATE)) AS recency,
        
        -- Number of unique orders (frequency)
        COUNT(DISTINCT invoice) AS frequency,
        
        -- Total spend (monetary)
        ROUND(SUM(revenue)::numeric, 0) AS monetary
        
    FROM online_retail_data
    WHERE iscancelled = false 
      AND customer_id != 0 
    GROUP BY customer_id
)
SELECT *
FROM rfm_raw
order by recency asc;
""", engine)

Unnamed: 0,customer_id,recency,frequency,monetary
0,13777,0,61,56478.0
1,14422,0,19,10272.0
2,12713,0,1,849.0
3,17754,0,13,4578.0
4,16446,0,2,168473.0
...,...,...,...,...
5876,14654,738,1,247.0
5877,17056,738,1,129.0
5878,12636,738,1,141.0
5879,17592,738,1,148.0


This query calculates RFM metrics for each customer:

- **Recency**: Days since the customer's last purchase.
- **Frequency**: Total number of unique orders.
- **Monetary**: Total amount spent.

Only non-cancelled transactions and valid customer IDs are included. Results are sorted by recency to show the most recently active customers at the top.


### 2. Creating View for RFM Segmentation

In [5]:
with engine.connect() as conn:
    conn.execute(text("""
        CREATE OR REPLACE VIEW rfm_seg AS
        WITH max_date AS (
            SELECT MAX(CAST(invoicedate AS DATE)) AS max_invoice_date
            FROM online_retail_data
        ),
        rfm_initial_calc AS (
            SELECT
                customer_id,
                ROUND(SUM(revenue)::numeric, 0) AS monetary_value,
                COUNT(DISTINCT invoice) AS frequency,
                (SELECT max_invoice_date FROM max_date) - MAX(CAST(invoicedate AS DATE)) AS recency
            FROM online_retail_data
            WHERE iscancelled = false 
              AND customer_id IS NOT NULL 
              AND customer_id != 0
            GROUP BY customer_id
        ),
        rfm_score_calc AS (
            SELECT
                *,
                NTILE(4) OVER (ORDER BY recency DESC) AS recency_score,
                NTILE(4) OVER (ORDER BY frequency ASC) AS frequency_score,
                NTILE(4) OVER (ORDER BY monetary_value ASC) AS monetary_score
            FROM rfm_initial_calc
        )
        SELECT
            customer_id,
            recency, 
            frequency,
            monetary_value,
            recency_score,
            frequency_score,
            monetary_score,
            (recency_score + frequency_score + monetary_score) AS total_rfm_score,
            CONCAT(recency_score, frequency_score, monetary_score) AS rfm_category_combination
        FROM rfm_score_calc;
    """))

In [6]:
pd.read_sql("SELECT * FROM rfm_seg ORDER BY total_rfm_score DESC LIMIT 20;", engine)

Unnamed: 0,customer_id,recency,frequency,monetary_value,recency_score,frequency_score,monetary_score,total_rfm_score,rfm_category_combination
0,18259,24,7,4318.0,4,4,4,12,444
1,13632,24,8,3648.0,4,4,4,12,444
2,12428,25,9,7956.0,4,4,4,12,444
3,15981,24,21,9363.0,4,4,4,12,444
4,12676,24,11,2843.0,4,4,4,12,444
5,14640,24,10,3667.0,4,4,4,12,444
6,15681,25,14,2990.0,4,4,4,12,444
7,17663,25,17,5176.0,4,4,4,12,444
8,15532,25,10,3772.0,4,4,4,12,444
9,15291,25,40,13847.0,4,4,4,12,444


We created a **PostgreSQL view** named `rfm_seg` to perform full RFM segmentation:

- **Recency**: Days since the customer's last purchase.
- **Frequency**: Number of unique purchases.
- **Monetary**: Total amount spent by the customer.

Each metric is scored into **quartiles (1–4)** using `NTILE(4)`:
- Higher scores indicate better performance (e.g. recent activity, frequent orders, or high spending). 

We know, **lower recency** is better, so we sorted in DES, and **higher frequency & monetary** is better so they are in ASC.

A total RFM score (sum of the 3) and a category combination (e.g. `444`) are calculated for each customer.

The final query retrieves the **top 20 customers** with the highest total RFM scores, representing the business’s most valuable customers.


In [7]:
pd.read_sql("""
SELECT DISTINCT rfm_category_combination
    FROM rfm_seg
ORDER BY rfm_category_combination;
""", engine)

Unnamed: 0,rfm_category_combination
0,111
1,112
2,113
3,114
4,121
...,...
57,433
58,434
59,442
60,443


Here it showed the lists of all the unique **RFM category combinations** found in the dataset.

Each combination (e.g., `111`, `324`, `444`) represents a 3-digit code:
- **1st digit** = Recency score (1 = oldest, 4 = most recent)
- **2nd digit** = Frequency score (1 = least orders, 4 = most frequent)
- **3rd digit** = Monetary score (1 = low spend, 4 = high spend)

For example:
- `444` = Best customers (very recent, frequent, and high spenders)
- `111` = Least engaged customers (old purchase, rare orders, low spend)
- `344` = Recently active, frequent, high spenders

A total of **62 unique segments** are identified, allowing for targeted customer segmentation strategies.


In [8]:
pd.read_sql("""
SELECT customer_id,
    CASE
        WHEN CONCAT(recency_score, frequency_score, monetary_score) IN ('111', '112', '121', '123', '132', '211', '212', '114', '141') THEN 'CHURNED CUSTOMER'
        WHEN CONCAT(recency_score, frequency_score, monetary_score) IN ('133', '134', '143', '244', '334', '343', '344', '144') THEN 'SLIPPING AWAY, CANNOT LOSE'
        WHEN CONCAT(recency_score, frequency_score, monetary_score) IN ('311', '411', '331') THEN 'NEW CUSTOMERS'
        WHEN CONCAT(recency_score, frequency_score, monetary_score) IN ('222', '231', '221', '223', '233', '322') THEN 'POTENTIAL CHURNERS'
        WHEN CONCAT(recency_score, frequency_score, monetary_score) IN ('323', '333', '321', '341', '422', '332', '432') THEN 'ACTIVE'
        WHEN CONCAT(recency_score, frequency_score, monetary_score) IN ('433', '434', '443', '444') THEN 'LOYAL'
        ELSE 'CANNOT BE DEFINED'
    END AS customer_segment
FROM rfm_seg;
""", engine)

Unnamed: 0,customer_id,customer_segment
0,17056,CHURNED CUSTOMER
1,12636,CHURNED CUSTOMER
2,17592,CHURNED CUSTOMER
3,13526,CHURNED CUSTOMER
4,14654,CHURNED CUSTOMER
...,...,...
5876,16626,LOYAL
5877,15311,LOYAL
5878,17490,LOYAL
5879,12518,LOYAL


Here we mapped each customer to a descriptive **RFM segment** based on their recency, frequency, and monetary scores.

The segments are defined using combinations of RFM scores:
- `'CHURNED CUSTOMER'`: Low recency, frequency, and spend (e.g. `111`, `112`, etc.)
- `'NEW CUSTOMERS'`: Recently active but low frequency/spend (e.g. `311`, `331`)
- `'POTENTIAL CUSTOMERS'`: Moderate activity and spend (e.g. `222`, `231`)
- `'ACTIVE CUSTOMERS'`: Engaged and purchasing regularly (e.g. `323`, `341`)
- `'LOYAL'`: Highest scores across all dimensions (e.g. `433`, `444`)
- `'CANNOT BE DEFINED'`: For customers not falling into defined categories

This categorization helps with **targeted marketing**, **retention strategies**, and understanding customer value.


In [9]:
pd.read_sql("""
SELECT 
    CASE
        WHEN CONCAT(recency_score, frequency_score, monetary_score) IN ('111', '112', '121', '123', '132', '211', '212', '114', '141') THEN 'CHURNED CUSTOMER'
        WHEN CONCAT(recency_score, frequency_score, monetary_score) IN ('133', '134', '143', '244', '334', '343', '344', '144') THEN 'SLIPPING AWAY, CANNOT LOSE'
        WHEN CONCAT(recency_score, frequency_score, monetary_score) IN ('311', '411', '331') THEN 'NEW CUSTOMERS'
        WHEN CONCAT(recency_score, frequency_score, monetary_score) IN ('222', '231', '221', '223', '233', '322') THEN 'POTENTIAL CHURNERS'
        WHEN CONCAT(recency_score, frequency_score, monetary_score) IN ('323', '333', '321', '341', '422', '332', '432') THEN 'ACTIVE'
        WHEN CONCAT(recency_score, frequency_score, monetary_score) IN ('433', '434', '443', '444') THEN 'LOYAL'
        ELSE 'CANNOT BE DEFINED'
    END AS customer_segment,
    COUNT(*) AS customer_count
FROM rfm_seg
GROUP BY customer_segment
ORDER BY customer_count DESC;
""", engine)

Unnamed: 0,customer_segment,customer_count
0,CHURNED CUSTOMER,1353
1,LOYAL,1034
2,POTENTIAL CHURNERS,877
3,"SLIPPING AWAY, CANNOT LOSE",861
4,CANNOT BE DEFINED,826
5,ACTIVE,692
6,NEW CUSTOMERS,238


Finally, we counted the number of customers in each **RFM segment**, grouped by their behavioral labels.

Segments are assigned based on RFM score combinations:
- **CHURNED CUSTOMER**: Most common, with 1,353 customers who haven't purchased recently and show low engagement.
- **LOYAL**: 1,034 highly engaged and valuable customers.
- **POTENTIAL CHURNERS** and **SLIPPING AWAY**: Customers showing signs of decline in activity.
- **ACTIVE**: Currently engaged but not yet top-tier.
- **CANNOT BE DEFINED**: 826 customers with undefined or rare score combinations.

This breakdown helps prioritize actions for customer retention, reactivation, or rewards.
