# Customer Segmentation with RFM and K-Means Clustering

This project applies RFM (Recency, Frequency, Monetary) analysis and K-Means clustering to segment customers in an e-commerce context.  
The goal is to group customers based on their purchasing behavior to support marketing and business decisions.


**Dataset**: Olist E-Commerce Public Dataset (Brazil)  
**Author**: Paulo Castro
**Date**: July 2025  
**Tools**: Python (Pandas for data manipulation, Scikit-learn for clustering, Matplotlib/Seaborn for visualization)

---

## 1. Loading and Preparing the Data

In this section, we load the necessary tables from the Olist Brazilian E-Commerce dataset.  
We aim to construct a unified view of customer transactions by merging the relevant tables that contain order, customer, items, and payments details.

**Files used:**
- `olist_orders_dataset.csv`
- `olist_customers_dataset.csv`
- `olist_order_items_dataset.csv`
- `olist_order_payments_dataset.csv`

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.utils import resample

orders = pd.read_csv('/content/olist_orders_dataset.csv')
customers = pd.read_csv('/content/olist_customers_dataset.csv')
order_items = pd.read_csv('/content/olist_order_items_dataset.csv')
payments = pd.read_csv('/content/olist_order_payments_dataset.csv')

In [None]:
print("--- Initial Dataframes Head ---")
print("\nOrders Head:\n", orders.head())
print("\nCustomers Head:\n", customers.head())
print("\nOrder Items Head:\n", order_items.head())
print("\nPayments Head:\n", payments.head())

In [None]:
# Merge 1: Orders + Customers (on customer_id)
# Combines order details with unique customer IDs
df_1 = pd.merge(orders, customers, how='left', on='customer_id')

# Merge 2: Add Order Items (on order_id)
# Adds product and price details for each order
df_2 = pd.merge(df_1, order_items, how='left', on='order_id')

# Merge 3: Add Payments (on order_id)
# Includes payment information associated with each order
df_final = pd.merge(df_2, payments, how='left', on='order_id')


## 2. Data Cleaning and Type Adjustments

Before performing the RFM analysis, we need to ensure that the dataset is clean and that all relevant columns are in the correct format, especially date fields.

We will check for missing values, confirm that each column has the appropriate data type and ensure the dataset does not contain duplicated records, outliers, or invalid entries that could distort the analysis.

We now inspect the data types to confirm whether datetime and numerical columns are properly formatted.

In [None]:
df_final.dtypes

We convert `order_purchase_timestamp` to a datetime format and `payment_value` to float.
This ensures proper time-based operations and accurate numerical aggregations during the RFM analysis.

In [None]:
# Ensure correct data types for analysis
df_final['order_purchase_timestamp'] = pd.to_datetime(df_final['order_purchase_timestamp'])
df_final['payment_value'] = df_final['payment_value'].astype(float)
print(df_final[['order_purchase_timestamp', 'payment_value']].dtypes)

We check for and remove any duplicated rows to avoid skewed results in the customer segmentation process.

In [None]:
# Remove duplicated rows
print("Rows before removing duplicates:", df_final.shape[0])
print("Number of duplicate rows found:", df_final.duplicated().sum())
df_final.drop_duplicates(inplace=True)
print("Rows after removing duplicates:", df_final.shape[0])

We remove rows that contain null values in critical columns for RFM analysis:  
`customer_unique_id`, `order_purchase_timestamp`, `order_id`, and `payment_value`.

These fields are required for identifying customers, tracking orders orver time, and measuring transaction values.

In [None]:
# Check and Drop rows with missing values in key RFM-related columns
columns_rfm = ['customer_unique_id', 'order_purchase_timestamp', 'order_id', 'payment_value']
missing = df_final[columns_rfm].isnull().sum()
print("Missing values before removal:", missing)

df_final.dropna(subset=columns_rfm, inplace=True)
print("Remaining rows:", df_final.shape[0])

We remove the top 1% of `payment_values` entries to reduce the impact of extreme outliers on the clustering results.

In [None]:
# Remove extreme outliers above the 99th percentile in payment_values

high_limit = df_final['payment_value'].quantile(0.99)
print(f"Outlier threshold (99th percentile): {high_limit: .2f}")
df_final = df_final[df_final['payment_value'] <= high_limit]
print("Remaining rows after outlier removal:", df_final.shape[0])

We remove transactions with payment values equal to or below zero, as they do not represent valid purchases.
After this, we review the statistical distribution of the cleaned `payment_value` column to ensure consistency.

In [None]:
# Remove non-positive payment values
df_final = df_final[df_final['payment_value'] > 0]

# Review the cleaned payment_value distribution
df_final['payment_value'].describe()

## 3. RFM Calculation

The RFM model evaluates customer behavior using three key metrics:

- **Recency**: How many days ago the customer made their last purchase **(in days)**.
- **Frequency**: How many unique purchases the customer has made.
- **Monetary**: How much the customer has spent in total.

We'll calculate these metrics based on the cleaned dataset.

**Note**: We use the `customer_unique_id` column instead of `customer_id`.  
This is because the `customer_id` is unique per order (a customer might have a different ID for each purchase),  
while the `customer_unique_id` is stable and consistent across all purchases from the same person.

We calculate **Recency** as the number of days between each customer's last purchase and the most recent purchase date in the dataset.

In [None]:
# Define the reference date (last purchase date in the dataset)
reference_date = df_final['order_purchase_timestamp'].max()

# Calculate Recency: days since the last purchase per customer
last_purchase = df_final.groupby('customer_unique_id')['order_purchase_timestamp'].max().reset_index()
last_purchase.columns = ['customer_unique_id', 'LastPurchaseDate']
last_purchase['Recency'] = (reference_date - last_purchase['LastPurchaseDate']).dt.days


print("Reference date:", reference_date)
print(last_purchase)

We calculate **Frequency** as the number of unique purchase orders (`order_id`) per customer.

In [None]:
# Calculate Frequency: number of unique purchases per customer
frequency = df_final.groupby('customer_unique_id')['order_id'].nunique().reset_index()
frequency.columns = ['customer_unique_id', 'Frequency']

# Check distribution of Frequency values
frequency['Frequency'].value_counts().sort_index()

We calculate **Monetary** as the total amount paid by each customer across all their purchases.

In [None]:
# Calculate Monetary: total spending per customer
monetary = df_final.groupby('customer_unique_id')['payment_value'].sum().reset_index()
monetary.columns = ['customer_unique_id', 'Monetary']

We merge Recency, Frequency, and Monetary metrics into a single `rfm` dataframe,
ready for normalization and clustering. We also removed the unnecessary column `LastPurchaseDate`.

In [None]:
# Merge all RFM metrics into a single dataframe
rfm = pd.merge(last_purchase, frequency, on='customer_unique_id')
rfm = pd.merge(rfm, monetary, on='customer_unique_id')

In [None]:
# Remove LastPurchaseDate column

rfm.drop('LastPurchaseDate', axis=1, inplace=True)

In [None]:
# Quick overview of RFM distribution

rfm.describe()

## 4. Exploratory Data Analysis (EDA)

This section explores the distribution and relationships of the three RFM variables — **Recency**, **Frequency**, and **Monetary** — in order to better understand the customer base and to prepare the data for clustering.

### Summary Statistics of RFM Variables

Overview of central tendency and spread before transformation and scaling. Useful to understand the magnitude and dispersion of each feature.

In [None]:
rfm.describe().T

### Recency Distribution and Outliers

Recency is already measured in days and typically doesn't require transformation. However, outliers are visible in the boxplot and can affect clustering.

In [None]:
# Distribution of Recency
sns.histplot(rfm['Recency'], bins=20, kde=True)
plt.title('Recency Distribution')
plt.xlabel('Recency (days)')
plt.ylabel('Count')
plt.show()

In [None]:
# Boxplot to detect outliers in Recency
sns.boxplot(x=rfm['Recency'])
plt.title('Boxplot of Recency')
plt.show()

**Interpretation**: The histogram for Recency shows a **right-skewed distribution**, indicating that a large portion of our customers made a purchase relatively recently (lower number of days since last purchase). This is a positive sign for customer engagement. The boxplot confirms this skewness and highlights the presence of **outliers** on the higher end of the scale. These represent customers who have not purchased for a considerable amount of time. While these long-inactive customers are outliers, they constitute a real segment of the customer base. Their inclusion is important for potential re-engagement strategies and they will be normalized with the rest of the data during the scaling phase for clustering.

### Frequency Distribution and Outliers (Log Transformed)

Most customers purchase only once, resulting in a highly right-skewed distribution. A log transformation helps reduce skewness and improve the effectiveness of clustering.

In [None]:
# Distribution of log-transformed Frequency
sns.histplot(np.log1p(rfm['Frequency']), bins=30, kde=True)
plt.title('Distribution of Frequency (log-transformed)')
plt.show()

In [None]:
# Boxplot to detect outliers after transformation
sns.boxplot(x=np.log1p(rfm['Frequency']))
plt.title('Boxplot of Frequency (log-transformed)')
plt.show()

**Interpretation**: The histogram clearly shows that the original Frequency distribution was highly **right-skewed**, with a large number of customers having very low purchase frequency (many buying only once). While the `log1p` transformation does not result in a normal distribution, it is effective in **compressing the extreme values and significantly reducing the severity of the skewness**. This makes the data more suitable for distance-based clustering algorithms like K-Means, as it mitigates the disproportionate influence of a few customers with an exceptionally high number of unique orders on the cluster centroids. The boxplot of the log-transformed data indicates that despite the transformation, there may still be some mild outliers, particularly on the higher end, representing **customers with an unusually high number of unique orders**.

### Monetary Distribution and Outliers (Log Transformed)

Spending behavior also shows strong skewness, with a few customers spending significantly more. Log transformation helps stabilize the variance and makes the data more usable for clustering.

In [None]:
# Distribution of log-transformed Monetary
sns.histplot(np.log1p(rfm['Monetary']), bins=20, kde=True)
plt.title('Distribution of Monetary (log-transformed)')
plt.show()

In [None]:
# Boxplot to check for extreme spenders
sns.boxplot(x=np.log1p(rfm['Monetary']))
plt.title('Boxplot of Monetary (log-transformed)')
plt.show()

**Interpretation**: Similar to Frequency, the original Monetary distribution was also **highly right-skewed**, indicating that a few customers contribute significantly more to the total revenue. The `log1p` transformation effectively **compresses these extreme spending values**. We can observe an **approximation towards a normal distribution**, although it remains **right-skewed**. This transformation makes the data more suitable for distance-based clustering algorithms by normalizing the influence of high-value spenders. The boxplot of the log-transformed data indicates that while the transformation improved the distribution, there may still be some mild outliers, particularly on the higher end, representing customers with exceptionally high total spending.

### Correlation Between RFM Variables

This correlation matrix shows how Recency, Frequency, and Monetary relate to one another.
- Recency usually has **negative** correlation with Frequency and Monetary (recent buyers tend to buy more and spend more).
- Frequency and Monetary are often **positively** correlated.

In [None]:
# Compute correlation matrix
rfm_numeric = rfm[['Recency', 'Frequency', 'Monetary']]
rfm_corr = rfm_numeric.corr()

# Heatmap of correlation
sns.heatmap(rfm_corr, annot=True, cmap='coolwarm', center=0)
plt.title('Correlation between RFM variables')
plt.show()

**Interpretation**: The correlation heatmap reveals the relationships between the RFM variables. While the **direction** of the correlations is as generally hypothesized – Recency shows a negative relationship with Frequency (-0.025) and Monetary (-0.0084), and Frequency and Monetary show a positive relationship (0.14) – it is important to note that the **magnitudes of these correlations are very low**.

This indicates that, for this specific dataset, the linear relationships between Recency, Frequency, and Monetary are **weak or negligible**. This suggests that these RFM dimensions are relatively independent of each other, meaning they capture distinct aspects of customer behavior, which can be beneficial for clustering as each dimension provides unique information to segment customers.

## 5. Data Normalization

Before clustering, it's crucial to normalize the data so that all features contribute equally to the model. Here, we apply **Standard Scaling** to the RFM features.

In [None]:
# Ensure data is in the correct format for scaling and clustering

# 1. Select the RFM features for transformation and scaling
# Create a copy to avoid SettingWithCopyWarning
rfm_for_clustering = rfm[['Recency', 'Frequency', 'Monetary']].copy()

# 2. Apply log transformation to Frequency and Monetary
# Recency is typically not log-transformed as its distribution is often less severely skewed and 0 is meaningful.
rfm_for_clustering['Frequency'] = np.log1p(rfm_for_clustering['Frequency'])
rfm_for_clustering['Monetary'] = np.log1p(rfm_for_clustering['Monetary'])

# 3. Standardize the RFM values (mean = 0, std = 1)
scaler = StandardScaler()
rfm_scaled = scaler.fit_transform(rfm_for_clustering)

# 4. Store the scaled data in a new DataFrame
rfm_scaled_df = pd.DataFrame(rfm_scaled, columns=rfm_for_clustering.columns)

# 5. Inspect the standardized values
print("Descriptive statistics of scaled RFM features:")
print(rfm_scaled_df.describe())

## Choosing the Optimal Number of Clusters

We use two methods:
- **Elbow Method**: evaluates within-cluster variance (inertia).
- **Silhouette Score**: measures how well each object lies within its cluster.

In [None]:
# Lists to store metrics
inertias = []
silhouette_scores = []

# Test different values of K (from 2 to 10)
for k in range(2, 11):
    model = KMeans(n_clusters=k, random_state=42, n_init=10, max_iter=100)
    model.fit(rfm_scaled)
    inertias.append(model.inertia_)

    # Use subsampling to speed up silhouette score calculation
    labels = model.labels_
    sample_data, sample_labels = resample(rfm_scaled, labels, n_samples=10000, random_state=42)
    score = silhouette_score(sample_data, sample_labels)
    silhouette_scores.append(score)


print("Inertias:", inertias)
print("Silhouette Scores:", silhouette_scores)

In [None]:
# Plotting the Elbow Method
plt.figure(figsize=(8,5))
plt.plot(range(2, 11), inertias, marker='o')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia')
plt.title('Elbow Method')
plt.grid(True)
plt.show()

In [None]:
# Plotting the Silhouette Scores
plt.figure(figsize=(8,5))
plt.plot(range(2, 11), silhouette_scores, marker='o', color='orange')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Score by K (with subsampling)')
plt.grid(True)
plt.show()

**Interpretation of Elbow and Silhouette Methods**:

The Elbow Method plot shows a noticeable "bend" or "elbow" around K=3 or K=4, indicating that adding more clusters beyond this point provides diminishing returns in terms of reducing within-cluster variance.

The Silhouette Score plot, which measures the compactness and separation of clusters, shows a rising score from K=3 to K=5. A higher silhouette score indicates denser and more well-separated clusters.

Considering both methods, **K=4** is chosen as the optimal number of clusters, offering a good balance between minimizing within-cluster variance and maximizing cluster distinctiveness, while also providing a manageable number of segments for business interpretation.

## 6. Applying K-Means Clustering (K = 4)

Based on the Elbow and Silhouette Score, we choose **K = 4** for segmentation.

In [None]:
# Fit final KMeans model
model = KMeans(n_clusters=4, random_state=42, n_init=10, max_iter=100)
model.fit(rfm_scaled)

# Assign cluster labels to original RFM data
rfm['Cluster'] = model.labels_

## 7. Cluster Summary (Mean Values per Group)

Here we inspect the **average RFM values per cluster** to begin interpretation of each customer segment.

In [None]:
# Compute average RFM values by cluster
cluster_summary = rfm.groupby('Cluster')[['Recency', 'Frequency', 'Monetary']].mean().round(2)
cluster_summary

**Interpretation of Cluster Summary Table**:

Based on the average RFM values, we can identify distinct customer segments:

* **Cluster 0 - "Recent Big Spenders, Infrequent"**:
    * **Recency**: Medium/High (`233.63` days) - Purchased some time ago..
    * **Frequency**: Low (`1` purchases) - Purchased only once.
    * **Monetary**: High (e.g., `407.03` R$) - Spent a significant amount.
    * *Profile*: These customers have spent a good amount, but their purchases are infrequent and not recent. They are valuable but are at risk of leaving or are already inactive. Re-engagement strategies should focus on reactivating their high spending potential.

* **Cluster 1 - "High-Value Frequent Champions"**:
    * **Recency**: Medium/High (`268.17` days) - Purchased some time ago (though relatively higher among frequent buyers).
    * **Frequency**: High (`2.12` purchases) - Purchase most frequently among all segments.
    * **Monetary**: High (`418.24` R$) - Have the highest total spending.
    * *Perfil*: Our most valuable and frequent customers. They spend the most and buy often. Focus on retention, loyalty programs, and exclusive offers to maintain their engagement.

* **Cluster 2 - "Long-Term Inactive / Lost Customers"**:
    * **Recency**: Very High (`474.91` days) -  Individuals who haven't purchased for a very long time (most inactive segment).
    * **Frequency**: Low (`1` purchases) - Purchased only once.
    * **Monetary**: Medium (`123.66` R$) - Low to medium initial spending.
    * *Perfil*: These customers are largely inactive or have left. They represent a low-value segment that hasn't purchased in nearly 1.5 years. Re-engagement efforts here might have low ROI, or focus on win-back strategies with significant incentives.

* **Cluster 3 - "New/Recent Low Spenders"**:
    * **Recency**: Low  (`193.25` days) - Purchased most recently among all segments.
    * **Frequency**: Low (`1` purchases) - Purchased only once.
    * **Monetary**: Low (`76.25` R$) - Have the lowest total spending.
    * *Perfil*: These are relatively new or very recent customers who have only made a single, low-value purchase. They are trialists. Focus on nurturing, encouraging a second purchase, and cross-selling to increase their lifetime value.

### Boxplots by Cluster

Visualize the distribution of each RFM variable by cluster to detect patterns and variability within each group.

In [None]:
# Copy rfm_for_clustering and adds Cluster column to it
rfm_plot_data = rfm_for_clustering.copy()
rfm_plot_data['Cluster'] = rfm['Cluster']

# Recency
plt.figure(figsize=(10, 6))
sns.boxplot(x='Cluster', y='Recency', data=rfm_plot_data)
plt.title('Recency Distribution by Cluster')
plt.savefig('boxplot.png', bbox_inches='tight', dpi=300)
plt.show()

# Frequency (log scale due to skewness)
plt.figure(figsize=(10, 6))
sns.boxplot(x='Cluster', y='Frequency', data=rfm_plot_data)
plt.title('Frequency Distribution by Cluster (log1p transformed)')
plt.show()

# Monetary (log scale due to skewness)
plt.figure(figsize=(10, 6))
sns.boxplot(x='Cluster', y='Monetary', data=rfm_plot_data)
plt.title('Monetary Distribution by Cluster (log1p transformed)')
plt.show()

**Visual Analysis of Clusters (Boxplots)**:

The boxplots for each RFM variable across the clusters visually reinforce and expand upon the segment profiles identified in the summary table, providing insights into the variability within each group.

- **Recency**: The boxplot clearly shows distinct ranges and **meaningful amplitudes across clusters** (e.g., from 100 to 200 days amplitude between Q1 and Q3 for various clusters). This indicates varying degrees of recent activity and a discernible spread within each group, allowing for clear differentiation based on customer recency.
- **Frequency (log scale)**: As observed, for most clusters (e.g., Clusters 0, 2, and 3), the Frequency boxplot largely appears as a line at a specific value (likely corresponding to a single purchase in the original scale). This highlights the **minimal internal variability in purchase frequency for these segments**, indicating that the vast majority of customers within them have made a consistent, low number of purchases. In contrast, Cluster 1 is likely the exception, showing a more discernible box and potentially outliers, reflecting greater variability and higher average frequency.
- **Monetary (log scale)**: The Monetary boxplot (log-scaled) tends to be more compact, which is expected due to the logarithmic transformation applied to handle the skewness of the data. Despite this compactness, these boxplots effectively highlight differences in spending power across segments, showing distinct median values and interquartile ranges, even if narrower, indicating varying levels of monetary contribution within and across clusters.

These visualizations emphasize not only the average behavior but also the **homogeneity or heterogeneity** of customers within each segment based on their RFM attributes. When a boxplot is a line, it strongly indicates that the cluster is very pure or uniform on that specific RFM dimension.


### Heatmap of Normalized Cluster Averages

Normalize mean values to compare relative RFM strength across clusters.

In [None]:
# Normalize means for heatmap visualization
cluster_means = rfm.groupby('Cluster')[['Recency', 'Frequency', 'Monetary']].mean()
cluster_means_norm = (cluster_means - cluster_means.min()) / (cluster_means.max() - cluster_means.min())


# Visual heatmap
plt.figure(figsize=(8, 5))
sns.heatmap(cluster_means_norm, annot=True, cmap='YlGnBu')
plt.title('Normalized Mean RFM Values per Cluster')
plt.savefig('heatmap.png', bbox_inches='tight', dpi=300)
plt.show()

The heatmap provides a clear visual summary of each cluster's profile, displaying their normalized RFM scores (from 0 to 1). It's crucial to understand the normalization context for each metric:

* **Recency**: A lower normalized value (closer to 0) indicates a more recent purchase (fewer days since last purchase). A higher normalized value (closer to 1) indicates a less recent purchase (more days since last purchase).
* **Frequency & Monetary**: For these, a higher normalized value (closer to 1) indicates higher frequency/spending, and a lower value (closer to 0) indicates lower frequency/spending.

Based on this, here are the cluster profiles:

* **Cluster 0**: Shows relatively low normalized Recency (meaning they are **quite recent** buyers, given 0.14 is closer to 0), the lowest Frequency (0), but a very high Monetary value (0.97). This segment likely represents **recent big spenders, but infrequent buyers**.
* **Cluster 1**: Exhibits relatively low normalized Recency (meaning they are **quite recent** buyers, given 0.27 is closer to 0), and the highest scores in both Frequency (1) and Monetary (1). This is the segment of **highly valuable, very frequent, and relatively recent "Champion" customers**.
* **Cluster 2**: Displays the highest normalized Recency (meaning these are the **least recent / most inactive** customers, given 1.00 is closer to 1), combined with the lowest Frequency (0) and a relatively low Monetary value (0.14). This segment represents **long-term inactive or "lost" customers** with low overall value.
* **Cluster 3**: Shows the lowest normalized scores across all three dimensions: lowest Recency (meaning they are **quite recent** buyers, given 0.00 is at the minimum), lowest Frequency (0), and lowest Monetary value (0). This segment likely consists of **very recent, but low-value and infrequent buyers**, potentially new customers or trialists who have only made a single, small purchase.

## 8. Cluster Interpretation

Based on normalized values, the customer clusters are interpreted as follows:

| Cluster | Recency ↑ | Frequency ↑ | Monetary ↑ | Profile                                                                           |
|---------|-----------|-------------|-------------|---------------------------------------------------------------------------------------|
| 0       | Low       | Very Low    | Very High   | **Recent Big Spenders, Infrequent** – Recently purchased and spent a lot, but do so rarely. |
| 1       | Low       | Very High   | Very High   | **High-Value Frequent Champions** – Your most valuable customers, purchasing frequently and spending a lot. |
| 2       | High      | Very Low    | Low         | **Long-Term Inactive / Lost Customers** – Have not purchased in a very long time, with low frequency and value. |
| 3       | Very Low  | Very Low    | Very Low    | **New/Recent Low Spenders** – Very recent buyers with low frequency and low value. Potentially "trialists." |

## 9. 2D Cluster Visualization: Monetary vs Recency

We visualize the clusters using a scatterplot, where:
- **Y-axis = Recency**
- **X-axis = log-transformed Monetary**

This helps intuitively understand how customer groups separate in two key dimensions.

In [None]:
# Add necessary columns to the scaled DataFrame
rfm_scaled_df['Cluster'] = rfm['Cluster']

# 2D Scatterplot of Clusters
plt.figure(figsize=(10, 6))
sns.scatterplot(
    data=rfm_scaled_df,
    x='Monetary',
    y='Recency',
    hue='Cluster',
    palette='Set2',
    alpha=0.6,
    s=20
)
plt.title('Customer Clusters: Monetary vs Recency')
plt.xlabel('Monetary (Standardized, log-transformed)')
plt.ylabel('Recency (Standardized)')
plt.legend(title='Cluster')
plt.grid(True)
plt.savefig('scatter.png', bbox_inches='tight', dpi=300)
plt.show()

## 10. Final Insights and Export

This final section adds complementary visualizations, summarises the main findings, and exports the segmented customer dataset for further use.

### Cluster Size and Revenue Contribution

To better understand the business value of each segment, we visualise:

- The number of customers in each cluster
- The total monetary value generated by each cluster

In [None]:
# Count of customers per clauster
plt.figure(figsize=(10, 6))
ax = sns.countplot(data=rfm, x='Cluster')
plt.title('Number of Customers per Cluster')
plt.xlabel('Cluster')
plt.ylabel('Number of Customers')

# Add data label
for p in ax.patches:
  height = p.get_height()
  ax.annotate(f'{height}', xy=(p.get_x() + p.get_width() / 2, height), xytext=(0,5), textcoords='offset points', ha='center', va='center')

plt.show()

In [None]:
# Total Monetary value per clauster
monetary_sum = rfm.groupby('Cluster')['Monetary'].sum().reset_index()


plt.figure(figsize=(10, 6))
ax = sns.barplot(x='Cluster', y='Monetary', data=monetary_sum)
plt.title('Total Monetary Value per Cluster')
plt.xlabel('Cluster')
plt.ylabel('Total Monetary Value')

# Add data label
for p in ax.patches:
  height = p.get_height()
  formatted = f'{height:,.2f}'.replace(',',' ')
  ax.annotate(formatted, xy=(p.get_x() + p.get_width() / 2, height), xytext=(0,5), textcoords='offset points', ha='center', va='center')

plt.show()

### Conclusions and Business Recommendations

This analysis segmented customers into four behavioural clusters using RFM metrics and K-means clustering. The clusters differ significantly in terms of purchase recency, frequency, and monetary value.

**Key Insights:**

* **Cluster 0 - "Recent Big Spenders, Infrequent"**: Low Recency, Very Low Frequency, Very High Monetary. Consider strategies to encourage repeat purchases and leverage their high spending potential.
* **Cluster 1 - "High-Value Frequent Champions"**: Low Recency, Very High Frequency, Very High Monetary. These are your most valuable customers; focus on retention, loyalty programs, and exclusive offers.
* **Cluster 2 - "Long-Term Inactive / Lost Customers"**: High Recency, Very Low Frequency, Low Monetary. Evaluate the ROI of win-back campaigns, as resources might be better allocated to more promising segments.
* **Cluster 3 - "New/Recent Low Spenders"**: Very Low Recency, Very Low Frequency, Very Low Monetary. Focus on nurturing, onboarding, and encouraging a second purchase to increase their engagement and lifetime value.



### Export Clustered Data

The Final dataset is exported to a CSV file, containing:

- The original RFM metrics
- The assigned cluster label ('Cluster')
- Ready for integration with bussiness dashboards or CRM tools.

In [None]:
# Export final RFM table with cluster labels
rfm.to_csv('rfm_clusters_segments.csv', index=False)