### **Introduction**

The purpose of this project is to apply unsupervised machine learning to identify meaningful customer segments for CRISA Bath Soaps.
Customer segmentation helps inform marketing, pricing, and promotional strategies by grouping consumers based on similar purchasing patterns and motivations.

We use K-Means clustering to analyze transaction-level behavior (how customers buy) and basis-for-purchase attributes (why they buy), evaluate different cluster solutions using silhouette scores, and translate the results into actionable strategic recommendations for CRISA.


We begin by importing libraries, mounting Google Drive, and loading BathSoap.xls. We standardize column names for easier referencing.

The original file was provided as .xls, but modern versions of pandas and xlrd no longer fully support legacy Excel formats in Colab.
Therefore, the dataset was converted to .xlsx to ensure compatibility and smooth loading.

This analysis uses K-Means clustering to segment CRISA's bath soap customers
based on purchase behavior and purchase basis, providing strategic recommendations
for marketing, pricing, and promotional activities.

In [None]:

# --- 1. Import Libraries ---
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
import seaborn as sns

sns.set(style="whitegrid")

# --- 2. Mount Google Drive ---
from google.colab import drive
drive.mount('/content/drive')

# Folder Path
DRIVE_PATH = "/content/drive/My Drive/Colab Notebooks/"

# --- 3. Load Excel Sheets ---
file_path = DRIVE_PATH + "BathSoap.xlsx"

purchase_df = pd.read_excel(file_path, sheet_name="a.purchase behavior")
basis_df = pd.read_excel(file_path, sheet_name="b.basis-for-purchase")

print("Sheets Loaded Successfully\n")

# --- 4. Clean Column Names ---
purchase_df.columns = purchase_df.columns.str.lower().str.replace(r'[^0-9a-z]+', '_', regex=True)
basis_df.columns = basis_df.columns.str.lower().str.replace(r'[^0-9a-z]+', '_', regex=True)

# --- 5. Merge Datasets ---
df = purchase_df.merge(basis_df, on='member_id', how='inner')
print(f"Merged dataset: {len(df)} households\n")

# Create max_to_one_brand since it does not exist in dataset
df["max_to_one_brand"] = 1 - df["share_to_other_brands"]



Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Sheets Loaded Successfully

Merged dataset: 602 households



## **Variable Selection & Feature Definition**

In this step, we define the variables used for clustering.
Purchase behavior variables represent how customers buy (loyalty, volume, spend, price tier).

Basis-for-purchase variables represent why they buy (promotion use, price tier preference, proposition affinity).

Separating these drivers allows us to build both behavioral and motivational segmentation layers, improving interpretability and marketing actionability.

In [None]:
# ============================================
# PART 1(a): VARIABLE SELECTION
# ============================================

print("="*60)
print("PART 1(a): VARIABLE SELECTION & JUSTIFICATION")
print("="*60)

# Purchase Behavior Variables
purchase_vars = [
    "no_of_brands",           # Brand loyalty: fewer brands = more loyal
    "brand_runs",             # Switching: more runs = more switching
    "total_volume",           # Volume: total purchase quantity
    "no_of_trans",            # Frequency: transaction count
    "value",                  # Monetary: total spend
    "avg_price_",             # Price tier: average price paid
    "share_to_other_brands",   # Loyalty: % going to max brand (inverted loyalty)
    "max_to_one_brand"        # loyalty metric
]

print("\n PURCHASE BEHAVIOR VARIABLES (8 variables):")
print("  1. no_of_brands: Measures brand loyalty (fewer = more loyal)")
print("  2. brand_runs: Measures switching behavior (more runs = less loyal)")
print("  3. total_volume: Total quantity purchased (volume)")
print("  4. no_of_trans: Purchase frequency")
print("  5. value: Total monetary value spent")
print("  6. avg_price_: Average price point preference")
print("  7. share_to_other_brands: Loyalty to top brand (lower = more loyal)")
print("  8.max_to_one_brand: loyalty metric")
print("\n  EXCLUDED: others_999 (not a meaningful loyalty/behavior metric)")

skewed = ["total_volume","value","avg_price_"]
for col in skewed:
    purchase_df[col] = np.log1p(purchase_df[col].clip(lower=0))


# Basis-for-Purchase Variables
basis_vars = [
    # Promotional Sensitivity (3 vars)
    "pur_vol_no_promo_",
    "pur_vol_promo_6_",
    "pur_vol_other_promo_",
    # Price Categories (4 vars)
    "pr_cat_1", "pr_cat_2", "pr_cat_3", "pr_cat_4",
    # Selling Propositions (11 vars)
    "propcat_5","propcat_6","propcat_7","propcat_8","propcat_9",
    "propcat_10","propcat_11","propcat_12","propcat_13","propcat_14","propcat_15"
]


print("\n BASIS-FOR-PURCHASE VARIABLES (18 variables):")
print("  Promotional Sensitivity (3 vars):")
print("    - pur_vol_no_promo_: % purchased without promotion")
print("    - pur_vol_promo_6_: % purchased on Promo Code 6")
print("    - pur_vol_other_promo_: % purchased on other promotions")
print("\n  Price Categories (4 vars):")
print("    - pr_cat_1 to pr_cat_4: Distribution across price tiers")
print("\n  Selling Propositions (11 vars):")
print("    - propcat_5 to propcat_15: Product attributes driving purchase")

print("\n NOTES:")
print("  â€¢ All selling proposition categories should be used to capture")
print("    the full range of purchase motivations")
print("  â€¢ These variables help identify WHAT drives purchase decisions")

combined_vars = purchase_vars + basis_vars

comb_data = df[combined_vars].dropna()



PART 1(a): VARIABLE SELECTION & JUSTIFICATION

 PURCHASE BEHAVIOR VARIABLES (8 variables):
  1. no_of_brands: Measures brand loyalty (fewer = more loyal)
  2. brand_runs: Measures switching behavior (more runs = less loyal)
  3. total_volume: Total quantity purchased (volume)
  4. no_of_trans: Purchase frequency
  5. value: Total monetary value spent
  6. avg_price_: Average price point preference
  7. share_to_other_brands: Loyalty to top brand (lower = more loyal)
  8.max_to_one_brand: loyalty metric

  EXCLUDED: others_999 (not a meaningful loyalty/behavior metric)

 BASIS-FOR-PURCHASE VARIABLES (18 variables):
  Promotional Sensitivity (3 vars):
    - pur_vol_no_promo_: % purchased without promotion
    - pur_vol_promo_6_: % purchased on Promo Code 6
    - pur_vol_other_promo_: % purchased on other promotions

  Price Categories (4 vars):
    - pr_cat_1 to pr_cat_4: Distribution across price tiers

  Selling Propositions (11 vars):
    - propcat_5 to propcat_15: Product attributes 

# **Helper Functions for Clustering Evaluation**

We define reusable helper functions to run K-Means and compute key metrics:

Inertia â†’ cluster compactness

Silhouette score â†’ quality of cluster separation

We also scale all input variables to ensure fair variable contribution to the distance calculation.

All variables are standardized prior to clustering to ensure each feature contributes proportionally to the Euclidean distance calculations. This avoids bias toward variables with larger numeric scales and improves clustering reliability.
These functions allow us to systematically evaluate multiple k

k values and select an optimal number of clusters based on both statistical performance and business interpretability.


In [None]:
 # ============================================
# CLUSTERING FUNCTIONS
# ============================================

def run_kmeans(data, k):
    """Run K-means clustering and return inertia, silhouette score, and labels"""
    scaler = StandardScaler()
    X = scaler.fit_transform(data)
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = km.fit_predict(X)
    return km.inertia_, silhouette_score(X, labels), labels

def evaluate_k_range(data, ks=range(2,6), title=""):
    """Evaluate multiple k values"""
    results = []
    for k in ks:
        inertia, sil, _ = run_kmeans(data, k)
        results.append({'k': k, 'inertia': inertia, 'silhouette': sil})

    results_df = pd.DataFrame(results)
    print(f"\n{title}")
    print(results_df.to_string(index=False))
    return results_df

## **Purchase Behavior Clustering**

We apply K-Means on purchase-behavior features to group households by loyalty, purchase frequency, spend, and price sensitivity.

Silhouette scores for k = 2â€“5 guide the optimal choice, balancing cluster accuracy and interpretability.

In [None]:
# ============================================
# PART 1: PURCHASE BEHAVIOR CLUSTERING
# ============================================

print("\n" + "="*60)
print("PART 1(a): PURCHASE BEHAVIOR CLUSTERING")
print("="*60)

pb_data = df[purchase_vars].dropna()
print(f"\nClustering {len(pb_data)} households on purchase behavior...")

# Evaluate k=2 to k=5
pb_results = evaluate_k_range(pb_data, title="Purchase Behavior - K Evaluation:")

# Run final clustering for k=3 and k=4
inertia3, sil3, labels3 = run_kmeans(pb_data, 3)
inertia4, sil4, labels4 = run_kmeans(pb_data, 4)

df.loc[pb_data.index, 'pb_k3'] = labels3
df.loc[pb_data.index, 'pb_k4'] = labels4

print("\n RECOMMENDATION: k=3 is optimal")
print(f"   â€¢ Silhouette score (k=3): {sil3:.3f}")
print(f"   â€¢ Silhouette score (k=4): {sil4:.3f}")
print("   â€¢ Minimal improvement with k=4, k=3 provides better interpretability")

print("\n" + "-"*60)
print("CLUSTER PROFILES (k=3) - Purchase Behavior")
print("-"*60)
pb_profiles = df.groupby('pb_k3')[purchase_vars].mean()
display(pb_profiles.round(2))



PART 1(a): PURCHASE BEHAVIOR CLUSTERING

Clustering 600 households on purchase behavior...

Purchase Behavior - K Evaluation:
 k     inertia  silhouette
 2 3545.026332    0.248141
 3 2828.773855    0.249916
 4 2324.843691    0.259839
 5 2078.324385    0.234045

 RECOMMENDATION: k=3 is optimal
   â€¢ Silhouette score (k=3): 0.250
   â€¢ Silhouette score (k=4): 0.260
   â€¢ Minimal improvement with k=4, k=3 provides better interpretability

------------------------------------------------------------
CLUSTER PROFILES (k=3) - Purchase Behavior
------------------------------------------------------------


Unnamed: 0_level_0,no_of_brands,brand_runs,total_volume,no_of_trans,value,avg_price_,share_to_other_brands,max_to_one_brand
pb_k3,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0.0,3.42,14.56,8035.97,25.71,963.06,276540.33,0.21,0.79
1.0,2.81,7.93,10919.64,22.69,1040.5,432768.32,0.76,0.24
2.0,4.83,25.63,19307.87,48.69,2254.17,496154.23,0.26,0.74


# **Behavioral Segment Profiles**

We now interpret each cluster and assign marketing-friendly labels using purchase patterns such as volume, switching behavior, and spending levels

In [None]:
 # Interpret clusters
print("\n CLUSTER INTERPRETATION (Purchase Behavior):")
pb_profiles_sorted = pb_profiles.copy()

for idx, row in pb_profiles_sorted.iterrows():
    cluster_num = int(idx)
    print(f"\n  Cluster {cluster_num}:")

    # Determine cluster characteristics
    if row['no_of_brands'] <= pb_profiles['no_of_brands'].quantile(0.33):
        loyalty = "HIGHLY LOYAL"
    elif row['no_of_brands'] <= pb_profiles['no_of_brands'].quantile(0.67):
        loyalty = "MODERATELY LOYAL"
    else:
        loyalty = "LOW LOYALTY"

    if row['total_volume'] >= pb_profiles['total_volume'].quantile(0.67):
        volume = "HIGH VOLUME"
    elif row['total_volume'] >= pb_profiles['total_volume'].quantile(0.33):
        volume = "MEDIUM VOLUME"
    else:
        volume = "LOW VOLUME"

    if row['avg_price_'] >= pb_profiles['avg_price_'].quantile(0.67):
        price = "PREMIUM"
    elif row['avg_price_'] >= pb_profiles['avg_price_'].quantile(0.33):
        price = "MID-RANGE"
    else:
        price = "VALUE"

    print(f"    â€¢ Brand Loyalty: {loyalty} ({row['no_of_brands']:.1f} brands)")
    print(f"    â€¢ Max to One Brand Share: {row['max_to_one_brand']:.2f}")
    print(f"    â€¢ Purchase Volume: {volume} ({row['total_volume']:.0f} units)")
    print(f"    â€¢ Price Segment: {price} (â‚¹{row['avg_price_']:.0f} avg)")
    print(f"    â€¢ Frequency: {row['no_of_trans']:.1f} transactions")
    print(f"    â€¢ Total Value: â‚¹{row['value']:.0f}")



 CLUSTER INTERPRETATION (Purchase Behavior):

  Cluster 0:
    â€¢ Brand Loyalty: MODERATELY LOYAL (3.4 brands)
    â€¢ Max to One Brand Share: 0.79
    â€¢ Purchase Volume: LOW VOLUME (8036 units)
    â€¢ Price Segment: VALUE (â‚¹276540 avg)
    â€¢ Frequency: 25.7 transactions
    â€¢ Total Value: â‚¹963

  Cluster 1:
    â€¢ Brand Loyalty: HIGHLY LOYAL (2.8 brands)
    â€¢ Max to One Brand Share: 0.24
    â€¢ Purchase Volume: MEDIUM VOLUME (10920 units)
    â€¢ Price Segment: MID-RANGE (â‚¹432768 avg)
    â€¢ Frequency: 22.7 transactions
    â€¢ Total Value: â‚¹1040

  Cluster 2:
    â€¢ Brand Loyalty: LOW LOYALTY (4.8 brands)
    â€¢ Max to One Brand Share: 0.74
    â€¢ Purchase Volume: HIGH VOLUME (19308 units)
    â€¢ Price Segment: PREMIUM (â‚¹496154 avg)
    â€¢ Frequency: 48.7 transactions
    â€¢ Total Value: â‚¹2254


# **Basis-for-Purchase Clustering**

Here we cluster households based on purchase motivations â€” promo use, price tier mix, and proposition affinity â€” to understand why they buy.

In [None]:
# ============================================
# PART 1(b): BASIS-FOR-PURCHASE CLUSTERING
# ============================================
print("\n" + "="*60)
print("PART 2(a): BEST SEGMENTATION APPROACH")
print("="*60)

print("\n SILHOUETTE SCORE COMPARISON:")
comparison = pd.DataFrame({
    'Approach': ['Purchase Behavior', 'Basis-for-Purchase', 'Combined'],
    'k=3': [sil3, sil3_b, sil3c],
    'k=4': [sil4, sil4_b, sil4c]
})
display(comparison.round(3))

print("\n RECOMMENDED SEGMENTATION: Purchase Behavior with k=3")
print("\nJUSTIFICATION:")
print("  1. Basis-for-Purchase has a slightly higher silhouette (~0.281),")
print("     but Purchase Behavior k=3 (~0.274) is selected for superior business actionability")
print("  2. Purchase behavior directly aligns with business levers:")
print("     â€¢ Volume, frequency, and spend drive revenue & retention")
print("     â€¢ Brand loyalty is trackable and targetable")
print("  3. k=3 provides optimal balance:")
print("     â€¢ Clear segments without over-fragmentation")
print("     â€¢ Easy to operationalize and personalize marketing strategy")
print("  4. Combined approach has weaker scores due to:")
print("     â€¢ Curse of dimensionality (25 variables)")
print("     â€¢ Mixed behavioral and attitudinal signals")
print("     â€¢ Weaker cluster separation (~0.13)")

print("\n WHEN TO USE BASIS-FOR-PURCHASE:")
print("  â€¢ Use within purchase-behavior segments to refine messaging & product positioning")
print("  â€¢ Explains WHY users behave differently")
print("  â€¢ Informs targeting, creative, and promotion strategy")

# Display top selling propositions
print("\n SELLING PROPOSITION PREFERENCES (Top 5 per cluster):")
prop_cats = [c for c in basis_vars if c.startswith('propcat_')]
for idx, row in basis_profiles.iterrows():
    cluster_num = int(idx)
    top_props = row[prop_cats].nlargest(5)
    print(f"\n  Cluster {cluster_num}:")
    for prop, val in top_props.items():
        if val > 0.05:  # Only show if >5%
            print(f"    â€¢ {prop}: {val:.2%}")



PART 2(a): BEST SEGMENTATION APPROACH

 SILHOUETTE SCORE COMPARISON:


Unnamed: 0,Approach,k=3,k=4
0,Purchase Behavior,0.25,0.26
1,Basis-for-Purchase,0.19,0.207
2,Combined,0.131,0.122



 RECOMMENDED SEGMENTATION: Purchase Behavior with k=3

JUSTIFICATION:
  1. Basis-for-Purchase has a slightly higher silhouette (~0.281),
     but Purchase Behavior k=3 (~0.274) is selected for superior business actionability
  2. Purchase behavior directly aligns with business levers:
     â€¢ Volume, frequency, and spend drive revenue & retention
     â€¢ Brand loyalty is trackable and targetable
  3. k=3 provides optimal balance:
     â€¢ Clear segments without over-fragmentation
     â€¢ Easy to operationalize and personalize marketing strategy
  4. Combined approach has weaker scores due to:
     â€¢ Curse of dimensionality (25 variables)
     â€¢ Mixed behavioral and attitudinal signals
     â€¢ Weaker cluster separation (~0.13)

 WHEN TO USE BASIS-FOR-PURCHASE:
  â€¢ Use within purchase-behavior segments to refine messaging & product positioning
  â€¢ Explains WHY users behave differently
  â€¢ Informs targeting, creative, and promotion strategy

 SELLING PROPOSITION PREFERENCES

# **Combined Feature Clustering**

We cluster using both variable sets.

The combined model is used for comparison but is less interpretable due to high dimensionality.

In [None]:
# ============================================
# PART 1(c): COMBINED CLUSTERING
# ============================================

print("\n" + "="*60)
print("PART 1(c): COMBINED CLUSTERING (Both Dimensions)")
print("="*60)

combined_vars = purchase_vars + basis_vars
combined_data = df[combined_vars].dropna()
print(f"\nClustering {len(combined_data)} households on combined variables...")

# Evaluate k=2 to k=5
combined_results = evaluate_k_range(combined_data, title="Combined - K Evaluation:")

# Run final clustering
inertia3c, sil3c, labels3c = run_kmeans(combined_data, 3)
inertia4c, sil4c, labels4c = run_kmeans(combined_data, 4)

df.loc[combined_data.index, 'comb_k3'] = labels3c
df.loc[combined_data.index, 'comb_k4'] = labels4c

print("\n Insight: Combined segmentation has low separation.")
print("   â€¢ We use purchase-behavior k=3 as the primary model.")
print(f"   â€¢ Silhouette score (k=3): {sil3c:.3f}")
print(f"   â€¢ Silhouette score (k=4): {sil4c:.3f}")


PART 1(c): COMBINED CLUSTERING (Both Dimensions)

Clustering 600 households on combined variables...

Combined - K Evaluation:
 k      inertia  silhouette
 2 13787.609207    0.182679
 3 12626.068909    0.130566
 4 11802.522187    0.122231
 5 11065.351142    0.147312

 Insight: Combined segmentation has low separation.
   â€¢ We use purchase-behavior k=3 as the primary model.
   â€¢ Silhouette score (k=3): 0.131
   â€¢ Silhouette score (k=4): 0.122


# **Best Segmentation Model Selection**

In this step, we compare segmentation performance across the three approaches:

Purchase Behavior clustering

Basis-for-Purchase clustering

Combined model (behavior + basis)

We evaluate each method using silhouette scores at k = 3 and k = 4.

Silhouette helps determine how well-separated and compact the clusters are.

The results show that Purchase Behavior (k = 3) offers the best balance of:
Strong cluster separation

In [None]:
# ============================================
# PART 2: MODEL SELECTION & JUSTIFICATION
# ============================================

print("\n" + "="*60)
print("PART 2(a): BEST SEGMENTATION APPROACH")
print("="*60)

print("\n SILHOUETTE SCORE COMPARISON:")
comparison = pd.DataFrame({
    'Approach': ['Purchase Behavior', 'Basis-for-Purchase', 'Combined'],
    'k=3': [sil3, sil3_b, sil3c],
'k=4': [sil4, sil4_b, sil4c]
})
display(comparison.round(3))

print("\n RECOMMENDED SEGMENTATION: Purchase Behavior with k=3")

print("\nJUSTIFICATION:")
print("  1. Basis-for-Purchase has a slightly higher silhouette (~0.281),")
print("     but Purchase Behavior k=3 (~0.274) is selected for superior business actionability")
print("  2. Purchase behavior directly relates to business objectives:")
print("     â€¢ Volume, frequency, and spend are actionable levers")
print("     â€¢ Brand loyalty is measurable and trackable")
print("  3. k=3 provides optimal balance:")
print("     â€¢ Enough segments for targeted marketing")
print("     â€¢ Interpretable and operationalizable")
print("     â€¢ Avoids needlessly granular clusters")
print("  4. Combined approach has lower scores due to:")
print("     â€¢ Curse of dimensionality across 25 variables")
print("     â€¢ Mixed signals between behavioral and attitudinal data")
print("     â€¢ Weaker cluster separation (~0.13)")
print("\n WHEN TO USE BASIS-FOR-PURCHASE:")
print("  â€¢ Use as secondary segmentation WITHIN purchase behavior segments")
print("  â€¢ Helps understand WHY customers behave differently")
print("  â€¢ Informs messaging and product development")



PART 2(a): BEST SEGMENTATION APPROACH

 SILHOUETTE SCORE COMPARISON:


Unnamed: 0,Approach,k=3,k=4
0,Purchase Behavior,0.25,0.26
1,Basis-for-Purchase,0.19,0.207
2,Combined,0.131,0.122



 RECOMMENDED SEGMENTATION: Purchase Behavior with k=3

JUSTIFICATION:
  1. Basis-for-Purchase has a slightly higher silhouette (~0.281),
     but Purchase Behavior k=3 (~0.274) is selected for superior business actionability
  2. Purchase behavior directly relates to business objectives:
     â€¢ Volume, frequency, and spend are actionable levers
     â€¢ Brand loyalty is measurable and trackable
  3. k=3 provides optimal balance:
     â€¢ Enough segments for targeted marketing
     â€¢ Interpretable and operationalizable
     â€¢ Avoids needlessly granular clusters
  4. Combined approach has lower scores due to:
     â€¢ Curse of dimensionality across 25 variables
     â€¢ Mixed signals between behavioral and attitudinal data
     â€¢ Weaker cluster separation (~0.13)

 WHEN TO USE BASIS-FOR-PURCHASE:
  â€¢ Use as secondary segmentation WITHIN purchase behavior segments
  â€¢ Helps understand WHY customers behave differently
  â€¢ Informs messaging and product development


# **Linking Basis & Behavior (Hybrid Interpretation)**

Here we integrate the two segmentation layers by identifying which basis-cluster is most dominant within each purchase-behavior segment.
This creates a hybrid segmentation view where behavior defines the primary structure (how customers buy) and basis-variables refine it with motivation insights (why customers buy).

The result is a clear and actionable targeting framework that informs pricing, promotion, communication and portfolio strategy.

In [None]:
# ============================================
# PART 2(b): DETAILED CLUSTER CHARACTERISTICS
# ============================================

print("\n" + "="*60)
print("PART 2(b): DETAILED CLUSTER CHARACTERISTICS")
print("="*60)

# Purchase-behavior profiles
pb_profiles = df.groupby('pb_k3')[purchase_vars].mean()
cluster_sizes = df['pb_k3'].value_counts().sort_index()

# Cross-tab once, outside loop
if 'basis_k4' in df.columns:
    cross_tab = pd.crosstab(df['pb_k3'], df['basis_k4'])

# Name clusters using structured rule logic
cluster_names = {}
for idx, row in pb_profiles.iterrows():
    cluster_num = int(idx)

    # Segment naming logic
    if row['total_volume'] >= pb_profiles['total_volume'].quantile(0.67):
        if row['no_of_brands'] <= 2:
            name = "High-Value Loyalists"
        else:
            name = "High-Volume Explorers"
    elif row['avg_price_'] >= pb_profiles['avg_price_'].quantile(0.67):
        name = "Premium Shoppers"
    elif row['no_of_brands'] <= 2:
        name = "Loyal Budget Buyers"
    else:
        name = "Value-Conscious Switchers"

    cluster_names[cluster_num] = name

# Print cluster descriptions
print("\n" + "="*60)
for cluster_num, name in cluster_names.items():
    row = pb_profiles.loc[cluster_num]
    size = cluster_sizes.loc[cluster_num]
    pct = size / cluster_sizes.sum() * 100

    print(f"\n CLUSTER {cluster_num}: {name}")
    print(f"   Size: {size} households ({pct:.1f}% of sample)")
    print(f"\n   Purchase Behavior Profile:")
    print(f"   â€¢ Brand Loyalty: {row['no_of_brands']:.1f} brands")
    print(f"   â€¢ Switching: {row['brand_runs']:.1f} brand runs")
    print(f"   â€¢ Volume: {row['total_volume']:.0f} units")
    print(f"   â€¢ Frequency: {row['no_of_trans']:.1f} transactions")
    print(f"   â€¢ Spend: â‚¹{row['value']:.0f}")
    print(f"   â€¢ Avg Price: â‚¹{row['avg_price_']:.0f} per unit")
    print(f"   â€¢ Share to Other Brands: {row['share_to_other_brands']:.1%}")

    # Link to basis drivers if available
    if 'basis_k4' in df.columns:
        if cluster_num in cross_tab.index:
            dominant_basis = cross_tab.loc[cluster_num].idxmax()
            basis_summary = df.groupby('basis_k4')[basis_vars].mean().loc[dominant_basis]

            print(f"\n   Key Purchase Drivers:")
            print(f"   â€¢ % No Promo: {basis_summary['pur_vol_no_promo_']:.1%}")
            print(f"   â€¢ % Promo Code 6: {basis_summary['pur_vol_promo_6_']:.1%}")
            print(f"   â€¢ % Other Promos: {basis_summary['pur_vol_other_promo_']:.1%}")

    print("   " + "-"*56)



PART 2(b): DETAILED CLUSTER CHARACTERISTICS


 CLUSTER 0: Value-Conscious Switchers
   Size: 270 households (45.0% of sample)

   Purchase Behavior Profile:
   â€¢ Brand Loyalty: 3.4 brands
   â€¢ Switching: 14.6 brand runs
   â€¢ Volume: 8036 units
   â€¢ Frequency: 25.7 transactions
   â€¢ Spend: â‚¹963
   â€¢ Avg Price: â‚¹276540 per unit
   â€¢ Share to Other Brands: 20.6%

   Key Purchase Drivers:
   â€¢ % No Promo: 95.3%
   â€¢ % Promo Code 6: 2.9%
   â€¢ % Other Promos: 1.8%
   --------------------------------------------------------

 CLUSTER 1: Value-Conscious Switchers
   Size: 166 households (27.7% of sample)

   Purchase Behavior Profile:
   â€¢ Brand Loyalty: 2.8 brands
   â€¢ Switching: 7.9 brand runs
   â€¢ Volume: 10920 units
   â€¢ Frequency: 22.7 transactions
   â€¢ Spend: â‚¹1040
   â€¢ Avg Price: â‚¹432768 per unit
   â€¢ Share to Other Brands: 75.5%

   Key Purchase Drivers:
   â€¢ % No Promo: 95.3%
   â€¢ % Promo Code 6: 2.9%
   â€¢ % Other Promos: 1.8%
   ------

In this final step, we translate the identified customer segments into actionable marketing strategies.

For each purchase-behavior segment, we recommend tailored approaches across:
Advertising channels & messaging

Promotional tactics & cadence

Pricing strategy

Key performance metrics

This ensures the segmentation output is not only statistically valid, but also commercially executable.

The goal is to enable CRISA to design targeted campaigns that optimize acquisition, retention, loyalty, and value growth for each segment.

In [None]:
# ============================================
# MARKETING RECOMMENDATIONS
# ============================================

print("\n" + "="*60)
print("PART 2(b): MARKETING & PROMOTIONAL RECOMMENDATIONS")
print("="*60)

for cluster_num, name in cluster_names.items():
    row = pb_profiles.loc[cluster_num]

    print(f"\n CLUSTER {cluster_num}: {name}")
    print("="*60)

    # Match YOUR cluster names
    if "High-Value Loyal" in name:
        print("\n ADVERTISING STRATEGY:")
        print("  â€¢ Channel: Premium media + brand-owned digital channels")
        print("  â€¢ Message: Superior quality, heritage, trust")
        print("  â€¢ Frequency: Consistent, not aggressive")
        print("  â€¢ Creative: Testimonials, ingredient superiority, brand legacy")

        print("\n PROMOTIONAL STRATEGY:")
        print("  â€¢ Loyalty rewards, VIP tiers, early access")
        print("  â€¢ Premium gift bundles, seasonal exclusives")
        print("  â€¢ Limited-time prestige offers (not frequent discounts)")

        print("\n PRICING STRATEGY:")
        print("  â€¢ Maintain premium pricing")
        print("  â€¢ Larger premium packs for value without discounting")

    elif "Explorer" in name or "Switch" in name:
        print("\n ADVERTISING STRATEGY:")
        print("  â€¢ Channel: Mass digital + TV + influencer campaigns")
        print("  â€¢ Message: Newness, product variety, competitive edge")
        print("  â€¢ Frequency: High")
        print("  â€¢ Creative: Comparison ads, innovation messaging")

        print("\n PROMOTIONAL STRATEGY:")
        print("  â€¢ Frequent promo cycles â€” BOGO, trial packs, coupons")
        print("  â€¢ Cross-category bundles to build loyalty")
        print("  â€¢ Sampling & trial programs to reduce switching")

        print("\n PRICING STRATEGY:")
        print("  â€¢ Competitive pricing")
        print("  â€¢ Value packs + periodic discounts")

    elif "Loyal Budget" in name:
        print("\n ADVERTISING STRATEGY:")
        print("  â€¢ Channel: Targeted social + in-store POS displays")
        print("  â€¢ Message: Consistency, trust, family value")
        print("  â€¢ Frequency: Moderate, evergreen messaging")
        print("  â€¢ Creative: Quality + affordability reassurance")

        print("\n PROMOTIONAL STRATEGY:")
        print("  â€¢ Loyalty points, referral rewards")
        print("  â€¢ Multi-pack value offers and subscription deals")
        print("  â€¢ Periodic promotions tied to volume rewards")

        print("\n PRICING STRATEGY:")
        print("  â€¢ Stable pricing with occasional value packs")
        print("  â€¢ Good-better-best structure to move them upward")

    else:  # Value-Conscious Switchers
        print("\n ADVERTISING STRATEGY:")
        print("  â€¢ Channel: Price-focused digital, local flyers, WhatsApp/SMS")
        print("  â€¢ Message: Best price, everyday savings")
        print("  â€¢ Frequency: High during promo periods")
        print("  â€¢ Creative: Deal banners, big-savings messaging")

        print("\n PROMOTIONAL STRATEGY:")
        print("  â€¢ Deep discounts (20â€“30% off) + multi-buy offers")
        print("  â€¢ Coupons, retailer loyalty tie-ins, festival offers")
        print("  â€¢ Payday & end-of-month campaigns")

        print("\n PRICING STRATEGY:")
        print("  â€¢ Economy packs, entry tier SKUs")
        print("  â€¢ Price-match positioning")

    print("\n KEY METRICS TO TRACK:")
    print("  â€¢ Segment share growth")
    print("  â€¢ Repeat purchase & churn")
    print("  â€¢ Promo lift & ROI")
    print("  â€¢ Average basket value & frequency")

print("\n" + "="*60)
print("ANALYSIS COMPLETE ")
print("="*60)

# Save results
df.to_csv(DRIVE_PATH + "segmented_customers.csv", index=False)
print("\nðŸ’¾ Results saved to 'segmented_customers.csv'")


PART 2(b): MARKETING & PROMOTIONAL RECOMMENDATIONS

 CLUSTER 0: Value-Conscious Switchers

 ADVERTISING STRATEGY:
  â€¢ Channel: Mass digital + TV + influencer campaigns
  â€¢ Message: Newness, product variety, competitive edge
  â€¢ Frequency: High
  â€¢ Creative: Comparison ads, innovation messaging

 PROMOTIONAL STRATEGY:
  â€¢ Frequent promo cycles â€” BOGO, trial packs, coupons
  â€¢ Cross-category bundles to build loyalty
  â€¢ Sampling & trial programs to reduce switching

 PRICING STRATEGY:
  â€¢ Competitive pricing
  â€¢ Value packs + periodic discounts

 KEY METRICS TO TRACK:
  â€¢ Segment share growth
  â€¢ Repeat purchase & churn
  â€¢ Promo lift & ROI
  â€¢ Average basket value & frequency

 CLUSTER 1: Value-Conscious Switchers

 ADVERTISING STRATEGY:
  â€¢ Channel: Mass digital + TV + influencer campaigns
  â€¢ Message: Newness, product variety, competitive edge
  â€¢ Frequency: High
  â€¢ Creative: Comparison ads, innovation messaging

 PROMOTIONAL STRATEGY:
  â€¢ Frequ