# Notebook 05: Products per segment analysis

 "Through RFM analysis, companies can identify valuable customers who bring the most profit per average purchase, develop individual marketing campaigns for different segments, and optimize marketing costs by reducing them for those who buy infrequently and bring low income." (Kobets & Yashyna, 2025, p. 41)

**Input Files:**
- `rfm_customer_segments.csv` (from Notebook 03)
- `orders.csv` (from raw data)
- `order_products__prior.csv` (from raw data)
- `products.csv` (from raw data)

**Output Files:**
- `products_per_segment.csv` (top 10 products per segment)

### The Problem

Without knowing which products each segment prefers:
- Promotions are generic (one-size-fits-all)
- Bundles don't match segment preferences
- Inventory decisions lack customer insight
- Marketing budget is wasted on irrelevant products

### The Solution

**Link customers → orders → products → segments** to identify:
- Which products are most popular per segment
- Which products should be bundled together for each segment
- Which products to promote to which customers


In [13]:
# STEP 1: IMPORT LIBRARIES
import pandas as pd
import gc  # Garbage Collector: pour libérer la RAM
import os
import json
from datetime import datetime

print("="*70)
print("NOTEBOOK 05: PRODUCTS PER SEGMENT ANALYSIS")
print("="*70)
print(f"Execution started: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

NOTEBOOK 05: PRODUCTS PER SEGMENT ANALYSIS
Execution started: 2026-02-20 19:37:47


In [14]:
#STEP 2 : DEFINE PATHS  

# Input files (from previous notebooks)
FILE_RFM = '../data/processed/rfm_customer_segments.csv'
FILE_ORDERS = '../data/processed/orders_optimized.csv'
FILE_ORDER_PRODUCTS = '../data/processed/orders_prior_optimized.csv'
FILE_PRODUCTS = '../data/processed/products_optimized.csv'

# Output file
FILE_OUTPUT = '../data/processed/products_per_segment.csv'

# Ensure output directory exists
os.makedirs('../data/processed', exist_ok=True)

print(f"\n✓ Input files configured")
print(f"✓ Output directory: {os.path.dirname(FILE_OUTPUT)}")


✓ Input files configured
✓ Output directory: ../data/processed


In [15]:
#STEP 3: Load required datasets

print("\n" + "="*70)
print("SECTION 3: LOAD DATASETS")
print("="*70)

print("\n[1/4] Loading RFM segments...")
rfm = pd.read_csv(FILE_RFM, usecols=['user_id', 'segment'])
print(f"  ✓ Loaded {len(rfm):,} customers")

print("\n[2/4] Loading orders...")
orders = pd.read_csv(FILE_ORDERS, usecols=['order_id', 'user_id'])
print(f"  ✓ Loaded {len(orders):,} orders")

print("\n[3/4] Loading order-products...")
order_products = pd.read_csv(FILE_ORDER_PRODUCTS, usecols=['order_id', 'product_id'])
print(f"  ✓ Loaded {len(order_products):,} order-product pairs")

print("\n[4/4] Loading products...")
products = pd.read_csv(FILE_PRODUCTS, usecols=['product_id', 'product_name'])
print(f"  ✓ Loaded {len(products):,} products")


SECTION 3: LOAD DATASETS

[1/4] Loading RFM segments...
  ✓ Loaded 206,209 customers

[2/4] Loading orders...
  ✓ Loaded 3,421,083 orders

[3/4] Loading order-products...
  ✓ Loaded 32,434,489 order-product pairs

[4/4] Loading products...
  ✓ Loaded 49,688 products


In [16]:

# STEP 4: MERGE DATASETS (LINK CUSTOMERS → PRODUCTS)
# Source: products_segmentation.ipynb - Merge strategy


print("\n" + "="*70)
print("SECTION 4: MERGE DATASETS")
print("="*70)

print("\n[1/3] Linking Customers → Orders (via user_id)...")
df = orders.merge(rfm, on='user_id', how='inner')
print(f"  ✓ Merged: {len(df):,} rows")

print("\n[2/3] Linking Orders → Products (via order_id)...")
df = df.merge(order_products, on='order_id', how='inner')
print(f"  ✓ Merged: {len(df):,} rows")

# Drop columns we no longer need
df.drop(columns=['order_id', 'user_id'], inplace=True)

print("\n[3/3] Linking Products → Product Names (via product_id)...")
df = df.merge(products, on='product_id', how='inner')
print(f"  ✓ Merged: {len(df):,} rows")

# Drop product_id - we only need segment and product_name
df.drop(columns=['product_id'], inplace=True)

print(f"\n✓ Final columns: {list(df.columns)}")
print(f"\nSample data:")
print(df.head())


SECTION 4: MERGE DATASETS

[1/3] Linking Customers → Orders (via user_id)...
  ✓ Merged: 3,421,083 rows

[2/3] Linking Orders → Products (via order_id)...
  ✓ Merged: 32,434,489 rows

[3/3] Linking Products → Product Names (via product_id)...
  ✓ Merged: 32,434,489 rows

✓ Final columns: ['segment', 'product_name']

Sample data:
    segment                             product_name
0  Sleeping                                     Soda
1  Sleeping  Organic Unsweetened Vanilla Almond Milk
2  Sleeping                      Original Beef Jerky
3  Sleeping               Aged White Cheddar Popcorn
4  Sleeping         XL Pick-A-Size Paper Towel Rolls


In [18]:
# STEP 5: COUNT PRODUCTS PER SEGMENT
# Source: products_segmentation.ipynb

print("\n" + "="*70)
print("SECTION 5: COUNT PRODUCTS PER SEGMENT")
print("="*70)

print("\n[1/2] Counting product purchases by segment...")

# Count how many times each product appears in each segment
counts = df.groupby(['segment', 'product_name']).size().reset_index(name='purchase_count')

print(f"  ✓ Counted {len(counts):,} segment-product combinations")
print(f"  ✓ Columns: {list(counts.columns)}")

print(f"\nSample product counts:")
print(counts.head(10))


SECTION 5: COUNT PRODUCTS PER SEGMENT

[1/2] Counting product purchases by segment...
  ✓ Counted 306,229 segment-product combinations
  ✓ Columns: ['segment', 'product_name', 'purchase_count']

Sample product counts:
  segment                                     product_name  purchase_count
0  Frugal                                #2 Coffee Filters               5
1  Frugal                  #4 Natural Brown Coffee Filters               3
2  Frugal           & Go! Hazelnut Spread + Pretzel Sticks               5
3  Frugal     +Energy Black Cherry Vegetable & Fruit Juice               7
4  Frugal                             .5\" Waterproof Tape               2
5  Frugal         0 Calorie Fuji Apple Pear Water Beverage              22
6  Frugal  0 Calorie Strawberry Dragonfruit Water Beverage               1
7  Frugal                    0% Fat Blueberry Greek Yogurt              22
8  Frugal                         0% Fat Free Organic Milk             131
9  Frugal              0% Fat O

In [19]:
# STEP 6: EXTRACT TOP 10 PRODUCTS PER SEGMENT
# Source: products_segmentation.ipynb


print("\n" + "="*70)
print("SECTION 6: EXTRACT TOP 10 PRODUCTS PER SEGMENT")
print("="*70)

print("\n[1/2] Sorting and extracting top 10...")

# Sort by segment (alphabetical) then by purchase_count (descending)
counts = counts.sort_values(by=['segment', 'purchase_count'], ascending=[True, False])

# Group by segment and keep only the top 10 lines
top_10_df = counts.groupby('segment').head(10)

print(f"  ✓ Extracted top 10 products for {top_10_df['segment'].nunique()} segments")
print(f"  ✓ Total rows: {len(top_10_df)}")

print(f"\n[2/2] Formatting for export...")

# For each segment, join product names with ", "
final_df = top_10_df.groupby('segment')['product_name'].apply(lambda x: ', '.join(x)).reset_index()

# Rename column
final_df.rename(columns={'product_name': 'products'}, inplace=True)

print(f"  ✓ Formatted: {len(final_df)} segments")
print(f"  ✓ Columns: {list(final_df.columns)}")

# Show results
print(f"\nTop 10 Products Per Segment:")
print("="*70)
for idx, row in final_df.iterrows():
    print(f"\n{row['segment'].upper()} ({idx+1}/8):")
    products = row['products'].split(', ')
    for i, product in enumerate(products, 1):
        print(f"  {i}. {product}")


SECTION 6: EXTRACT TOP 10 PRODUCTS PER SEGMENT

[1/2] Sorting and extracting top 10...
  ✓ Extracted top 10 products for 8 segments
  ✓ Total rows: 80

[2/2] Formatting for export...
  ✓ Formatted: 8 segments
  ✓ Columns: ['segment', 'products']

Top 10 Products Per Segment:

FRUGAL (1/8):
  1. Banana
  2. Bag of Organic Bananas
  3. Soda
  4. Spring Water
  5. Sparkling Mineral Water
  6. Sparkling Natural Mineral Water
  7. Organic Half & Half
  8. Hass Avocados
  9. Natural Spring Water
  10. Clementines

HIGH_CHECK (2/8):
  1. Banana
  2. Bag of Organic Bananas
  3. Organic Baby Spinach
  4. Organic Strawberries
  5. Organic Avocado
  6. Strawberries
  7. Organic Hass Avocado
  8. Large Lemon
  9. Limes
  10. Organic Garlic

LOST (3/8):
  1. Banana
  2. Bag of Organic Bananas
  3. Organic Baby Spinach
  4. Organic Avocado
  5. Organic Strawberries
  6. Large Lemon
  7. Strawberries
  8. Organic Hass Avocado
  9. Limes
  10. Cucumber Kirby

LOYAL (4/8):
  1. Banana
  2. Bag of Orga

In [20]:
# STEP 7: SAVE RESULTS
# Source: products_segmentation.ipynb


print("\n" + "="*70)
print("SECTION 7: SAVE RESULTS")
print("="*70)

print(f"\nSaving to: {FILE_OUTPUT}...")

# Save to CSV
final_df.to_csv(FILE_OUTPUT, index=False)

print(f"  ✓ Saved: {FILE_OUTPUT}")

# Verify file was created
if os.path.exists(FILE_OUTPUT):
    file_size = os.path.getsize(FILE_OUTPUT) / 1024  # KB
    print(f"  ✓ File size: {file_size:.2f} KB")




SECTION 7: SAVE RESULTS

Saving to: ../data/processed/products_per_segment.csv...
  ✓ Saved: ../data/processed/products_per_segment.csv
  ✓ File size: 1.42 KB
