<p align="center">
  <img src="img/Group80_Background.png" alt="Group80 Background" style="width:80%;" />
</p>

# <a class='anchor' id='0'></a>
<br>

<div style="background: linear-gradient(to right, #00411E, #00622D, #00823C, #45AF28, #82BA72); 
            padding: 10px; color: white; text-align: center;  max-width: 97%;">
    <center><h1 style="margin-top: 10px; margin-bottom: 4px; color: white;
                       font-size: 32px; font-family: 'Roboto', sans-serif;">
        <b>0. Introduction and Review of EDA</b></h1></center>
</div>

## **Introduction**

Amazing International Airlines Inc. (AIAI) faces the challenge of designing personalized services and marketing strategies for a diverse customer base. In today’s highly competitive airline industry, leveraging data-driven approaches to understand customer segments is crucial for improving satisfaction, increasing retention, and maximizing revenue potential.

In this project, we act as consultants for AIAI and analyze loyalty membership data and flight activity collected over a three-year period to develop a data-driven segmentation strategy.

We analyze loyalty membership data and flight activity collected over a three-year period (2019-2021) to develop a data-driven customer segmentation strategy through clustering analysis.

## **Metadata**

### **Customer Database** 

| Variable | Description |
| --- | --- |
| **Loyalty#** | Unique customer identifier for loyalty program members |
| **Country** | Customer's country of residence |
| **Province or State** | Customer's province or state |
| **City** | Customer's city of residence |
| **Postal code** | Customer's postal/ZIP code |
| **Gender** | Customer's gender |
| **Education** | Customer's highest education level (e.g., Bachelor, College) |
| **Location Code** | Urban/Suburban/Rural classification of customer residence |
| **Income** | Customer's annual income |
| **Marital Status** | Customer's marital status (Married, Single, Divorced) |
| **LoyaltyStatus** | Current tier status in loyalty program (Star > Nova > Aurora) |
| **EnrollmentDateOpening** | Date when customer joined the loyalty program |
| **CancellationDate** | Date when customer left the program |
| **Customer Lifetime Value** | Total calculated monetary value of customer relationship |
| **EnrollmentType** | Method of joining loyalty program |

### **Flight Activity Database**

| Variable | Description |
| --- | --- |
| **Loyalty#** | Unique customer identifier linking to CustomerDB |
| **Year** | Year of flight activity record |
| **Month** | Month of flight activity record (1-12) |
| **YearMonthDate** | First day of the month for the activity period |
| **NumFlights** | Total number of flights taken by customer in the month |
| **NumFlightsWithCompanions** | Number of flights where customer traveled with companions |
| **DistanceKM** | Total distance traveled in kilometers for the month |
| **PointsAccumulated** | Loyalty points earned by customer during the month |
| **PointsRedeemed** | Loyalty points spent/redeemed during the month |
| **DollarCostPointsRedeemed** | Dollar value of points redeemed during the month |

## **Review of EDA (Open to do)**

### hier auch sagen dass wir die outlier nochmal tief angehen und wir ne startegy dafür haben

Key findings from the Exploratory Data Analysis:

- **Missing Values:** Income, CancellationDate, and Customer Lifetime Value have missing values; Income and CLTV are perfectly correlated
- **Duplicates:** Found duplicate Loyalty# entries in CustomerDB (will be removed)
- **Incoherences:** 
  - Distance/km with zero flights: Months exist where DistanceKM > 0 despite NumFlights == 0
  - Physically impossible leg lengths: Some customers show DistanceKM beyond realistic bounds (~14,000 km per flight)
  - Fractional counts in flight metrics for 2019 data
  - CLTV-positive customers with zero recorded flights
- **Outliers:** Zero-income customers with CLTV > 0 (mostly "College" education)
- **Feature Engineering:** Created recency features (is_active_6m, is_active_12m), rejoined status, and aggregated flight metrics


ergänzen:

neue features die wir hinzufügen blablabla

---

# <a class='anchor' id='1'></a>
<br>

<div style="background: linear-gradient(to right, #00411E, #00622D, #00823C, #45AF28, #6BCF5D); 
            padding: 10px; color: white; text-align: center;  max-width: 97%;">
    <center><h1 style="margin-top: 10px; margin-bottom: 4px; color: white;
                       font-size: 32px; font-family: 'Roboto', sans-serif;">
        <b>1. Import Libraries and Load Data</b></h1></center>
</div>

In [None]:
# For data manipulation
import pandas as pd
import numpy as np
import os

# For visualization
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.colors import LinearSegmentedColormap
import matplotlib.ticker as mtick
import matplotlib as mpl
from mpl_toolkits.axes_grid1 import make_axes_locatable
from matplotlib import cm, colorbar
from matplotlib.colors import Normalize
import matplotlib.colors as mpl_colors
from matplotlib.scale import FuncScale
from matplotlib.ticker import MaxNLocator
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.colors as mcolors
from sklearn.manifold import TSNE
import umap

# For preprocessing
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.impute import KNNImputer
from scipy.stats import mstats


# For clustering
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN, MeanShift, estimate_bandwidth
from sklearn.decomposition import PCA
from sklearn.mixture import GaussianMixture
from sklearn.neighbors import NearestNeighbors
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet
from minisom import MiniSom
from matplotlib.patches import RegularPolygon
from scipy.spatial.distance import pdist


# For model evaluation
from sklearn.metrics import pairwise_distances
from sklearn.metrics import silhouette_samples
from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score

# Disable warnings
import warnings
warnings.filterwarnings('ignore', category=UserWarning, module='seaborn')
warnings.simplefilter("ignore", FutureWarning)
warnings.simplefilter("ignore", DeprecationWarning)

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', lambda x: '%.2f' % x)

# For better resolution plots
%config InlineBackend.figure_format = 'retina'

# Setting seaborn style
sns.set_style('white')

In [None]:
# Custom color palette for Group 80
CUSTOM_HEX = [
    "#00411E", "#00622D", "#00823C", "#45AF28", "#6BCF5D", "#D5E6D0","#212121", "#313131", "#595959", "#909090"
]

# Apply globally (Seaborn + Matplotlib)
sns.set_theme(style="whitegrid")
sns.set_palette(CUSTOM_HEX)
mpl.rcParams["axes.prop_cycle"] = mpl.cycler(color=CUSTOM_HEX)

# Continuous colormap (for numeric hues, heatmaps, etc.)
GROUP80_palette_continuous = LinearSegmentedColormap.from_list(
    "green_white_gray_black",
    ["#00411E", "#00823C", "#82BA72", "#D5E6D0", "#FFFFFF", "#909090", "#595959", "#313131", "#212121"]
)

# Colors for specific uses
colors = ["#00411E", "#00622D", "#00823C", "#45AF28", "#6BCF5D"]

In [None]:
# Load the datasets
df_Customer = pd.read_csv('data/input_data/DM_AIAI_CustomerDB.csv')
df_Flights = pd.read_csv('data/input_data/DM_AIAI_FlightsDB.csv')

print(df_Customer.head())
print(df_Flights.head())

---

# <a class='anchor' id='2'></a>
<br>

<div style="background: linear-gradient(to right, #00411E, #00622D, #00823C, #45AF28, #82BA72); 
            padding: 10px; color: white; text-align: center;   max-width: 97%;">
    <center><h1 style="margin-top: 10px; margin-bottom: 4px; color: white;
                       font-size: 32px; font-family: 'Roboto', sans-serif;">
        <b>2. Data Preparation</b></h1></center>
</div>

## **2.1 Data Type Conversion**

In [None]:
# Convert date columns in CustomerDB
df_Customer['EnrollmentDateOpening'] = pd.to_datetime(
    df_Customer['EnrollmentDateOpening'],
    format='%m/%d/%Y',
    errors='coerce'
)
df_Customer['CancellationDate'] = pd.to_datetime(
    df_Customer['CancellationDate'],
    format='%m/%d/%Y',
    errors='coerce'
)

# Convert suitable object columns into pandas Categorical dtype
categorical_cols = [
    'Loyalty#', 'Country', 'Province or State', 'City', 'Postal code',
    'Gender', 'Education', 'Location Code', 'Marital Status',
    'LoyaltyStatus', 'EnrollmentType'
]
df_Customer[categorical_cols] = df_Customer[categorical_cols].apply(lambda s: s.astype('category'))

# Convert date column in FlightsDB
df_Flights["YearMonthDate"] = (
    pd.to_datetime(df_Flights["YearMonthDate"], format="%m/%d/%Y", errors="coerce")
      .dt.to_period("M")
      .astype(str)
)
df_Flights['Loyalty#'] = df_Flights['Loyalty#'].astype('category')

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 20px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">Summary of Actions Taken</h3>
    <ul style="margin: 10px 0; padding-left: 20px; color: #000;">
        <li><strong>Date conversions:</strong> Converted EnrollmentDateOpening, CancellationDate, and YearMonthDate to datetime format</li>
        <li><strong>Categorical conversions:</strong> Converted 11 columns (Loyalty#, demographic variables, LoyaltyStatus, EnrollmentType) to categorical dtype for memory efficiency</li>
        <li><strong>Period format:</strong> Converted YearMonthDate to monthly period string format for time-series analysis</li>
    </ul>
</div>

## **2.2 Duplicates**

In [None]:
# Check for duplicate Loyalty# in CustomerDB
duplicate_loyalty_count = df_Customer['Loyalty#'].duplicated().sum()
print(f'Duplicate Loyalty# in CustomerDB: {duplicate_loyalty_count}')

# Identify and remove duplicate Loyalty# from both datasets
duplicate_loyalty_numbers = df_Customer[df_Customer.duplicated(subset='Loyalty#', keep=False)]['Loyalty#'].unique()
print(f'Number of Loyalty# with duplicates: {len(duplicate_loyalty_numbers)}')

# Remove duplicates from both datasets
df_Customer = df_Customer[~df_Customer['Loyalty#'].isin(duplicate_loyalty_numbers)]
df_Flights = df_Flights[~df_Flights['Loyalty#'].isin(duplicate_loyalty_numbers)]

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 20px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">Summary of Actions Taken</h3>
    <ul style="margin: 10px 0; padding-left: 20px; color: #000;">
        <li><strong>Removed all duplicate Loyalty# entries</strong> from both CustomerDB and FlightsDB (because we cannot determine which duplicate is correct)</li>
    </ul>
</div>

## **2.3 Incoherences**

In [None]:
# IDs in FlightsDB but not in CustomerDB
missing_flights = df_Flights.loc[~df_Flights['Loyalty#'].isin(df_Customer['Loyalty#']), 'Loyalty#']

# IDs in CustomerDB but not in FlightsDB
missing_customer = df_Customer.loc[~df_Customer['Loyalty#'].isin(df_Flights['Loyalty#']), 'Loyalty#']

print(f"FlightsDB IDs missing in CustomerDB: {missing_flights.nunique()}")
print(f"CustomerDB IDs missing in FlightsDB: {missing_customer.nunique()}")

# Remove those rows
df_Flights = df_Flights[~df_Flights['Loyalty#'].isin(missing_flights)]
df_Customer = df_Customer[~df_Customer['Loyalty#'].isin(missing_customer)]


In [None]:
# Fix: Set DistanceKM, PointsAccumulated and NumFlightsWithCompanions to 0 where NumFlights == 0
incoherent_rows = (df_Flights['NumFlights'] == 0) & (
    (df_Flights['DistanceKM'] > 0) | 
    (df_Flights['PointsAccumulated'] > 0) | 
    (df_Flights['NumFlightsWithCompanions'] > 0)
)
print(f'Rows with NumFlights=0 but other metrics>0: {incoherent_rows.sum()}')

# Set those values to 0 where NumFlights == 0 but other metrics > 0
df_Flights.loc[df_Flights['NumFlights'] == 0, ['DistanceKM', 'PointsAccumulated', 'NumFlightsWithCompanions']] = 0

In [None]:
# Fix: Check for illogical flight distances (> 14,000 km per flight)
max_flight_distance = 14000
avg_distance = df_Flights['DistanceKM'] / df_Flights['NumFlights'].replace(0, np.nan)
illogical_mask = avg_distance > max_flight_distance

print(f'Rows with illogical distances (>{max_flight_distance}km per flight): {illogical_mask.sum()}')

# Set to 0 for illogical distances
df_Flights.loc[illogical_mask, ['NumFlights', 'NumFlightsWithCompanions', 'DistanceKM', 'PointsAccumulated']] = 0

In [None]:
# Remove customers with no flights in 2019-2021
# Customers without any flight activity cannot be meaningfully segmented
# They have zero values for all flight-related metrics (distance, companions, points, etc.)
# This filtering ensures we only cluster customers with actual travel behavior to analyze
print(f"Customers before removing non-flyers: {len(df_Customer):,}")

# Sum total flights per customer across all 36 months
total_flights_per_customer = df_Flights.groupby('Loyalty#')['NumFlights'].sum()
customers_with_flights = total_flights_per_customer[total_flights_per_customer > 0].index

df_Customer = df_Customer[df_Customer['Loyalty#'].isin(customers_with_flights)].copy()
df_Flights = df_Flights[df_Flights['Loyalty#'].isin(customers_with_flights)].copy()

print(f"Customers after removing non-flyers: {len(df_Customer):,}")


In [None]:
# Remove customers with inconsistent enrollment data: enrolled before 2021 but with '2021 Promotion' type
print(f"Customers before removing inconsistent enrollments: {len(df_Customer):,}")

inconsistent_enrollment = (
    (df_Customer['EnrollmentDateOpening'] < pd.to_datetime('2021-01-01')) & 
    (df_Customer['EnrollmentType'] == '2021 Promotion')
)
inconsistent_ids = df_Customer[inconsistent_enrollment]['Loyalty#']

df_Customer = df_Customer[~inconsistent_enrollment].copy()
df_Flights = df_Flights[~df_Flights['Loyalty#'].isin(inconsistent_ids)].copy()

print(f"Customers after removing inconsistent enrollments: {len(df_Customer):,}")

In [None]:
# Remove customers with impossible redemption_rate: enrolled after 2019 but rate > 1
# These cannot have accumulated points before our data window, so rate > 1 indicates data errors
print(f"Customers before removing impossible redemption rates: {len(df_Customer):,}")

# Temporarily calculate redemption_rate
points_temp = df_Flights.groupby('Loyalty#').agg({
    'PointsAccumulated': 'sum',
    'PointsRedeemed': 'sum'
})
points_temp['redemption_rate_temp'] = np.where(
    points_temp['PointsAccumulated'] > 0,
    points_temp['PointsRedeemed'] / points_temp['PointsAccumulated'],
    0
)

# Temporary merge
df_Customer = df_Customer.merge(points_temp[['redemption_rate_temp']], on='Loyalty#', how='left')
df_Customer['redemption_rate_temp'] = df_Customer['redemption_rate_temp'].fillna(0)

impossible_redemption = (
    (df_Customer['redemption_rate_temp'] > 1) & 
    (df_Customer['EnrollmentDateOpening'] > pd.to_datetime('2019-01-01'))
)
impossible_ids = df_Customer[impossible_redemption]['Loyalty#']

df_Customer = df_Customer[~impossible_redemption].copy()
df_Flights = df_Flights[~df_Flights['Loyalty#'].isin(impossible_ids)].copy()

# Drop temporary column
df_Customer = df_Customer.drop(columns=['redemption_rate_temp'])

print(f"Customers after removing impossible redemption rates: {len(df_Customer):,}")


In [None]:
df_Flights[df_Flights['Loyalty#'] == 748810]

In [None]:
# Remove customers with CancellationDate before EnrollmentDateOpening (data inconsistency)
print(f"Customers before removing invalid cancellation dates: {len(df_Customer):,}")

invalid_cancellation = (
    df_Customer['CancellationDate'].notna() & 
    df_Customer['EnrollmentDateOpening'].notna() &
    (df_Customer['CancellationDate'] < df_Customer['EnrollmentDateOpening'])
)
invalid_ids = df_Customer[invalid_cancellation]['Loyalty#']

df_Customer = df_Customer[~invalid_cancellation].copy()
df_Flights = df_Flights[~df_Flights['Loyalty#'].isin(invalid_ids)].copy()

print(f"Customers after removing invalid cancellation dates: {len(df_Customer):,}")

In [None]:
# Show customers who redeemed points before their enrollment date
first_redemption = df_Flights[df_Flights['PointsRedeemed'] > 0].groupby('Loyalty#')['YearMonthDate'].min().rename('first_redemption_date')

redemption_check = df_Customer[['Loyalty#', 'EnrollmentDateOpening', 'EnrollmentType']].merge(first_redemption, on='Loyalty#', how='inner')

invalid_redemptions = redemption_check[
    (redemption_check['first_redemption_date'] < redemption_check['EnrollmentDateOpening'])]

print(f"Customers with redemption before enrollment: {len(invalid_redemptions)}")
display(invalid_redemptions)


<div style="background-color: #fff9e6ff; padding: 15px; margin-right: 20px; border-left: 5px solid; border-image: linear-gradient(to bottom, #8B8000, #B8A000, #D4C000, #E8D800, #F0E68C) 1; border-radius: 5px;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #8B8000; font-weight: bold;">Key Insight for Strategy: Loyalty Program Design Flaw</h3>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        <strong>Critical Finding:</strong> 1,967 customers were able to accumulate and redeem points <strong>before their official enrollment date</strong>. This represents a significant program design issue.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #8B8000; font-weight: bold;">Why This Matters:</h4>
    <ul style="margin: 10px 0; padding-left: 20px; color: #000; margin-right: 40px;">
        <li><strong>Reduces enrollment incentive:</strong> If customers can earn and use points without formally joining the loyalty program, there is no motivation to enroll. The enrollment barrier becomes meaningless.</li>
        <li><strong>Lost engagement opportunity:</strong> Non-enrolled customers cannot be targeted with personalized communications, tier-based promotions, or retention campaigns.</li>
        <li><strong>Data quality impact:</strong> Creates inconsistent customer records where flight activity precedes membership, complicating customer journey analysis.</li>
    </ul>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #8B8000; font-weight: bold;">Our Approach:</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        <strong>Data retained:</strong> We keep these 1,967 customers in the dataset. Removing them would significantly reduce our sample size and these customers still exhibit valid flight behavior. We assume the system allowed this and treat their data as valid for clustering purposes.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #8B8000; font-weight: bold;">Recommended Action:</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        <strong>Fix the system:</strong> Points accumulation and redemption should be strictly gated behind enrollment. Customers flying without enrollment should see "You could have earned X points - enroll now!" messaging to drive program signup. This converts passive flyers into engaged loyalty members.
    </p>
</div>


In [None]:
# Remove customers who redeemed points before accumulating any points
# Exception: customers enrolled before 2019 (they could have accumulated points before our data window)
print(f"Customers before removing invalid redemption sequence: {len(df_Customer):,}")

# Find first redemption date and first accumulation date per customer
first_redemption = df_Flights[df_Flights['PointsRedeemed'] > 0].groupby('Loyalty#')['YearMonthDate'].min().rename('first_redemption_date')
first_accumulation = df_Flights[df_Flights['PointsAccumulated'] > 0].groupby('Loyalty#')['YearMonthDate'].min().rename('first_accumulation_date')

# Merge with enrollment date
sequence_check = df_Customer[['Loyalty#', 'EnrollmentDateOpening']].merge(first_redemption, on='Loyalty#', how='inner')
sequence_check = sequence_check.merge(first_accumulation, on='Loyalty#', how='left')

# Find customers who redeemed before accumulating and enrolled after 2019
# (enrolled before 2019 = could have accumulated points before our data window)
invalid_sequence_ids = sequence_check[
    (sequence_check['first_redemption_date'] < sequence_check['first_accumulation_date']) &
    (sequence_check['EnrollmentDateOpening'] >= pd.to_datetime('2019-01-01'))
]['Loyalty#']

print(f"Customers with invalid redemption sequence (enrolled after 2019): {len(invalid_sequence_ids)}")

df_Customer = df_Customer[~df_Customer['Loyalty#'].isin(invalid_sequence_ids)].copy()
df_Flights = df_Flights[~df_Flights['Loyalty#'].isin(invalid_sequence_ids)].copy()

print(f"Customers after removing invalid redemption sequence: {len(df_Customer):,}")


<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 20px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">Summary of Actions Taken</h3>
    <ul style="margin: 10px 0; padding-left: 20px; color: #000;">
        <li><strong>Removed mismatched IDs:</strong> Deleted records with Loyalty# present in one database but missing in the other (20 customers)</li>
        <li><strong>Fixed zero-flight distances:</strong> Set DistanceKM, PointsAccumulated, and NumFlightsWithCompanions to 0 where NumFlights == 0</li>
        <li><strong>Removed impossible distances:</strong> Set all flight metrics to 0 for records showing >14,000 km per flight (physically impossible)</li>
        <li><strong>Removed inconsistent enrollments:</strong> Deleted 168 customers enrolled before 2021 but with '2021 Promotion' enrollment type (logically inconsistent)</li>
        <li><strong>Removed impossible redemption rates:</strong> Deleted customers enrolled after 2019 with redemption_rate > 1 (cannot have accumulated points before data window, indicates data error)</li>
        <li><strong>Removed invalid cancellation dates:</strong> Deleted customers with CancellationDate before EnrollmentDateOpening (impossible timeline)</li>
        <li><strong>Removed invalid redemption sequences:</strong> Deleted customers who redeemed points before accumulating any (enrolled after 2019, so no prior accumulation possible)</li>
    </ul>
</div>


<div style="background-color: #fce8e8ff; padding: 15px; margin-right: 20px; border-left: 5px solid; border-image: linear-gradient(to bottom, #8B0000, #A52A2A, #CD5C5C, #F08080) 1; border-radius: 5px;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #8B0000; font-weight: bold;">Critical: Data Quality Cleaning</h3>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        <strong>2,328 customers (14.0%) were removed</strong> from the dataset through sequential cleaning steps to ensure data quality and logical consistency.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #A52A2A; font-weight: bold;">Why This Matters:</h4>
    <ul style="margin: 10px 0; padding-left: 20px; color: #000; margin-right: 40px;">
        <li><strong>Behavioral clustering requires behavior:</strong> Customers with no flights have zero values for all behavioral features (avg_distance, distance_variability, companion_ratio, points, etc.), making them impossible to segment meaningfully.</li>
        <li><strong>Data integrity:</strong> Customers with impossible timelines (cancellation before enrollment, redemption before accumulation) indicate data errors that would introduce noise.</li>
        <li><strong>Business relevance:</strong> Non-flyers and data anomalies represent different strategic challenges outside the scope of behavioral clustering.</li>
    </ul>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #A52A2A; font-weight: bold;">Dataset Impact:</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        <strong>Original:</strong> 16,594 | <strong>After ID mismatch:</strong> 16,574 (-20) | <strong>After non-flyers:</strong> 15,080 (-1,494) | <strong>After inconsistent enrollment:</strong> 14,912 (-168) | <strong>After impossible redemption rate:</strong> 14,450 (-462) | <strong>After invalid cancellation:</strong> 14,295 (-155) | <strong>After invalid redemption sequence:</strong> 14,266 (-29) | <strong>Final:</strong> 14,266 (86.0% retained)
    </p>
</div>


## **2.4 Missingness**

In [None]:
# Check for missing values in flightsDB
df_Flights.isna().sum()

In [None]:
# Check for missing values in CustomerDB
missing_customer = pd.DataFrame({
    'n_missing': df_Customer.isnull().sum(),
    '%_missing': round(df_Customer.isnull().mean() * 100, 2)
})
missing_customer = missing_customer[missing_customer['n_missing'] > 0]
print('Missing values in CustomerDB:')
print(missing_customer)

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 20px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">Summary of Actions Taken</h3>
    <ul style="margin: 10px 0; padding-left: 20px; color: #000;">
        <li><strong>No imputation for CancellationDate</strong> - missing values are logical</li>
    </ul>
</div>

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 20px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">Key Actionable</h3>
    <p style="margin: 0; color: #000; margin-right: 40px; margin-bottom: 10px;">
        As those missing values in the Column <strong>CancellationDate</strong> are <strong>logical</strong> (customers not cancelled yet), we will <strong>not impute</strong> them, but will create a <strong>new feature</strong> later indicating loyal vs not loyalty customers, which is called <strong>"is_loyal"</strong>
    </p>
</div>

## **2.5 Outlier Handling Strategy**

<div style="background-color: #e1e1e1ff; padding: 15px; margin-right: 20px; border-left: 5px solid; border-image: linear-gradient(to bottom, #212121, #313131, #595959, #909090) 1; border-radius: 5px;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #000000ff; font-weight: bold;">Ändern ausstehended Outlier and Scaling Strategy</h3>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #313131; font-weight: bold;">The Challenge:</h4>
    <p style="margin: 10px 0; color: #000; margin-right: 40px;">
       Outlier detection in customer segmentation requires distinguishing between three fundamentally different cases. Section 2.3 already removed obvious data errors (negative income, impossible distances), but statistical outliers present a more nuanced challenge. Extreme values may represent either remaining data quality issues or legitimate high-value customers that clustering algorithms need to handle carefully.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #313131; font-weight: bold;">Two-Perspective Strategy:</h4>
    <p style="margin: 10px 0; color: #000; margin-right: 40px;">
        <strong>Demographic Features:</strong> Primarily discrete and ordinal variables (Income bins 0-4, Education levels 0-2, Location codes 0-2) plus frequency-encoded geography (Province, City, FSA). Feature engineering through binning and collapsing already addressed univariate outliers by converting continuous extremes into discrete categories.
    </p>
    <p style="margin: 10px 0; color: #000; margin-right: 40px;">
        <strong>Behavioral Features:</strong> Entirely continuous variables (flights 0-200, distance 0-300k km, points 0-3M) that retain full numeric ranges and require explicit outlier detection and handling.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #313131; font-weight: bold;">Three-Stage Approach:</h4>
    <ul style="margin: 10px 0; padding-left: 20px; color: #000; margin-right: 40px;">
        <li style="margin-bottom: 10px;">
            <strong>Stage 1: Data Errors (Section 2.3, Completed)</strong>
            <ul style="margin-top: 6px; padding-left: 20px;">
                <li style="margin-bottom: 4px;"><em>Scope:</em> Both demographic and behavioral features</li>
                <li style="margin-bottom: 4px;"><em>Examples:</em> Income < $0, Age > 120, DistanceKM > 50,000 per flight, logically inconsistent combinations</li>
                <li style="margin-bottom: 4px;"><em>Method:</em> Domain knowledge and business rules</li>
                <li style="margin-bottom: 4px;"><em>Action:</em> Delete (removed X customers, Y% of original dataset)</li>
            </ul>
        </li>
        <li style="margin-bottom: 10px;">
            <strong>Stage 2: Feature Classification and Differential Scaling (Section 2.5, Section 5)</strong>
            <ul style="margin-top: 6px; padding-left: 20px;">
                <li style="margin-bottom: 4px;"><em>Scope:</em> All features after data cleaning, analyzed individually</li>
                <li style="margin-bottom: 4px;"><em>Method:</em> Distribution analysis (skewness, IQR outlier detection, boxplots) to classify each feature into categories</li>
                <li style="margin-bottom: 4px;"><em>Categories:</em>
                    <ul style="padding-left: 20px; margin-top: 4px;">
                        <li><strong>Category 1 (Normal/Moderate):</strong> |Skewness| < 1.0, outliers < 5% → StandardScaler</li>
                        <li><strong>Category 2 (Heavily Skewed):</strong> |Skewness| ≥ 1.0, outliers ≥ 5% → RobustScaler</li>
                        <li><strong>Category 3 (Binned/Discrete):</strong> Some Features will be binned and this will be addressed in the feature engineering section 3 → StandardScaler</li>
                    </ul>
                </li>
                <li style="margin-bottom: 4px;"><em>Action:</em> Assign appropriate scaler per feature, apply differential scaling in Section 5</li>
                <li style="margin-bottom: 4px;"><em>Rationale:</em> Features with clean distributions use StandardScaler (interpretable Z-scores), while heavily skewed features use RobustScaler (outlier-resistant median/IQR)</li>
            </ul>
        </li>
        <li style="margin-bottom: 10px;">
            <strong>Stage 3: Multivariate Outlier Detection (Section 8.0, Behavioral Only)</strong>
            <ul style="margin-top: 6px; padding-left: 20px;">
                <li style="margin-bottom: 4px;"><em>Scope:</em> Behavioral features only (continuous multidimensional space)</li>
                <li style="margin-bottom: 4px;"><em>Examples:</em> High flights + low distance (short-haul commuter), low flights + extreme distance (single long trip), high points accumulated + zero redemption (hoarder)</li>
                <li style="margin-bottom: 4px;"><em>Method:</em> DBSCAN density-based clustering identifies low-density regions in behavioral feature space</li>
                <li style="margin-bottom: 4px;"><em>Action:</em> Separate into df_behavioral_outliers for dedicated profiling (typically 2-3%, ~300-450 customers)</li>
                <li style="margin-bottom: 4px;"><em>Rationale:</em> Unusual behavioral combinations pull K-Means centroids, create spurious hierarchical merges, and distort SOM topology. Separation ensures core algorithms work on typical patterns while preserving outliers for niche analysis</li>
            </ul>
        </li>
    </ul>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #313131; font-weight: bold;">Why No DBSCAN for Demographics:</h4>
    <p style="margin: 10px 0; color: #000; margin-right: 40px;">
        Demographic features are predominantly discrete (Income bins, Education levels, Location codes) with approximately 270 theoretical combinations across 14,527 customers (average 54 customers per combination). Unusual combinations like "High Income + Rural + Low Education" represent potentially valuable niche segments (entrepreneurs, remote workers, farmers) rather than statistical noise. Feature engineering through binning already converted continuous extremes into discrete categories, addressing univariate outliers. DBSCAN on discrete feature space would fragment meaningful segments rather than isolate distortive outliers. Therefore, demographic clustering (Sections 7.1-7.5) proceeds directly on all customers without multivariate outlier separation.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #313131; font-weight: bold;">Why DBSCAN for Behavioral:</h4>
    <p style="margin: 10px 0; color: #000; margin-right: 40px;">
        Behavioral features are entirely continuous (flights 0-200, distance 0-300k km, points 0-3M), creating a true multidimensional continuous space where multivariate outliers represent unusual behavioral patterns. Unlike demographics where unusual combinations might be valid market segments, behavioral outliers like "2 flights with 100k km total" or "200 flights with 5k km total" represent statistically rare patterns that severely distort distance-based clustering. DBSCAN separation ensures core behavioral clustering captures typical travel patterns while preserving unusual behaviors for dedicated strategic analysis.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #313131; font-weight: bold;">Outcome:</h4>
    <ul style="margin: 10px 0; padding-left: 20px; color: #000; margin-right: 40px;">
        <li style="margin-bottom: 6px;"><strong>Demographic clustering (Sections 7.1-7.5):</strong> Operates on all active customers with differential scaling based on feature characteristics, no multivariate outlier removal</li>
        <li style="margin-bottom: 6px;"><strong>Behavioral clustering (Sections 8.1-8.x):</strong> Operates on df_behavioral_core (~97% of customers) after DBSCAN separation, with multivariate outliers profiled separately in Section 9</li>
        <li style="margin-bottom: 6px;"><strong>Transparency:</strong> Full documentation of sample sizes at each stage, scaler assignments per feature, and outlier characteristics</li>
        <li style="margin-bottom: 6px;"><strong>No deletion of legitimate customers:</strong> Only data errors removed, all statistical outliers retained either in core clustering or separate outlier segments</li>
    </ul>
</div>

### Helper functions

In [None]:
def plot_numeric_distribution(
    df,
    column,
    color=None,
    show_kde=True,
    y_scale="linear",
    bins=None,
    integer_ticks=False,
    show_boxplot=True,
    show_pct_labels=True,
):
    series = df[column].dropna()
    total_count = len(series)
    sns.set_style("white")
    if color is None:
        color = CUSTOM_HEX[1]

    ncols = 2 if show_boxplot else 1
    fig, axes = plt.subplots(1, ncols, figsize=(16, 4) if show_boxplot else (8, 4))
    if show_boxplot:
        hist_ax, box_ax = axes
    else:
        hist_ax = axes

    hist_kwargs = dict(
        x=series,
        ax=hist_ax,
        kde=show_kde,
        color=color,
        edgecolor=color,
        linewidth=1,
        stat="count",
    )
    if bins is not None:
        hist_kwargs["bins"] = bins

    sns.histplot(**hist_kwargs)

    # Add percentage labels on top of bars
    if show_pct_labels:
        for patch in hist_ax.patches:
            height = patch.get_height()
            if height > 0:
                pct = height / total_count * 100
                hist_ax.annotate(
                    f'{pct:.1f}%',
                    xy=(patch.get_x() + patch.get_width() / 2, height),
                    ha='center',
                    va='bottom',
                    fontsize=12,
                    fontweight='bold'
                )

    if integer_ticks:
        hist_ax.xaxis.set_major_locator(MaxNLocator(integer=True))
        if show_boxplot:
            box_ax.xaxis.set_major_locator(MaxNLocator(integer=True))

    if y_scale == "log":
        hist_ax.set_yscale("log")
        y_label = "Count (log scale)"
    elif y_scale == "sqrt":
        hist_ax.set_yscale(FuncScale(hist_ax, functions=(np.sqrt, lambda y: y**2)))
        y_label = "Count (sqrt scale)"
    else:
        y_label = "Count"

    hist_ax.set_title(f"{column} Distribution")
    hist_ax.set_xlabel(column)
    hist_ax.set_ylabel(y_label)
    hist_ax.grid(False)

    if show_boxplot:
        sns.boxplot(
            x=series,
            ax=box_ax,
            color=color,
            saturation=1,
            linewidth=1,
            flierprops=dict(marker="o", markersize=4, markerfacecolor="white", markeredgecolor="black"),
            boxprops=dict(alpha=0.9),
            whiskerprops=dict(color="0.3", linewidth=1),
            capprops=dict(color="0.3", linewidth=1),
            medianprops=dict(color="0.2", linewidth=1.5),
        )
        box_ax.set_title(f"{column} Boxplot")
        box_ax.set_xlabel(column)
        box_ax.set_yticks([])
        box_ax.grid(False)
        box_ax.spines["left"].set_visible(False)
        sns.despine(ax=box_ax, left=True)

    sns.despine(ax=hist_ax)
    plt.tight_layout()
    plt.show()




def plot_categorical_distribution(df, column, top_n=None):
    """
    Plot category counts as a horizontal bar chart.

    Parameters
    ----------
    df : pandas.DataFrame
        Source data.
    column : str
        Categorical column to summarise.
    top_n : int, optional
        Keep only the N most frequent categories.
    """
    series = df[column].dropna().astype(str)
    counts = series.value_counts()
    total = counts.sum()

    # Sort by frequency (descending)
    counts = counts.sort_values(ascending=False)
    
    # Keep only top N if specified
    if top_n is not None:
        counts = counts.iloc[:top_n]

    sns.set_style("white")
    color = CUSTOM_HEX[1]

    fig, ax = plt.subplots(figsize=(8, 4))
    sns.barplot(x=counts.values, y=counts.index, color=color, edgecolor="0.3", ax=ax)
    ax.set_title(f"{column} Distribution")
    ax.set_xlabel("Count")
    ax.set_ylabel(column)
    ax.grid(False)
    
    # Always annotate with percentages
    if len(counts) > 0:
        offset = counts.values.max() * 0.02
        ax.set_xlim(0, counts.values.max() * 1.1)
        for val, idx in zip(counts.values, range(len(counts))):
            percentage = (val / total) * 100
            ax.text(val + offset, idx, f"{percentage:.1f}%", va="center", ha="left", fontsize=9)

    sns.despine(ax=ax, left=True)
    plt.tight_layout()
    plt.show()

# Systematic identification of statistical outliers using IQR method

IQR_THRESHOLD = 1.5

def detect_univariate_outliers(series, feature_name, threshold=1.5):
    """
    Detect outliers using Interquartile Range (IQR) method.
    """
    Q1 = series.quantile(0.25)
    Q3 = series.quantile(0.75)
    IQR = Q3 - Q1
    
    lower_bound = Q1 - threshold * IQR
    upper_bound = Q3 + threshold * IQR
    
    outliers_mask = (series < lower_bound) | (series > upper_bound)
    n_outliers = outliers_mask.sum()
    pct_outliers = (n_outliers / len(series)) * 100
    
    lower_outliers = (series < lower_bound).sum()
    upper_outliers = (series > upper_bound).sum()
    
    return {
        'Feature': feature_name,
        'N_Analyzed': len(series),
        'Q1': Q1,
        'Q3': Q3,
        'IQR': IQR,
        'Lower_Bound': lower_bound,
        'Upper_Bound': upper_bound,
        'Total_Outliers': n_outliers,
        'Outlier_Pct': pct_outliers,
        'Lower_Outliers': lower_outliers,
        'Upper_Outliers': upper_outliers,
        'Min_Value': series.min(),
        'Max_Value': series.max()
    }

def analyze_outliers(df, features, caption=None, exclude_zeros=False):
    """
    Analyze outliers for multiple features and return a summary table.
    """
    outlier_results = []
    
    for feature in features:
        if exclude_zeros:
            series = df[df[feature] > 0][feature]
        else:
            series = df[feature].dropna()
        
        if len(series) > 0:
            result = detect_univariate_outliers(series, feature, threshold=IQR_THRESHOLD)
            if exclude_zeros:
                result['Zero_Count'] = (df[feature] == 0).sum()
                result['Zero_Pct'] = (df[feature] == 0).sum() / len(df) * 100
            outlier_results.append(result)

    columns = ['Feature', 'N_Analyzed', 'Total_Outliers', 'Outlier_Pct', 'Upper_Outliers', 'Upper_Bound', 'Lower_Outliers', 'Lower_Bound', 'Min_Value', 'Max_Value']
    if exclude_zeros:
        columns.insert(2, 'Zero_Pct')
    
    summary = pd.DataFrame(outlier_results)[columns]
    
    if caption:
        print(caption)
    display(summary)
    
    return summary


def analyze_skewness(df, features, skew_threshold=1.0):
    """
    Analyze skewness of features to determine if log transformation is needed.
    
    Parameters
    ----------
    df : pd.DataFrame
        DataFrame containing the features
    features : list
        List of column names to analyze
    skew_threshold : float, default=1.0
        Threshold above which a feature is considered highly skewed
        |skewness| > threshold → recommend log transformation
        
    Returns
    -------
    pd.DataFrame : Summary with skewness values and transformation recommendations
    """
    results = []
    
    for feature in features:
        series = df[feature].dropna()
        skewness = series.skew()
        
        # Determine recommendation
        if abs(skewness) > skew_threshold:
            recommendation = "Log Transform"
        else:
            recommendation = "No Transform"
        
        results.append({
            'Feature': feature,
            'N_Analyzed': len(series),
            'Skewness': skewness,
            'Recommendation': recommendation
        })
    
    summary = pd.DataFrame(results)
    
    # Print features that need transformation
    skewed_features = summary[summary['Recommendation'] == "Log Transform"]['Feature'].tolist()
    if skewed_features:
        print(f"Features requiring log transformation (|skewness| > {skew_threshold}):")
        for f in skewed_features:
            skew_val = summary[summary['Feature'] == f]['Skewness'].values[0]
            print(f"  - {f}: skewness = {skew_val:.2f}")
    else:
        print(f"No features exceed skewness threshold of {skew_threshold}")
    
    return summary




### **Part 0: Business Related Exclusions**

In [None]:
# Remove customers with first flight after July 2021
# Customers with only 1-2 months of flight data become artificial outliers in behavioral features like distance_variability or companion_flight_ratio, and seasonal_concentration etc. (calculated on monthly aggregates). Requiring first flight by July 2021 ensures at least 6 months of data for meaningful patterns.

print(f"Customers before removing late-starters: {len(df_Customer):,}")

# Calculate first flight date per customer
first_flight = (
    df_Flights[df_Flights['NumFlights'] > 0]
    .groupby('Loyalty#')['YearMonthDate']
    .min()
)

# Convert to datetime for proper comparison
first_flight_dt = pd.to_datetime(first_flight + "-01")
cutoff_date = pd.to_datetime("2021-07-01")

customers_with_enough_history = first_flight_dt[first_flight_dt <= cutoff_date].index

df_Customer = df_Customer[df_Customer['Loyalty#'].isin(customers_with_enough_history)].copy()
df_Flights = df_Flights[df_Flights['Loyalty#'].isin(customers_with_enough_history)].copy()

print(f"Customers after removing late-starters: {len(df_Customer):,}")

<div style="background-color: #fce8e8ff; padding: 15px; margin-right: 20px; border-left: 5px solid; border-image: linear-gradient(to bottom, #8B0000, #A52A2A, #CD5C5C, #F08080) 1; border-radius: 5px;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #8B0000; font-weight: bold;">Critical: Late-Starting Customers Removed</h3>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        <strong>731 customers were removed</strong> because their first flight occurred after July 2021.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #A52A2A; font-weight: bold;">Why This Matters:</h4>
    <ul style="margin: 10px 0; padding-left: 20px; color: #000; margin-right: 40px;">
        <li><strong>Insufficient data for behavioral features:</strong> Customers with less than 6 months of flight history produce unreliable coefficients of variation (distance_variability), ratios (companion_flight_ratio), and seasonal patterns since these metrics require multiple observations across time.</li>
        <li><strong>Seasonal concentration requires multiple seasons:</strong> The seasonal_concentration feature (Gini coefficient across Winter/Spring/Summer/Fall) is meaningless for customers who only have data in 1-2 seasons, artificially inflating their concentration score.</li>
        <li><strong>Minimum threshold:</strong> Requiring first flight by July 2021 ensures at least 6 months of potential flight activity (Jul-Dec), covering at least 3 seasons for meaningful seasonal pattern detection. Because Winter is starting on December in our definition.</li>
    </ul>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #A52A2A; font-weight: bold;">Dataset Impact:</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        <strong>Before:</strong> 14,266 customers | <strong>After:</strong> 13,535 customers
    </p>
</div>


### **Part 1: df_Customer Features**

In [None]:
customer_features = ['Income', 'Customer Lifetime Value']

# Plots
for feature in customer_features:
    plot_numeric_distribution(
        df_Customer, 
        feature,
        show_pct_labels=False
    )


# Outlier Summary
customer_summary = analyze_outliers(
    df_Customer, 
    features=customer_features,
    caption='df_Customer Features (Income, Customer Lifetime Value) Outlier Summary (IQR Method)'
)

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 20px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">Decision</h3>
    <p style="margin: 0 40px 10px 0; color: #000;">
        <strong>Note:</strong> Skewness analysis is skipped for these features. Income will be binned in Section 3 (rendering skewness irrelevant), and Customer Lifetime Value is not used in clustering. For behavioral features, we apply formal skewness checks to determine log transformation needs.
    </p>
    <ul style="margin: 0; color: #000; margin-right: 40px; padding-left: 20px;">
        <li style="margin-bottom: 8px;">
            <strong>Income:</strong> Right skewed distribution visible in histogram with high concentration at low values. Equal width binning would create imbalanced categories. Instead, we apply <strong>custom bins based on domain knowledge</strong> in Section 3 to create meaningful income segments. The binned feature (Income_Bin_Num) will use <strong>StandardScaler</strong>.
        </li>
        <li style="margin-bottom: 8px;">
            <strong>Customer Lifetime Value:</strong> Heavily right skewed with significant outlier percentage visible in boxplot. However, this feature is <strong>not used in clustering</strong> as value-based segmentation is derived from FM features (Frequency, Monetary) instead. No scaling or transformation required.
        </li>
    </ul>
</div>


### **Part 2: df_Flights Features**

In [None]:
# As mentioned in the EDA Part: "Fractional counts in flight metrics: The KPIs (NumFlights, NumFlightsWithCompanions, DistanceKM, PointsAccumulated, PointsRedeemed, DollarCostPointsRedeemed) contain decimal values in 2019. These likely stem from multi-year aggregation"

# Summary table: Float analysis per column and year (excluding zeros)
summary = []
for col in df_Flights.select_dtypes("float").columns:
    row = {'Column': col}
    for year in sorted(df_Flights['Year'].unique()):
        non_zero = df_Flights[(df_Flights['Year'] == year) & (df_Flights[col] != 0)]
        floats = (non_zero[col] % 1 != 0).sum()
        total = len(non_zero)
        row[f'{year} (n)'] = floats
        row[f'{year} (%)'] = round(floats / total * 100, 2) if total > 0 else 0
    summary.append(row)

display(pd.DataFrame(summary).set_index('Column'))

# Remove 2019 data due to quality issues (90-99% fractional values)
df_Flights = df_Flights[df_Flights['Year'] != 2019].copy()

<div style="background-color: #fce8e8ff; padding: 15px; margin-right: 20px; border-left: 5px solid; border-image: linear-gradient(to bottom, #8B0000, #A52A2A, #CD5C5C, #F08080) 1; border-radius: 5px;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #8B0000; font-weight: bold;">Critical: 2019 Data excluded due to Quality Issues</h3>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        <strong>All 2019 flight records are excluded</strong> from behavioral feature engineering due to pervasive fractional values indicating data quality problems.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #A52A2A; font-weight: bold;">Why This Matters:</h4>
    <ul style="margin: 10px 0; padding-left: 20px; color: #000; margin-right: 40px;">
        <li><strong>Fractional values in count metrics:</strong> 90% to 99% of non-zero values in 2019 are floats for NumFlights, NumFlightsWithCompanions, PointsAccumulated, PointsRedeemed, and DollarCostPointsRedeemed. These metrics should be integers (e.g., 3 flights, not 2.67 flights).</li>
        <li><strong>Likely cause:</strong> Multi-year aggregation or data migration artifacts from legacy systems created fractional counts that do not represent actual customer behavior.</li>
        <li><strong>2020 and 2021 are clean:</strong> Float ratios drop to 0% in 2020 and 2021, confirming data quality is resolved for these years.</li>
    </ul>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #A52A2A; font-weight: bold;">Decision:</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        <strong>Analysis uses only 2020 and 2021 data.</strong> While this reduces the observation window for seasonal features, it ensures reliable behavioral metrics. Since we focus on customers active in 2021 who could also fly in 2020, two full years provide sufficient data for meaningful segmentation.
    </p>
</div>

In [None]:
# Part 2: df_Flights Features (non zero values only, as most records are zero)
flight_features = [
    'NumFlights',
    'NumFlightsWithCompanions', 
    'DistanceKM',
    'PointsAccumulated',
    'PointsRedeemed'
]

for feature in flight_features:
    plot_numeric_distribution(
        df_Flights[df_Flights[feature] > 0], 
        feature,
        show_pct_labels=False
    )

# Outlier Summary
flight_summary = analyze_outliers(
    df_Flights, 
    features=flight_features,
    caption='df_Flights Features (non zero values only, as most records are zero) Outlier Summary (IQR Method)',
    exclude_zeros=True
)

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 20px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">Decision</h3>
    <p style="margin: 0; color: #000; margin-right: 40px; margin-bottom: 10px;">
        <strong>Note:</strong> Skewness analysis is skipped for these raw features. They exist at monthly granularity and will be aggregated per customer in Section 3, fundamentally changing their distributions. Skewness and scaling decisions will be made after feature engineering.
    </p>
    <p style="margin: 0; color: #000; margin-right: 40px; margin-bottom: 10px;">
        <strong>Outlier Summary:</strong> At monthly level, outlier percentages are minimal:
    </p>
    <ul style="margin: 0 40px 10px 20px; color: #000; padding-left: 20px;">
        <li><strong>NumFlights:</strong> 0.27% outliers (504 of 188,245 non-zero records)</li>
        <li><strong>NumFlightsWithCompanions:</strong> 1.49% outliers (1,466 of 98,208)</li>
        <li><strong>DistanceKM:</strong> 0% outliers</li>
        <li><strong>PointsAccumulated:</strong> 0% outliers</li>
        <li><strong>PointsRedeemed:</strong> 0.48% outliers (106 of 22,234)</li>
    </ul>
    <p style="margin: 0; color: #000; margin-right: 40px; margin-bottom: 10px;">
        These raw features will <strong>not be used directly</strong> for clustering. In Section 3 (Feature Engineering), they will be <strong>aggregated per customer</strong> through transformations such as sum (total flights, total distance), mean (average monthly activity), or derived ratios (companion flight ratio, points redemption rate). This aggregation fundamentally changes the distributions as customer level totals will show different patterns than monthly records.
    </p>
    <p style="margin: 0; color: #000; margin-right: 40px; margin-bottom: 10px;">
        Therefore, the <strong>final scaling decision</strong> for behavioral features will be made in Section 3 after feature engineering is complete. The newly created features will be re-evaluated for skewness and outlier percentage to determine whether StandardScaler or RobustScaler is appropriate. Additionally, as outlined in the strategy, <strong>multivariate outliers</strong> in the behavioral feature space will be identified using DBSCAN in Section 8 before core clustering.
    </p>
</div>

# <a class='anchor' id='3'> </a>
<br>

<div style="background: linear-gradient(to right, #00411E, #00622D, #00823C, #45AF28, #82BA72); 
            padding: 10px; color: white; text-align: center;  max-width: 97%;">
    <center><h1 style="margin-top: 10px; margin-bottom: 4px; color: white;
                       font-size: 32px; font-family: 'Roboto', sans-serif;">
        <b>3. Feature Engineering</b></h1></center>
</div>

<div style="background-color: #e1e1e1ff; padding: 15px; margin-right: 20px; border-left: 5px solid; border-image: linear-gradient(to bottom, #212121, #313131, #595959, #909090) 1; border-radius: 5px;"> <h3 style="margin-top: 0; margin-bottom: 10px; color: #000000ff; font-weight: bold;">Feature Engineering Overview</h3> <ul style="margin: 10px 0; padding-left: 20px; color: #000; margin-right: 40px;"> <li style="margin-bottom: 10px;"> <strong>Loyalty Lifecycle & Focus Groups:</strong> <ul style="margin-top: 6px; padding-left: 20px;"> <li>"Which segment do they belong to?"</li> <li>Metrics: is_current_loyalty_member, is_active, Focus_Group (1/2)</li> <li>Focus: Customer filtering and stratification for targeted clustering</li> </ul> </li> <li style="margin-bottom: 10px;"> <strong>Value-Based:</strong> <ul style="margin-top: 6px; padding-left: 20px;"> <li>"How much value do they generate?"</li> <li>Metrics: Frequency, Monetary, Recency</li> <li>Focus: Economic contribution</li> </ul> </li> <li style="margin-bottom: 10px;"> <strong>Demographic:</strong> <ul style="margin-top: 6px; padding-left: 20px;"> <li>"Who are they?"</li> <li>Metrics: Income, Education, Location, Gender, Marital Status, Province, City, FSA</li> <li>Focus: Socioeconomic segments</li> </ul> </li> <li style="margin-bottom: 10px;"> <strong>Behavioral:</strong> <ul style="margin-top: 6px; padding-left: 20px;"> <li>"How do they fly?" (patterns, style, preferences)</li> <li>Metrics: avg_distance_per_flight, distance_variability, companion_flight_ratio, flight_regularity, seasonal_concentration, peak_season_sin/cos, redemption_rate, redemption_frequency</li> <li>Focus: Travel style & engagement patterns</li> </ul> </li> </ul> </div>

## **3.1 Loyalty Lifecycle & Focus Group Features**

### Is_Active and Is_Loyal


In [None]:
# Create 'is_current_loyalty_member' feature
# Customer is a current member if they have no CancellationDate
df_Customer['is_current_loyalty_member'] = (
    df_Customer['EnrollmentDateOpening'].notna() & 
    df_Customer['CancellationDate'].isna()
)

# Create 'is_active' feature: customer had flights in 2021
flights_2021_activity = (
    df_Flights[df_Flights['Year'] == 2021]
    .groupby('Loyalty#')['NumFlights']
    .sum()
    .rename('total_flights_2021_check')
)

df_Customer = df_Customer.merge(flights_2021_activity, on='Loyalty#', how='left')
df_Customer['is_active'] = (df_Customer['total_flights_2021_check'].fillna(0) > 0)
df_Customer.drop('total_flights_2021_check', axis=1, inplace=True)

# Summary statistics
crosstab = pd.crosstab(
    df_Customer['is_current_loyalty_member'],
    df_Customer['is_active'],
    margins=True
)

# Rename and reorder
crosstab.index = crosstab.index.map({True: 'Loyalty', False: 'Non-Loyalty', 'All': 'Total'})
crosstab.columns = crosstab.columns.map({True: 'Active', False: 'Inactive', 'All': 'Total'})
crosstab = crosstab.loc[['Loyalty', 'Non-Loyalty', 'Total'], ['Active', 'Inactive', 'Total']]
crosstab.index.name = crosstab.columns.name = None
print(crosstab)

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">Activity & Loyalty Labels Summary</h3>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        We created <strong>binary labels</strong> to identify customer loyalty status and 2021 flight activity, enabling targeted segmentation for clustering analysis.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Goal of the Features (Why):</h4>
    <p style="margin: 10px 40px 10px 20px; color: #000;">
        These labels define the <strong>scope of our clustering analysis</strong> by filtering customers into relevant focus groups. Active customers (those with 2021 flights) are our primary target for segmentation, while loyalty status determines whether we analyze current members (retention strategies) or ex-members (win-back strategies). This filtering ensures clustering focuses on actionable, engaged customer segments rather than inactive or churned populations.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Feature Definitions (What):</h4>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-right: 20px;"><strong>is_current_loyalty_member:</strong> True if customer is currently enrolled in the loyalty program (CancellationDate is NaN). Separates active program members from ex-members.</li>
        <li style="margin-right: 20px;"><strong>is_active:</strong> True if customer had at least one flight in 2021 (total_flights_2021 > 0). Identifies customers with recent engagement versus inactive/dormant customers.</li>
    </ul>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Implementation Details (How):</h4>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-right: 20px;"><strong>Why focus on 2021 activity?</strong> 2021 represents the most recent complete year of data, ensuring we cluster customers based on current behavior rather than historical patterns that may no longer be relevant.</li>
    </ul>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Interpretation:</h4>
    <p style="margin: 10px 40px 10px 20px; color: #000;">
        The crosstab reveals a <strong>highly engaged customer base</strong>: 13,038 of 13,535 customers (96.3%) were active in 2021, with only 497 inactive. Key insights:
    </p>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-right: 20px;"><strong>Loyalty Members:</strong> 12,510 customers (92.4%), of which 12,470 (99.7%) were active in 2021. Extremely high engagement among current members.</li>
        <li style="margin-right: 20px;"><strong>Non-Loyalty (Ex-Members):</strong> 1,025 customers (7.6%), with 568 (55.4%) still active despite cancelling membership. These represent win-back opportunities.</li>
        <li style="margin-right: 20px;"><strong>Inactive Customers:</strong> Only 497 total (3.7%), predominantly ex-members (457). Minimal inactivity among current loyalty members (40).</li>
    </ul>
    <p style="margin: 10px 40px 10px 20px; color: #000;">
        <strong>Implication:</strong> The dataset is dominated by active loyalty members, making it ideal for behavioral segmentation.
    </p>
</div>


### Is_Focus_Group and Focus_Group

Define focus groups for targeted analysis.

In [None]:
# Create 'Is_Focus_Group' feature: True if customer belongs to a focus group (Active customers)
df_Customer['Is_Focus_Group'] = df_Customer['is_active']

# Create 'Focus_Group' feature: 
# 1 = Loyalty Members & Active (Focus Group 1)
# 2 = Ex-Loyalty Members & Active (Focus Group 2)
# NaN = Not in any focus group (inactive customers)
df_Customer['Focus_Group'] = None
df_Customer.loc[
    df_Customer['is_current_loyalty_member'] & df_Customer['is_active'], 
    'Focus_Group'
] = 1
df_Customer.loc[
    ~df_Customer['is_current_loyalty_member'] & df_Customer['is_active'], 
    'Focus_Group'
] = 2

# Display summary
focus_group_summary = df_Customer['Focus_Group'].value_counts(dropna=False).sort_index()
focus_group_summary

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">Focus Group Assignment Summary</h3>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        We created <strong>focus group identifiers</strong> to separate active customers into two strategically important segments based on loyalty membership status.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Goal of the Features (Why):</h4>
    <p style="margin: 10px 40px 10px 20px; color: #000;">
        Focus groups enable <strong>parallel clustering strategies</strong> for different customer populations with distinct business objectives. Focus Group 1 (current loyalty & active members) requires retention and upsell strategies, while Focus Group 2 (ex-loyalty but active members) presents win-back opportunities. By clustering these groups separately, we can identify personas with actionable, group-specific marketing interventions rather than generic one-size-fits-all segments.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Feature Definitions (What):</h4>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-right: 20px;"><strong>Is_Focus_Group:</strong> Boolean flag (True/False) indicating whether a customer belongs to any focus group. Equivalent to <strong>is_active</strong>. All active customers (with 2021 flights) are included in focus groups; inactive customers are excluded from clustering analysis.</li>
        <li style="margin-right: 20px;"><strong>Focus_Group:</strong> Categorical label with three possible values:
            <ul style="padding-left: 20px; margin-top: 5px;">
                <li><strong>1</strong> = Loyalty Members & Active (Focus Group 1): Current program members with recent flight activity</li>
                <li><strong>2</strong> = Ex-Loyalty Members & Active (Focus Group 2): Former program members who cancelled but still fly with the airline</li>
                <li><strong>NaN</strong> = Inactive customers excluded from clustering (no 2021 flights)</li>
            </ul>
        </li>
    </ul>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Implementation Details (How):</h4>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-right: 20px;"><strong>Why separate FG1 and FG2?</strong> These groups exhibit fundamentally different relationships with the brand. FG1 customers actively participate in the loyalty program (earning/redeeming points), while FG2 customers fly without program benefits. Their demographics, behaviors, and value profiles may differ substantially, justifying separate clustering analyses.</li>
        <li style="margin-right: 20px;"><strong>Why exclude inactive customers (NaN)?</strong> Clustering algorithms work best on homogeneous populations. Including dormant customers (no 2021 activity) would create noise and dilute meaningful patterns among active, engaged customers who represent the airline's current revenue base.</li>
        <li style="margin-right: 20px;"><strong>Strategic implication:</strong> This structure allows us to build 2-4 personas per focus group (e.g., "Champions", "Frequent Flyers", "Premium Occasional", "At Risk"), then combine insights across groups to understand the full active customer landscape.</li>
    </ul>
</div>

## **3.2 Value based Feature Enginnering**

FM segments customers by **Frequency** and **Monetary** value.

In [None]:
# Preparing Step 1: Aggregate 2021 flight data per customer
flights_2021 = (
    df_Flights[df_Flights["Year"] == 2021]
    .groupby("Loyalty#", as_index=False)
    .agg(
        total_flights_2021=("NumFlights", "sum"),
        total_distance_2021=("DistanceKM", "sum"),
    )
)

df_Customer = df_Customer.merge(flights_2021, on="Loyalty#", how="left")
df_Customer[["total_flights_2021", "total_distance_2021"]] = (
    df_Customer[["total_flights_2021", "total_distance_2021"]].fillna(0)
)


# Preparing Step 2: First flight date (earliest month with NumFlights > 0)
first_flight = (
    df_Flights[df_Flights["NumFlights"] > 0]
    .groupby("Loyalty#", as_index=False)["YearMonthDate"]
    .min()
    .rename(columns={"YearMonthDate": "first_flight_date"})
)

df_Customer = df_Customer.merge(first_flight, on="Loyalty#", how="left")


# Preparing Step 3: Active months in 2021 based on first flight
year_start = pd.to_datetime("2021-01-01")
year_end = pd.to_datetime("2021-12-31")

df_Customer["first_flight_date"] = pd.to_datetime(df_Customer["first_flight_date"])
# Only calculate for customers with flights; NaN for customers without flights
start_dates = df_Customer["first_flight_date"].apply(lambda x: max(year_start, x) if pd.notna(x) else pd.NaT)
df_Customer["months_active_2021"] = (
    ((year_end.year - start_dates.dt.year) * 12 + (year_end.month - start_dates.dt.month) + 1)
    .where(start_dates.notna())
    .clip(lower=0.1)
)


# Preparing Step 4: Last flight date over all available years
last_flight = (
    df_Flights[df_Flights["NumFlights"] > 0]
    .groupby("Loyalty#", as_index=False)["YearMonthDate"]
    .max()
    .rename(columns={"YearMonthDate": "last_flight_date"})
)

df_Customer = df_Customer.merge(last_flight, on="Loyalty#", how="left")

# F = Frequency (flights per active month in 2021)
df_Customer["Frequency"] = (
    df_Customer["total_flights_2021"] / df_Customer["months_active_2021"]
).fillna(0)

# M = Monetary (distance per active month in 2021)
df_Customer["Monetary"] = (
    df_Customer["total_distance_2021"] / df_Customer["months_active_2021"]
).fillna(0)


In [None]:
# Visualize FM distributions
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
fig.suptitle('FM Features Distribution', fontsize=16, fontweight='bold', y=1.02)

# Frequency
ax = axes[0]
freq_plot = df_Customer['Frequency']
ax.hist(freq_plot, bins=50, color=colors[1], edgecolor='white', alpha=0.8)
ax.set_title('F - Frequency (Flights per Active Month)', fontsize=12, fontweight='bold')
ax.set_xlabel('Flights/Month', fontsize=10)
ax.set_ylabel('Customers', fontsize=10)
ax.axvline(df_Customer['Frequency'].median(), color=colors[3], linestyle='--', 
           linewidth=2, label=f'Median: {df_Customer["Frequency"].median():.2f}')
ax.legend()
ax.grid(False)


# Monetary
ax = axes[1]
mon_plot = df_Customer['Monetary']
ax.hist(mon_plot, bins=50, color=colors[2], edgecolor='white', alpha=0.8)
ax.set_title('M - Monetary (Value Score per Active Month)', fontsize=12, fontweight='bold')
ax.set_xlabel('Monetary Value', fontsize=10)
ax.set_ylabel('Customers', fontsize=10)
ax.axvline(df_Customer['Monetary'].median(), color=colors[3], linestyle='--', 
           linewidth=2, label=f'Median: {df_Customer["Monetary"].median():.4f}')
ax.legend()
ax.grid(False)

plt.tight_layout()
plt.show()

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">Value-Based Features (FM) Summary</h3>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        We implemented <strong>FM-based features</strong> (Frequency, Monetary) to quantify customer engagement and value, focusing exclusively on 2021 flight activity.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Goal of the Features (Why):</h4>
    <p style="margin: 10px 40px 10px 20px; color: #000;">
        FM features serve as the foundation for <strong>value-based customer segmentation</strong>. These features enable us to identify high-value Champions, Frequent Flyers, Premium Occasional travelers, and At-Risk customers using a simple 2D median-split approach that is interpretable for business stakeholders. By normalizing flight activity and distance by active months, we create fair comparisons between customers with different engagement durations throughout 2021.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Why 2021 as Reference Year:</h4>
    <p style="margin: 10px 40px 10px 20px; color: #000;">
        <strong>2021 measures current behavior, not historical value.</strong> A customer who flew frequently in 2019-2020 but not in 2021 is churned and no longer valuable for retention strategies. Conversely, a customer who started flying in 2021 and shows high engagement is a hot lead worth prioritizing. Historical flight data would mislead clustering by treating inactive high-value customers as still relevant.
    </p>
    <p style="margin: 10px 40px 10px 20px; color: #000;">
        Customers without 2021 flights are considered <strong>churned/inactive</strong> (497 customers). Reactivating churned customers requires significantly more effort and resources than retaining active ones. Our clustering therefore focuses on the <strong>active customer base</strong>:
    </p>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; color: #000;">
        <li><strong>Focus Group 1:</strong> Active + Loyalty Members (12,470 customers) → retention and upsell strategies</li>
        <li><strong>Focus Group 2:</strong> Active + Non-Loyalty (568 customers) → win-back to loyalty program</li>
        <li><strong>Excluded:</strong> Inactive/Churned (497 customers) → separate reactivation campaigns outside clustering scope</li>
    </ul>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Feature Definitions (What):</h4>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-right: 20px;"><strong>Frequency:</strong> Average flights per active month in 2021. Calculated as total_flights_2021 / months_active_2021. Measures flight engagement intensity normalized by customer's actual participation period.</li>
        <li style="margin-right: 20px;"><strong>Monetary:</strong> Average distance traveled per active month in 2021. Calculated as total_distance_2021 / months_active_2021. Serves as a proxy for customer value (longer distances typically indicate higher-value routes and ticket prices).</li>
    </ul>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Implementation Details (How):</h4>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-right: 20px;"><strong>months_active_2021:</strong> Number of months from first flight date until December 2021, bounded by the 2021 calendar year. Clipped to minimum 0.1 to avoid division by zero.</li>
        <li style="margin-right: 20px;"><strong>Why normalize by active months?</strong> A customer flying 12 times over 12 months (1/month) has different engagement than one flying 12 times in 3 months (4/month). Monthly averaging captures engagement intensity rather than absolute volume.</li>
    </ul>
</div>


## **3.3 Demographic based Feature Engineering**

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">Demographic Features Overview</h3>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        We engineer <strong>demographic features</strong> to enable customer profiling through clustering while reducing dimensionality and transforming categorical variables into clustering-friendly numeric formats. These features answer the question: <strong>"Who is this customer?"</strong>
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Goal of the Features:</h4>
    <p style="margin: 10px 40px 10px 20px; color: #000;">
        Demographic features provide the foundation for <strong>customer persona development</strong>. By understanding who customers are (education, income, location, gender, marital status), we can create actionable segments like "Urban High-Earners", "Suburban Families", or "Budget-Conscious Travelers". These personas complement behavioral segmentation by adding human context for targeted marketing messages and channel strategies.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Feature Encoding:</h4>
    <table style="margin: 10px 0; border-collapse: collapse; width: 100%;">
        <tr style="background-color: #00411E; color: white;">
            <th style="padding: 8px; border: 1px solid #00411E;">Feature</th>
            <th style="padding: 8px; border: 1px solid #00411E;">Original</th>
            <th style="padding: 8px; border: 1px solid #00411E;">Encoding</th>
        </tr>
        <tr><td style="padding: 8px; border: 1px solid #ccc; color: #000;">Education_Level_Num</td><td style="padding: 8px; border: 1px solid #ccc; color: #000;">Education (5 categories)</td><td style="padding: 8px; border: 1px solid #ccc; color: #000;">Ordinal (Low=0, Mid=1, High=2)</td></tr>
        <tr><td style="padding: 8px; border: 1px solid #ccc; color: #000;">Income_Bin_Num</td><td style="padding: 8px; border: 1px solid #ccc; color: #000;">Income (continuous)</td><td style="padding: 8px; border: 1px solid #ccc; color: #000;">Binned (Low≤20k=0, Mid≤70k=1, High=2)</td></tr>
        <tr><td style="padding: 8px; border: 1px solid #ccc; color: #000;">Location_Code_Num</td><td style="padding: 8px; border: 1px solid #ccc; color: #000;">Location Code</td><td style="padding: 8px; border: 1px solid #ccc; color: #000;">Ordinal (Rural=0, Suburban=1, Urban=2)</td></tr>
        <tr><td style="padding: 8px; border: 1px solid #ccc; color: #000;">Province_Encoded</td><td style="padding: 8px; border: 1px solid #ccc; color: #000;">Province or State</td><td style="padding: 8px; border: 1px solid #ccc; color: #000;">Frequency Encoding</td></tr>
        <tr><td style="padding: 8px; border: 1px solid #ccc; color: #000;">City_Encoded</td><td style="padding: 8px; border: 1px solid #ccc; color: #000;">City</td><td style="padding: 8px; border: 1px solid #ccc; color: #000;">Frequency Encoding</td></tr>
        <tr><td style="padding: 8px; border: 1px solid #ccc; color: #000;">FSA_Encoded</td><td style="padding: 8px; border: 1px solid #ccc; color: #000;">Postal Code (first 3 chars)</td><td style="padding: 8px; border: 1px solid #ccc; color: #000;">Frequency Encoding</td></tr>
        <tr><td style="padding: 8px; border: 1px solid #ccc; color: #000;">Gender_Encoded</td><td style="padding: 8px; border: 1px solid #ccc; color: #000;">Gender</td><td style="padding: 8px; border: 1px solid #ccc; color: #000;">Binary (male=1, female=0)</td></tr>
        <tr><td style="padding: 8px; border: 1px solid #ccc; color: #000;">Marital_*</td><td style="padding: 8px; border: 1px solid #ccc; color: #000;">Marital Status</td><td style="padding: 8px; border: 1px solid #ccc; color: #000;">One-Hot (Divorced, Married, Single)</td></tr>
    </table>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Encoding Rationale:</h4>
    <ul style="margin: 5px 40px 10px 20px; padding-left: 20px; color: #000;">
        <li><strong>Education:</strong> Collapsed from 5 to 3 ordinal levels to reduce noise while preserving hierarchy</li>
        <li><strong>Income:</strong> Binned into 3 levels based on domain knowledge (Low ≤20k, Mid ≤70k, High >70k)</li>
        <li><strong>Location Code:</strong> Ordinal encoding reflecting urbanization level (Rural -> Suburban -> Urban)</li>
        <li><strong>Province/City/FSA:</strong> Frequency encoding captures population density without high-cardinality OHE explosion</li>
        <li><strong>Gender:</strong> Binary encoding for 2 categories</li>
        <li><strong>Marital Status:</strong> One-Hot encoding as no natural order exists between categories</li>
    </ul>
</div>


In [None]:
# Education: Collapse 5 categories into 3 ordinal levels
def map_education_level(x):
    if x in ["High School or Below", "College"]:
        return "Low"
    if x in ["Bachelor"]:
        return "Mid"
    if x in ["Master", "Doctor"]:
        return "High"
    return pd.NA

df_Customer["Education_Level"] = df_Customer["Education"].apply(map_education_level)

edu_mapping = {"Low": 0, "Mid": 1, "High": 2}
df_Customer["Education_Level_Num"] = df_Customer["Education_Level"].map(edu_mapping)

In [None]:
plot_numeric_distribution(df_Customer, 'Education_Level_Num')

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">Education_Level_Num</h3>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Goal of the Feature:</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        As we saw in the EDA, Education is completely dominated by Bachelor with roughly 62% and College 25%. The Bachelor educated customers also have the highest average income and customer lifetime value, with the rest being evenly distributed. Collapsing five categories into three ordinal levels reduces dimensionality while preserving the meaningful distinction between education tiers that correlate with customer value.
    </p>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        <strong>Clustering benefit:</strong> One hot encoding five education categories would create five binary features that dominate distance calculations and inflate the feature space. Ordinal encoding into three levels (0, 1, 2) produces a single numeric feature where distances reflect the natural ordering of education (Low → Mid → High). This allows clustering algorithms to group customers with similar education levels together while treating the difference between Low and High as larger than Low and Mid.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Feature Definition and Implementation:</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Ordinal numeric encoding (0, 1, 2) representing education tier. Bachelor is separated as its own category (Mid) because EDA showed this group has distinct income and value characteristics. College is grouped with High School (Low) rather than Bachelor because their income and CLTV distributions are more similar:
    </p>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; color: #000;">
        <li><strong>Low (0):</strong> High School or Below + College (30.3%)</li>
        <li><strong>Mid (1):</strong> Bachelor (62.3%, highest value segment)</li>
        <li><strong>High (2):</strong> Master + Doctor (7.3%)</li>
    </ul>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Scaling Decision:</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        <strong>StandardScaler.</strong> Discrete ordinal values (0, 1, 2) have no outliers.
    </p>
</div>


In [None]:
# Income: Custom bins based on domain knowledge
def bin_income(income):
    if income <= 20000:
        return 0  # Low Income
    elif income <= 70000:
        return 1  # Middle Income
    else:
        return 2  # High Income

df_Customer["Income_Bin_Num"] = df_Customer["Income"].apply(bin_income)

In [None]:
plot_numeric_distribution(df_Customer, 'Income_Bin_Num')

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">Income_Bin_Num</h3>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Goal of the Feature (Why):</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        As identified in Section 2.5, Income shows a right skewed distribution. Custom domain knowledge bins transform the continuous skewed feature into a discrete ordinal variable with more balanced category sizes, preventing extreme income values from dominating distance calculations.
    </p>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        <strong>Why 3 bins?</strong> Three categories (Low/Middle/High) capture the essential income distinctions relevant for customer segmentation: budget-conscious travelers, mainstream customers, and premium spenders. This granularity aligns with Education_Level_Num (also 0-2), ensuring consistent ordinal scaling across demographic features and avoiding unnecessary dimensionality from fine-grained income splits that offer diminishing business value.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Feature Definition and Implementation (What and How):</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Ordinal numeric encoding (0-2) based on domain knowledge income thresholds:
    </p>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; color: #000;">
        <li><strong>0 (Low Income):</strong> ≤ 20,000 (31.2%)</li>
        <li><strong>1 (Middle Income):</strong> 20,001 - 75,000 (49.8%)</li>
        <li><strong>2 (High Income):</strong> > 75,000 (19%)</li>
    </ul>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Scaling Decision:</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        <strong>StandardScaler.</strong> Binning already addressed the original skewness. Discrete ordinal values (0-2) have no outliers.
    </p>
</div>


In [None]:
# FSA (Forward Sortation Area): Frequency encoding for geographic feature
df_Customer["FSA"] = df_Customer["Postal code"].str[:3]

fsa_counts = df_Customer["FSA"].value_counts()

df_Customer["FSA_Encoded"] = df_Customer["FSA"].map(fsa_counts).astype(int)

In [None]:
plot_numeric_distribution(df_Customer, 'FSA_Encoded')

In [None]:
# Check skewness
analyze_skewness(df_Customer, features=['FSA_Encoded'], skew_threshold=1.0)

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">FSA_Encoded</h3>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Goal of the Feature (Why):</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Forward Sortation Area (first 3 characters of postal code) has too many unique categories for one hot encoding. Frequency encoding captures geographic density:
        customers from densely populated FSAs get higher values, customers from sparse areas get lower values. This creates a continuous proxy for urban density at the postal code level.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Feature Definition and Implementation (What and How):</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Each FSA is replaced with the count of customers sharing that FSA. Range: 0-1200. Higher values indicate more common postal areas (urban centers),
        lower values indicate rare postal areas (rural/remote).
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Scaling Decision:</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        <strong>StandardScaler (no log transform).</strong> Despite a skewness of 1.6 exceeding our threshold, the distribution is intentionally <strong>multimodal</strong>: high values reflect genuinely dense urban FSAs, while low values represent rural/remote areas. The skewness results from this natural urban/rural split, not from a problematic long tail.
        A log transform would compress the upper range and reduce the meaningful separation this feature is designed to carry.
        <br><br>
        We also do not treat values around 1200 as data errors, they represent legitimate high-density FSAs. However, because our clustering methods are distance-based (and K-Means optimizes squared Euclidean distances),
        scaling is still required to prevent raw magnitude from dominating the feature space.
        <br><br>
        We therefore prefer StandardScaler here: it normalizes the feature using mean and standard deviation, which (given legitimate high urban values) yields a balanced contribution of this density proxy across the overall distance computation.
        RobustScaler is robust in how it estimates scale (median/IQR), but it remains a linear transform and does not compress extreme, but valid values meaning very high urban FSAs can still become disproportionately influential in distance based clustering.
    </p>
</div>


In [None]:
# Province: Frequency encoding (number of customers in that province)
province_counts = df_Customer["Province or State"].value_counts()
df_Customer["Province_Encoded"] = df_Customer["Province or State"].map(province_counts).astype(int)

In [None]:
plot_numeric_distribution(df_Customer, 'Province_Encoded')

In [None]:
# Check skewness
analyze_skewness(df_Customer, features=['Province_Encoded'], skew_threshold=1.0)

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">Province_Encoded</h3>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Goal of the Feature (Why):</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Province has multiple categories but no natural ordering. One hot encoding would create many sparse features. Frequency encoding captures population distribution: customers from populous provinces get higher values, customers from smaller provinces get lower values.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Feature Definition and Implementation (What and How):</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Each province is replaced with the count of customers in that province. Range: 0-5000. Distribution shows a clear gap between small provinces (0-1000, about 22% of customers) and large provinces (2700-5000, about 78% of customers), reflecting the actual population concentration in a few major Canadian provinces.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Scaling Decision:</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        <strong>StandardScaler (no log transform).</strong> Skewness of -0.78 is below our threshold of 1.0. The left-skewed multimodal distribution reflects real geographic concentration in Canada, with most customers in a few large provinces. Log transformation would compress this meaningful distinction between small and large provinces.
    </p>
</div>


In [None]:
# City: Frequency encoding (number of customers in that city)
city_counts = df_Customer["City"].value_counts()
df_Customer["City_Encoded"] = df_Customer["City"].map(city_counts).astype(int)

In [None]:
plot_numeric_distribution(df_Customer, 'City_Encoded')

In [None]:
# Check skewness
analyze_skewness(df_Customer, features=['City_Encoded'], skew_threshold=1.0)

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">City_Encoded</h3>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Goal of the Feature (Why):</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        City has many unique values making one hot encoding impractical. Frequency encoding captures city size: customers from large cities get higher values, customers from small towns get lower values. This provides a granular urban/rural signal between Province and FSA level.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Feature Definition and Implementation (What and How):</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Each city is replaced with the count of customers in that city. Range: 0-3000. Distribution shows two clusters: smaller cities (0-500, about 52% of customers) and larger cities (1500-3000, about 48% of customers), with a gap in between reflecting the typical urban hierarchy.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Scaling Decision:</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        <strong>StandardScaler (no log transform).</strong> Skewness of 0.34 is well below our threshold of 1.0, indicating a balanced distribution. The bimodal pattern reflects natural city size variation in Canada. No transformation needed.
    </p>
</div>


In [None]:
# Gender: Binary encode (male = 1, female = 0)
df_Customer["Gender_Encoded"] = df_Customer["Gender"].map({"male": 1, "female": 0}).astype(int)

In [None]:
plot_categorical_distribution(df_Customer, 'Gender_Encoded')

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">Gender_Encoded</h3>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Goal and Implementation:</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Gender has only 2 categories (Male/Female) with no ordinal relationship. Binary encoding (Male=1, Female=0) is the most efficient representation, avoiding unnecessary one hot expansion while maintaining interpretability. Distribution is nearly balanced (50.1% Female, 49.9% Male).
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Scaling Decision:</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        <strong>StandardScaler.</strong> Binary values (0, 1) with balanced distribution.
    </p>
</div>


In [None]:
# Location Code: Map to ordered values (Rural = 0, Suburban = 1, Urban = 2)
location_mapping = {"Rural": 0, "Suburban": 1, "Urban": 2}
df_Customer["Location_Code_Num"] = df_Customer["Location Code"].map(location_mapping).astype(int)

In [None]:
plot_categorical_distribution(df_Customer, 'Location_Code_Num')

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">Location_Code_Num</h3>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Goal of the Feature (Why):</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Location Code (Rural/Suburban/Urban) has a natural ordered progression from low to high population density. Ordinal encoding preserves this meaningful ranking, which helps clustering algorithms detect urbanization patterns that correlate with travel behavior.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Feature Definition and Implementation (What and How):</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Ordinal numeric encoding (0, 1, 2) representing population density:
    </p>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; color: #000;">
        <li><strong>Rural (0):</strong> 33.3%</li>
        <li><strong>Suburban (1):</strong> 33.9%</li>
        <li><strong>Urban (2):</strong> 32.7%</li>
    </ul>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Scaling Decision:</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        <strong>StandardScaler.</strong> Discrete ordinal values (0, 1, 2) with balanced distribution.
    </p>
</div>


In [None]:
# Marital Status: One-hot encode
marital_cols = [col for col in df_Customer.columns if col.startswith('Marital_')]
if not marital_cols:
    marital_dummies = pd.get_dummies(df_Customer["Marital Status"], prefix="Marital", drop_first=False)
    df_Customer = pd.concat([df_Customer, marital_dummies], axis=1)


In [None]:
df_Customer['Marital Status'].value_counts(normalize=True).mul(100).round(1)

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">Marital Status (One-Hot Encoded)</h3>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Goal of the Feature (Why):</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Marital Status (Divorced/Married/Single) has no natural ordering. Being divorced is not "between" single and married. One-hot encoding treats all categories as equally distinct, preventing false ordinal assumptions. The feature is important for family-oriented segmentation (e.g., married customers may travel with companions more frequently).
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Feature Definition and Implementation (What and How):</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        One-hot encoding creates three binary columns (Marital_Divorced, Marital_Married, Marital_Single). In Section 4 (Feature Selection), we drop Marital_Single as it is fully determined by the other two columns (if Divorced=0 and Married=0, then Single=1). This avoids multicollinearity while retaining full information.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Scaling Decision:</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        <strong>No scaling.</strong> One-hot encoded features are already binary (0/1) and should not be scaled. Scaling would distort the categorical interpretation and provide no benefit since the values are already bounded.
    </p>
</div>


## **3.4 Behavorial based Feature Enginnering**

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">Behavioral Features Overview</h3>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        We engineer <strong>behavioral features</strong> to capture how customers interact with the airline - their travel patterns, temporal preferences, and loyalty program engagement. These features answer the question: <strong>"How does this customer fly?"</strong>
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Goal of the Features:</h4>
    <p style="margin: 10px 40px 10px 20px; color: #000;">
        Behavioral features enable <strong>actionable customer segmentation</strong> based on actual travel behavior rather than demographics alone. By understanding how customers fly (distance, regularity, companions), when they fly (seasonality, peak season), and how they engage with the loyalty program (redemption patterns), we can create segments like "Business Commuters", "Seasonal Vacationers", or "Points Hoarders" - each requiring different marketing strategies, service offerings, and retention approaches.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Feature Categories:</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;"><strong>Travel Pattern (4 features):</strong></p>
    <ul style="margin: 5px 40px 10px 20px; padding-left: 20px; color: #000;">
        <li><strong>avg_distance_per_flight:</strong> Short-haul vs long-haul traveler (km)</li>
        <li><strong>distance_variability:</strong> Consistent routes vs diverse destinations (CV)</li>
        <li><strong>companion_flight_ratio:</strong> Solo vs group traveler (0-1)</li>
        <li><strong>flight_regularity:</strong> Routine vs sporadic traveler (0-1)</li>
    </ul>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;"><strong>Temporal Pattern (3 features):</strong></p>
    <ul style="margin: 5px 40px 10px 20px; padding-left: 20px; color: #000;">
        <li><strong>seasonal_concentration:</strong> Year-round vs seasonal traveler (Gini 0-1)</li>
        <li><strong>peak_season_sin/cos:</strong> Which season is peak (cyclical encoding -1 to +1)</li>
    </ul>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;"><strong>Loyalty Engagement (2 features):</strong></p>
    <ul style="margin: 5px 40px 10px 20px; padding-left: 20px; color: #000;">
        <li><strong>redemption_rate:</strong> How much of earned points redeemed (0-1)</li>
        <li><strong>redemption_frequency:</strong> How often points are redeemed (0-1)</li>
    </ul>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Scaling Strategy:</h4>
    <table style="margin: 10px 0; border-collapse: collapse; width: 90%;">
        <tr style="background-color: #00411E; color: white;">
            <th style="padding: 8px; border: 1px solid #00411E;">Feature</th>
            <th style="padding: 8px; border: 1px solid #00411E;">Transformation</th>
            <th style="padding: 8px; border: 1px solid #00411E;">Scaling</th>
        </tr>
        <tr><td style="padding: 8px; border: 1px solid #ccc; color: #000;">avg_distance_per_flight</td><td style="padding: 8px; border: 1px solid #ccc; color: #000;">Winsorized (0.25%/0.5%)</td><td style="padding: 8px; border: 1px solid #ccc; color: #000;">StandardScaler</td></tr>
        <tr><td style="padding: 8px; border: 1px solid #ccc; color: #000;">distance_variability</td><td style="padding: 8px; border: 1px solid #ccc; color: #000;">None</td><td style="padding: 8px; border: 1px solid #ccc; color: #000;">StandardScaler</td></tr>
        <tr><td style="padding: 8px; border: 1px solid #ccc; color: #000;">companion_flight_ratio</td><td style="padding: 8px; border: 1px solid #ccc; color: #000;">None</td><td style="padding: 8px; border: 1px solid #ccc; color: #000;">StandardScaler</td></tr>
        <tr><td style="padding: 8px; border: 1px solid #ccc; color: #000;">flight_regularity</td><td style="padding: 8px; border: 1px solid #ccc; color: #000;">None</td><td style="padding: 8px; border: 1px solid #ccc; color: #000;">StandardScaler</td></tr>
        <tr><td style="padding: 8px; border: 1px solid #ccc; color: #000;">seasonal_concentration</td><td style="padding: 8px; border: 1px solid #ccc; color: #000;">None</td><td style="padding: 8px; border: 1px solid #ccc; color: #000;">StandardScaler</td></tr>
        <tr><td style="padding: 8px; border: 1px solid #ccc; color: #000;">peak_season_sin/cos</td><td style="padding: 8px; border: 1px solid #ccc; color: #000;">Cyclical encoding</td><td style="padding: 8px; border: 1px solid #ccc; color: #000;">None</td></tr>
        <tr><td style="padding: 8px; border: 1px solid #ccc; color: #000;">redemption_rate</td><td style="padding: 8px; border: 1px solid #ccc; color: #000;">Capped at 1.0</td><td style="padding: 8px; border: 1px solid #ccc; color: #000;">StandardScaler</td></tr>
        <tr><td style="padding: 8px; border: 1px solid #ccc; color: #000;">redemption_frequency</td><td style="padding: 8px; border: 1px solid #ccc; color: #000;">None</td><td style="padding: 8px; border: 1px solid #ccc; color: #000;">StandardScaler</td></tr>
    </table>
</div>


### Group 1: Travel Pattern

In [None]:
# Travel Pattern Feature: Average distance per flight
# Aggregates all flights per customer, then calculates avg distance

flight_totals = (
    df_Flights
    .groupby("Loyalty#", as_index=False)
    .agg(
        total_distance=("DistanceKM", "sum"),
        total_flights=("NumFlights", "sum")
    )
)

# Calculate average distance per flight (avoid division by zero)
flight_totals["avg_distance_per_flight"] = (
    flight_totals["total_distance"] / flight_totals["total_flights"].replace(0, np.nan)
).fillna(0)

# Merge to customer dataframe
df_Customer = df_Customer.merge(
    flight_totals[["Loyalty#", "avg_distance_per_flight"]], 
    on="Loyalty#", 
    how="left"
)
df_Customer["avg_distance_per_flight"] = df_Customer["avg_distance_per_flight"].fillna(0)


In [None]:
plot_numeric_distribution(df_Customer, "avg_distance_per_flight", show_pct_labels=False)

In [None]:
# Outlier Summary
avg_distance = analyze_outliers(
    df_Customer, 
    features=["avg_distance_per_flight"],
    caption='avg_distance_per_flight Outlier Summary (IQR Method)'
)

In [None]:
# Winsorize avg_distance_per_flight to cap extreme outliers
# Original IQR analysis: 2.64% outliers (300 upper above 2969 km, 58 lower below 860 km, max 13,852 km)
# 
# Why 0.25% lower, 0.5% upper (conservative thresholds)?
# - Lower tail: Few outliers (58), values around 860 km are plausible short-haul averages
# - Upper tail: More extreme values (max 13,852 km approaches Earth's max flight distance)
# - After plotting: 0.5% upper caps at reasonable upper bound for average flight distance (allows long-haul but removes implausible extremes)
# - Higher percentiles (e.g., 1%) would remove too many long-haul travelers

df_Customer['avg_distance_per_flight'] = mstats.winsorize(
    df_Customer['avg_distance_per_flight'], 
    limits=[0.0025, 0.005]  # 0.25% lower, 0.5% upper
)

In [None]:
plot_numeric_distribution(df_Customer, "avg_distance_per_flight", show_pct_labels=False)

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">Feature 1: avg_distance_per_flight</h3>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Goal of the Feature (Why):</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Average flight distance is a primary discriminator between business and leisure travel patterns. Short-haul frequent flyers (500-1500 km) typically represent business commuters on regional routes, while long-haul travelers (4000+ km) indicate leisure or international business travel. This feature captures fundamental differences in trip purpose, route preferences, and service expectations.
    </p>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        <strong>Clustering benefit:</strong> Creates clear separation between customer segments with different travel styles. Short-haul customers require different marketing strategies (frequent flyer bonuses, lounge access) compared to long-haul customers (premium upgrades, international partnerships). Distance patterns correlate with value, loyalty tier progression, and lifetime value trajectories.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Feature Definition and Implementation (What and How):</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Continuous ratio feature calculated as total kilometers flown divided by total number of flights over 2019-2021 period.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Outlier Handling & Scaling:</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Original distribution showed right skewness with extreme outliers extending to 13,852 km (approaching Earth's maximum flight distance). IQR analysis identified 2.8% outliers (383 total), predominantly in the upper tail (330 customers above 2,984 km, 53 below 833 km).
    </p>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        <strong>Winsorizing applied</strong> instead of log transformation to preserve km interpretation. Conservative thresholds (0.25% lower, 0.5% upper) cap values at roughly 3,500 km - a reasonable upper bound that retains legitimate long-haul travelers while removing physically implausible averages. Log transformation was considered but rejected as the main distribution (1000-3000 km) is already well-behaved; only the extreme tail needed treatment.
    </p>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        <strong>StandardScaler</strong> applied after winsorizing. The capped distribution is compact and suitable for mean/std scaling.
    </p>
</div>


In [None]:
# Travel Pattern Feature: distance_variability
# Coefficient of variation (CV) measures how consistent a customer's flight distances are

def calculate_cv(values):
    """Calculate coefficient of variation for a list of values."""
    if len(values) < 2:
        return 0
    mean_val = np.mean(values)
    return np.std(values) / mean_val if mean_val > 0 else 0

# Filter months with flights and calculate avg distance per flight per month
flights_with_activity = df_Flights[df_Flights['NumFlights'] > 0].copy()
flights_with_activity['avg_dist_month'] = flights_with_activity['DistanceKM'] / flights_with_activity['NumFlights']

# Calculate CV of monthly avg distances per customer
distance_variability = (
    flights_with_activity.groupby('Loyalty#')['avg_dist_month']
    .apply(list)
    .apply(calculate_cv)
    .rename('distance_variability')
)

# Merge into df_Customer
df_Customer = df_Customer.merge(distance_variability, on='Loyalty#', how='left')
df_Customer['distance_variability'] = df_Customer['distance_variability'].fillna(0)

In [None]:
plot_numeric_distribution(df_Customer, "distance_variability", show_pct_labels=False)

In [None]:
# Check skewness
analyze_skewness(df_Customer, features=['distance_variability'], skew_threshold=1.0)

In [None]:
# Outlier Summary
distance_variability_outliers = analyze_outliers(
    df_Customer, 
    features=["distance_variability"],
    caption='distance_variability Outlier Summary (IQR Method)'
)



<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">Feature 2: distance_variability</h3>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Goal of the Feature (Why):</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Route consistency reveals travel purpose and planning behavior. Low variability indicates customers who repeatedly fly similar distances (business commuters, regular visitors), while high variability suggests diverse travel patterns (leisure explorers, varied business destinations). This consistency dimension is orthogonal to average distance.
    </p>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        <strong>Clustering benefit:</strong> Separates "routine travelers" from "explorers" within the same distance category. A customer with avg_distance=2000km could be flying Toronto-Vancouver repeatedly (low CV ~0.3) or mixing Toronto-New York, Toronto-Mexico, Toronto-Calgary trips (high CV ~1.2). These patterns require different marketing approaches and predict different ancillary revenue opportunities.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Feature Definition and Implementation (What and How):</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Coefficient of variation (CV = standard deviation / mean) calculated across monthly average flight distances per customer. CV is a normalized ratio where 0 = identical distances every month, 1 = standard deviation equals mean.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Skewness, Transformation & Scaling:</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Skewness of -0.32 is well below threshold, indicating a roughly symmetric distribution centered around 0.85. <strong>No log transformation needed.</strong> IQR analysis identifies only 1.82% outliers (246 total): 174 lower outliers (CV near 0, very consistent travelers) and 72 upper outliers. The lower outliers represent customers with minimal route variation - legitimate behavioral patterns we want to retain, not data errors. <strong>StandardScaler</strong> applied without winsorizing or RobustScaler.
    </p>
</div>


In [None]:
# Travel Pattern Feature: companion_flight_ratio
# Ratio of flights with companions vs total flights (business alone vs leisure with family/friends)

# Filter months with flights
flights_with_activity = df_Flights[df_Flights['NumFlights'] > 0]

# Aggregate companion flights per customer
customer_companions = flights_with_activity.groupby('Loyalty#').agg({
    'NumFlightsWithCompanions': 'sum',
    'NumFlights': 'sum'
}).reset_index()

# Calculate ratio
customer_companions['companion_flight_ratio'] = (
    customer_companions['NumFlightsWithCompanions'] / 
    customer_companions['NumFlights']
)

# Merge into df_Customer
df_Customer = df_Customer.merge(
    customer_companions[['Loyalty#', 'companion_flight_ratio']], 
    on='Loyalty#', 
    how='left'
)
df_Customer['companion_flight_ratio'] = df_Customer['companion_flight_ratio'].fillna(0)


In [None]:
plot_numeric_distribution(df_Customer, "companion_flight_ratio", show_pct_labels=False)

In [None]:
# Check skewness
analyze_skewness(df_Customer, features=['companion_flight_ratio'], skew_threshold=1.0)

In [None]:
# Outlier Summary
companion_flight_ratio_outliers = analyze_outliers(
    df_Customer, 
    features=["companion_flight_ratio"],
    caption='Companion Flight Ratio Outlier Summary (IQR Method)'
)

In [None]:
# Check customers with companion_flight_ratio >= 0.9 (almost always fly with companions)
# After examining the outliers we found these are mostly inactive customers who flew in 2020 but not in 2021. Since our clustering approach focuses only on active customers anyway, we check what proportion of these extreme values are inactive.

companion_always = df_Customer[df_Customer['companion_flight_ratio'] >= 0.9]
companion_always_inactive = companion_always[~companion_always['is_active']]

print(f'Customers with companion_flight_ratio >= 0.9: {len(companion_always)}')
print(f'Of those, inactive: {len(companion_always_inactive)} ({len(companion_always_inactive)/len(companion_always)*100:.1f}%)')


<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">Feature 3: companion_flight_ratio</h3>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Goal of the Feature (Why):</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Companion behavior is a strong proxy for trip purpose and customer lifecycle stage. Solo travelers predominantly represent business trips, while customers flying with companions indicate leisure travel, family vacations, or couple trips. This dimension directly impacts ancillary revenue (extra seats, baggage, meals) and appropriate service offerings.
    </p>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        <strong>Clustering benefit:</strong> Enables clear separation between business-focused solo travelers and leisure-oriented group travelers. High companion ratio customers respond to family packages, group discounts, and destination-based marketing, while low ratio customers require business amenities and flexible booking policies. Critical for tailoring service delivery and revenue optimization strategies.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Feature Definition and Implementation (What and How):</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Continuous ratio feature (0-1) calculated as proportion of flights taken with one or more companions out of total flights. Aggregated across all flights from 2019-2021.
    </p>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; color: #000;">
        <li><strong>0.0:</strong> Pure solo traveler - likely business or independent leisure</li>
        <li><strong>0.3-0.6:</strong> Mixed traveler - combination of solo and group trips</li>
        <li><strong>0.6-1.0:</strong> Predominantly group traveler - family/leisure focus</li>
    </ul>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Skewness, Transformation & Scaling:</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Skewness of 0.80 is below our threshold (1.0), so <strong>no transformation required</strong>. Log transformation would be inappropriate anyway for bounded ratio features (0-1): it's undefined at 0 (where we have a concentration of solo travelers), and the "right tail" up to 1.0 is not a magnitude problem but simply customers who always fly with companions.
    </p>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        <strong>Outlier Analysis:</strong> IQR analysis detected 306 outliers (2.26%) with upper bound at 0.48 - 206 upper outliers and 100 lower outliers. Notably, all 12 customers with companion_flight_ratio >= 0.9 are inactive (100%), meaning these extreme values won't affect our clustering of active customers. The remaining outliers represent legitimate behavioral patterns valuable for clustering insights.
    </p>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        <strong>Decision:</strong> <strong>StandardScaler</strong> applied without transformation or winsorizing. The bounded nature (0-1) prevents extreme outliers, and retaining the original values preserves important behavioral distinctions between solo and group travelers that are critical for clustering.
    </p>
</div>


In [None]:
# Travel Pattern Feature: flight_regularity
# Measures how consistently a customer flies across months (from first flight onwards)
# High regularity = business traveler with consistent schedule, Low = sporadic leisure traveler

# Get monthly flight counts per customer
monthly_flight_counts = df_Flights.groupby(['Loyalty#', 'YearMonthDate'])['NumFlights'].sum().unstack(fill_value=0)

# Sort columns chronologically
monthly_flight_counts = monthly_flight_counts.reindex(sorted(monthly_flight_counts.columns), axis=1)

# Find first flight month index per customer
first_flight = df_Flights[df_Flights['NumFlights'] > 0].groupby('Loyalty#')['YearMonthDate'].min()

# Calculate regularity: 1 / (1 + CV) where CV = std/mean, only for months >= first flight
def calc_regularity(row):
    loyalty_id = row.name
    if loyalty_id not in first_flight.index:
        return 0
    first_date = first_flight[loyalty_id]
    # Get column positions from first flight onwards
    cols = monthly_flight_counts.columns.tolist()
    if first_date not in cols:
        return 0
    start_idx = cols.index(first_date)
    values = row.iloc[start_idx:]
    mean_val = values.mean()
    if mean_val < 0.01:
        return 0
    return 1 / (1 + values.std() / mean_val)

flight_regularity = monthly_flight_counts.apply(calc_regularity, axis=1).rename('flight_regularity')

# Merge into df_Customer
df_Customer = df_Customer.merge(flight_regularity, on='Loyalty#', how='left')
df_Customer['flight_regularity'] = df_Customer['flight_regularity'].fillna(0)

In [None]:
plot_numeric_distribution(df_Customer, "flight_regularity", show_pct_labels=False)

In [None]:
# Check skewness
analyze_skewness(df_Customer, features=['flight_regularity'], skew_threshold=1.0)

In [None]:
# Outlier Summary
flight_regularity_outliers = analyze_outliers(
    df_Customer,
    features=["flight_regularity"],
    caption='Flight Regularity Outlier Summary (IQR Method)'
)

In [None]:
# after examining the outliers we found these are mostly inactive customers who flew in 2020 but not in 2021. Since our clustering approach focuses only on active customers anyway, we check what proportion of these extreme values are inactive.

# Check how many lower-bound outliers are inactive customers
lower_bound_outliers = df_Customer[df_Customer['flight_regularity'] <= 0.37]
inactive_outliers = lower_bound_outliers[~lower_bound_outliers['is_active']]

print(f"Total outliers below lower bound: {len(lower_bound_outliers)}")
print(f"Inactive customers among outliers: {len(inactive_outliers)} ({len(inactive_outliers)/len(lower_bound_outliers)*100:.1f}%)")


In [None]:
# 77.8% of lower-bound outliers (441 of 567) are inactive customers
# Plotting only active customers to show the distribution relevant for clustering
plot_numeric_distribution(df_Customer[df_Customer['is_active']], "flight_regularity", show_pct_labels=False)

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">Feature 4: flight_regularity</h3>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Goal of the Feature (Why):</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Flight regularity measures <strong>month-to-month consistency</strong> of travel activity. While seasonal_concentration captures year-level patterns (4 seasons), this feature uses monthly data points to distinguish <strong>routine travelers</strong> (consistent monthly flights) from <strong>sporadic travelers</strong> (unpredictable bursts). Two customers with identical total flights can have completely different rhythms.
    </p>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        <strong>Clustering benefit:</strong> Separates predictable business commuters (fly every week/month) from event-driven travelers (project-based, vacation-only). This distinction is critical for campaign timing, service design, and capacity planning. High regularity customers respond to subscription offers and routine-based perks, while low regularity customers need flexible policies and triggered campaigns.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Feature Definition and Implementation (What and How):</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Continuous feature (0-1) calculated as the inverse of the coefficient of variation (CV) of monthly flights: <strong>regularity = 1 / (1 + CV)</strong>, where CV = std / mean. Only months from the customer's first flight onwards are considered to avoid penalizing late joiners.
    </p>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; color: #000;">
        <li><strong>0.8-1.0:</strong> Very regular - similar flights each month (business commuter)</li>
        <li><strong>0.5-0.7:</strong> Moderate - some variation but consistent presence</li>
        <li><strong>0.3-0.5:</strong> Irregular - noticeable gaps between activity periods</li>
        <li><strong>0.0-0.3:</strong> Very irregular - sporadic bursts with long gaps</li>
    </ul>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Skewness, Outliers & Scaling:</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Initial skewness of -1.39 indicated left-skew exceeding the threshold (1.0). IQR detected 677 outliers (5.0%), with 569 below the lower bound (0.37) and 108 above the upper bound (0.64).
    </p>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        <strong>Root cause investigation:</strong> Analysis revealed that <strong>77.8% of lower-bound outliers (441 of 567) are inactive customers</strong>. These customers flew in 2020 but stopped flying in 2021, creating long stretches of zero-flight months in their observation window. The regularity formula (based on CV of monthly flights) naturally produces low scores when most months have zero flights with occasional activity bursts. This is not a data quality issue but rather the feature correctly identifying churned or dormant customers.
    </p>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        <strong>Decision:</strong> <strong>No transformation or capping applied.</strong> The left-skew disappears when filtering to active customers only, as shown in the distribution plot above. Since behavioral clustering in Section 8 focuses exclusively on active customers, these low-regularity inactive customers will be naturally excluded from the analysis. The remaining active customers show a well-behaved distribution suitable for <strong>StandardScaler</strong>.
    </p>
</div>

### Group 2: Temporal Pattern

In [None]:
# Temporal Pattern Feature: seasonal_concentration
# Measures how concentrated a customer's flights are across seasons
# Low = year-round traveler (business), High = seasonal traveler (leisure/vacation)

# Add Season mapping (Winter: Dec, Jan, Feb | Spring: Mar, Apr, May | Summer: Jun, Jul, Aug | Fall: Sep, Oct, Nov)
df_Flights['Season'] = df_Flights['Month'].map({
    12: 'Winter', 1: 'Winter', 2: 'Winter',
    3: 'Spring', 4: 'Spring', 5: 'Spring',
    6: 'Summer', 7: 'Summer', 8: 'Summer',
    9: 'Fall', 10: 'Fall', 11: 'Fall'
})

# Aggregate flights per customer per season (across all years)
seasonal_flights = df_Flights.groupby(['Loyalty#', 'Season'])['NumFlights'].sum().unstack(fill_value=0)

# Ensure all seasons exist
for season in ['Winter', 'Spring', 'Summer', 'Fall']:
    if season not in seasonal_flights.columns:
        seasonal_flights[season] = 0

# Calculate Gini coefficient for seasonal concentration
def calculate_gini(values):
    values = np.sort(values)
    n = len(values)
    total = values.sum()
    if total == 0:
        return 0
    cumsum = np.cumsum(values)
    return (n + 1 - 2 * cumsum.sum() / total) / n

seasonal_concentration = seasonal_flights.apply(lambda row: calculate_gini(row.values), axis=1).rename('seasonal_concentration')

# Merge into df_Customer
df_Customer = df_Customer.merge(seasonal_concentration, on='Loyalty#', how='left')
df_Customer['seasonal_concentration'] = df_Customer['seasonal_concentration'].fillna(0)



In [None]:
plot_numeric_distribution(df_Customer, "seasonal_concentration", show_pct_labels=False)

In [None]:
# Check skewness
analyze_skewness(df_Customer, features=['seasonal_concentration'], skew_threshold=1.0)

In [None]:
# Check customers with seasonal_concentration >= 0.7 (highly concentrated seasonal flying)
# After examining the outliers we found these are mostly inactive customers who flew in 2020 but not in 2021. Since our clustering approach focuses only on active customers anyway, we check what proportion of these extreme values are inactive.

seasonal_extreme = df_Customer[df_Customer['seasonal_concentration'] >= 0.7]
seasonal_extreme_inactive = seasonal_extreme[~seasonal_extreme['is_active']]

print(f'Customers with seasonal_concentration >= 0.7: {len(seasonal_extreme)}')
print(f'Of those, inactive: {len(seasonal_extreme_inactive)} ({len(seasonal_extreme_inactive)/len(seasonal_extreme)*100:.1f}%)')


<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">Feature 5: seasonal_concentration</h3>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Goal of the Feature (Why):</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Seasonal concentration measures temporal distribution of travel activity throughout the year. Year-round travelers (low concentration) typically represent business customers with consistent travel needs, while highly seasonal travelers (high concentration) indicate vacation-focused leisure customers who concentrate trips in specific seasons. This pattern fundamentally affects campaign timing, inventory allocation, and revenue predictability.
    </p>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        <strong>Clustering benefit:</strong> Enables strategic differentiation in marketing timing and resource allocation. Low concentration customers receive continuous engagement throughout the year, while high concentration customers require focused pre-season campaigns (e.g., promoting summer destinations in early spring). This feature also predicts revenue volatility and helps identify customers suitable for off-season promotional targeting to smooth demand.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Feature Definition and Implementation (What and How):</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Continuous feature (0-1) calculated using Gini coefficient across seasonal flight totals (Winter: Dec-Feb, Spring: Mar-May, Summer: Jun-Aug, Fall: Sep-Nov). Gini measures inequality in distribution - perfect equality (flights evenly distributed across seasons) yields 0, while maximum inequality (all flights in one season) approaches 1.
    </p>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; color: #000;">
        <li><strong>Low (0-0.3):</strong> Year-round traveler - consistent across seasons (business pattern)</li>
        <li><strong>Medium (0.3-0.6):</strong> Moderate seasonality - some preference but active year-round</li>
        <li><strong>High (0.6-1.0):</strong> Highly seasonal - concentrated in 1-2 seasons (vacation pattern)</li>
    </ul>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Skewness, Transformation & Scaling:</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Skewness of 1.48 is above our threshold (1.0), indicating right-skew from customers with highly concentrated seasonal patterns. However, <strong>no log transformation applied</strong> because the Gini coefficient is bounded (0-1) and log is undefined at 0. The right tail represents legitimate behavioral patterns of vacation-focused travelers.
    </p>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        <strong>Outlier Analysis:</strong> Of 121 customers with seasonal_concentration >= 0.7, 106 (88%) are inactive. These extreme values predominantly represent customers who flew only in 2019/2020 but not in 2021, naturally resulting in concentrated seasonal patterns. Since our clustering focuses on active customers, most extreme values are automatically excluded.
    </p>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        <strong>Decision:</strong> <strong>StandardScaler</strong> applied without transformation. The bounded nature (0-1) limits outlier impact, and the remaining active customers with high concentration represent genuine vacation-focused travelers valuable for cluster differentiation.
    </p>
</div>


In [None]:
# Temporal Pattern Feature: peak_season_sin & peak_season_cos
# Identifies WHICH season has the customer's peak travel activity
# Cyclical encoding ensures Winter and Fall are mathematically close (adjacent seasons)

# Identify peak season for each customer (reuses seasonal_flights from above)
peak_season = seasonal_flights.idxmax(axis=1)

# Map seasons to numeric values for cyclical encoding (0-3, following calendar cycle)
season_to_numeric = {
    'Winter': 0,   # Dec, Jan, Feb
    'Spring': 1,   # Mar, Apr, May
    'Summer': 2,   # Jun, Jul, Aug
    'Fall': 3      # Sep, Oct, Nov
}
peak_season_numeric = peak_season.map(season_to_numeric)

# Cyclical encoding: angle = 2π * season / 4 (each season = 90° on unit circle)
peak_season_sin = np.sin(2 * np.pi * peak_season_numeric / 4).rename('peak_season_sin')
peak_season_cos = np.cos(2 * np.pi * peak_season_numeric / 4).rename('peak_season_cos')

# Merge into df_Customer
df_Customer = df_Customer.merge(peak_season_sin, on='Loyalty#', how='left')
df_Customer = df_Customer.merge(peak_season_cos, on='Loyalty#', how='left')
df_Customer['peak_season_sin'] = df_Customer['peak_season_sin'].fillna(0)
df_Customer['peak_season_cos'] = df_Customer['peak_season_cos'].fillna(0)


In [None]:
# Peak Season Distribution
df_Customer['peak_season'] = df_Customer['Loyalty#'].map(peak_season)
plot_categorical_distribution(df_Customer, 'peak_season')

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">Feature 6 & 7: peak_season_sin & peak_season_cos</h3>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Goal of the Feature (Why):</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        While <strong>seasonal_concentration</strong> measures <strong>how concentrated</strong> travel is across seasons, it doesn't reveal <strong>when</strong> customers prefer to fly. Two customers with identical concentration (e.g., 0.75) could be a Winter ski traveler vs. a Summer beach traveler - requiring completely different marketing campaigns at different times.
    </p>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        <strong>Clustering benefit:</strong> Enables precise campaign timing based on individual peak season. Winter-peak customers receive ski resort promotions in September-November, while Summer-peak customers receive beach destination promotions in March-May. Combined with <strong>seasonal_concentration</strong>, this creates actionable segments for targeted marketing with optimal timing.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Feature Definition and Implementation (What and How):</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Two continuous features (-1 to +1) representing the customer's peak travel season using cyclical sine/cosine encoding. The peak season is determined by which season (Winter, Spring, Summer, Fall) has the highest flight count, then encoded as a position on the unit circle.
    </p>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        <strong>Formula:</strong> angle = 2π × season_numeric / 4 (each season = 90° on unit circle)
    </p>
    <table style="margin: 10px 0; border-collapse: collapse; width: 80%;">
        <tr style="background-color: #00411E; color: white;">
            <th style="padding: 8px; border: 1px solid #00411E;">Season</th>
            <th style="padding: 8px; border: 1px solid #00411E;">Angle</th>
            <th style="padding: 8px; border: 1px solid #00411E;">sin</th>
            <th style="padding: 8px; border: 1px solid #00411E;">cos</th>
        </tr>
        <tr><td style="padding: 8px; border: 1px solid #ccc; color: #000;">Winter (Dec-Feb)</td><td style="padding: 8px; border: 1px solid #ccc; color: #000;">0°</td><td style="padding: 8px; border: 1px solid #ccc; color: #000;">0.00</td><td style="padding: 8px; border: 1px solid #ccc; color: #000;">+1.00</td></tr>
        <tr><td style="padding: 8px; border: 1px solid #ccc; color: #000;">Spring (Mar-May)</td><td style="padding: 8px; border: 1px solid #ccc; color: #000;">90°</td><td style="padding: 8px; border: 1px solid #ccc; color: #000;">+1.00</td><td style="padding: 8px; border: 1px solid #ccc; color: #000;">0.00</td></tr>
        <tr><td style="padding: 8px; border: 1px solid #ccc; color: #000;">Summer (Jun-Aug)</td><td style="padding: 8px; border: 1px solid #ccc; color: #000;">180°</td><td style="padding: 8px; border: 1px solid #ccc; color: #000;">0.00</td><td style="padding: 8px; border: 1px solid #ccc; color: #000;">-1.00</td></tr>
        <tr><td style="padding: 8px; border: 1px solid #ccc; color: #000;">Fall (Sep-Nov)</td><td style="padding: 8px; border: 1px solid #ccc; color: #000;">270°</td><td style="padding: 8px; border: 1px solid #ccc; color: #000;">-1.00</td><td style="padding: 8px; border: 1px solid #ccc; color: #000;">0.00</td></tr>
    </table>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Distribution:</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        <strong>Summer dominates</strong> with 51.8% of customers having their peak travel in summer months - typical for leisure-focused airlines. The remaining customers distribute across Spring (18.8%), Fall (17.4%), and Winter (12.0%), representing business travelers, ski enthusiasts, and holiday-season travelers respectively.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Why Cyclical Encoding (Not Ordinal or One-Hot)?</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        <strong>Ordinal encoding (0,1,2,3)</strong> incorrectly treats Winter-Fall as maximally distant (|0-3|=3) when they are adjacent seasons. <strong>One-Hot encoding</strong> requires 4 features and treats all seasons as equally different, losing the adjacent/opposite relationship. <strong>Sin/Cos encoding</strong> preserves cyclical nature: adjacent seasons (Winter-Fall, Winter-Spring) have distance 1.41, while opposite seasons (Winter-Summer) have maximum distance 2.0.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Scaling Decision:</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        <strong>No scaling applied.</strong> Sin/cos values are already bounded between -1 and +1, which is comparable to StandardScaler output. Additionally, scaling would distort the unit circle geometry that makes this encoding mathematically correct for distance-based clustering. No outliers possible due to bounded range.
    </p>
</div>


### Group 3: Loyalty Engagement

In [None]:
# Loyalty Engagement Feature: redemption_rate
# Measures what percentage of accumulated points has been redeemed
# Low = hoarder or disengaged, High = active program user

# Calculate total points accumulated and redeemed per customer (across all months)
points_agg = df_Flights.groupby('Loyalty#').agg({
    'PointsAccumulated': 'sum',
    'PointsRedeemed': 'sum'
}).rename(columns={
    'PointsAccumulated': 'total_points_accumulated',
    'PointsRedeemed': 'total_points_redeemed'
})

# Calculate redemption rate (handle division by zero for customers with no accumulation)
points_agg['redemption_rate'] = np.where(
    points_agg['total_points_accumulated'] > 0,
    points_agg['total_points_redeemed'] / points_agg['total_points_accumulated'],
    0
)

# Merge into df_Customer
df_Customer = df_Customer.merge(points_agg[['redemption_rate']], on='Loyalty#', how='left')
df_Customer['redemption_rate'] = df_Customer['redemption_rate'].fillna(0)


In [None]:
plot_numeric_distribution(df_Customer, "redemption_rate", show_pct_labels=False)

In [None]:
# Outlier Summary
redemption_outliers = analyze_outliers(
    df_Customer, 
    features=["redemption_rate"],
    caption='Redemption Rate Outlier Summary (IQR Method)'
)

In [None]:
# Most Customers with redemption_rate > 1 all enrolled before 2019
over_1 = df_Customer[df_Customer['redemption_rate'] > 1]
print(f"Customers with redemption_rate > 1: {len(over_1)}")
print(f"Enrollment date range: {over_1['EnrollmentDateOpening'].min()} to {over_1['EnrollmentDateOpening'].max()}")

# Breakdown by enrollment year
enrolled_before_2019 = over_1[over_1['EnrollmentDateOpening'] < pd.to_datetime('2019-01-01')]
enrolled_2019_or_later = over_1[over_1['EnrollmentDateOpening'] >= pd.to_datetime('2019-01-01')]
print(f"\nEnrolled before 2019: {len(enrolled_before_2019)} ({len(enrolled_before_2019)/len(over_1)*100:.1f}%)")
print(f"Enrolled 2019 or later: {len(enrolled_2019_or_later)} ({len(enrolled_2019_or_later)/len(over_1)*100:.1f}%)")

In [None]:
# This is valid behavior, not a data error - cap at 1.0 to normalize the ratio
df_Customer['redemption_rate'] = df_Customer['redemption_rate'].clip(0, 1)

In [None]:
# Plot final distribution after capping
plot_numeric_distribution(df_Customer, "redemption_rate", show_pct_labels=False)

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">Feature 8: redemption_rate</h3>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Goal of the Feature (Why):</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Redemption rate measures overall loyalty program engagement as the proportion of accumulated points that have been redeemed. This is a fundamental behavioral indicator: active redeemers (high rate) demonstrate program understanding and value perception, while hoarders (low rate) either save for aspirational rewards or lack engagement with redemption options. The rate directly correlates with program satisfaction and stickiness.
    </p>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        <strong>Clustering benefit:</strong> Separates engaged users from passive accumulators. High redemption rate customers are active program participants requiring diverse redemption options and frequent reward refreshes. Low rate customers may need education on redemption value, special promotions to trigger engagement, or warnings about point expiration. From EDA, short-haul frequent flyers often show low redemption (hoarding for big rewards), making this feature critical for understanding strategic vs opportunistic program usage.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Feature Definition and Implementation (What and How):</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Continuous ratio feature (0-1) calculated as total points redeemed divided by total points accumulated over 2020-2021. Represents overall utilization of earned loyalty currency.
    </p>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; color: #000;">
        <li><strong>0.0:</strong> Never redeemed - either saving strategically or disengaged from program</li>
        <li><strong>0.0-0.2:</strong> Hoarder - accumulates without redeeming (saving or disengaged)</li>
        <li><strong>0.2-0.6:</strong> Moderate user - balanced accumulation and redemption</li>
        <li><strong>0.6-1.0:</strong> Active redeemer - regularly uses points (engaged, values program)</li>
    </ul>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Data Quality and Capping:</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        <strong>Values > 1:</strong> 158 customers showed redemption_rate > 1. Of these, 143 (90.5%) enrolled before 2019 and accumulated points before our data window, while 15 (9.5%) enrolled in 2019 or later. This is valid behavior: they redeemed more points than they accumulated within 2020-2021 because they had prior balances. These are <strong>capped at 1.0</strong> to normalize the ratio while retaining these engaged customers.
    </p>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        <strong>Values = 0:</strong> Customers with zero redemption are valid - they simply never redeemed any points during 2020-2021 (strategic savers or disengaged). No capping needed at lower bound.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Skewness, Outliers & Scaling:</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        After capping, distribution shows right-skew with concentration at 0 (non-redeemers) and spread across 0.1-1.0 (active users).
    </p>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        <strong>Decision:</strong> <strong>StandardScaler</strong> applied after capping. The bounded nature (0-1 after capping) limits extreme outliers, and the concentration at zero represents meaningful behavioral distinction (non-redeemers) rather than problematic outliers.
    </p>
</div>

In [None]:
# Loyalty Engagement Feature: redemption_frequency
# Measures how often a customer redeems points relative to their time in the program
# Calculated from first points accumulation (not enrollment) to end of data window

# Find first month with points accumulated per customer
first_accumulation = df_Flights[df_Flights['PointsAccumulated'] > 0].groupby('Loyalty#').agg({
    'Year': 'min',
    'Month': 'first'  # Gets month of the min year row
})

# Better approach: get the actual first accumulation date
first_acc_data = df_Flights[df_Flights['PointsAccumulated'] > 0].sort_values('YearMonthDate').groupby('Loyalty#').first()[['Year', 'Month']]

# Calculate months available for redemption (from first accumulation to Dec 2021)
# End date: December 2021 = Year 2021, Month 12
first_acc_data['months_available'] = (2021 - first_acc_data['Year']) * 12 + (12 - first_acc_data['Month']) + 1

# Count months with any redemption activity per customer
monthly_redemptions = df_Flights.groupby(['Loyalty#', 'Year', 'Month'])['PointsRedeemed'].sum()
months_with_redemptions = (monthly_redemptions > 0).groupby('Loyalty#').sum().rename('months_redeemed')

# Calculate frequency (proportion of available months with redemptions)
redemption_freq = first_acc_data[['months_available']].merge(months_with_redemptions, on='Loyalty#', how='left')
redemption_freq['months_redeemed'] = redemption_freq['months_redeemed'].fillna(0)
redemption_freq['redemption_frequency'] = redemption_freq['months_redeemed'] / redemption_freq['months_available']
redemption_freq['redemption_frequency'] = redemption_freq['redemption_frequency'].clip(0, 1)

# Merge into df_Customer
df_Customer = df_Customer.merge(redemption_freq[['redemption_frequency']], on='Loyalty#', how='left')
df_Customer['redemption_frequency'] = df_Customer['redemption_frequency'].fillna(0)


In [None]:
plot_numeric_distribution(df_Customer, "redemption_frequency", show_pct_labels=False)

In [None]:
# Check skewness
analyze_skewness(df_Customer, features=['redemption_frequency'], skew_threshold=1.0)

In [None]:
# Outlier Summary
redemption_frequency_outliers = analyze_outliers(
    df_Customer, 
    features=["redemption_frequency"],
    caption='Redemption Frequency Outlier Summary (IQR Method)'
)

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">Feature 9: redemption_frequency</h3>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Goal of the Feature (Why):</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        While redemption_rate measures <strong>how much</strong> is redeemed (total percentage), redemption_frequency captures <strong>how often</strong> redemptions occur. This temporal dimension distinguishes "cash-like users" who redeem small amounts frequently from "big savers" who redeem large amounts rarely. Frequency indicates redemption strategy: continuous small redemptions suggest opportunistic point usage for upgrades and minor rewards, while infrequent redemptions indicate strategic saving for major rewards.
    </p>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        <strong>Clustering benefit:</strong> Provides orthogonal information to redemption_rate, enabling four distinct behavioral quadrants. High rate + high frequency = active continuous users. High rate + low frequency = strategic savers. Low rate + low frequency = disengaged hoarders. Marketing strategies differ dramatically: frequent redeemers need diverse small-reward options, while infrequent redeemers need aspirational big-ticket promotions.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Feature Definition and Implementation (What and How):</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Continuous ratio feature (0-1) calculated as the proportion of available months (from first points accumulation to Dec 2021) where customer had any redemption activity. This normalization ensures fair comparison between customers who joined at different times.
    </p>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; color: #000;">
        <li><strong>0.0:</strong> Never redeemed - either strategic saver or disengaged</li>
        <li><strong>0.0-0.1:</strong> Very infrequent - redeems once or twice per year</li>
        <li><strong>0.1-0.2:</strong> Occasional redeemer - redeems every few months</li>
        <li><strong>0.2+:</strong> Frequent redeemer - redeems regularly (cash-like usage)</li>
    </ul>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Skewness, Outliers & Scaling:</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Skewness of 0.7 is below threshold (1.0), so <strong>no transformation required</strong>. Distribution shows concentration at 0 (non-redeemers) with decreasing frequency toward higher values, max at 0.43.
    </p>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        <strong>Outlier Analysis:</strong> IQR detected 75 outliers (0.55%) in upper tail above 0.25. These represent customers who redeem very frequently - legitimate behavioral pattern valuable for clustering.
    </p>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        <strong>Decision:</strong> <strong>StandardScaler</strong> applied without transformation. The bounded nature (0-1) and moderate skewness make standard scaling appropriate. Outliers retained as they represent genuine frequent redeemers.
    </p>
</div>


# <a class='anchor' id='4'> </a>
<br>

<div style="background: linear-gradient(to right, #00411E, #00622D, #00823C, #45AF28, #82BA72); 
            padding: 10px; color: white; text-align: center;  max-width: 97%;">
    <center><h1 style="margin-top: 10px; margin-bottom: 4px; color: white;
                       font-size: 32px; font-family: 'Roboto', sans-serif;">
        <b>4. Feature Selection</b></h1></center>
</div>

<div style="background-color: #e1e1e1ff; padding: 15px; margin-right: 20px; border-left: 5px solid; border-image: linear-gradient(to bottom, #212121, #313131, #595959, #909090) 1; border-radius: 5px;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #000000ff; font-weight: bold;">To be added: Goal and Reference to next step, and also very short outline (Relevance/Redundancy and why only active?) </h3>
</div>

## **4.1 Select Value based Features**

### Focus Group 1: Loyalty Members | Active

In [None]:
# Select FM features for value based clustering
fm_features = df_Customer[['Frequency', 'Monetary']]

# Filter to Focus Group 1: Loyalty Members | Active
fm_features_l_a = fm_features[df_Customer['Focus_Group'] == 1]

fm_features_l_a

### Focus Group 2: Ex Loyalty Members | Active

In [None]:
# Filter to Focus Group 2: Ex Loyalty Members | Active
fm_features_non_l_a = fm_features[df_Customer['Focus_Group'] == 2]

fm_features_non_l_a

### Combined Focus Groups: All Active Customers

In [None]:
# Filter to all active customers (Focus Groups 1 & 2 combined)
fm_features_a = fm_features[df_Customer['Is_Focus_Group'] == True]

fm_features_a

## **4.2 Select Demographic Features**

In [None]:
# Select demographic features
demographic_features = df_Customer[['Province_Encoded', 'City_Encoded', 'FSA_Encoded', 'Gender_Encoded', 'Education_Level_Num', 'Location_Code_Num', 'Income_Bin_Num', 'Marital_Divorced', 'Marital_Married', 'Marital_Single']]

# Filter to all active customers (Focus Groups 1 & 2 combined)
demographic_features_a = demographic_features[df_Customer['Is_Focus_Group'] == True]

### Remove redundant features

In [None]:
# Correlation heatmap for demographic features (excluding one-hot encoded categoricals)
# Remove one-hot encoded marital status columns as they are categorical
numeric_demographic_features = demographic_features_a.drop(columns=['Marital_Divorced', 'Marital_Married', 'Marital_Single'])

plt.figure(figsize=(10, 7))
corr = numeric_demographic_features.corr()
corr = corr.round(2)

sns.heatmap(
    corr,
    center=0,
    cmap=GROUP80_palette_continuous.reversed(),
    annot=np.where(np.absolute(corr) >= 0.5, corr.values, np.full(corr.values.shape, "")),
    fmt='s',
    square=True,
    linewidths=.5,
    cbar_kws={'label': 'Pearson Correlation'}
)
plt.title('Correlation Heatmap: Numeric Demographic Features\n(One-Hot Encoded Variables Excluded)', 
          fontsize=12, fontweight='bold', pad=15)
plt.tight_layout()
plt.show()

### Evaluate Feature Relevance

Variance analysis before scaling is not meaningful here since the demographic features operate on different scales (e.g., frequency-encoded cities vs. ordinal education levels). After StandardScaler normalization, all features will have unit variance by definition. Feature relevance will instead be assessed through correlation analysis (see heatmap above) and cluster-specific feature importance during the clustering phase.


In [None]:
# We remove "Marital_Single" because it is fully determined by the others
demographic_final = demographic_features_a.drop(columns=["Marital_Single"])

In [None]:
# Final demographic feature df
df_demographic_a = demographic_final.copy()
df_demographic_a.head()

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;"> <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">Demographic Feature Selection Summary</h3> <p style="margin: 10px 0; margin-right: 40px; color: #000;"> We validated the selected demographic features to ensure they are suitable for clustering analysis. </p> <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Goal of Selection:</h4> <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;"> <li style="margin-right: 20px;">Ensure no highly correlated features that would introduce multicollinearity</li> <li style="margin-right: 20px;">Identify potential redundancies between demographic metrics</li> </ul> <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Findings:</h4> <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;"> <li style="margin-right: 20px;"><strong>Moderate correlations acceptable:</strong> Province-City (r=0.54) and Education-Income (r=0.55) correlations are expected as cities belong to provinces and education influences income, but remain below the critical threshold of |r| > 0.7</li> <li style="margin-right: 20px;"><strong>All other correlations acceptable:</strong> Remaining feature pairs show |r| < 0.5, indicating independent information without redundancy</li> <li style="margin-right: 20px;"><strong>Removed one-hot encoded redundancy:</strong> Excluded Marital_Single from the final feature set as it is fully determined by Marital_Married and Marital_Divorced (redundant one-hot encoded category)</li> </ul> </div>

---

## **4.3 Select Behavorial Features**

In [None]:
# Select behavioral features
behavioral_features = df_Customer[['avg_distance_per_flight', 'distance_variability', 'companion_flight_ratio', 'flight_regularity', 'seasonal_concentration', 'peak_season_sin', 'peak_season_cos', 'redemption_rate', 'redemption_frequency']]

# Filter to all active customers (Focus Groups 1 & 2 combined)
behavioral_features_a = behavioral_features[df_Customer['Is_Focus_Group'] == True]


### Remove redundant features

In [None]:
# Correlation heatmap for behavioral features
plt.figure(figsize=(10, 7))
corr = behavioral_features_a.corr()
corr = corr.round(2)

sns.heatmap(
    corr,
    center=0,
    cmap=GROUP80_palette_continuous.reversed(),
    annot=np.where(np.absolute(corr) >= 0.5, corr.values, np.full(corr.values.shape, "")),
    fmt='s',
    square=True,
    linewidths=.5,
    cbar_kws={'label': 'Pearson Correlation'}
)
plt.title('Correlation Heatmap: Behavioral Features', 
          fontsize=12, fontweight='bold', pad=15)
plt.tight_layout()
plt.show()


### Evaluate Feature Relevance

Variance analysis before scaling is not meaningful here since the behavioral features operate on vastly different scales (e.g., avg_distance_per_flight in km² vs. ratios bounded 0-1). After StandardScaler normalization, all features will have unit variance by definition. Feature relevance will instead be assessed through correlation analysis (see heatmap above) and cluster-specific feature importance during the clustering phase.

In [None]:
# Final behavioral feature df - filtered to 4 selected clustering features
behavioral_feats = ['distance_variability', 'companion_flight_ratio', 'flight_regularity', 'redemption_frequency']
df_behavioral_a = behavioral_features_a[behavioral_feats].copy()
df_behavioral_a.head()

<div style="background-color: #fce8e8ff; padding: 15px; margin-right: 20px; border-left: 5px solid; border-image: linear-gradient(to bottom, #8B0000, #A52A2A, #CD5C5C, #F08080) 1; border-radius: 5px;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #8B0000; font-weight: bold;">Critical: Feature Selection from 9 to 4 Behavioral Features</h3>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #A52A2A; font-weight: bold;">Initial Problem:</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        With all 9 behavioral features, clustering produced <strong>Silhouette scores below 0.10</strong> and poor cluster separation. Extensive grid searches over feature combinations revealed problematic correlations: <strong>redemption_rate</strong> and <strong>redemption_frequency</strong> dominated together, <strong>peak_season_sin</strong> and <strong>peak_season_cos</strong> created redundancy, while other features showed little to no discriminative power.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #A52A2A; font-weight: bold;">Key Insight: Clustering Features vs. Targeting Attributes</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        True behavioral clustering requires features that capture <strong>behavioral Patterns</strong> (Who the customer is), not <strong>states</strong> or <strong>targeting attributes</strong> (What/When to offer). The distinction:
    </p>
    <ul style="margin: 10px 0; padding-left: 20px; color: #000; margin-right: 40px;">
        <li><strong>Clustering Features:</strong> Define customer segments through repeated behavioral patterns over time</li>
        <li><strong>Targeting Attributes:</strong> Enable post-hoc personalization within segments (can be filtered directly without clustering)</li>
    </ul>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #A52A2A; font-weight: bold;">Excluded Features and Rationale:</h4>
    <table style="width: 100%; border-collapse: collapse; margin: 10px 0; color: #000;">
        <tr style="background-color: rgba(139, 0, 0, 0.1);">
            <th style="padding: 8px; text-align: left; border-bottom: 1px solid #8B0000;">Feature</th>
            <th style="padding: 8px; text-align: left; border-bottom: 1px solid #8B0000;">Reason for Exclusion</th>
        </tr>
        <tr>
            <td style="padding: 8px; border-bottom: 1px solid rgba(139, 0, 0, 0.3);"><strong>avg_distance_per_flight</strong></td>
            <td style="padding: 8px; border-bottom: 1px solid rgba(139, 0, 0, 0.3);">Targeting attribute. Direct filtering possible (IF distance > 3000km -> long-haul offer). Value Preselection already uses total_distance and avg_flights on axes, making avg_distance derivable.</td>
        </tr>
        <tr>
            <td style="padding: 8px; border-bottom: 1px solid rgba(139, 0, 0, 0.3);"><strong>seasonal_concentration</strong></td>
            <td style="padding: 8px; border-bottom: 1px solid rgba(139, 0, 0, 0.3);">Redundant with flight_regularity. Both measure temporal consistency, but flight_regularity (CV of monthly flights) is more interpretable.</td>
        </tr>
        <tr>
            <td style="padding: 8px; border-bottom: 1px solid rgba(139, 0, 0, 0.3);"><strong>peak_season_sin / peak_season_cos</strong></td>
            <td style="padding: 8px; border-bottom: 1px solid rgba(139, 0, 0, 0.3);">Timing attributes for campaign scheduling. Direct filtering possible (IF peak_cos > 0.5 -> winter campaign). Not a behavioral pattern.</td>
        </tr>
        <tr>
            <td style="padding: 8px; border-bottom: 1px solid rgba(139, 0, 0, 0.3);"><strong>redemption_rate</strong></td>
            <td style="padding: 8px; border-bottom: 1px solid rgba(139, 0, 0, 0.3);">State, not pattern. Measures cumulative ratio (points redeemed / points earned) at a single point in time. One large redemption event changes the entire metric.</td>
        </tr>
    </table>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #A52A2A; font-weight: bold;">Final 4 Features Selected:</h4>
    <table style="width: 100%; border-collapse: collapse; margin: 10px 0; color: #000;">
        <tr style="background-color: rgba(139, 0, 0, 0.1);">
            <th style="padding: 8px; text-align: left; border-bottom: 1px solid #8B0000;">Feature</th>
            <th style="padding: 8px; text-align: left; border-bottom: 1px solid #8B0000;">Pattern Type</th>
            <th style="padding: 8px; text-align: left; border-bottom: 1px solid #8B0000;">Strategic Interpretation</th>
        </tr>
        <tr>
            <td style="padding: 8px; border-bottom: 1px solid rgba(139, 0, 0, 0.3);"><strong>distance_variability</strong></td>
            <td style="padding: 8px; border-bottom: 1px solid rgba(139, 0, 0, 0.3);">Travel Pattern</td>
            <td style="padding: 8px; border-bottom: 1px solid rgba(139, 0, 0, 0.3);">Routinized (same routes) vs. Explorer (variable destinations)</td>
        </tr>
        <tr>
            <td style="padding: 8px; border-bottom: 1px solid rgba(139, 0, 0, 0.3);"><strong>companion_flight_ratio</strong></td>
            <td style="padding: 8px; border-bottom: 1px solid rgba(139, 0, 0, 0.3);">Social Pattern</td>
            <td style="padding: 8px; border-bottom: 1px solid rgba(139, 0, 0, 0.3);">Solo traveler vs. Group/Family traveler (average across all flights)</td>
        </tr>
        <tr>
            <td style="padding: 8px; border-bottom: 1px solid rgba(139, 0, 0, 0.3);"><strong>flight_regularity</strong></td>
            <td style="padding: 8px; border-bottom: 1px solid rgba(139, 0, 0, 0.3);">Temporal Pattern</td>
            <td style="padding: 8px; border-bottom: 1px solid rgba(139, 0, 0, 0.3);">Regular flyer (consistent monthly) vs. Sporadic flyer (irregular bursts)</td>
        </tr>
        <tr>
            <td style="padding: 8px; border-bottom: 1px solid rgba(139, 0, 0, 0.3);"><strong>redemption_frequency</strong></td>
            <td style="padding: 8px; border-bottom: 1px solid rgba(139, 0, 0, 0.3);">Engagement Pattern</td>
            <td style="padding: 8px; border-bottom: 1px solid rgba(139, 0, 0, 0.3);">Active redeemer (redeems in many months) vs. Passive accumulator (rarely redeems)</td>
        </tr>
    </table>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #A52A2A; font-weight: bold;">Why redemption_frequency over redemption_rate:</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Consider two customers with identical redemption_rate = 0.8:
    </p>
    <ul style="margin: 10px 0; padding-left: 20px; color: #000; margin-right: 40px;">
        <li><strong>Customer A:</strong> Redeemed once (large amount) -> frequency = 0.1</li>
        <li><strong>Customer B:</strong> Redeemed every month (small amounts) -> frequency = 0.9</li>
    </ul>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Same rate, completely different behavior. <strong>Frequency captures the habit</strong>, rate captures the cumulative state.
    </p>
     <h4 style="margin-top: 15px; margin-bottom: 8px; color: #A52A2A; font-weight: bold;">Post-Hoc Targeting Strategy:</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Excluded features remain valuable for <strong>within-cluster personalization</strong>. After identifying behavioral segments, marketers can apply additional filters using avg_distance (short vs. long-haul offers) or peak_season_cos (winter vs. summer campaigns) to tailor specific promotions within each segment. This separates the question "Who behaves similarly?" (clustering) from "What should we offer them?" (targeting).
    </p>
</div>


# <a class='anchor' id='5'></a>
<br>

<div style="background: linear-gradient(to right, #00411E, #00622D, #00823C, #45AF28, #82BA72); 
            padding: 10px; color: white; text-align: center;   max-width: 97%;">
    <center><h1 style="margin-top: 10px; margin-bottom: 4px; color: white;
                       font-size: 32px; font-family: 'Roboto', sans-serif;">
        <b>5. Feature Scaling</b></h1></center>
</div>

<div style="background-color: #e1e1e1ff; padding: 15px; margin-right: 20px; border-left: 5px solid; border-image: linear-gradient(to bottom, #212121, #313131, #595959, #909090) 1; border-radius: 5px;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #000000ff; font-weight: bold;">To be added: Reference on 2.5 and Goal and Reference to next step, and also very short outline (Before/After Scaling) </h3>
</div>

## **5.1 Scale demographical Features**

In [None]:
# Initialize and fit StandardScaler (excluding one-hot encoded marital status features)
scaler_demo = StandardScaler()
cols_to_scale = [col for col in df_demographic_a.columns if not col.startswith('Marital_')]
df_demographic_a_scaled = df_demographic_a.copy()
df_demographic_a_scaled[cols_to_scale] = scaler_demo.fit_transform(df_demographic_a[cols_to_scale])

# Ensure one-hot encoded marital status columns remain as float type
df_demographic_a_scaled['Marital_Divorced'] = df_demographic_a_scaled['Marital_Divorced'].astype(float)
df_demographic_a_scaled['Marital_Married'] = df_demographic_a_scaled['Marital_Married'].astype(float)

# Verify scaling: mean = 0, std = 1 for scaled features
print(df_demographic_a_scaled[cols_to_scale].describe().loc[['mean', 'std']])

In [None]:
# Create side-by-side comparison: Before (left) and After (right)
# Only show scaled features (exclude one-hot encoded marital status)
fig, axes = plt.subplots(1, 2, figsize=(16, 8))

# Left plot: Before scaling (only scaled features)
df_before_long = df_demographic_a[cols_to_scale].melt(var_name='Feature', value_name='Value')
sns.boxplot(
    x='Value',
    y='Feature',
    data=df_before_long,
    ax=axes[0],
    color=CUSTOM_HEX[0],
    showfliers=False,
    orient='h'
)
axes[0].set_title('Before Standard Scaling', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Original Feature Values', fontsize=11)
axes[0].set_ylabel('Feature', fontsize=11)


# Right plot: After scaling (only scaled features)
df_after_long = df_demographic_a_scaled[cols_to_scale].melt(var_name='Feature', value_name='Value')
sns.boxplot(
    x='Value',
    y='Feature',
    data=df_after_long,
    ax=axes[1],
    color=CUSTOM_HEX[2],
    showfliers=False,
    orient='h'
)
axes[1].set_title('After Standard Scaling', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Standardized Values (Z-Score)', fontsize=11)
axes[1].set_ylabel('')
axes[1].set_yticklabels([])

plt.tight_layout()
plt.show()

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">Standard Scaling (Z-Score Normalization)</h3>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Distance-based clustering algorithms (K-Means, Hierarchical Clustering) compute Euclidean distances between observations. Features with large numerical ranges (e.g., <strong>Province_Encoded</strong>) dominate distance calculations over features with small ranges (e.g., <strong>Gender_Encoded</strong>: 0-1), leading to biased clustering results. Standard Scaling equalizes feature importance by transforming all features to the same scale (mean = 0, standard deviation = 1).
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Goal of Scaling:</h4>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-right: 20px;">Standardize features by removing the mean (μ) and scaling to unit variance (σ = 1)</li>
        <li style="margin-right: 20px;">Ensure all features contribute equally to distance calculations regardless of their original scale</li>
        <li style="margin-right: 20px;">Prevent frequency-encoded geographic features from dominating the clustering algorithm</li>
    </ul>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Implementation Details:</h4>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-right: 20px;"><strong>Why StandardScaler over MinMaxScaler?</strong> StandardScaler is robust to outliers (uses mean/std instead of min/max) and is the industry standard for clustering algorithms. MinMaxScaler compresses data to [0,1] but is sensitive to extreme values in frequency-encoded features.</li>
        <li style="margin-right: 20px;"><strong>One-hot encoded features excluded:</strong> Marital status features (Marital_Divorced, Marital_Married) are not scaled as they are binary indicators that should remain in their original 0/1 form to preserve interpretability.</li>
        <li style="margin-right: 20px;"><strong>Critical for frequency-encoded features:</strong> Geographic features (<strong>Province_Encoded</strong>, <strong>City_Encoded</strong>, <strong>FSA_Encoded</strong>) use customer counts (large range), which have variance 100-1000x larger than ordinal/binary features. Without scaling, clustering would be entirely geography-driven.</li>
    </ul>
</div>


## **5.2 Scale behavorial Features**

In [None]:
# Initialize and fit StandardScaler (excluding peak_season_sin/cos which are already bounded -1 to +1)
scaler_behav = StandardScaler()
cols_to_scale = [col for col in df_behavioral_a.columns if col not in ['peak_season_sin', 'peak_season_cos']]
df_behavioral_a_scaled = df_behavioral_a.copy()
df_behavioral_a_scaled[cols_to_scale] = scaler_behav.fit_transform(df_behavioral_a[cols_to_scale])

# Verify scaling: mean = 0, std = 1 for scaled features
print(df_behavioral_a_scaled[cols_to_scale].describe().loc[['mean', 'std']])

In [None]:
# Create side-by-side comparison: Before (left) and After (right)
# Only show scaled features (exclude peak_season_sin/cos)
fig, axes = plt.subplots(1, 2, figsize=(16, 8))

# Left plot: Before scaling (only scaled features)
df_before_long = df_behavioral_a[cols_to_scale].melt(var_name='Feature', value_name='Value')
sns.boxplot(
    x='Value',
    y='Feature',
    data=df_before_long,
    ax=axes[0],
    color=CUSTOM_HEX[0],
    showfliers=False,
    orient='h'
)
axes[0].set_title('Before Standard Scaling', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Original Feature Values', fontsize=11)
axes[0].set_ylabel('Feature', fontsize=11)


# Right plot: After scaling (only scaled features)
df_after_long = df_behavioral_a_scaled[cols_to_scale].melt(var_name='Feature', value_name='Value')
sns.boxplot(
    x='Value',
    y='Feature',
    data=df_after_long,
    ax=axes[1],
    color=CUSTOM_HEX[2],
    showfliers=False,
    orient='h'
)
axes[1].set_title('After Standard Scaling', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Standardized Values (Z-Score)', fontsize=11)
axes[1].set_ylabel('')
axes[1].set_yticklabels([])

plt.tight_layout()
plt.show()

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">Standard Scaling (Z-Score Normalization)</h3>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Same scaling approach as for demographic features (see above). Key difference for behavioral features:
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Behavioral-Specific Implementation:</h4>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-right: 20px;"><strong>Cyclical features excluded:</strong> peak_season_sin and peak_season_cos are not scaled as they are already bounded between -1 and +1 by their trigonometric encoding, preserving their cyclical nature.</li>
    </ul>
</div>


# <a class='anchor' id='6'></a>
<br>

<div style="background: linear-gradient(to right, #00411E, #00622D, #00823C, #45AF28, #82BA72); 
            padding: 10px; color: white; text-align: center;  max-width: 97%;">
    <center><h1 style="margin-top: 10px; margin-bottom: 4px; color: white;
                       font-size: 32px; font-family: 'Roboto', sans-serif;">
        <b>6. Value Based Preselection</b></h1></center>
</div>

<div style="background-color: #e1e1e1ff; padding: 15px; margin-right: 20px; border-left: 5px solid; border-image: linear-gradient(to bottom, #212121, #313131, #595959, #909090) 1; border-radius: 5px;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #000000ff; font-weight: bold;">Goal of Section 6: Value-Based Preselection</h3>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        This section performs <strong>FM segmentation</strong> using a <strong>rule-based median-split approach</strong> rather than advanced clustering algorithms. This value-based preselection creates interpretable customer segments that provide a foundation for understanding customer value before demographic and behavioral clustering.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #000000ff; font-weight: bold;">Why Rule-Based Segmentation Instead of Advanced Clustering?</h4>
    <ul style="margin: 10px 0; padding-left: 20px; color: #000; margin-right: 40px;">
        <li style="margin-bottom: 6px;"><strong>Limited Feature Space:</strong> Only 2 features (Frequency and Monetary) - insufficient dimensionality for complex clustering algorithms to provide meaningful advantages</li>
        <li style="margin-bottom: 6px;"><strong>Business Interpretability:</strong> Median-split creates clearly defined, explainable segments (High/Low F × High/Low M) that stakeholders can easily understand and act upon</li>
        <li style="margin-bottom: 6px;"><strong>Benchmark Standardization:</strong> FM matrix is an established framework in customer analytics, enabling comparison with industry standards and best practices</li>
        <li style="margin-bottom: 6px;"><strong>Reserve Complexity for Rich Feature Sets:</strong> Advanced clustering (Hierarchical, K-Means etc.) will be applied to demographic (6+ features) and behavioral (8+ features) datasets where they can capture complex multi-dimensional patterns</li>
        <li style="margin-bottom: 6px;"><strong>Computational Efficiency:</strong> Simple rule-based segmentation is fast, deterministic, and doesn't require hyperparameter tuning or convergence iterations</li>
    </ul>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #000000ff; font-weight: bold;">Specific Objectives:</h4>
    <ul style="margin: 10px 0; padding-left: 20px; color: #000; margin-right: 40px;">
        <li style="margin-bottom: 5px;"><strong>Create 4 FM Segments:</strong> Champions (High F & High M), Frequent Flyer (High F & Low M), Premium Occasional (Low F & High M), At Risk (Low F & Low M)</li>
        <li style="margin-bottom: 5px;"><strong>Identify Elite Tier:</strong> Top 10% customers in both Frequency and Monetary dimensions within Champions segment</li>
        <li style="margin-bottom: 5px;"><strong>Separate Analysis by Focus Group:</strong> Apply segmentation independently to Focus Group 1 (Loyalty Members | Active) and Focus Group 2 (Ex-Loyalty Members | Active)</li>
        <li style="margin-bottom: 5px;"><strong>Combined View Analysis:</strong> Segment ALL active customers together using unified thresholds to identify win-back opportunities for Focus Group 2 customers</li>
        <li style="margin-bottom: 5px;"><strong>Segment Migration Analysis:</strong> Track how Focus Group 2 customers' segments change when evaluated against combined thresholds (e.g., Elite → Champions, Champions → At Risk)</li>
        <li style="margin-bottom: 5px;"><strong>Reference for Integration:</strong> Store FM segments as categorical features to be combined with clustering results in Chapter 9</li>
    </ul>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #000000ff; font-weight: bold;">Business Value:</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        The Combined View demonstrates the <strong>opportunity of re-enrolling Focus Group 2 customers</strong> into the loyalty program. By showing how ex-loyalty members perform when benchmarked against current loyalty members, we can identify high-value win-back targets and quantify the potential revenue uplift from successful re-enrollment campaigns.
    </p>
</div>


## **6.1 Categorize Customers in FM Segments**

We create a **2x2 FM Matrix** using median as the cutoff for both Frequency and Monetary dimensions. Additionally, we identify **Elite customers** (Top 10% in both dimensions) within the Champions segment for premium targeting.


### **Focus Group 1: Loyalty Members | Active**

In [None]:
# Use pre-filtered DataFrame from Section 3.1
fg1_customers = df_Customer.loc[fm_features_l_a.index].copy()

# Calculate thresholds
freq_median_fg1 = fg1_customers['Frequency'].median()
mon_median_fg1 = fg1_customers['Monetary'].median()
freq_p90_fg1 = fg1_customers['Frequency'].quantile(0.90)
mon_p90_fg1 = fg1_customers['Monetary'].quantile(0.90)

# Segmentation functions
def assign_fm_segment(row, freq_med, mon_med):
    if row['Frequency'] >= freq_med and row['Monetary'] >= mon_med:
        return 'Champions'
    elif row['Frequency'] >= freq_med and row['Monetary'] < mon_med:
        return 'Frequent Flyer'
    elif row['Frequency'] < freq_med and row['Monetary'] >= mon_med:
        return 'Premium Occasional'
    else:
        return 'At Risk'

def assign_fm_tier(row, freq_med, mon_med, freq_p90, mon_p90):
    if row['Frequency'] >= freq_p90 and row['Monetary'] >= mon_p90:
        return 'Elite'
    elif row['Frequency'] >= freq_med and row['Monetary'] >= mon_med:
        return 'High'
    elif row['Frequency'] >= freq_med or row['Monetary'] >= mon_med:
        return 'Medium'
    else:
        return 'Low'

# Apply segmentation
fg1_customers['fm_segment_fg1'] = fg1_customers.apply(
    lambda row: assign_fm_segment(row, freq_median_fg1, mon_median_fg1), axis=1
)
fg1_customers['fm_tier_fg1'] = fg1_customers.apply(
    lambda row: assign_fm_tier(row, freq_median_fg1, mon_median_fg1, freq_p90_fg1, mon_p90_fg1), axis=1
)

# Summary DataFrame
summary_fg1 = pd.DataFrame({
    'Segment': ['Champions', 'Frequent Flyer', 'Premium Occasional', 'At Risk'],
    'Count': [(fg1_customers['fm_segment_fg1'] == seg).sum() for seg in ['Champions', 'Frequent Flyer', 'Premium Occasional', 'At Risk']],
    'Pct': [(fg1_customers['fm_segment_fg1'] == seg).sum() / len(fg1_customers) * 100 for seg in ['Champions', 'Frequent Flyer', 'Premium Occasional', 'At Risk']]
})
summary_fg1['Elite_Count'] = [
    (fg1_customers['fm_tier_fg1'] == 'Elite').sum() if seg == 'Champions' else 0 
    for seg in summary_fg1['Segment']
]

# Merge back
df_Customer = df_Customer.merge(
    fg1_customers[['Loyalty#', 'fm_segment_fg1', 'fm_tier_fg1']], 
    on='Loyalty#', 
    how='left'
)

# Display summary
summary_fg1.style.format({'Pct': '{:.1f}%'})

### **Focus Group 2: Ex-Loyalty Members | Active**

In [None]:
# Use pre-filtered DataFrame from Section 3.1
fg2_customers = df_Customer.loc[fm_features_non_l_a.index].copy()

# Calculate thresholds
freq_median_fg2 = fg2_customers['Frequency'].median()
mon_median_fg2 = fg2_customers['Monetary'].median()
freq_p90_fg2 = fg2_customers['Frequency'].quantile(0.90)
mon_p90_fg2 = fg2_customers['Monetary'].quantile(0.90)

# Apply segmentation
fg2_customers['fm_segment_fg2'] = fg2_customers.apply(
    lambda row: assign_fm_segment(row, freq_median_fg2, mon_median_fg2), axis=1
)
fg2_customers['fm_tier_fg2'] = fg2_customers.apply(
    lambda row: assign_fm_tier(row, freq_median_fg2, mon_median_fg2, freq_p90_fg2, mon_p90_fg2), axis=1
)

# Summary DataFrame
summary_fg2 = pd.DataFrame({
    'Segment': ['Champions', 'Frequent Flyer', 'Premium Occasional', 'At Risk'],
    'Count': [(fg2_customers['fm_segment_fg2'] == seg).sum() for seg in ['Champions', 'Frequent Flyer', 'Premium Occasional', 'At Risk']],
    'Pct': [(fg2_customers['fm_segment_fg2'] == seg).sum() / len(fg2_customers) * 100 for seg in ['Champions', 'Frequent Flyer', 'Premium Occasional', 'At Risk']]
})
summary_fg2['Elite_Count'] = [
    (fg2_customers['fm_tier_fg2'] == 'Elite').sum() if seg == 'Champions' else 0 
    for seg in summary_fg2['Segment']
]

# Merge back
df_Customer = df_Customer.merge(
    fg2_customers[['Loyalty#', 'fm_segment_fg2', 'fm_tier_fg2']], 
    on='Loyalty#', 
    how='left'
)

# Display summary
summary_fg2.style.format({'Pct': '{:.1f}%'})

## **6.2 FM Matrix Visualization**

Visualize the 2x2 FM matrix for both focus groups, highlighting the **Elite segment** (Top 10% in both dimensions) within the Champions quadrant.

In [None]:
# Create 2x2 FM Matrix Visualization for both Focus Groups
fig, axes = plt.subplots(1, 2, figsize=(18, 7))
fig.suptitle('FM Matrix: 2x2 Customer Segmentation with Elite Tier', 
             fontsize=16, fontweight='bold', y=0.98)

# Define colors
segment_colors = {
    'Champions': colors[3],
    'Frequent Flyer': colors[1],
    'Premium Occasional': colors[2],
    'At Risk': colors[4]
}
elite_color = colors[0]
label_color = "#00411E"  # Color for all quadrant labels

# --- Focus Group 1: Loyalty Members | Active ---
ax = axes[0]

# Plot reference lines
ax.axvline(freq_median_fg1, color='#313131', linestyle='--', linewidth=2, alpha=0.7, zorder=2)
ax.axhline(mon_median_fg1, color='#313131', linestyle='--', linewidth=2, alpha=0.7, zorder=2)

# Highlight Champions quadrant
max_freq_fg1 = fg1_customers['Frequency'].max() * 1.05
max_mon_fg1 = fg1_customers['Monetary'].max() * 1.05
ax.axvspan(freq_median_fg1, max_freq_fg1, alpha=0.08, color=colors[3], zorder=0)
ax.axhspan(mon_median_fg1, max_mon_fg1, alpha=0.08, color=colors[3], zorder=0)

# Highlight Elite zone (Top 10% in both)
ax.axvspan(freq_p90_fg1, max_freq_fg1, alpha=0.15, color=elite_color, zorder=1)
ax.axhspan(mon_p90_fg1, max_mon_fg1, alpha=0.15, color=elite_color, zorder=1)

# Scatter plot by segment
for segment in ['At Risk', 'Premium Occasional', 'Frequent Flyer', 'Champions']:
    segment_data = fg1_customers[fg1_customers['fm_segment_fg1'] == segment]
    
    # Separate Elite from Champions
    if segment == 'Champions':
        elite_data = segment_data[segment_data['fm_tier_fg1'] == 'Elite']
        non_elite_data = segment_data[segment_data['fm_tier_fg1'] != 'Elite']
        
        # Plot non-Elite Champions
        ax.scatter(non_elite_data['Frequency'], non_elite_data['Monetary'],
                  c=segment_colors[segment], s=50, alpha=0.6, 
                  edgecolor='white', linewidth=0.3, zorder=3,
                  label=f'Champions (n={len(non_elite_data):,})')
        
        # Plot Elite with distinct marker
        ax.scatter(elite_data['Frequency'], elite_data['Monetary'],
                  c=elite_color, s=80, alpha=0.8, marker='D',
                  edgecolor='white', linewidth=0.5, zorder=4,
                  label=f'Elite (n={len(elite_data):,})')
    else:
        ax.scatter(segment_data['Frequency'], segment_data['Monetary'],
                  c=segment_colors[segment], s=50, alpha=0.6,
                  edgecolor='white', linewidth=0.3, zorder=3,
                  label=f'{segment} (n={len(segment_data):,})')

# Add quadrant labels (with high zorder to stay on top)
label_style = {'fontsize': 14, 'fontweight': 'bold', 'zorder': 10, 'color': '#00411E'}
ax.text(freq_median_fg1 * 1.4, mon_median_fg1 * 1.4, 'Champions', 
        ha='center', va='center', **label_style)
ax.text(freq_median_fg1 * 0.3, mon_median_fg1 * 1.4, 'Premium\nOccasional', 
        ha='center', va='center', **label_style)
ax.text(freq_median_fg1 * 1.4, mon_median_fg1 * 0.3, 'Frequent\nFlyer', 
        ha='center', va='center', **label_style)
ax.text(freq_median_fg1 * 0.3, mon_median_fg1 * 0.3, 'At Risk', 
        ha='center', va='center', **label_style)
# Add Elite label in center of Elite zone with highest z-order
elite_label_style = {'fontsize': 13, 'fontweight': 'bold', 'zorder': 100, 'color': '#00411E', 'style': 'italic'}
elite_center_x = (freq_p90_fg1 + max_freq_fg1) / 2
elite_center_y = (mon_p90_fg1 + max_mon_fg1) / 2
ax.text(elite_center_x, elite_center_y, 'Elite', 
        ha='center', va='center', **elite_label_style)

ax.set_xlabel("Frequency (Flights per Active Month)", fontsize=11, fontweight='bold')
ax.set_ylabel("Monetary (Distance per Active Month)", fontsize=11, fontweight='bold')
ax.set_title(f"Focus Group 1: Loyalty Members | Active\n(n={len(fg1_customers):,} customers)", 
             fontsize=12, fontweight='bold', pad=10)
ax.legend(loc='lower right', fontsize=8, frameon=True, shadow=True)
ax.grid(True, alpha=0.2, linestyle=':', zorder=0)
ax.set_axisbelow(True)

# --- Focus Group 2: Ex-Loyalty Members | Active ---
ax = axes[1]

# Plot reference lines
ax.axvline(freq_median_fg2, color='#313131', linestyle='--', linewidth=2, alpha=0.7, zorder=2)
ax.axhline(mon_median_fg2, color='#313131', linestyle='--', linewidth=2, alpha=0.7, zorder=2)

# Highlight Champions quadrant
max_freq_fg2 = fg2_customers['Frequency'].max() * 1.05
max_mon_fg2 = fg2_customers['Monetary'].max() * 1.05
ax.axvspan(freq_median_fg2, max_freq_fg2, alpha=0.08, color=colors[3], zorder=0)
ax.axhspan(mon_median_fg2, max_mon_fg2, alpha=0.08, color=colors[3], zorder=0)

# Highlight Elite zone (Top 10% in both)
ax.axvspan(freq_p90_fg2, max_freq_fg2, alpha=0.15, color=elite_color, zorder=1)
ax.axhspan(mon_p90_fg2, max_mon_fg2, alpha=0.15, color=elite_color, zorder=1)

# Scatter plot by segment
for segment in ['At Risk', 'Premium Occasional', 'Frequent Flyer', 'Champions']:
    segment_data = fg2_customers[fg2_customers['fm_segment_fg2'] == segment]
    
    # Separate Elite from Champions
    if segment == 'Champions':
        elite_data = segment_data[segment_data['fm_tier_fg2'] == 'Elite']
        non_elite_data = segment_data[segment_data['fm_tier_fg2'] != 'Elite']
        
        # Plot non-Elite Champions
        ax.scatter(non_elite_data['Frequency'], non_elite_data['Monetary'],
                  c=segment_colors[segment], s=50, alpha=0.6, 
                  edgecolor='white', linewidth=0.3, zorder=3,
                  label=f'Champions (n={len(non_elite_data):,})')
        
        # Plot Elite with distinct marker
        ax.scatter(elite_data['Frequency'], elite_data['Monetary'],
                  c=elite_color, s=80, alpha=0.8, marker='D',
                  edgecolor='white', linewidth=0.5, zorder=4,
                  label=f'Elite (n={len(elite_data):,})')
    else:
        ax.scatter(segment_data['Frequency'], segment_data['Monetary'],
                  c=segment_colors[segment], s=50, alpha=0.6,
                  edgecolor='white', linewidth=0.3, zorder=3,
                  label=f'{segment} (n={len(segment_data):,})')

# Add quadrant labels
ax.text(freq_median_fg2 * 1.4, mon_median_fg2 * 1.4, 'Champions', 
        ha='center', va='center', **label_style)
ax.text(freq_median_fg2 * 0.3, mon_median_fg2 * 1.4, 'Premium\nOccasional', 
        ha='center', va='center', **label_style)
ax.text(freq_median_fg2 * 1.4, mon_median_fg2 * 0.3, 'Frequent\nFlyer', 
        ha='center', va='center', **label_style)
# Add Elite label in center of Elite zone with highest z-order
elite_center_x = (freq_p90_fg2 + max_freq_fg2) / 2
elite_center_y = (mon_p90_fg2 + max_mon_fg2) / 2
ax.text(elite_center_x, elite_center_y, 'Elite', 
        ha='center', va='center', **elite_label_style)

ax.set_xlabel("Frequency (Flights per Active Month)", fontsize=11, fontweight='bold')
ax.set_ylabel("Monetary (Distance per Active Month)", fontsize=11, fontweight='bold')
ax.set_title(f"Focus Group 2: Ex-Loyalty Members | Active\n(n={len(fg2_customers):,} customers)", 
             fontsize=12, fontweight='bold', pad=10)
ax.legend(loc='lower right', fontsize=8, frameon=True, shadow=True)
ax.grid(True, alpha=0.2, linestyle=':', zorder=0)
ax.set_axisbelow(True)

plt.tight_layout()


plt.show()


## **6.3 Segment Profiling**

Analyze the characteristics of each FM segment across both focus groups.

In [None]:
# Profile Focus Group 1: Loyalty Members | Active
profile_fg1 = fg1_customers.groupby('fm_segment_fg1').agg({
    'Frequency': ['mean', 'median'],
    'Monetary': ['mean', 'median'],
    'Loyalty#': 'count'
}).round(2)
profile_fg1.columns = ['_'.join(col).strip() for col in profile_fg1.columns.values]
profile_fg1.rename(columns={'Loyalty#_count': 'Count'}, inplace=True)

# Profile Focus Group 2: Ex-Loyalty Members | Active
profile_fg2 = fg2_customers.groupby('fm_segment_fg2').agg({
    'Frequency': ['mean', 'median'],
    'Monetary': ['mean', 'median'],
    'Loyalty#': 'count'
}).round(2)
profile_fg2.columns = ['_'.join(col).strip() for col in profile_fg2.columns.values]
profile_fg2.rename(columns={'Loyalty#_count': 'Count'}, inplace=True)

# Elite Tier Summary
elite_summary = pd.DataFrame({
    'Focus Group': ['Loyalty Members | Active', 'Ex-Loyalty Members | Active'],
    'Elite Count': [
        (fg1_customers['fm_tier_fg1'] == 'Elite').sum(),
        (fg2_customers['fm_tier_fg2'] == 'Elite').sum()
    ],
    'Elite %': [
        (fg1_customers['fm_tier_fg1'] == 'Elite').sum() / len(fg1_customers) * 100,
        (fg2_customers['fm_tier_fg2'] == 'Elite').sum() / len(fg2_customers) * 100
    ],
    'Avg Frequency': [
        fg1_customers[fg1_customers['fm_tier_fg1'] == 'Elite']['Frequency'].mean(),
        fg2_customers[fg2_customers['fm_tier_fg2'] == 'Elite']['Frequency'].mean()
    ],
    'Avg Monetary': [
        fg1_customers[fg1_customers['fm_tier_fg1'] == 'Elite']['Monetary'].mean(),
        fg2_customers[fg2_customers['fm_tier_fg2'] == 'Elite']['Monetary'].mean()
    ]
})

# Display all profiles
print("FOCUS GROUP 1: Loyalty Members | Active - Segment Profile")
display(profile_fg1)
print("\nFOCUS GROUP 2: Ex-Loyalty Members | Active - Segment Profile")
display(profile_fg2)
print("\nELITE TIER SUMMARY (Top 10% in both F & M)")
display(elite_summary.style.format({'Elite %': '{:.1f}%', 'Avg Frequency': '{:.2f}', 'Avg Monetary': '{:.2f}'}))

## **6.4 Combined View: All Active Customers**

Segment **all active customers** together (Focus Group 1 + Focus Group 2) using unified thresholds to benchmark ex-loyalty members against current loyalty members.


In [None]:
# Use pre-filtered DataFrame from Section 3.1 (Combined: All Active Customers)
combined_customers = df_Customer.loc[fm_features_a.index].copy()

# Calculate COMBINED thresholds (across ALL active customers)
freq_median_combined = combined_customers['Frequency'].median()
mon_median_combined = combined_customers['Monetary'].median()
freq_p90_combined = combined_customers['Frequency'].quantile(0.90)
mon_p90_combined = combined_customers['Monetary'].quantile(0.90)

# Apply segmentation to ALL active customers
combined_customers['fm_segment_combined'] = combined_customers.apply(
    lambda row: assign_fm_segment(row, freq_median_combined, mon_median_combined), axis=1
)
combined_customers['fm_tier_combined'] = combined_customers.apply(
    lambda row: assign_fm_tier(row, freq_median_combined, mon_median_combined, freq_p90_combined, mon_p90_combined), axis=1
)

# Add Focus Group identifier
combined_customers['focus_group'] = combined_customers['is_current_loyalty_member'].map({
    True: 'FG1: Loyalty | Active',
    False: 'FG2: Ex-Loyalty | Active'
})

# Summary by Focus Group
summary_combined = combined_customers.groupby(['focus_group', 'fm_segment_combined']).size().unstack(fill_value=0)
summary_combined['Total'] = summary_combined.sum(axis=1)
summary_combined.loc['Total'] = summary_combined.sum()

# Percentage distribution
summary_combined_pct = summary_combined.div(summary_combined['Total'], axis=0) * 100

# Merge back to main dataframe
df_Customer = df_Customer.merge(
    combined_customers[['Loyalty#', 'fm_segment_combined', 'fm_tier_combined']], 
    on='Loyalty#', 
    how='left'
)

print("Combined: Segment Distribution by Focus Group")

display(summary_combined)
print("\nPercentage Distribution:")
display(summary_combined_pct.round(1))


In [None]:
# Visualize Combined View with FG2 highlighted in RED
fig, ax = plt.subplots(1, 1, figsize=(16, 9))
fig.suptitle('Combined FM Matrix: All Active Customers\nFocus Group 2 (Ex-Loyalty) Highlighted in Red', 
             fontsize=16, fontweight='bold', y=0.98)

# Plot reference lines
ax.axvline(freq_median_combined, color='#313131', linestyle='--', linewidth=2, alpha=0.7, zorder=2)
ax.axhline(mon_median_combined, color='#313131', linestyle='--', linewidth=2, alpha=0.7, zorder=2)

# Highlight Champions quadrant
max_freq_combined = combined_customers['Frequency'].max() * 1.05
max_mon_combined = combined_customers['Monetary'].max() * 1.05
ax.axvspan(freq_median_combined, max_freq_combined, alpha=0.08, color=colors[3], zorder=0)
ax.axhspan(mon_median_combined, max_mon_combined, alpha=0.08, color=colors[3], zorder=0)

# Highlight Elite zone
ax.axvspan(freq_p90_combined, max_freq_combined, alpha=0.15, color=elite_color, zorder=1)
ax.axhspan(mon_p90_combined, max_mon_combined, alpha=0.15, color=elite_color, zorder=1)

# Separate FG1 and FG2
fg1_data = combined_customers[combined_customers['focus_group'] == 'FG1: Loyalty | Active']
fg2_data = combined_customers[combined_customers['focus_group'] == 'FG2: Ex-Loyalty | Active']

# Plot FG1 (Loyalty Members) - Green/Gray tones
ax.scatter(fg1_data['Frequency'], fg1_data['Monetary'],
          c=colors[2], s=30, alpha=0.4, 
          edgecolor='white', linewidth=0.2, zorder=3,
          label=f'FG1: Loyalty Members (n={len(fg1_data):,})')

# Plot FG2 (Ex-Loyalty Members) - RED with stronger visibility
ax.scatter(fg2_data['Frequency'], fg2_data['Monetary'],
          c='#D32F2F', s=50, alpha=0.7, marker='o',
          edgecolor='darkred', linewidth=0.5, zorder=5,
          label=f'FG2: Ex-Loyalty Members (n={len(fg2_data):,})')

# Add quadrant labels
label_style = {'fontsize': 14, 'fontweight': 'bold', 'zorder': 10, 'color': '#00411E'}
ax.text(freq_median_combined * 1.4, mon_median_combined * 1.4, 'Champions', 
        ha='center', va='center', **label_style)
ax.text(freq_median_combined * 0.3, mon_median_combined * 1.4, 'Premium\nOccasional', 
        ha='center', va='center', **label_style)
ax.text(freq_median_combined * 1.4, mon_median_combined * 0.3, 'Frequent\nFlyer', 
        ha='center', va='center', **label_style)
ax.text(freq_median_combined * 0.3, mon_median_combined * 0.3, 'At Risk', 
        ha='center', va='center', **label_style)

# Add Elite label
elite_label_style = {'fontsize': 13, 'fontweight': 'bold', 'zorder': 100, 'color': '#00411E', 'style': 'italic'}
elite_center_x = (freq_p90_combined + max_freq_combined) / 2
elite_center_y = (mon_p90_combined + max_mon_combined) / 2
ax.text(elite_center_x, elite_center_y, 'Elite', 
        ha='center', va='center', **elite_label_style)

ax.set_xlabel("Frequency (Flights per Active Month)", fontsize=11, fontweight='bold')
ax.set_ylabel("Monetary (Distance per Active Month)", fontsize=11, fontweight='bold')
ax.set_title(f"Combined Segmentation | Total Active Customers: n={len(combined_customers):,}", 
             fontsize=12, fontweight='bold', pad=10)
ax.legend(loc='lower right', fontsize=10, frameon=True, shadow=True)
ax.grid(True, alpha=0.2, linestyle=':', zorder=0)
ax.set_axisbelow(True)

plt.tight_layout()
plt.show()


Compare how **Focus Group 2 customers** are distributed across segments in their isolated view versus the combined view.


In [None]:
# Create simple comparison table for FG2
fg2_isolated_counts = fg2_customers.groupby('fm_segment_fg2').size().reset_index(name='FG2_Isolated')
fg2_combined_counts = fg2_data.groupby('fm_segment_combined').size().reset_index(name='Combined')

# Merge tables
segment_comparison = fg2_isolated_counts.merge(
    fg2_combined_counts, 
    left_on='fm_segment_fg2', 
    right_on='fm_segment_combined', 
    how='outer'
).fillna(0)

# Clean up
segment_comparison = segment_comparison[['fm_segment_fg2', 'FG2_Isolated', 'Combined']]
segment_comparison.columns = ['Segment', 'FG2 Isolated View', 'Combined View']
segment_comparison = segment_comparison.set_index('Segment')

# Ensure correct order
segment_order = ['Champions', 'Frequent Flyer', 'Premium Occasional', 'At Risk']
segment_comparison = segment_comparison.reindex([s for s in segment_order if s in segment_comparison.index])

# Add percentages
segment_comparison['FG2 Isolated (%)'] = (segment_comparison['FG2 Isolated View'] / segment_comparison['FG2 Isolated View'].sum() * 100).round(1)
segment_comparison['Combined (%)'] = (segment_comparison['Combined View'] / segment_comparison['Combined View'].sum() * 100).round(1)

# Reorder columns
segment_comparison = segment_comparison[['FG2 Isolated View', 'FG2 Isolated (%)', 'Combined View', 'Combined (%)']]
print("Focus Group 2: Segment Distribution Comparison")
print(f"Total FG2 Customers: {len(fg2_data):,}\n")
display(segment_comparison)


<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">Value-Based Segmentation Summary</h3>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        We performed a comprehensive <strong>FM-based customer segmentation</strong> using a rule-based median-split approach to segment active customers by flight frequency and monetary value. Three distinct analyses were conducted: Focus Group 1 (Loyalty Members), Focus Group 2 (Ex-Loyalty Members), and a Combined View to benchmark performance across all active customers.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">What We Did:</h4>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-right: 20px;"><strong>Created FM Features:</strong> Engineered Frequency (flights/active month) and Monetary (distance/active month) metrics based on 2021 flight activity</li>
        <li style="margin-right: 20px;"><strong>Separate Group Analysis:</strong> Applied median-split segmentation independently to FG1 (current loyalty members) and FG2 (ex-loyalty members) to understand each group's characteristics</li>
        <li style="margin-right: 20px;"><strong>Combined Benchmarking:</strong> Segmented ALL active customers using unified thresholds to directly compare FG2 performance against FG1 standards</li>
        <li style="margin-right: 20px;"><strong>Elite Tier Identification:</strong> Flagged Top 10% performers in both dimensions within the Champions segment for premium targeting</li>
        <li style="margin-right: 20px;"><strong>4 Actionable Segments:</strong> Champions (High F & M), Frequent Flyer (High F, Low M), Premium Occasional (Low F, High M), At Risk (Low F & M)</li>
    </ul>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Key Findings:</h4>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-right: 20px;"><strong>Performance Disparity:</strong> Current loyalty members (FG1) demonstrate significantly higher Frequency and Monetary values compared to ex-loyalty members (FG2), validating the loyalty program's effectiveness in attracting and retaining high-value customers</li>
        <li style="margin-right: 20px;"><strong>Segment Redistribution:</strong> When benchmarked against combined thresholds, many FG2 customers shift from higher segments (in their isolated peer group) to lower segments, revealing a performance gap that represents win-back opportunity</li>
        <li style="margin-right: 20px;"><strong>High-Value Win-Back Targets:</strong> FG2 customers who remain in Champions/Elite tiers even in the combined view (competing against FG1) represent exceptional performers who are flying at loyalty-member levels despite not being enrolled—these are the highest-priority re-enrollment targets</li>
        <li style="margin-right: 20px;"><strong>Segment Characteristics:</strong> Champions generate the most value across both frequency and distance; Frequent Flyers have engagement potential through route/class upgrades; Premium Occasional customers are retention-focused targets; At Risk requires re-engagement strategies</li>
</div>


<div style="background-color: #fefde9ff; padding: 15px; margin-right: 20px; border-left: 5px solid; border-image: linear-gradient(to bottom, #909090, #d4d400, #e6e600) 1; border-radius: 5px;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #6b6b00; font-weight: bold;">To be updated -- Next Steps: Multi-Dimensional Clustering (Chapters 7-9)</h3>
    <p style="margin: 10px 0; color: #000; margin-right: 40px; margin-bottom: 10px;">
        The FM segments created in this chapter serve as the <strong>value-based foundation</strong> for our customer segmentation strategy. To develop comprehensive, actionable customer personas, we will layer <strong>demographic</strong> and <strong>behavioral</strong> insights on top of these value segments through advanced clustering algorithms.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #6b6b00; font-weight: bold;">Strategic Rationale:</h4>
    <ul style="margin: 10px 0; padding-left: 20px; color: #000; margin-right: 40px;">
        <li style="margin-bottom: 6px;"><strong>Combined Target Group (ALL Active Customers):</strong> From Chapter 7 onwards, we analyze FG1 + FG2 together rather than separately. The Combined View demonstrated that unified benchmarking provides consistent performance standards and enables direct comparison for win-back prioritization.</li>
        <li style="margin-bottom: 6px;"><strong>Multi-Perspective Segmentation:</strong> Value (FM) + Demographics + Behavior = comprehensive understanding of WHO customers are (demographics), HOW they behave (patterns), and WHAT they're worth (value)</li>
        <li style="margin-bottom: 6px;"><strong>Focus Group Identifier Preserved:</strong> While clustering on combined data, we retain FG1/FG2 labels to identify win-back targets versus retention priorities in final personas</li>
    </ul>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #6b6b00; font-weight: bold;">Chapter 7: Demographic Clustering</h4>
    <ul style="margin: 10px 0; padding-left: 20px; color: #000; margin-right: 40px;">
        <li style="margin-bottom: 6px;"><strong>Features:</strong> Gender, Education, Income, Marital Status, Location Code and more</li>
        <li style="margin-bottom: 6px;"><strong>Algorithms:</strong> Hierarchical Clustering, K-Means, DBSCAN | Compare performance using Silhouette Score, Davies-Bouldin Index, Calinski-Harabasz Score</li>
        <li style="margin-bottom: 6px;"><strong>Business Goal:</strong> Identify demographic profiles</li>
    </ul>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #6b6b00; font-weight: bold;">Chapter 8: Behavioral Clustering</h4>
    <ul style="margin: 10px 0; padding-left: 20px; color: #000; margin-right: 40px;">
        <li style="margin-bottom: 6px;"><strong>Features:</strong> Flight patterns (frequency distribution, seasonality), points accumulation/redemption behavior, companion travel frequency, route preferences, booking patterns and more</li>
        <li style="margin-bottom: 6px;"><strong>Algorithms:</strong> Hierarchical Clustering, K-Means, DBSCAN | Compare performance using evaluation metrics</li>
        <li style="margin-bottom: 6px;"><strong>Business Goal:</strong> Identify behavioral archetypes</li>
    </ul>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #6b6b00; font-weight: bold;">Chapter 9: Final Persona Integration</h4>
    <ul style="margin: 10px 0; padding-left: 20px; color: #000; margin-right: 40px;">
        <li style="margin-bottom: 6px;"><strong>Integration Strategy:</strong> Combine FM segments (value), demographic clusters, and behavioral clusters to create multi-dimensional customer personas</li>
        <li style="margin-bottom: 6px;"><strong>Win-Back Prioritization:</strong> Use Focus Group identifier to flag FG2 customers within high-value personas for targeted re-enrollment campaigns</li>
        <li style="margin-bottom: 6px;"><strong>Deliverables:</strong> Final customer personas with actionable marketing strategies, service personalization recommendations, and revenue opportunity sizing</li>
        <li style="margin-bottom: 6px;"><strong>Business Impact:</strong> Quantify revenue uplift from converting FG2 Champions to FG1 engagement levels | Design persona-specific campaigns with projected ROI</li>
    </ul>
</div>


---

# <a class='anchor' id='7'></a>
<br>

<div style="background: linear-gradient(to right, #00411E, #00622D, #00823C, #45AF28, #82BA72); 
            padding: 10px; color: white; text-align: center;  max-width: 97%;">
    <center><h1 style="margin-top: 10px; margin-bottom: 4px; color: white;
                       font-size: 32px; font-family: 'Roboto', sans-serif;">
        <b>7. Demographical Clustering</b></h1></center>
</div>

<div style="background-color: #e1e1e1ff; padding: 15px; margin-right: 20px; border-left: 5px solid; border-image: linear-gradient(to bottom, #212121, #313131, #595959, #909090) 1; border-radius: 5px;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #000000ff; font-weight: bold;">Section 7: Demographic Clustering</h3>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #313131; font-weight: bold;">Why, Rationale:</h4>
    <p style="margin: 10px 0; color: #000; margin-right: 40px;">
        While FM segmentation (Section 6) captures customer value through transactional behavior (frequency, monetary value), demographic characteristics provide complementary insights into <strong>who</strong> our customers are beyond their purchasing patterns. Understanding demographic profiles enables more targeted marketing strategies, personalized communication, and product recommendations tailored to specific customer groups. By combining value-based segmentation with demographic clustering, we can create multi-dimensional customer personas that capture both <strong>what customers do</strong> (transactional behavior) and <strong>who they are</strong> (socio-demographic profile). This section systematically explores multiple clustering algorithms to identify natural demographic groupings within our customer base. The resulting demographic segments will be synthesized with FM segments & the behavorial segments in Section 9 to create comprehensive, actionable customer personas for strategic decision-making.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #313131; font-weight: bold;">What, Objectives:</h4>
    <ul style="margin: 10px 0; padding-left: 20px; color: #000; margin-right: 40px;">
        <li style="margin-bottom: 6px;"><strong>Multi-Algorithm Comparison:</strong> Evaluate four distinct clustering approaches, Hierarchical Clustering (agglomerative, bottom-up), K-Means (centroid-based, partition), Mean Shift (density-based, non-parametric), and GMM (probabilistic, soft assignments), plus a SOM-based two-stage approach to understand how different algorithmic assumptions and optimization criteria segment the same demographic data differently</li>
        <li style="margin-bottom: 6px;"><strong>Optimal Cluster Selection:</strong> For each algorithm, determine the optimal number of clusters using multiple validation metrics (Silhouette Score, Calinski-Harabasz Index, Davies-Bouldin Index, R² variance explained) while consistently favoring parsimonious solutions that balance statistical fit with business interpretability</li>
        <li style="margin-bottom: 6px;"><strong>Initial Cluster Profiling:</strong> Characterize each algorithm's clusters using Z-score profile analysis (standardized centroid deviations from population mean) and feature importance analysis (variance of centroids across clusters). This initial profiling identifies which demographic features drive segmentation for each method. Comprehensive profiling with business interpretation, cross-tabulations, and strategic recommendations will follow in Section 9 after final segment selection and comparison across all clustering approaches</li>
        <li style="margin-bottom: 6px;"><strong>Algorithm Behavior Analysis:</strong> Document how different algorithms handle the demographic feature space, which features emerge as primary differentiators, how cluster sizes distribute, and what segment structures each method reveals, to inform the final clustering method selection in Section 9</li>
    </ul>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #313131; font-weight: bold;">How, Method & Structure:</h4>
    <p style="margin: 10px 0; color: #000; margin-right: 40px;">
        <strong>Dataset:</strong> All active customers (FG1 + FG2 combined, n=14,527) from the demographic feature set prepared in Section 5.
    </p>
    <p style="margin: 10px 0; color: #000; margin-right: 40px;">
        <strong>Features (9 variables, Z-score standardized):</strong>
    </p>
    <ul style="margin: 10px 0; padding-left: 20px; color: #000; margin-right: 40px;">
        <li style="margin-bottom: 4px;"><strong>Geographic:</strong> Province_Encoded, City_Encoded, FSA_Encoded, Location_Code_Num</li>
        <li style="margin-bottom: 4px;"><strong>Personal:</strong> Gender_Encoded, Education_Level_Num, Income_Bin_Num</li>
        <li style="margin-bottom: 4px;"><strong>Marital Status:</strong> Marital_Divorced, Marital_Married (one-hot encoded)</li>
    </ul>
    <p style="margin: 10px 0; color: #000; margin-right: 40px;">
        <strong>Notebook Structure:</strong>
    </p>
    <ul style="margin: 10px 0; padding-left: 20px; color: #000; margin-right: 40px;">
        <li style="margin-bottom: 6px;"><strong>7.1 Hierarchical Clustering:</strong> Agglomerative clustering with linkage method comparison (Ward, Complete, Average, Single), optimal k selection via dendrogram analysis and validation metrics → <strong>k=6 selected</strong></li>
        <li style="margin-bottom: 6px;"><strong>7.2 K-Means Clustering:</strong> Centroid-based partitioning with Elbow method and multi-metric evaluation → <strong>k=3 selected</strong></li>
        <li style="margin-bottom: 6px;"><strong>7.3 Mean Shift Clustering:</strong> Non-parametric density-based clustering with bandwidth optimization via quantile grid search → <strong>4 clusters (q=0.132)</strong></li>
        <li style="margin-bottom: 6px;"><strong>7.4 GMM Clustering:</strong> Gaussian Mixture Model with full covariance, probabilistic soft assignments, component selection via BIC/AIC → <strong>n=4 selected</strong></li>
        <li style="margin-bottom: 6px;"><strong>7.5 SOM + K-Means:</strong> Two-stage approach using 40×40 Self-Organizing Map for topology-preserving dimensionality reduction, followed by K-Means on learned neuron weights → <strong>k=4 selected</strong></li>
    </ul>
    <p style="margin: 10px 0; color: #000; margin-right: 40px;">
        Each subsection follows a consistent structure: (1) Methodology explanation, (2) Parameter/hyperparameter optimization, (3) Candidate solution comparison, (4) Final model selection, (5) Initial cluster profiling with Z-score heatmaps and feature importance analysis.
    </p>
</div>


### Functions

In [None]:
# Helper functions for variance decomposition and R² calculation

def get_ss(df, feats):
    """
    Calculate Total Sum of Squares (SST) for given features.
    
    SST = Σ(n-1) * Var(feature_j) across all features
    
    Parameters:
    -----------
    df : pd.DataFrame
        Input dataset
    feats : list
        List of feature column names
    
    Returns:
    --------
    float : Total sum of squares
    """
    df_ = df[feats]
    ss = np.sum(df_.var() * (df_.count() - 1))
    return ss


def get_ssb(df, feats, label_col):
    """
    Calculate Between-Cluster Sum of Squares (SSB).
    
    SSB = Σ n_k * (x̄_k - x̄)²
    Measures variance between cluster centroids and overall mean.
    
    Parameters:
    -----------
    df : pd.DataFrame
        Input dataset with cluster labels
    feats : list
        List of feature column names
    label_col : str
        Column name containing cluster labels
    
    Returns:
    --------
    float : Between-cluster sum of squares
    """
    ssb_i = 0
    X_ = df[feats].values
    overall_mean = X_.mean(axis=0)
    
    for cluster_id in np.unique(df[label_col]):
        X_k = df.loc[df[label_col] == cluster_id, feats].values
        n_k = X_k.shape[0]
        cluster_mean = X_k.mean(axis=0)
        ssb_i += n_k * np.square(cluster_mean - overall_mean)
    
    return np.sum(ssb_i)


def get_ssw(df, feats, label_col):
    """
    Calculate Within-Cluster Sum of Squares (SSW).
    
    SSW = Σ Σ (x_i - x̄_k)² for all points in each cluster
    Measures total variance within all clusters.
    
    Parameters:
    -----------
    df : pd.DataFrame
        Input dataset with cluster labels
    feats : list
        List of feature column names
    label_col : str
        Column name containing cluster labels
    
    Returns:
    --------
    float : Within-cluster sum of squares
    """
    feats_label = feats + [label_col]
    df_k = df[feats_label].groupby(by=label_col).apply(
        lambda col: get_ss(col, feats), 
        include_groups=False
    )
    return df_k.sum()


def get_rsq(df, feats, label_col):
    """
    Calculate R² (coefficient of determination) for clustering solution.
    
    R² = SSB / SST = (SST - SSW) / SST
    
    Interpretation:
    - R² close to 1: High separation between clusters (good clustering)
    - R² close to 0: Low separation between clusters (poor clustering)
    
    Parameters:
    -----------
    df : pd.DataFrame
        Input dataset with cluster labels
    feats : list
        List of feature column names
    label_col : str
        Column name containing cluster labels
    
    Returns:
    --------
    float : R² value between 0 and 1
    """
    df_sst = get_ss(df, feats)
    df_ssw = get_ssw(df, feats, label_col)
    df_ssb = df_sst - df_ssw
    
    return df_ssb / df_sst


def get_r2_hc(df, link_method, max_nclus, min_nclus=1, dist="euclidean"):
    """
    Compute R² for hierarchical clustering across multiple k values.
    
    Applies hierarchical clustering for k = min_nclus to max_nclus and
    calculates R² for each solution to identify optimal cluster count.
    
    Parameters:
    -----------
    df : pd.DataFrame
        Input dataset (should be scaled)
    link_method : str
        Linkage method: 'ward', 'complete', 'average', or 'single'
    max_nclus : int
        Maximum number of clusters to test
    min_nclus : int, default=1
        Minimum number of clusters to test
    dist : str, default='euclidean'
        Distance metric for clustering
    
    Returns:
    --------
    np.ndarray : Array of R² values for each k
    """
    r2 = []
    feats = df.columns.tolist()
    
    for i in range(min_nclus, max_nclus + 1):
        cluster = AgglomerativeClustering(
            linkage=link_method, 
            metric=dist, 
            n_clusters=i
        )
        hclabels = cluster.fit_predict(df[feats])
        
        df_concat = pd.concat([
            df, 
            pd.Series(hclabels, name='labels', index=df.index)
        ], axis=1)
        
        r2.append(get_rsq(df_concat, feats, 'labels'))
    
    return np.array(r2)

# Reusable clustering utility functions

def compute_cophenetic_correlation(df, linkage_method, metric='euclidean'):
    """
    Compute the Cophenetic Correlation Coefficient (CCC) for hierarchical clustering.
    
    The CCC measures how well the dendrogram preserves the original pairwise distances.
    Higher values (closer to 1) indicate better preservation of the data structure.
    
    Parameters:
    -----------
    df : pd.DataFrame or np.ndarray
        Scaled feature data
    linkage_method : str
        Linkage method: 'ward', 'complete', 'average', 'single'
    metric : str, default='euclidean'
        Distance metric
    
    Returns:
    --------
    float : Cophenetic correlation coefficient (range: -1 to 1)
    """
    
    # Compute linkage matrix
    Z = linkage(df, method=linkage_method, metric=metric)
    
    # Compute cophenetic distances
    c, coph_dists = cophenet(Z, pdist(df, metric=metric))
    
    return c


def plot_linkage_comparison(linkage_results, palette):
    """
    Visualize comparison of linkage methods using CCC scores.
    
    Parameters:
    -----------
    linkage_results : pd.DataFrame
        DataFrame with columns: 'Linkage Method', 'CCC'
    palette : list
        Color palette for visualization
    
    Returns:
    --------
    None (displays plot)
    """
    fig, ax = plt.subplots(figsize=(10, 6))
    
    bars = ax.bar(linkage_results['Linkage Method'], 
                  linkage_results['CCC'],
                  color=palette[:len(linkage_results)],
                  edgecolor='black',
                  linewidth=1.5,
                  alpha=0.85)
    
    # Add value labels on bars
    for bar in bars:
        height = bar.get_height()
        ax.text(bar.get_x() + bar.get_width()/2., height,
                f'{height:.4f}',
                ha='center', va='bottom', fontweight='bold', fontsize=11)
    
    ax.set_ylabel('Cophenetic Correlation Coefficient (CCC)', fontsize=12, fontweight='bold')
    ax.set_xlabel('Linkage Method', fontsize=12, fontweight='bold')
    ax.set_title('Hierarchical Clustering: Linkage Method Comparison\nCophenetic Correlation Coefficient (Higher is Better)', 
                 fontsize=14, fontweight='bold', pad=15)
    ax.set_ylim(0, 1.05)
    ax.grid(False)
    ax.legend(loc='lower right')
    
    plt.tight_layout()
    plt.show()


def evaluate_clustering_metrics(df, labels, algorithm_name=''):
    """
    Calculate comprehensive clustering evaluation metrics.
    
    Parameters:
    -----------
    df : pd.DataFrame or np.ndarray
        Scaled feature data
    labels : np.ndarray
        Cluster labels
    algorithm_name : str, optional
        Name for display purposes
    
    Returns:
    --------
    dict : Dictionary with metric names and values
    """
    
    metrics = {
        'Silhouette Score': silhouette_score(df, labels),
        'Calinski-Harabasz Index': calinski_harabasz_score(df, labels),
        'Davies-Bouldin Index': davies_bouldin_score(df, labels)
    }
    
    return metrics


def plot_cluster_profiles_heatmap(cluster_profiles, population_mean, palette_continuous, title='Cluster Profiles'):
    """
    Create heatmap visualization of cluster profiles with z-scores.
    
    Parameters:
    -----------
    cluster_profiles : pd.DataFrame
        Mean feature values per cluster
    population_mean : pd.Series
        Population mean for comparison
    palette_continuous : matplotlib colormap
        Continuous color palette
    title : str
        Plot title
    
    Returns:
    --------
    None (displays plot)
    """
    profiles_with_pop = pd.concat([
        cluster_profiles.T,
        population_mean.rename('Population')
    ], axis=1)
    
    fig, ax = plt.subplots(figsize=(12, 8))
    
    sns.heatmap(
        profiles_with_pop,
        annot=True,
        fmt='.2f',
        cmap=palette_continuous,
        center=0,
        ax=ax,
        cbar_kws={'label': 'Z-Score'},
        linewidths=0.5,
        linecolor='white'
    )
    
    ax.set_title(title, fontweight='bold', fontsize=14, pad=15)
    ax.set_ylabel('Features', fontweight='bold', fontsize=11)
    ax.set_xlabel('Cluster', fontweight='bold', fontsize=11)
    
    plt.tight_layout()
    plt.show()


def plot_cluster_sizes(labels, k, palette, title='Cluster Size Distribution'):
    """
    Visualize cluster size distribution with counts and percentages.
    
    Parameters:
    -----------
    labels : np.ndarray
        Cluster labels
    k : int
        Number of clusters
    palette : list
        Color palette
    title : str
        Plot title
    
    Returns:
    --------
    None (displays plot)
    """
    cluster_sizes = pd.Series(labels).value_counts().sort_index()
    
    fig, ax = plt.subplots(figsize=(10, 6))
    
    bars = ax.bar(cluster_sizes.index, cluster_sizes.values,
                  color=[palette[i % len(palette)] for i in range(k)], linewidth=1.2, alpha=0.85)

    total = cluster_sizes.sum()
    for i, (idx, count) in enumerate(cluster_sizes.items()):
        percentage = (count / total) * 100
        ax.text(idx, count + (max(cluster_sizes.values) * 0.02), 
                f'{count}\n({percentage:.1f}%)',
                ha='center', va='bottom', fontweight='bold', fontsize=12)
    
    ax.set_xlabel('Cluster', fontsize=12, fontweight='bold')
    ax.set_ylabel('Number of Customers', fontsize=12, fontweight='bold')
    ax.set_title(title, fontweight='bold', fontsize=13, pad=15)
    ax.set_xticks(range(k))
    ax.grid(False)

    ax.set_ylim(0, max(cluster_sizes.values) * 1.15)
    
    plt.tight_layout()
    plt.show()


def plot_feature_importance(feature_variance, palette, title='Feature Importance: Variance Analysis'):
    """
    Visualize feature importance based on variance across clusters.
    
    Parameters:
    -----------
    feature_variance : pd.Series
        Variance of each feature across clusters (sorted descending)
    palette : list
        Color palette
    title : str
        Plot title
    
    Returns:
    --------
    None (displays plot)
    """
    fig, ax = plt.subplots(figsize=(12, 5))
    
    ax.bar(range(len(feature_variance)), feature_variance.values, color=palette[0])
    ax.set_xticks(range(len(feature_variance)))
    ax.set_xticklabels(feature_variance.index, rotation=45, ha='right', fontsize=10)
    ax.set_ylabel('Variance Across Clusters', fontweight='bold', fontsize=11)
    ax.set_title(title, fontweight='bold', fontsize=13, pad=15)
    ax.grid(False)
    
    plt.tight_layout()
    plt.show()

def plot_cluster_size_comparison(labels_dict, palette, title='Cluster Size Comparison'):
    """
    Compare cluster size distributions across multiple clustering solutions.
    
    Parameters:
    -----------
    labels_dict : dict
        Dictionary with format {k_value: labels_array}
        e.g., {6: labels_k6, 8: labels_k8}
    palette : list
        Color palette
    title : str
        Plot title
    
    Returns:
    --------
    None (displays plot)
    """
    n_solutions = len(labels_dict)
    fig, axes = plt.subplots(1, n_solutions, figsize=(7*n_solutions, 5), sharey=True)
    
    # If only one solution, axes is not a list
    if n_solutions == 1:
        axes = [axes]
    
    # Find max cluster size for consistent y-axis
    max_size = 0
    for labels in labels_dict.values():
        cluster_sizes = pd.Series(labels).value_counts()
        max_size = max(max_size, cluster_sizes.max())
    
    # Plot each solution
    for idx, (k_value, labels) in enumerate(labels_dict.items()):
        ax = axes[idx]
        
        # Calculate cluster sizes
        cluster_sizes = pd.Series(labels).value_counts().sort_index()
        
        # Plot bars
        ax.bar(cluster_sizes.index, cluster_sizes.values,
               color=palette[:len(cluster_sizes)], 
               linewidth=1.2, alpha=0.85)
        
        # Add value labels on bars
        for i, v in enumerate(cluster_sizes.values):
            percentage = v/len(labels)*100
            ax.text(i, v + (max_size * 0.02), f'{v}\n({percentage:.1f}%)',
                   ha='center', va='bottom', fontweight='bold', fontsize=11)
        
        # Styling
        ax.set_title(f'k={k_value}', fontweight='bold', fontsize=14, pad=10)
        ax.set_xlabel('Cluster', fontweight='bold', fontsize=12)
        if idx == 0:
            ax.set_ylabel('Number of Customers', fontweight='bold', fontsize=12)
        ax.grid(False)
        ax.set_ylim(0, max_size * 1.15)
        ax.set_xticks(range(len(cluster_sizes)))
    
    plt.suptitle(title, fontsize=14, fontweight='bold', y=1.02)
    plt.tight_layout()
    plt.show()

def plot_elbow_method(k_range, inertia_values, palette, title='Elbow Method: Optimal k Selection'):
    """
    Visualize the Elbow Method plot for K-Means clustering.
    
    Parameters:
    -----------
    k_range : range or list
        Range of k values tested
    inertia_values : list
        Inertia (WCSS) values for each k
    palette : list
        Color palette
    title : str
        Plot title
    
    Returns:
    --------
    None (displays plot)
    """
    fig, ax = plt.subplots(figsize=(10, 6))
    
    ax.plot(k_range, inertia_values, marker='o', linewidth=2.5, markersize=8,
            color=palette[2], markerfacecolor=palette[3])
    ax.set_xticks(k_range)
    ax.set_ylabel("Inertia (Within-Cluster Sum of Squares)", fontsize=11, fontweight='bold')
    ax.set_xlabel("Number of Clusters (k)", fontsize=11, fontweight='bold')
    ax.set_title(title, fontsize=14, fontweight='bold', pad=15)
    ax.grid(False)
    
    plt.tight_layout()
    plt.show()


def plot_clustering_metrics(metrics_df, k_range, palette, title='Clustering Metrics Evaluation'):
    """
    Visualize clustering evaluation metrics (Silhouette, Calinski-Harabasz, Davies-Bouldin).
    
    Parameters:
    -----------
    metrics_df : pd.DataFrame
        DataFrame with columns: k, Silhouette, Calinski-Harabasz, Davies-Bouldin
    k_range : range or list
        Range of k values
    palette : list
        Color palette
    title : str
        Plot title
    
    Returns:
    --------
    None (displays plot)
    """
    fig, axes = plt.subplots(1, 3, figsize=(16, 5))
    
    # Plot 1: Silhouette Score (maximize)
    axes[0].plot(metrics_df['k'], metrics_df['Silhouette'],
                 marker='o', linewidth=2.5, markersize=8, color=palette[0])
    max_sil_k = metrics_df.loc[metrics_df['Silhouette'].idxmax(), 'k']
    axes[0].axvline(x=max_sil_k, color='red', linestyle='--', alpha=0.5, linewidth=2)
    axes[0].set_xlabel('Number of Clusters (k)', fontweight='bold', fontsize=11)
    axes[0].set_ylabel('Silhouette Score', fontweight='bold', fontsize=11)
    axes[0].set_title('Silhouette Score (Higher is Better)', fontweight='bold', fontsize=12)
    axes[0].grid(False)
    axes[0].set_xticks(k_range)
    
    # Plot 2: Calinski-Harabasz Index (maximize)
    axes[1].plot(metrics_df['k'], metrics_df['Calinski-Harabasz'],
                 marker='o', linewidth=2.5, markersize=8, color=palette[1])
    max_ch_k = metrics_df.loc[metrics_df['Calinski-Harabasz'].idxmax(), 'k']
    axes[1].axvline(x=max_ch_k, color='red', linestyle='--', alpha=0.5, linewidth=2)
    axes[1].set_xlabel('Number of Clusters (k)', fontweight='bold', fontsize=11)
    axes[1].set_ylabel('Calinski-Harabasz Index', fontweight='bold', fontsize=11)
    axes[1].set_title('Calinski-Harabasz Index (Higher is Better)', fontweight='bold', fontsize=12)
    axes[1].grid(False)
    axes[1].set_xticks(k_range)
    
    # Plot 3: Davies-Bouldin Index (minimize)
    axes[2].plot(metrics_df['k'], metrics_df['Davies-Bouldin'],
                 marker='o', linewidth=2.5, markersize=8, color=palette[2])
    min_db_k = metrics_df.loc[metrics_df['Davies-Bouldin'].idxmin(), 'k']
    axes[2].axvline(x=min_db_k, color='red', linestyle='--', alpha=0.5, linewidth=2)
    axes[2].set_xlabel('Number of Clusters (k)', fontweight='bold', fontsize=11)
    axes[2].set_ylabel('Davies-Bouldin Index', fontweight='bold', fontsize=11)
    axes[2].set_title('Davies-Bouldin Index (Lower is Better)', fontweight='bold', fontsize=12)
    axes[2].grid(False)
    axes[2].set_xticks(k_range)
    
    fig.suptitle(title, fontsize=15, fontweight='bold', y=1.02)
    plt.tight_layout()
    plt.show()

def plot_linkage_comparison(df, linkage_methods, palette, title='Linkage Method Comparison'):
    """
    Compare hierarchical clustering linkage methods using CCC and R² metrics.
    
    Creates a side-by-side visualization showing:
    - Left: CCC bar chart (dendrogram preservation quality)
    - Right: R² line chart (variance explained across k values)
    
    Parameters:
    -----------
    df : pd.DataFrame
        Scaled feature dataset
    linkage_methods : list
        List of linkage method names (e.g., ['ward', 'complete', 'average', 'single'])
    palette : list
        Color palette - each method gets consistent color across both plots
    title : str
        Overall plot title
    
    Returns:
    --------
    tuple : (ccc_df, r2_results_all)
        - ccc_df: DataFrame with CCC values
        - r2_results_all: List of R² arrays for each method
    """
    # Define consistent color mapping for linkage methods
    method_colors = {
        'ward': palette[0],
        'complete': palette[1],
        'average': palette[2],
        'single': palette[3]
    }
    
    # Compute CCC for each linkage method
    ccc_results = []
    for method in linkage_methods:
        ccc = compute_cophenetic_correlation(df, method, metric='euclidean')
        ccc_results.append({'Linkage Method': method.capitalize(), 'CCC': ccc})
    
    # Create CCC comparison DataFrame
    ccc_df = pd.DataFrame(ccc_results)
    
    # Compute R² for each linkage method
    r2_results_all = []
    for method in linkage_methods:
        r2_values = get_r2_hc(df, link_method=method, max_nclus=10, min_nclus=2)
        r2_results_all.append(r2_values)
    
    # Create side-by-side visualization
    fig, axes = plt.subplots(1, 2, figsize=(16, 6))
    
    # Plot 1: CCC Bar Chart (preserve original order for consistent colors)
    ax1 = axes[0]
    bar_colors = [method_colors[method] for method in linkage_methods]
    bars = ax1.bar(ccc_df['Linkage Method'], ccc_df['CCC'], 
                   color=bar_colors, alpha=0.85, linewidth=1.2)
    ax1.set_ylabel('Cophenetic Correlation Coefficient', fontweight='bold', fontsize=12)
    ax1.set_xlabel('Linkage Method', fontweight='bold', fontsize=12)
    ax1.set_title('CCC: Dendrogram Preservation Quality\n(Higher = Better)', 
                  fontweight='bold', fontsize=13, pad=15)
    ax1.set_ylim(0, 1.0)
    ax1.grid(False)
    
    # Add value labels on bars
    for bar in bars:
        height = bar.get_height()
        ax1.text(bar.get_x() + bar.get_width()/2., height + 0.02,
                f'{height:.3f}', ha='center', va='bottom', fontweight='bold', fontsize=11)
    
    # Plot 2: R² Line Chart (same colors as bar chart)
    ax2 = axes[1]
    k_values = list(range(2, 11))
    for idx, method in enumerate(linkage_methods):
        ax2.plot(k_values, r2_results_all[idx], marker='o', linewidth=2, 
                 markersize=7, label=method.capitalize(), color=method_colors[method])
    
    ax2.set_xlabel('Number of Clusters (k)', fontweight='bold', fontsize=12)
    ax2.set_ylabel('R² (Variance Explained)', fontweight='bold', fontsize=12)
    ax2.set_title('R²: Clustering Variance Explained\n(Higher = Better)', 
                  fontweight='bold', fontsize=13, pad=15)
    ax2.set_xticks(k_values)
    ax2.grid(False)
    ax2.legend(title='Linkage Method', fontsize=10, title_fontsize=11)
    ax2.set_ylim(0, 1.0)
    
    plt.suptitle(title, fontsize=15, fontweight='bold', y=1.02)
    plt.tight_layout()
    plt.show()
    
    return ccc_df, r2_results_all

def plot_meanshift_quantile_vs_clusters(
    ms_results_df: pd.DataFrame,
    title: str = "Mean Shift: Quantile vs Number of Clusters"
) -> None:
    """
    Plot the relationship between Mean Shift bandwidth quantile and the resulting number of clusters.

    This visualization helps identify:
    - Regimes where the clustering solution is stable (plateaus in n_clusters)
    - Thresholds where clusters merge rapidly as the bandwidth increases

    Parameters
    ----------
    ms_results_df : pd.DataFrame
        DataFrame containing Mean Shift evaluation results with at least:
        - 'quantile' (float): bandwidth quantile used in estimate_bandwidth()
        - 'n_clusters' (int): number of clusters produced by MeanShift for that bandwidth
    title : str, optional
        Plot title.

    Returns
    -------
    None
        Displays the plot.
    """
    dfp = ms_results_df.sort_values("quantile")

    fig, ax = plt.subplots(figsize=(10, 5))
    ax.plot(dfp["quantile"], dfp["n_clusters"], marker="o", linewidth=2)

    ymin = int(dfp["n_clusters"].min())
    ymax = int(dfp["n_clusters"].max())
    ax.set_yticks(list(range(ymin, ymax + 1, 1)))

    ax.set_xlabel("Bandwidth Quantile", fontweight="bold")
    ax.set_ylabel("Number of Clusters", fontweight="bold")
    ax.set_title(title, fontweight="bold", pad=12)
    ax.grid(False)

    plt.tight_layout()
    plt.show()

def evaluate_gmm_grid(
    X: pd.DataFrame,
    feats: list,
    n_components_list: list,
    covariance_types: list,
    random_state: int = 1,
    n_init: int = 5,
    max_iter: int = 500
) -> pd.DataFrame:
    """
    Grid-evaluate Gaussian Mixture Models (GMM) across n_components and covariance_type.

    Parameters
    ----------
    X : pd.DataFrame
        Scaled feature matrix (e.g., df_demographic_a_scaled[feats]).
    feats : list[str]
        Feature column names (needed for R² calculation via get_rsq).
    n_components_list : list[int]
        Candidate numbers of mixture components.
    covariance_types : list[str]
        Candidate covariance types: ["full", "tied", "diag", "spherical"].
    random_state : int, default=1
        Random seed for reproducibility.
    n_init : int, default=5
        Number of initializations to reduce local optimum risk.
    max_iter : int, default=500
        Maximum EM iterations.

    Returns
    -------
    pd.DataFrame
        One row per configuration with:
        n_components, covariance_type, n_clusters, BIC, AIC, R2, Silhouette
    """
    results = []
    X_np = X[feats].values

    for cov in covariance_types:
        for k in n_components_list:
            gmm = GaussianMixture(
                n_components=k,
                covariance_type=cov,
                random_state=random_state,
                init_params="kmeans",
                n_init=n_init
            )

            labels = gmm.fit_predict(X_np)
            n_clusters = len(np.unique(labels))

            # R² (your helper)
            df_tmp = X[feats].copy()
            df_tmp["labels"] = labels
            r2 = get_rsq(df_tmp, feats, "labels")

            # Silhouette (only defined if >= 2 clusters)
            sil = silhouette_score(X_np, labels) if n_clusters >= 2 else np.nan

            results.append({
                "n_components": int(k),
                "covariance_type": cov,
                "n_clusters": int(n_clusters),
                "BIC": float(gmm.bic(X_np)),
                "AIC": float(gmm.aic(X_np)),
                "R2": float(r2),
                "Silhouette": float(sil) if not np.isnan(sil) else np.nan
            })

    return pd.DataFrame(results)

def plot_gmm_covtype_bic_aic(gmm_results_df, gmm_cov_types, gmm_n_components):
    """
    Compare covariance_type options using BIC and AIC side-by-side (1x2).
    Each line = one covariance_type across n_components.

    Use this to pick the best covariance_type family (primarily by lowest BIC/AIC).
    """
    dfp = gmm_results_df.copy()

    fig, axes = plt.subplots(1, 2, figsize=(14, 5), sharex=True)

    for cov in gmm_cov_types:
        d = dfp[dfp["covariance_type"] == cov].sort_values("n_components")
        axes[0].plot(d["n_components"], d["BIC"], marker="o", linewidth=2, label=cov)
        axes[1].plot(d["n_components"], d["AIC"], marker="o", linewidth=2, label=cov)

    axes[0].set_title("BIC by covariance_type (Lower is Better)", fontweight="bold", pad=10)
    axes[1].set_title("AIC by covariance_type (Lower is Better)", fontweight="bold", pad=10)

    for ax, yl in zip(axes, ["BIC", "AIC"]):
        ax.set_xlabel("n_components", fontweight="bold")
        ax.set_ylabel(yl, fontweight="bold")
        ax.set_xticks(gmm_n_components)
        ax.grid(False)
        ax.legend(title="covariance_type", fontsize=9, title_fontsize=9, loc="lower left")

    plt.tight_layout()
    plt.show()


def plot_gmm_n_selection_for_covtype(gmm_results_df, chosen_covariance_type, gmm_n_components):
    """
    After selecting a winning covariance_type, plot n_components selection in 3 charts:
    1) BIC + AIC together
    2) R²
    3) Silhouette

    Legend:
    - BIC/AIC and Silhouette: lower left
    - R²: lower right
    """
    dfc = (gmm_results_df[gmm_results_df["covariance_type"] == chosen_covariance_type]
           .sort_values("n_components")
           .copy())

    fig, axes = plt.subplots(1, 3, figsize=(18, 5), sharex=True)

    # 1) BIC + AIC in one chart
    axes[0].plot(dfc["n_components"], dfc["BIC"], marker="o", linewidth=2, label="BIC")
    axes[0].plot(dfc["n_components"], dfc["AIC"], marker="o", linewidth=2, label="AIC")
    axes[0].set_title(f"{chosen_covariance_type}: BIC & AIC (Lower is Better)", fontweight="bold", pad=10)
    axes[0].set_ylabel("Score", fontweight="bold")
    axes[0].legend(loc="lower left", fontsize=9)

    # 2) R²
    axes[1].plot(dfc["n_components"], dfc["R2"], marker="o", linewidth=2, label="R²")
    axes[1].set_title(f"{chosen_covariance_type}: R² (Higher is Better)", fontweight="bold", pad=10)
    axes[1].set_ylabel("R²", fontweight="bold")
    axes[1].legend(loc="lower right", fontsize=9)

    # 3) Silhouette
    axes[2].plot(dfc["n_components"], dfc["Silhouette"], marker="o", linewidth=2, label="Silhouette")
    axes[2].set_title(f"{chosen_covariance_type}: Silhouette (Higher is Better)", fontweight="bold", pad=10)
    axes[2].set_ylabel("Silhouette", fontweight="bold")
    axes[2].legend(loc="lower left", fontsize=9)

    for ax in axes:
        ax.set_xlabel("n_components", fontweight="bold")
        ax.set_xticks(gmm_n_components)
        ax.grid(False)

    plt.tight_layout()
    plt.show()

def visualize_som_grid(som_model, values, plot_title, fig_width=9, fig_height=6, ax=None):
    """
    Visualize SOM grid with hexagonal cells.
    
    Parameters:
    - ax: Optional matplotlib axis. If None, creates new figure.
    """
    if ax is None:
        fig, ax = plt.subplots(figsize=(fig_width, fig_height))
        show_plot = True
    else:
        show_plot = False
    
    color_norm = Normalize(vmin=values.min(), vmax=values.max())
    
    for row_idx in range(values.shape[0]):
        for col_idx in range(values.shape[1]):
            x, y = som_model.convert_map_to_euclidean((row_idx, col_idx))
            ax.add_patch(RegularPolygon(
                (x, y), numVertices=6, radius=np.sqrt(1/3),
                facecolor=cm.RdYlBu_r(color_norm(values[row_idx, col_idx])),
                edgecolor='white', linewidth=0.5
            ))
    
    ax.set_xlim(-1, values.shape[1])
    ax.set_ylim(-1, values.shape[0])
    ax.set_aspect('equal')
    ax.axis('off')
    ax.set_title(plot_title, fontsize=13 if show_plot else 11, fontweight='bold', pad=15 if show_plot else 0)
    
    sm = mpl.cm.ScalarMappable(cmap=cm.RdYlBu_r, norm=color_norm)
    sm.set_array([])
    cbar = plt.colorbar(sm, ax=ax, fraction=0.046, pad=0.04)
    if show_plot:
        cbar.set_label('Value', fontsize=10, fontweight='bold')
    
    if show_plot:
        plt.tight_layout()
        plt.show()


## **7.1 Hierarchical Clustering**

<div style="background-color: #e1e1e1ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #212121, #313131, #595959, #909090) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #000000ff; font-weight: bold;">Hierarchical Clustering Methodology</h3>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Hierarchical clustering is an <strong>agglomerative bottom-up algorithm</strong> that builds a tree-like structure (dendrogram) by iteratively merging the closest data points or clusters. Unlike K-Means, it does not require pre-specifying the number of clusters and provides a complete hierarchical view of data relationships.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #313131; font-weight: bold;">Algorithm:</h4>
    <ol style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-bottom: 5px;"><strong>Initialization:</strong> Start with each data point as its own cluster (n clusters for n points)</li>
        <li style="margin-bottom: 5px;"><strong>Distance Calculation:</strong> Compute pairwise distances between all clusters using a linkage criterion</li>
        <li style="margin-bottom: 5px;"><strong>Merge Step:</strong> Iteratively merge the two closest clusters into one larger cluster</li>
        <li style="margin-bottom: 5px;"><strong>Repeat:</strong> Continue until all points are merged into a single cluster, forming a hierarchical tree (dendrogram)</li>
    </ol>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #313131; font-weight: bold;">Workflow for This Analysis:</h4>
    <ol style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-bottom: 5px;"><strong>Linkage Method Selection:</strong> Test Ward, Complete, Average, and Single linkage methods using CCC (Cophenetic Correlation Coefficient) and R² (Variance Explained)</li>
        <li style="margin-bottom: 5px;"><strong>Optimal k Selection:</strong> For the best linkage method (Ward), identify optimal k using Silhouette, Calinski-Harabasz, Davies-Bouldin indices and dendrogram visual inspection</li>
        <li style="margin-bottom: 5px;"><strong>Solution Comparison:</strong> Compare k=3 vs k=7 solutions by examining cluster size distributions and feature profiles</li>
        <li style="margin-bottom: 5px;"><strong>Final Model & Profiling:</strong> Fit final Ward linkage model with k=3 and analyze demographic characteristics of each cluster</li>
    </ol>
</div>


### **7.1.1 Finding the best Linkage Method**

In [None]:
# Compare linkage methods using two complementary metrics:
# 1. CCC (Cophenetic Correlation Coefficient)
# 2. R²

linkage_methods = ['ward', 'complete', 'average', 'single']

# Use plot_linkage_comparison
ccc_df, r2_results_all = plot_linkage_comparison(
    df=df_demographic_a_scaled,
    linkage_methods=linkage_methods,
    palette=CUSTOM_HEX,
    title='Linkage Method Comparison'
)

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">Linkage Method Selection Summary</h3>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Compared four linkage methods (Ward, Complete, Average, Single) using dendrogram quality and clustering performance metrics to identify the optimal approach for hierarchical clustering.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Goal:</h4>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-right: 20px;">Determine which linkage criterion best preserves hierarchical structure while maximizing clustering quality</li>
    </ul>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Methodology:</h4>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-right: 20px;"><strong>Linkage Methods Tested:</strong> Ward (minimizes within-cluster variance), Complete (maximum distance), Average (mean distance), Single (minimum distance)</li>
        <li style="margin-right: 20px;"><strong>CCC (Cophenetic Correlation Coefficient):</strong> Measures how faithfully the dendrogram preserves pairwise distances (higher is better)</li>
        <li style="margin-right: 20px;"><strong>R² (Variance Explained):</strong> Proportion of variance explained by clustering across k=2-10 (higher is better)</li>
    </ul>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Findings:</h4>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-right: 20px;"><strong>CCC Results:</strong> Average (0.680) achieves highest dendrogram preservation, followed by Complete (0.596), Ward (0.578), and Single (0.527)</li>
        <li style="margin-right: 20px;"><strong>R² Results:</strong> Ward consistently explains the most variance across all k values, outperforming other methods</li>
        <li style="margin-right: 20px;"><strong>Decision:</strong> Selected Ward linkage - while Average has better CCC, Ward's superior R² indicates it creates more compact, well-separated clusters that better capture the underlying data structure for customer segmentation</li>
    </ul>
</div>


### **7.1.2 Defining the number of clusters**

In [None]:
# Determine optimal k using multiple evaluation metrics
# Ward linkage selected from previous CCC analysis

k_range = range(2, 11)
hc_metrics = {
    'k': [],
    'Silhouette': [],
    'Calinski-Harabasz': [],
    'Davies-Bouldin': []
}

for k in k_range:
    hc = AgglomerativeClustering(n_clusters=k, linkage='ward', metric='euclidean')
    labels = hc.fit_predict(df_demographic_a_scaled)
    
    metrics = evaluate_clustering_metrics(df_demographic_a_scaled, labels)
    
    hc_metrics['k'].append(k)
    hc_metrics['Silhouette'].append(metrics['Silhouette Score'])
    hc_metrics['Calinski-Harabasz'].append(metrics['Calinski-Harabasz Index'])
    hc_metrics['Davies-Bouldin'].append(metrics['Davies-Bouldin Index'])
    

# Create metrics DataFrame
hc_metrics_df = pd.DataFrame(hc_metrics)

In [None]:
# Visualize clustering metrics
plot_clustering_metrics(
    hc_metrics_df,
    k_range,
    CUSTOM_HEX,
    title='Hierarchical Clustering: Optimal k Evaluation (Ward Linkage)'
)

# Display metrics table
hc_metrics_df

In [None]:
# Dendrogram visualization for visual confirmation of cluster structure

linkage_matrix = linkage(df_demographic_a_scaled, method='ward', metric='euclidean')

fig, ax = plt.subplots(figsize=(14, 6))

dendrogram(
    linkage_matrix,
    ax=ax,
    truncate_mode='lastp',
    p=30,
    leaf_font_size=10,
    show_leaf_counts=True,
    color_threshold=0.7*max(linkage_matrix[:,2])
)

ax.set_title('Hierarchical Clustering Dendrogram (Ward Linkage)\nVisual Confirmation of Natural Grouping Structure', 
             fontweight='bold', fontsize=14, pad=15)
ax.set_xlabel('Sample Index or Cluster Size', fontsize=11, fontweight='bold')
ax.set_ylabel('Euclidean Distance', fontsize=11, fontweight='bold')
ax.grid(False)

plt.tight_layout()
plt.show()

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">Optimal k Selection Summary (Hierarchical Clustering)</h3>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Multi-metric approach combining internal validation indices and dendrogram visual inspection to determine optimal k for hierarchical clustering with Ward linkage.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Goal:</h4>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-right: 20px;">Identify the number of clusters (k) that maximizes cluster quality while maintaining business interpretability</li>
    </ul>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Methodology:</h4>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-right: 20px;"><strong>Silhouette Score:</strong> Measures how similar data points are to their own cluster vs. neighboring clusters (range -1 to 1, higher is better)</li>
        <li style="margin-right: 20px;"><strong>Calinski-Harabasz Index:</strong> Ratio of between-cluster to within-cluster variance (higher values indicate better-defined clusters)</li>
        <li style="margin-right: 20px;"><strong>Davies-Bouldin Index:</strong> Average similarity between each cluster and its most similar cluster (lower values indicate better separation)</li>
        <li style="margin-right: 20px;"><strong>Dendrogram Analysis:</strong> Visual inspection of hierarchical tree structure to identify natural cluster boundaries</li>
    </ul>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Findings:</h4>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-right: 20px;"><strong>Silhouette Score:</strong> Increases steadily from k=2 (0.186) to k=10 (0.214), with notable improvement at k=7 (0.200)</li>
        <li style="margin-right: 20px;"><strong>Calinski-Harabasz Peak:</strong> Maximum at k=3 (2779), then declining; k=7 (2176) still maintains reasonable separation</li>
        <li style="margin-right: 20px;"><strong>Davies-Bouldin Minimum:</strong> Best separation at k=10 (1.55); k=4 (1.74) and k=7 (1.70) show good mid-range performance</li>
        <li style="margin-right: 20px;"><strong>Candidate Selection:</strong> k=3 and k=7 emerge as candidates - k=3 achieves highest Calinski-Harabasz (2779) indicating strong cluster separation, while k=7 provides better Silhouette (0.200) with improved DBI (1.70) for more granular segmentation</li>
    </ul>
</div>


### **7.1.3 Comparison of Clustering Solutions**


In [None]:
# Compare two candidate solutions based on clustering metrics analysis
# k=3
# k=7

hc_k_candidate_1 = 3
hc_k_candidate_2 = 7

# Fit both candidate solutions
hc_k1 = AgglomerativeClustering(n_clusters=hc_k_candidate_1, linkage='ward', metric='euclidean')
hc_k2 = AgglomerativeClustering(n_clusters=hc_k_candidate_2, linkage='ward', metric='euclidean')

hc_labels_k1 = hc_k1.fit_predict(df_demographic_a_scaled)
hc_labels_k2 = hc_k2.fit_predict(df_demographic_a_scaled)

# Create temporary DataFrames with cluster labels
df_temp_k1 = df_demographic_a_scaled.copy()
df_temp_k1['Cluster'] = hc_labels_k1

df_temp_k2 = df_demographic_a_scaled.copy()
df_temp_k2['Cluster'] = hc_labels_k2

# Calculate cluster profiles (mean values per cluster)
cluster_profiles_k1 = df_temp_k1.groupby('Cluster').mean()
cluster_profiles_k2 = df_temp_k2.groupby('Cluster').mean()

# Display both profiles for comparison
print(f"\nCluster Profiles for k={hc_k_candidate_1}:")
display(cluster_profiles_k1.round(3))

print(f"\n\nCluster Profiles for k={hc_k_candidate_2}:")
display(cluster_profiles_k2.round(3))

# Visualize cluster size comparison
plot_cluster_size_comparison(
    labels_dict={hc_k_candidate_1: hc_labels_k1, hc_k_candidate_2: hc_labels_k2},
    palette=CUSTOM_HEX,
    title='Hierarchical Clustering: Cluster Size Comparison'
)

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">Comparison of Clustering Solutions Summary</h3>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Compared k=3 and k=7 candidate solutions to determine the optimal granularity for customer segmentation.
    </p>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        <strong>Decision: k=3 selected</strong>
    </p>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        K=3 achieves the highest Calinski-Harabasz Index (2779), indicating optimal between-cluster separation. While k=7 offers better Silhouette (0.200 vs 0.190) and DBI (1.70 vs 1.83) scores, the additional clusters in k=7 differentiate only on a few features with many features remaining similar across clusters. K=3 provides more distinct, interpretable segments with balanced cluster sizes (33.1%, 27.7%, 39.2%) where each cluster has a clearer demographic profile. This enables actionable marketing strategies without over-segmentation.
    </p>
</div>


### **7.1.4 Final Hierarchical Clustering Solution**

In [None]:
# Final Hierarchical Clustering Solution
# Using k=3 from comparison analysis

hc_final_k = hc_k_candidate_1  # k=3

# Reuse labels from comparison step
hc_labels_final = hc_labels_k1

# Create labeled dataset
df_demographic_a_scaled_labeled = df_demographic_a_scaled.copy()
df_demographic_a_scaled_labeled['Cluster'] = hc_labels_final

# Calculate final metrics
final_metrics = evaluate_clustering_metrics(df_demographic_a_scaled, hc_labels_final)

# Store for final comparison (Section 9)
if 'demo_clustering_results' not in dir():
    demo_clustering_results = {}

demo_clustering_results['Hierarchical'] = {
    'k': hc_final_k,
    'Silhouette': final_metrics['Silhouette Score'],
    'Calinski-Harabasz': final_metrics['Calinski-Harabasz Index'],
    'Davies-Bouldin': final_metrics['Davies-Bouldin Index'],
    'R2': get_rsq(df_demographic_a_scaled_labeled, df_demographic_a_scaled.columns.tolist(), 'Cluster'),
    'labels': hc_labels_final
}


### **7.1.5 Cluster Profiling**

In [None]:
# 1. Cluster Profiles Heatmap - Z-scores of demographic features per cluster
feats = df_demographic_a_scaled.columns.tolist()
hc_cluster_profiles = df_demographic_a_scaled_labeled.groupby('Cluster')[feats].mean()
population_mean = df_demographic_a_scaled[feats].mean()

plot_cluster_profiles_heatmap(
    hc_cluster_profiles, 
    population_mean, 
    GROUP80_palette_continuous,
    title='Hierarchical Clustering: Demographic Profiles (k=3)\nStandardized Z-Scores per Cluster'
)

In [None]:
# 2. Cluster Size Distribution
plot_cluster_sizes(
    hc_labels_final, 
    hc_final_k, 
    CUSTOM_HEX,
    title='Hierarchical Clustering - Final Cluster Sizes'
)

# Display cluster size statistics
cluster_sizes = pd.Series(hc_labels_final).value_counts().sort_index()
cluster_dist_df = pd.DataFrame({
    'Cluster': cluster_sizes.index,
    'Count': cluster_sizes.values,
    'Percentage': (cluster_sizes.values / len(hc_labels_final) * 100).round(2)
})


In [None]:
# 3. Feature Importance Analysis - Variance across clusters
# Features with high variance differentiate clusters most effectively

hc_feature_variance = hc_cluster_profiles.var(axis=0).sort_values(ascending=False)

plot_feature_importance(
    hc_feature_variance,
    CUSTOM_HEX,
    title='Hierarchical Clustering: Feature Importance Analysis\nWhich Demographics Differentiate Clusters?'
)

# Display feature importance ranking
hc_feature_importance_df = pd.DataFrame({
    'Feature': hc_feature_variance.index,
    'Variance': hc_feature_variance.values.round(4),
})

hc_feature_importance_df

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">Hierarchical Clustering Profiling Summary (k=3)</h3>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Analyzed demographic characteristics of the final 3 hierarchical clusters to identify distinct customer segments. Ward linkage produces well-separated clusters with Education, Income, and City as primary differentiators.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Key Findings:</h4>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-right: 20px;"><strong>Cluster 0 (33.1%, n=4,314) - Common Regions, Higher-Income Higher-Education:</strong> Common cities (Z=+0.98), common FSA regions (Z=+0.74), common provinces (Z=+0.47), higher education (Z=+0.30), higher income (Z=+0.45).</li>
        <li style="margin-right: 20px;"><strong>Cluster 1 (27.7%, n=3,610) - Lower-Income Lower-Education:</strong> Significantly lower education (Z=-1.35), significantly lower income (Z=-1.14), average geographic frequency, lower married rate (Z=+0.38).</li>
        <li style="margin-right: 20px;"><strong>Cluster 2 (39.2%, n=5,114) - Rare Regions, Higher-Education:</strong> Rare provinces (Z=-0.38), rare cities (Z=-0.77), rare FSA regions (Z=-0.45), higher education (Z=+0.70), higher income (Z=+0.43).</li>
        <li style="margin-right: 20px;"><strong>Primary Segmentation Driver - Education (Variance: 1.19):</strong> Education creates the clearest separation between clusters, with Cluster 1 showing significantly lower education levels.</li>
        <li style="margin-right: 20px;"><strong>Secondary Drivers - Income (Variance: 0.83) & City (Variance: 0.77):</strong> Income strongly correlates with education, while city frequency differentiates mainstream (Cluster 0) from niche geographic segments.</li>
        <li style="margin-right: 20px;"><strong>Gender Independence (Variance: 0.0003):</strong> Gender shows virtually no differentiation across hierarchical clusters.</li>
    </ul>
</div>


## **7.2 K-Means Clustering**

<div style="background-color: #e1e1e1ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #212121, #313131, #595959, #909090) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #000000ff; font-weight: bold;">K-Means Clustering Methodology</h3>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        K-Means is an <strong>iterative partitioning algorithm</strong> that assigns data points to k clusters by minimizing within-cluster variance (Sum of Squared Errors). Unlike hierarchical clustering, it requires pre-specifying k and uses a centroid-based approach to create spherical, compact clusters.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #313131; font-weight: bold;">Algorithm:</h4>
    <ol style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-bottom: 5px;"><strong>Choose Seeds:</strong> Select k initial centroids (using k-means++ for better starting positions)</li>
        <li style="margin-bottom: 5px;"><strong>Assignment:</strong> Associate each data point with the nearest seed/centroid based on Euclidean distance</li>
        <li style="margin-bottom: 5px;"><strong>Update Centroids:</strong> Calculate the centroids of the formed clusters as the mean of all assigned points</li>
        <li style="margin-bottom: 5px;"><strong>Iterate:</strong> Go back to step 2 and repeat until centroids cease to be recentered (convergence)</li>
    </ol>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #313131; font-weight: bold;">Workflow for This Analysis:</h4>
    <ol style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-bottom: 5px;"><strong>Optimal k Selection:</strong> Test k=2-13 using Elbow Method (Inertia/SSE), Silhouette Score, Calinski-Harabasz Index, and Davies-Bouldin Index</li>
        <li style="margin-bottom: 5px;"><strong>Solution Comparison:</strong> Compare k=3 vs k=7 solutions by examining cluster sizes and feature profiles</li>
        <li style="margin-bottom: 5px;"><strong>Final Model & Profiling:</strong> Fit final K-Means model with k=3 (k-means++ initialization) and analyze demographic characteristics of each cluster</li>
    </ol>
</div>


### **7.2.1 Defining the number of clusters**

In [None]:
# Evaluate K-Means clustering across range of k values
# Using k-means++ initialization and multiple runs for stability

km_k_range = range(2, 14)
km_metrics = {
    'k': [],
    'Inertia': [],
    'Silhouette': [],
    'Calinski-Harabasz': [],
    'Davies-Bouldin': []
}

# Store labels and silhouette samples for visualization
km_fitted_labels = {}
km_silhouette_samples = {}

for k in km_k_range:
    kmeans = KMeans(n_clusters=k, init='k-means++', n_init=15, random_state=42, max_iter=300)
    km_labels = kmeans.fit_predict(df_demographic_a_scaled)
    
    # Store labels and silhouette samples
    km_fitted_labels[k] = km_labels
    km_silhouette_samples[k] = silhouette_samples(df_demographic_a_scaled, km_labels)
    
    metrics = evaluate_clustering_metrics(df_demographic_a_scaled, km_labels)
    
    km_metrics['k'].append(k)
    km_metrics['Inertia'].append(kmeans.inertia_)
    km_metrics['Silhouette'].append(metrics['Silhouette Score'])
    km_metrics['Calinski-Harabasz'].append(metrics['Calinski-Harabasz Index'])
    km_metrics['Davies-Bouldin'].append(metrics['Davies-Bouldin Index'])

km_metrics_df = pd.DataFrame(km_metrics)

In [None]:
# Visualize Elbow Method
plot_elbow_method(
    km_k_range,
    km_metrics_df['Inertia'].tolist(),
    CUSTOM_HEX,
    title='K-Means Elbow Method: Optimal k Selection'
)

In [None]:
# Silhouette Analysis

for nclus in km_k_range:
    fig, ax = plt.subplots(figsize=(12, 7))
    
    km_labels = km_fitted_labels[nclus]
    sample_silhouette_values = km_silhouette_samples[nclus]
    silhouette_avg = km_metrics_df[km_metrics_df['k'] == nclus]['Silhouette'].values[0]
    
    y_lower = 10
    for i in range(nclus):
        ith_cluster_silhouette_values = sample_silhouette_values[km_labels == i]
        ith_cluster_silhouette_values.sort()
        
        size_cluster_i = ith_cluster_silhouette_values.shape[0]
        y_upper = y_lower + size_cluster_i
        
        color = CUSTOM_HEX[i % len(CUSTOM_HEX)]
        ax.fill_betweenx(np.arange(y_lower, y_upper),
                         0, ith_cluster_silhouette_values,
                         facecolor=color, edgecolor=color, alpha=0.7)
        
        ax.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i), fontweight='bold', fontsize=11)
        y_lower = y_upper + 10
    
    ax.set_title(f"K-Means Silhouette Analysis for k={nclus}\nAverage Silhouette Score: {silhouette_avg:.4f}", 
                 fontsize=14, fontweight='bold', pad=15)
    ax.set_xlabel("Silhouette Coefficient Values", fontsize=12, fontweight='bold')
    ax.set_ylabel("Cluster", fontsize=12, fontweight='bold')
    
    ax.axvline(x=silhouette_avg, color="red", linestyle="--", linewidth=2.5, 
               label=f'Average: {silhouette_avg:.4f}')
    
    xmin = max(-0.3, np.round(sample_silhouette_values.min() - 0.1, 2))
    xmax = min(1.0, np.round(sample_silhouette_values.max() + 0.1, 2))
    ax.set_xlim([xmin, xmax])
    ax.set_ylim([0, len(df_demographic_a_scaled) + (nclus + 1) * 10])
    
    ax.set_yticks([])
    ax.set_xticks(np.arange(xmin, xmax + 0.1, 0.1))
    ax.legend(loc='upper right', fontsize=11)
    ax.grid(True, alpha=0.3, axis='x')
    
    plt.tight_layout()
    plt.show()

In [None]:
# Visualize clustering metrics
plot_clustering_metrics(
    km_metrics_df,
    km_k_range,
    CUSTOM_HEX,
    title='K-Means Clustering: Optimal k Evaluation'
)

# Display metrics table
km_metrics_df

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">Optimal k Selection Summary (K-Means Clustering)</h3>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Evaluated k=2 to k=13 using multiple validation indices to identify the optimal number of clusters for K-Means partitioning of airline customer demographics.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Goal:</h4>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-right: 20px;">Determine k that balances cluster quality metrics with business interpretability for actionable customer segmentation</li>
    </ul>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Methodology:</h4>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-right: 20px;"><strong>Elbow Method (Inertia):</strong> Total within-cluster sum of squared distances - look for "elbow" where adding clusters yields diminishing returns</li>
        <li style="margin-right: 20px;"><strong>Silhouette Score:</strong> Measures cluster cohesion and separation (range -1 to 1, higher is better)</li>
        <li style="margin-right: 20px;"><strong>Calinski-Harabasz Index:</strong> Ratio of between-cluster to within-cluster variance (higher is better)</li>
        <li style="margin-right: 20px;"><strong>Davies-Bouldin Index:</strong> Average similarity between clusters (lower is better)</li>
    </ul>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Findings:</h4>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-right: 20px;"><strong>k=3 - Statistical Optimum:</strong> Clear elbow in Inertia curve, peak Calinski-Harabasz (3071), good Silhouette (0.203), DBI (1.74).</li>
        <li style="margin-right: 20px;"><strong>k=5 - Balanced Solution:</strong> Good Inertia reduction, good DBI (1.59), solid Silhouette (0.207), reasonable CH (2594).</li>
        <li style="margin-right: 20px;"><strong>Candidate Selection:</strong> k=3 vs k=5 - k=3 provides statistically optimal broad segments, k=5 offers better cluster separation (lowest DBI) with moderate granularity.</li>
    </ul>
</div>


### **7.2.2 Comparison of Clustering Solutions**


In [None]:
# Compare two candidate solutions based on clustering metrics analysis
# k=3
# k=5

km_k_candidate_1 = 3
km_k_candidate_2 = 5

# Use pre-fitted labels from 7.2.1
km_labels_k1 = km_fitted_labels[km_k_candidate_1]
km_labels_k2 = km_fitted_labels[km_k_candidate_2]

# Create temporary DataFrames with cluster labels
df_temp_k1 = df_demographic_a_scaled.copy()
df_temp_k1['Cluster'] = km_labels_k1

df_temp_k2 = df_demographic_a_scaled.copy()
df_temp_k2['Cluster'] = km_labels_k2

# Calculate cluster profiles (mean values per cluster)
cluster_profiles_k1 = df_temp_k1.groupby('Cluster').mean()
cluster_profiles_k2 = df_temp_k2.groupby('Cluster').mean()

# Display both profiles for comparison
print(f"\nCluster Profiles for k={km_k_candidate_1}:")
display(cluster_profiles_k1.round(3))

print(f"\n\nCluster Profiles for k={km_k_candidate_2}:")
display(cluster_profiles_k2.round(3))

# Visualize cluster size comparison
plot_cluster_size_comparison(
    labels_dict={km_k_candidate_1: km_labels_k1, km_k_candidate_2: km_labels_k2},
    palette=CUSTOM_HEX,
    title='K-Means Clustering: Cluster Size Comparison (k=3 vs k=5)'
)


<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">Comparison of Clustering Solutions Summary</h3>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        <strong>Decision: k=3 selected</strong>
    </p>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        K=3 shows clear elbow in Inertia curve and achieves highest Calinski-Harabasz Index (3071). Balanced cluster sizes (32.0%, 29.3%, 38.7%) with distinct profiles across Education and Income. Higher k values produce clusters that differentiate only on few features while remaining similar across most demographics, reducing interpretability.
    </p>
</div>


### **7.2.3 Final K-Means Clustering Solution**

In [None]:
# Final K-Means Clustering Solution
# Using k=3 from comparison analysis

km_final_k = km_k_candidate_1  # k=3

# Reuse labels from comparison step
km_labels_final = km_labels_k1

# Create labeled dataset
df_demographic_a_scaled_labeled_km = df_demographic_a_scaled.copy()
df_demographic_a_scaled_labeled_km['Cluster'] = km_labels_final

# Calculate final metrics
km_final_metrics = evaluate_clustering_metrics(df_demographic_a_scaled, km_labels_final)

# Store for final comparison (Section 9)
demo_clustering_results['K-Means'] = {
    'k': km_final_k,
    'Silhouette': km_final_metrics['Silhouette Score'],
    'Calinski-Harabasz': km_final_metrics['Calinski-Harabasz Index'],
    'Davies-Bouldin': km_final_metrics['Davies-Bouldin Index'],
    'R2': get_rsq(df_demographic_a_scaled_labeled_km, df_demographic_a_scaled.columns.tolist(), 'Cluster'),
    'labels': km_labels_final
}

### **7.2.4 Cluster Profiling**

In [None]:
# 1. Cluster Profiles Heatmap - Z-scores of demographic features per cluster
km_cluster_profiles = df_demographic_a_scaled_labeled_km.groupby('Cluster')[feats].mean()
km_population_mean = df_demographic_a_scaled[feats].mean()

plot_cluster_profiles_heatmap(
    km_cluster_profiles, 
    km_population_mean, 
    GROUP80_palette_continuous,
    title='K-Means Clustering: Demographic Profiles (k=3)\nStandardized Z-Scores per Cluster'
)

In [None]:
# 2. Cluster Size Distribution
plot_cluster_sizes(
    km_labels_final, 
    km_final_k, 
    CUSTOM_HEX,
    title='K-Means Clustering - Final Cluster Sizes'
)

In [None]:
# 3. Feature Importance Analysis - Variance across clusters
km_feature_variance = km_cluster_profiles.var(axis=0).sort_values(ascending=False)

plot_feature_importance(
    km_feature_variance,
    CUSTOM_HEX,
    title='K-Means Clustering: Feature Importance Analysis\nWhich Demographics Differentiate Clusters?'
)

# Display feature importance ranking
km_feature_importance_df = pd.DataFrame({
    'Feature': km_feature_variance.index,
    'Variance': km_feature_variance.values.round(4)
})

km_feature_importance_df

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">K-Means Clustering Profiling Summary (k=3)</h3>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Analyzed demographic characteristics of the final 3 K-Means clusters to identify distinct customer segments. Education, Income, and City emerge as primary differentiators.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Key Findings:</h4>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-right: 20px;"><strong>Cluster 0 (32.0%, n=4,170) - Common Regions, Higher-Income Higher-Education:</strong> Common cities (Z=+1.03), common FSA (Z=+0.59), common provinces (Z=+0.53), higher education (Z=+0.56), higher income (Z=+0.50).</li>
        <li style="margin-right: 20px;"><strong>Cluster 1 (29.3%, n=3,824) - Lower-Income Lower-Education:</strong> Significantly lower education (Z=-1.35), significantly lower income (Z=-1.19), average geographic frequency, lower married rate (Z=+0.37).</li>
        <li style="margin-right: 20px;"><strong>Cluster 2 (38.7%, n=5,044) - Rare Regions, Higher-Education:</strong> Rare provinces (Z=-0.45), rare cities (Z=-0.85), rare FSA (Z=-0.50), higher education (Z=+0.57), higher income (Z=+0.49).</li>
        <li style="margin-right: 20px;"><strong>Primary Segmentation Driver - Education (Variance: 1.22):</strong> Education creates the clearest separation, with Cluster 1 showing significantly lower education levels.</li>
        <li style="margin-right: 20px;"><strong>Secondary Drivers - Income (Variance: 0.94) & City (Variance: 0.89):</strong> Income strongly correlates with education, while city frequency differentiates mainstream (Cluster 0) from niche geographic segments.</li>
        <li style="margin-right: 20px;"><strong>Gender Independence (Variance: 0.0002):</strong> Gender shows virtually no differentiation across K-Means clusters.</li>
    </ul>
</div>


---

## **7.3 Mean Shift Clustering**

<div style="background-color: #e1e1e1ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #212121, #313131, #595959, #909090) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #000000ff; font-weight: bold;">Mean Shift Clustering Methodology</h3>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Mean Shift is a <strong>density-based mode-seeking algorithm</strong> that identifies clusters by shifting a sliding window toward regions of highest point density. Unlike K-Means, it does not require pre-specifying k and can discover clusters of arbitrary shape by following the gradient of the underlying density estimate.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #313131; font-weight: bold;">Algorithm:</h4>
    <ol style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-bottom: 5px;"><strong>Initialize a Sliding Window:</strong> Begin with a circular kernel window centered at a point <strong>C</strong> (randomly selected or one per data point) with radius <strong>r</strong> (bandwidth)</li>
        <li style="margin-bottom: 5px;"><strong>Shift Toward Higher Density:</strong> At each iteration, compute the mean of all points inside the window and shift the center <strong>C</strong> to this mean, gradually moving toward higher-density regions</li>
        <li style="margin-bottom: 5px;"><strong>Convergence:</strong> Repeat the shift step until the movement of the window center becomes negligible (the center has converged to a mode)</li>
        <li style="margin-bottom: 5px;"><strong>Merge Modes and Assign Clusters:</strong> Run the process from many initial centers, then merge converged centers that are within a tolerance distance. Assign each data point to the cluster of the nearest converged center. If sliding windows overlap, the densest mode (window containing the most points) is preserved and points are grouped accordingly</li>
    </ol>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #313131; font-weight: bold;">Workflow for This Analysis:</h4>
    <ol style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-bottom: 5px;"><strong>Selecting the Best Bandwidth:</strong> Tested a targeted set of <strong>bandwidth quantiles</strong> using <strong>estimate_bandwidth()</strong>, then fitted Mean Shift for each bandwidth and tracked <strong>n_clusters</strong>, <strong>R² (variance explained)</strong>, and <strong>Silhouette</strong> (only if ≥ 2 clusters). Visualized <strong>quantile vs. cluster count</strong> to detect stability regions and avoid extreme cases (too large -> 1 cluster, too small -> fragmentation)</li>
        <li style="margin-bottom: 5px;"><strong>Evaluation of Mean Shift Solutions:</strong> Selected two candidate regimes from the results table and compared them by fitting both solutions: <strong>Candidate A</strong> (quantile=0.061 -> 3 clusters) vs <strong>Candidate B</strong> (quantile=0.06 -> 5 clusters)</li>
        <li style="margin-bottom: 5px;"><strong>Final Mean Shift Clustering Solution:</strong> Selected quantile=0.06 (5 clusters) for stable plateau regime with meaningful geographic differentiation</li>
        <li style="margin-bottom: 5px;"><strong>Mean Shift Cluster Profiling:</strong> Profiled the final solution using (1) a <strong>cluster profile heatmap</strong> (cluster means vs population mean), (2) <strong>final cluster size distribution</strong>, and (3) <strong>feature importance</strong> via variance across cluster centroids</li>
    </ol>
</div>


### **7.3.1 Selecting the best Bandwidth**

In [None]:
# Estimate bandwidth values
bandwidth_quantiles = [0.0635, 0.061, 0.06095, 0.06, 0.0475, 0.04747, 0.0473,0.0472, 0.04715, 0.0471, 0.046, 0.045, 0.0435, 0.043, 0.04, 0.035, 0.03]
ms_results = []

X_ms = df_demographic_a_scaled[feats]

for q in bandwidth_quantiles:
    bw = estimate_bandwidth(X_ms, quantile=q, random_state=1)
    ms = MeanShift(bandwidth=bw, bin_seeding=True, n_jobs=-1)
    labels = ms.fit_predict(X_ms)

    n_clusters = len(np.unique(labels))

    # R²
    df_tmp = X_ms.copy()
    df_tmp["labels"] = labels
    r2 = get_rsq(df_tmp, feats, "labels")

    # Silhouette (only defined if >= 2 clusters)
    sil = silhouette_score(X_ms, labels) if n_clusters >= 2 else np.nan

    ms_results.append({
        "quantile": q,
        "bandwidth": float(bw),
        "n_clusters": int(n_clusters),
        "R2": float(r2),
        "Silhouette": float(sil) if not np.isnan(sil) else np.nan
    })

ms_results_df = pd.DataFrame(ms_results).sort_values("quantile", ascending=False)
ms_results_df


In [None]:
plot_meanshift_quantile_vs_clusters(ms_results_df)

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
  <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">Mean Shift Bandwidth Selection Summary</h3>

  <p style="margin: 10px 0; margin-right: 40px; color: #000;">
    Determined the Mean Shift kernel bandwidth by testing a targeted set of bandwidth quantiles (0.03 to 0.0635). For each quantile, bandwidth was estimated via <strong>estimate_bandwidth()</strong>, Mean Shift was fitted, and the solution was evaluated using <strong>cluster count</strong>, <strong>R²</strong>, and <strong>Silhouette</strong> to balance segmentation granularity and separation quality.
  </p>

  <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Goal:</h4>
  <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
    <li style="margin-right: 20px;">Select a bandwidth regime that yields interpretable, stable clusters with good separation, while avoiding over-fragmentation (many micro-clusters) or collapse into too few clusters</li>
  </ul>
  <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Methodology:</h4>
  <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
    <li style="margin-right: 20px;"><strong>Quantile scan (bandwidth selection):</strong> Evaluated 17 quantiles (0.03-0.0635) to observe how cluster count changes as bandwidth varies.</li>
    <li style="margin-right: 20px;"><strong>Evaluation criteria:</strong> Compared <strong>n_clusters</strong> (solution granularity), <strong>Silhouette</strong> (cluster separation) and <strong>R²</strong> (variance explained).</li>
  </ul>

  <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Findings:</h4>
  <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
    <li style="margin-right: 20px;"><strong>Stable 5-cluster plateau:</strong> quantile=0.06 and 0.0475 yield stable 5 clusters (R²=0.35-0.36, Silhouette=0.14)</li>
    <li style="margin-right: 20px;"><strong>3-cluster solution:</strong> quantile=0.061 yields 3 clusters with good Silhouette (0.15) - comparable to previous k=3 solutions</li>
    <li style="margin-right: 20px;"><strong>Candidate A (compact):</strong> quantile=0.061 yields 3 clusters - interpretable, consistent with previous algorithms</li>
    <li style="margin-right: 20px;"><strong>Candidate B (plateau):</strong> quantile=0.06 yields 5 clusters - stable regime across multiple quantiles</li>
    <li style="margin-right: 20px;"><strong>Decision for next step:</strong> Compare 3 clusters vs 5 clusters</li>
  </ul>
</div>


### **7.3.2 Evaluation of Mean Shift Solutions**

In [None]:
# Compare two candidate Mean Shift solutions based on previous evaluation results
# Candidate A: quantile=0.061 → 3 clusters
# Candidate B: quantile=0.06 → 5 clusters

ms_q_1 = 0.061
ms_q_2 = 0.06

X_ms = df_demographic_a_scaled[feats]

# Get bandwidths from the already computed results table
ms_bw_1 = float(ms_results_df.loc[ms_results_df["quantile"] == ms_q_1, "bandwidth"].iloc[0])
ms_bw_2 = float(ms_results_df.loc[ms_results_df["quantile"] == ms_q_2, "bandwidth"].iloc[0])

# Fit both candidate solutions
ms_cand_1 = MeanShift(bandwidth=ms_bw_1, bin_seeding=True, n_jobs=-1)
ms_cand_2 = MeanShift(bandwidth=ms_bw_2, bin_seeding=True, n_jobs=-1)

ms_labels_1 = ms_cand_1.fit_predict(X_ms)
ms_labels_2 = ms_cand_2.fit_predict(X_ms)

# Cluster profiles
df_temp_ms1 = df_demographic_a_scaled.copy()
df_temp_ms1["Cluster"] = ms_labels_1
cluster_profiles_ms1 = df_temp_ms1.groupby("Cluster")[feats].mean()

df_temp_ms2 = df_demographic_a_scaled.copy()
df_temp_ms2["Cluster"] = ms_labels_2
cluster_profiles_ms2 = df_temp_ms2.groupby("Cluster")[feats].mean()

display(cluster_profiles_ms1.round(3))
display(cluster_profiles_ms2.round(3))

# Visualize cluster size comparison
plot_cluster_size_comparison(
    labels_dict={
        f"q={ms_q_1} (bw={ms_bw_1:.2f})": ms_labels_1,
        f"q={ms_q_2} (bw={ms_bw_2:.2f})": ms_labels_2,
    },
    palette=CUSTOM_HEX,
    title="Mean Shift Clustering: Cluster Size Comparison (Quantile & Bandwidth)"
)

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
  <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">Comparison of Clustering Solutions Summary</h3>

  <p style="margin: 10px 0; margin-right: 40px; color: #000;">
    <strong>Decision: Mean Shift quantile=0.06 selected (5 clusters)</strong>
  </p>

  <p style="margin: 10px 0; margin-right: 40px; color: #000;">
    Compared 3 clusters (q=0.061) vs 5 clusters (q=0.06). Both solutions share a stable core: Cluster 1 (Lower-Education Lower-Income) is nearly identical in size (29.7% vs 29.9%) and profile (Education Z=-1.35, Income Z=-1.16). The 5-cluster solution preserves this core while splitting the dominant cluster into more meaningful segments instead of merging into one large mainstream group. This maintains interpretability while providing additional detail on geographic and FSA patterns.
  </p>
</div>


### **7.3.3 Final Mean Shift Clustering Solution**

In [None]:
# Final Mean Shift Clustering Solution
# Using quantile=0.06 (5 clusters) from comparison analysis

chosen_quantile_ms = ms_q_2  # 0.06
chosen_bandwidth_ms = ms_bw_2

# Reuse labels from comparison step
ms_labels_final = ms_labels_2

df_demographic_a_scaled["ms_cluster"] = ms_labels_final

# Calculate final metrics
ms_final_metrics = evaluate_clustering_metrics(df_demographic_a_scaled[feats], ms_labels_final)

# Store for final comparison (Section 9)
demo_clustering_results['Mean Shift'] = {
    'k': len(np.unique(ms_labels_final)),
    'Silhouette': ms_final_metrics['Silhouette Score'],
    'Calinski-Harabasz': ms_final_metrics['Calinski-Harabasz Index'],
    'Davies-Bouldin': ms_final_metrics['Davies-Bouldin Index'],
    'R2': ms_results_df[ms_results_df['quantile'] == chosen_quantile_ms]['R2'].values[0],
    'labels': ms_labels_final
}


### **7.3.4 Mean Shift Cluster Profiling**

In [None]:
# 1. Profile Heatmap - Z-scores of demographic features per cluster (Mean Shift)
ms_cluster_profiles = (df_demographic_a_scaled
                       .groupby("ms_cluster")[feats]
                       .mean())
ms_population_mean = df_demographic_a_scaled[feats].mean()

plot_cluster_profiles_heatmap(
    ms_cluster_profiles,
    ms_population_mean,
    GROUP80_palette_continuous,
    title="Mean Shift - Cluster Profiles"
)

In [None]:
# 2. Cluster sizes
nclus_ms = len(np.unique(ms_labels_final))
plot_cluster_sizes(
    ms_labels_final,
    nclus_ms,
    CUSTOM_HEX,
    title="Mean Shift - Final Cluster Sizes"
)

In [None]:
# 3. Feature Importance Analysis - Variance across clusters
ms_feature_variance = ms_cluster_profiles.var(axis=0).sort_values(ascending=False)

plot_feature_importance(
    ms_feature_variance,
    CUSTOM_HEX,
    title="Mean Shift Clustering: Feature Importance Analysis\nWhich Demographics Differentiate Clusters?"
)

# Display feature importance ranking
ms_feature_importance_df = pd.DataFrame({
    "Feature": ms_feature_variance.index,
    "Variance": ms_feature_variance.values.round(4),
})

ms_feature_importance_df

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
  <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">Mean Shift Profiling Summary (5 clusters)</h3>

  <p style="margin: 10px 0; margin-right: 40px; color: #000;">
    Profiled the final Mean Shift solution (quantile=0.06) with <strong>5 clusters</strong>. Segmentation driven by <strong>FSA</strong>, <strong>Education</strong>, and <strong>Province/City</strong>.
  </p>
  <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Key Findings:</h4>
  <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
    <li style="margin-right: 20px;"><strong>Cluster 0 (44.1%, n=5,748) - Mainstream Higher-Income:</strong> Average geography, higher income (Z=+0.70), moderate education (Z=+0.40).</li>
    <li style="margin-right: 20px;"><strong>Cluster 1 (29.9%, n=3,898) - Lower-Education Lower-Income:</strong> Significantly lower education (Z=-1.35), lower income (Z=-1.16). Stable across solutions.</li>
    <li style="margin-right: 20px;"><strong>Cluster 2 (5.7%, n=741) - Common Regions Highly-Educated:</strong> Common provinces (Z=+0.83), common cities (Z=+1.07), highest education (Z=+1.31).</li>
    <li style="margin-right: 20px;"><strong>Cluster 3 (13.8%, n=1,801) - Rare Regions Higher-Education:</strong> Rare provinces (Z=-1.51), rare cities (Z=-0.91), rare FSA (Z=-0.67), higher education (Z=+0.79).</li>
    <li style="margin-right: 20px;"><strong>Cluster 4 (6.5%, n=850) - High FSA Segment:</strong> Very high FSA (Z=+2.58), common cities (Z=+0.80), moderate education (Z=+0.69).</li>
    <li style="margin-right: 20px;"><strong>Primary Driver - FSA (Variance: 1.68):</strong> FSA frequency creates strongest separation, isolating Cluster 4.</li>
    <li style="margin-right: 20px;"><strong>Secondary Drivers - Education (1.03) & Province (0.80):</strong> Education separates Cluster 1, Province differentiates Cluster 3.</li>
    <li style="margin-right: 20px;"><strong>Gender Independence (Variance: 0.0006):</strong> No differentiation across clusters.</li>
  </ul>
</div>


---

## **7.4 Gaussian Mixture Models (GMM)**

<div style="background-color: #e1e1e1ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #212121, #313131, #595959, #909090) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #000000ff; font-weight: bold;">Gaussian Mixture Model (GMM) Methodology</h3>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        GMM is a <strong>probabilistic clustering algorithm</strong> that models data as a mixture of k Gaussian distributions. Unlike hard clustering (K-Means, Hierarchical), GMM provides soft assignments where each point has a probability of belonging to each cluster, enabling uncertainty quantification and overlapping segments.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #313131; font-weight: bold;">Algorithm:</h4>
    <ol style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-bottom: 5px;"><strong>Initialization:</strong> Randomly initialize k Gaussian components (each with mean, covariance, and mixing weight)</li>
        <li style="margin-bottom: 5px;"><strong>Expectation Step (E-step):</strong> Calculate probability that each data point belongs to each Gaussian component using current parameters</li>
        <li style="margin-bottom: 5px;"><strong>Maximization Step (M-step):</strong> Update component parameters (means, covariances, weights) to maximize the likelihood of the data given current assignments</li>
        <li style="margin-bottom: 5px;"><strong>Convergence:</strong> Iterate E-step and M-step until parameters stabilize (EM algorithm converges to local maximum)</li>
    </ol>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #313131; font-weight: bold;">Workflow for This Analysis:</h4>
    <ol style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-bottom: 5px;"><strong>Component Selection:</strong> Test n=2-10 Gaussian components, evaluate using AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion) to penalize model complexity</li>
        <li style="margin-bottom: 5px;"><strong>Candidate Comparison:</strong> Compare n=3 vs n=4 components using validation metrics (Silhouette, Calinski-Harabasz, Davies-Bouldin, R²)</li>
        <li style="margin-bottom: 5px;"><strong>Final Model & Profiling:</strong> Selected n=4 components for superior feature utilization and model fit, producing four segments (30.7%, 31.4%, 14.2%, 23.7%) differentiated primarily by FSA, Education, Income, and Gender</li>
    </ol>
</div>

### **7.4.1 Selecting covariance_type & n_components**

In [None]:
# Candidates
gmm_n_components = list(range(2, 11))
gmm_cov_types = ["full", "tied", "diag", "spherical"]

X_gmm = df_demographic_a_scaled[feats].copy()

gmm_results_df = evaluate_gmm_grid(
    X=X_gmm,
    feats=feats,
    n_components_list=gmm_n_components,
    covariance_types=gmm_cov_types,
    n_init=10,
    random_state=1
)

gmm_results_df_sorted = (
    gmm_results_df
    .sort_values(["BIC", "AIC"], ascending=[True, True])
    .reset_index(drop=True)
)

gmm_results_df_sorted

In [None]:
# 1) pick covariance_type
plot_gmm_covtype_bic_aic(gmm_results_df, gmm_cov_types, gmm_n_components)

In [None]:
# 2) Find best n_components for chosen covariance_type
chosen_covariance_type_gmm = "full" # based on previous plot
plot_gmm_n_selection_for_covtype(gmm_results_df, chosen_covariance_type_gmm, gmm_n_components)

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
  <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">
    GMM Parameter Selection Summary
  </h3>

  <p style="margin: 10px 0; margin-right: 40px; color: #000;">
    Evaluated Gaussian Mixture Models by running a grid over <strong>n_components</strong> (2–10) and
    <strong>covariance_type</strong> (<strong>full</strong>, <strong>tied</strong>, <strong>diag</strong>, <strong>spherical</strong>)
    using <strong>init_params="kmeans"</strong> to stabilize initialization. Each configuration was scored with
    <strong>BIC</strong> and <strong>AIC</strong> (model selection), plus <strong>R²</strong> (variance explained) and
    <strong>Silhouette</strong> (cluster separation sanity check).
  </p>

  <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Goal:</h4>
  <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
    <li style="margin-right: 20px;">
      Select a <strong>covariance structure</strong> that best fits the data distribution, then choose
      <strong>n_components</strong> that balances model fit (BIC/AIC), separation (Silhouette), and interpretability
      (avoid unnecessary micro-segmentation).
    </li>
  </ul>

  <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Methodology:</h4>
  <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
    <li style="margin-right: 20px;">
      <strong>Step 1 – covariance_type selection:</strong>
      Compared BIC &amp; AIC curves across all covariance types. Since BIC/AIC are the standard
      likelihood-penalized criteria for GMMs, the best covariance_type is the one that achieves the
      <strong>lowest BIC/AIC</strong> consistently across n.
    </li>
    <li style="margin-right: 20px;">
      <strong>Step 2 – choose n_components within the winning covariance_type:</strong>
      For the selected covariance_type, inspected (i) <strong>BIC/AIC</strong> for the best trade-off between fit and
      complexity, (ii) <strong>Silhouette</strong> to avoid "overfitting into overlapping components", and (iii)
      <strong>R²</strong> to understand how much additional variance is gained by increasing n.
    </li>
    <li style="margin-right: 20px;">
      <strong>Important note on negative scores:</strong>
      <strong>BIC/AIC can be negative</strong> because they are derived from the (negative) log-likelihood scale; only
      <strong>relative differences</strong> matter (lower is always better), not the absolute sign.
    </li>
  </ul>

  <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Findings:</h4>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
      <li style="margin-right: 20px;">
        <strong>Best covariance_type:</strong>
        <strong>Full</strong> and <strong>diag</strong> achieve the lowest BIC/AIC at higher n_components. However, while BIC/AIC improves with higher n (minimum at n=9-10), high component counts produce micro-segments that reduce interpretability. <strong>Full covariance</strong> is selected for capturing feature correlations.
      </li>
      <li style="margin-right: 20px;">
        <strong>Candidate A (best separation):</strong>
        With <strong>full</strong>, <strong>n=3</strong> achieves the <strong>highest Silhouette</strong> (0.200) among lower n values, indicating well-separated clusters. This aligns with Hierarchical and K-Means 3-cluster solutions, enabling cross-method comparison.
      </li>
      <li style="margin-right: 20px;">
        <strong>Candidate B (balanced fit):</strong>
        With <strong>full</strong>, <strong>n=4</strong> shows <strong>substantial BIC improvement</strong> (16,643 vs 119,653) and higher R² (0.350 vs 0.316). While Silhouette (0.183) is slightly lower, the additional component may reveal a meaningful fourth segment.
      </li>
      <li style="margin-right: 20px;">
        <strong>Why n=3 and n=4 instead of higher n?</strong>
        Silhouette scores at n=3-4 (0.183-0.200) are comparable to or better than higher n (n=7: 0.168), indicating separation does not improve with more components. Following the principle of parsimony, we prioritize actionable segments over maximum statistical fit.
      </li>
      <li style="margin-right: 20px;">
        <strong>Decision:</strong>
        Proceed with <strong>full, n=3</strong> (best separation) and <strong>full, n=4</strong> (balanced fit) for deeper profiling.
      </li>
    </ul>
</div>


### **7.4.2 Evaluation of GMM Solutions**

In [None]:
# n=3
# n=4

# Fit only the selected candidates
gmm_3 = GaussianMixture(n_components=3, covariance_type='full', n_init=10, init_params='kmeans', random_state=1)
gmm_4 = GaussianMixture(n_components=4, covariance_type='full', n_init=10, init_params='kmeans', random_state=1)

gmm_labels_3 = gmm_3.fit_predict(df_demographic_a_scaled[feats])
gmm_labels_4 = gmm_4.fit_predict(df_demographic_a_scaled[feats])

# Calculate cluster means
df_temp_3 = df_demographic_a_scaled[feats].copy()
df_temp_3['Cluster'] = gmm_labels_3
cluster_means_3 = df_temp_3.groupby('Cluster').mean()

df_temp_4 = df_demographic_a_scaled[feats].copy()
df_temp_4['Cluster'] = gmm_labels_4
cluster_means_4 = df_temp_4.groupby('Cluster').mean()

display(cluster_means_3.round(3))
display(cluster_means_4.round(3))

# Visualize cluster size comparison
plot_cluster_size_comparison(
    labels_dict={'GMM (n=3)': gmm_labels_3, 'GMM (n=4)': gmm_labels_4},
    palette=CUSTOM_HEX,
    title='GMM Clustering: Cluster Size Comparison'
)


In [None]:
# Uncertainty Analysis: Assignment probability < 70%
uncertainty_threshold = 0.7

gmm_max_probs_3 = gmm_3.predict_proba(df_demographic_a_scaled[feats]).max(axis=1)
gmm_max_probs_4 = gmm_4.predict_proba(df_demographic_a_scaled[feats]).max(axis=1)

uncertain_pct_3 = (gmm_max_probs_3 < uncertainty_threshold).sum() / len(gmm_max_probs_3) * 100
uncertain_pct_4 = (gmm_max_probs_4 < uncertainty_threshold).sum() / len(gmm_max_probs_4) * 100

uncertainty_summary = pd.DataFrame({
    'n_components': [3, 4],
    'uncertain_pct': [uncertain_pct_3, uncertain_pct_4],
    'mean_prob': [gmm_max_probs_3.mean(), gmm_max_probs_4.mean()],
    'min_prob': [gmm_max_probs_3.min(), gmm_max_probs_4.min()]
})

display(uncertainty_summary)


<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">Comparison of Clustering Solutions Summary</h3>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        <strong>Decision: n=4 components selected with full covariance</strong>
    </p>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Compared n=3 vs n=4 components using <strong>full covariance structure</strong>. While n=3 achieves slightly higher Silhouette (0.200 vs 0.183), n=4 provides <strong>superior feature utilization</strong>: it leverages Gender as a primary differentiator (Z-scores of +1.00 and -1.00 for male/female segments) and isolates a distinct High-FSA segment (Z=+2.12), whereas n=3 ignores Gender entirely (all clusters near Z=0). The "Lower Education/Income" segment remains stable across both solutions (Education Z=-1.35, Income Z=-1.26). N=4 achieves substantially better model fit (<strong>BIC=16,643</strong> vs 119,653) and higher R² (0.350 vs 0.316). Both solutions show <strong>0.0% uncertain assignments</strong> at the 70% probability threshold. Cluster sizes remain balanced and actionable (30.7%, 31.4%, 14.2%, 23.7%). Following the principle of maximizing demographic differentiation, <strong>n=4 with full covariance</strong> is selected.
    </p>
</div>

### **7.4.3 Final GMM Clustering Solution**

In [None]:
# Final parameters based on evaluation previous section
# Selected: n_components=4, covariance_type='full'
chosen_n_components_gmm = 4

gmm_labels_final = gmm_labels_4
gmm_final = gmm_4

df_demographic_a_scaled['gmm_cluster'] = gmm_labels_final

# Calculate final metrics
gmm_final_metrics = evaluate_clustering_metrics(df_demographic_a_scaled[feats], gmm_labels_final)

# Store for final comparison (Section 9)
demo_clustering_results['GMM'] = {
    'k': chosen_n_components_gmm,
    'Silhouette': gmm_final_metrics['Silhouette Score'],
    'Calinski-Harabasz': gmm_final_metrics['Calinski-Harabasz Index'],
    'Davies-Bouldin': gmm_final_metrics['Davies-Bouldin Index'],
    'R2': gmm_results_df[(gmm_results_df['n_components'] == 4) & (gmm_results_df['covariance_type'] == 'full')]['R2'].values[0],
    'labels': gmm_labels_final
}

### **7.4.4 GMM Cluster Profiling**

In [None]:
# 1. Profile Heatmap - Z-scores of demographic features per cluster (GMM)
df_temp_gmm = df_demographic_a_scaled[feats].copy()
df_temp_gmm['Cluster'] = gmm_labels_final
cluster_profiles_gmm = df_temp_gmm.groupby('Cluster').mean()
gmm_population_mean = df_demographic_a_scaled[feats].mean()

plot_cluster_profiles_heatmap(
    cluster_profiles_gmm,
    gmm_population_mean,
    GROUP80_palette_continuous,
    title='GMM Clustering: Demographic Profiles (n=4)\nStandardized Z-Scores per Cluster'
)

In [None]:
# 2. Cluster Size Distribution
plot_cluster_sizes(
    gmm_labels_final,
    chosen_n_components_gmm,
    CUSTOM_HEX,
    title='GMM Clustering - Final Cluster Sizes'
)

In [None]:
# 3. Feature Importance Analysis - Variance across clusters
gmm_feature_variance = cluster_profiles_gmm.var(axis=0).sort_values(ascending=False)

plot_feature_importance(
    gmm_feature_variance,
    CUSTOM_HEX,
    title='GMM Clustering: Feature Importance Analysis\nWhich Demographics Differentiate Clusters?'
)

# Display feature importance as DataFrame
gmm_feature_importance_df = pd.DataFrame({
    'Feature': gmm_feature_variance.index,
    'Variance': gmm_feature_variance.values
}).reset_index(drop=True)

display(gmm_feature_importance_df)


<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">GMM Clustering Profiling Summary (n=4)</h3>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Analyzed demographic characteristics of the final 4 GMM clusters (full covariance) to identify probabilistic customer segments. The model captures feature correlations through full covariance, with FSA, Education, Income, and Gender emerging as primary differentiators.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Key Findings:</h4>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-right: 20px;"><strong>Cluster 0 (30.7%, n=3,999) - Higher-Income Males:</strong> Predominantly male (Z=+1.00), higher education (Z=+0.52), higher income (Z=+0.49), married (Z=+0.67).</li>
        <li style="margin-right: 20px;"><strong>Cluster 1 (31.4%, n=4,096) - Higher-Income Females:</strong> Predominantly female (Z=-1.00), higher education (Z=+0.53), higher income (Z=+0.48), married (Z=+0.66).</li>
        <li style="margin-right: 20px;"><strong>Cluster 2 (14.2%, n=1,857) - High-FSA Segment:</strong> Very high FSA frequency (Z=+2.12), common cities (Z=+0.65), average education (Z=-0.03), average income (Z=-0.02).</li>
        <li style="margin-right: 20px;"><strong>Cluster 3 (23.7%, n=3,086) - Lower-Income Lower-Education:</strong> Significantly lower education (Z=-1.35), significantly lower income (Z=-1.26), less married (Z=+0.36).</li>
        <li style="margin-right: 20px;"><strong>Primary Segmentation Driver - FSA (Variance: 1.52):</strong> FSA frequency emerges as the strongest differentiator, primarily isolating Cluster 2.</li>
        <li style="margin-right: 20px;"><strong>Secondary Drivers - Education (0.78), Income (0.68), Gender (0.67):</strong> Education and Income separate Cluster 3 from the rest; Gender differentiates Cluster 0 (male) from Cluster 1 (female).</li>
        <li style="margin-right: 20px;"><strong>Geographic & Marital Independence (Variance: <0.15):</strong> Province, City, and Marital Status show minimal differentiation across GMM clusters.</li>
    </ul>
</div>


---

## **7.5 Self Organizing Maps**

<div style="background-color: #e1e1e1ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #212121, #313131, #595959, #909090) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #000000ff; font-weight: bold;">Self-Organizing Maps (SOM) + K-Means Two-Stage Clustering Methodology</h3>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Self-Organizing Maps are <strong>unsupervised neural networks</strong> that project high-dimensional data onto a low-dimensional grid while preserving topological relationships. Each neuron represents a weight vector in the input space, and during training, neurons are "pulled" toward data patterns, dragging their neighbors along. The two-stage approach combines SOM's dimensionality reduction with K-Means clustering on the learned neuron weights.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #313131; font-weight: bold;">Algorithm:</h4>
    <ol style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-bottom: 5px;"><strong>Initialization:</strong> Randomly initialize neuron weight vectors and set neighborhood radius (σ) and learning rate (α)</li>
        <li style="margin-bottom: 5px;"><strong>BMU Selection:</strong> For each input pattern, find the Best Matching Unit (BMU) - the neuron with minimum Euclidean distance to the input</li>
        <li style="margin-bottom: 5px;"><strong>Weight Update:</strong> Update the BMU and its neighbors: w(new) = w(old) + α[x - w(old)], with neighborhood function controlling update strength</li>
        <li style="margin-bottom: 5px;"><strong>Parameter Decay:</strong> Gradually reduce learning rate and neighborhood radius over iterations</li>
        <li style="margin-bottom: 5px;"><strong>Two-Stage Clustering:</strong> Apply K-Means to the trained SOM weight vectors, then map customers to clusters via their BMUs</li>
    </ol>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #313131; font-weight: bold;">Key Quality Metrics:</h4>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-bottom: 5px;"><strong>Quantization Error (QE):</strong> Average distance between data points and their BMUs - measures data representation accuracy (lower is better)</li>
        <li style="margin-bottom: 5px;"><strong>Topographic Error (TE):</strong> Proportion of data points where 1st and 2nd BMUs are not adjacent - measures topology preservation (lower is better)</li>
    </ul>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #313131; font-weight: bold;">Workflow for This Analysis:</h4>
    <ol style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-bottom: 5px;"><strong>Parameter Grid Search:</strong> Test 18 combinations of grid sizes (10×10, 20×20, 40×40), learning rates (0.5, 0.75, 1.0), and sigma values (0.5, 1.0) to optimize QE/TE trade-off</li>
        <li style="margin-bottom: 5px;"><strong>SOM Visualization:</strong> Analyze trained SOM using Component Planes (feature distributions), U-Matrix (cluster boundaries), and Hit Map (customer density)</li>
        <li style="margin-bottom: 5px;"><strong>Two-Stage Clustering:</strong> Apply K-Means to 1,600 neuron weight vectors (40×40 grid), evaluate k=2-13 using Silhouette, Calinski-Harabasz, and Davies-Bouldin indices</li>
        <li style="margin-bottom: 5px;"><strong>Solution Comparison:</strong> Compare k=3 vs k=6 solutions by examining SOM grid visualizations, cluster sizes, and feature profiles</li>
        <li style="margin-bottom: 5px;"><strong>Final Model & Profiling:</strong> Select k=3 solution and analyze demographic characteristics, comparing segmentation patterns to standalone K-Means results</li>
    </ol>
</div>


### **7.5.1 SOM Parameter Grid Search**

In [None]:
# Define parameter grid for SOM optimization
grid_sizes = [10, 20, 40]
learning_rates = [0.5, 0.75, 1.0]
sigma_values = [0.5, 1.0]

# Prepare scaled data for SOM training
som_data = df_demographic_a_scaled[feats].values


# Store grid search results
'''grid_search_results = []

# Perform grid search
for grid_size in grid_sizes:
    for lr in learning_rates:
        for sigma in sigma_values:
            # Initialize SOM with current parameters
            som = MiniSom(
                x=grid_size,
                y=grid_size,
                input_len=len(feats),
                sigma=sigma,
                learning_rate=lr,
                neighborhood_function='gaussian',
                topology='hexagonal',
                activation_distance='euclidean',
                random_seed=1
            )
            
            # Initialize weights randomly from data
            som.random_weights_init(som_data)
            
            # Train SOM
            # Scale iterations with map size (500 per neuron): larger grids have more neurons that need sufficient updates to converge properly
            num_iterations = 500 * grid_size * grid_size
            som.train_batch(som_data, num_iteration=num_iterations, verbose=False)
            
            # Calculate quality metrics
            qe = som.quantization_error(som_data)
            te = som.topographic_error(som_data)
            
            # Store results
            grid_search_results.append({
                'grid_size': f'{grid_size}x{grid_size}',
                'learning_rate': lr,
                'sigma': sigma,
                'units': grid_size * grid_size,
                'quantization_error': qe,
                'topographic_error': te
            })

# Convert results to DataFrame
grid_search_df = pd.DataFrame(grid_search_results)
grid_search_df = grid_search_df.sort_values(['quantization_error', 'topographic_error']).reset_index(drop=True)'''

In [None]:
'''grid_search_df.to_csv('data/output_data/som_grid_search_results.csv', index=False)'''

In [None]:
grid_search_df = pd.read_csv('data/output_data/som_grid_search_results.csv')

In [None]:
# Visualize grid search results
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Quantization Error by Grid Size
for lr in learning_rates:
    for sigma in sigma_values:
        subset = grid_search_df[(grid_search_df['learning_rate'] == lr) & (grid_search_df['sigma'] == sigma)]
        axes[0].plot(subset['units'], subset['quantization_error'], 
                    marker='o', alpha=0.6, label=f'LR={lr}, σ={sigma}')

axes[0].set_xlabel('Number of SOM Units', fontsize=11, fontweight='bold')
axes[0].set_ylabel('Quantization Error (QE)', fontsize=11, fontweight='bold')
axes[0].set_title('SOM Grid Search: Quantization Error', fontsize=12, fontweight='bold')
axes[0].grid(False)

# Plot 2: Topographic Error by Grid Size
for lr in learning_rates:
    for sigma in sigma_values:
        subset = grid_search_df[(grid_search_df['learning_rate'] == lr) & (grid_search_df['sigma'] == sigma)]
        axes[1].plot(subset['units'], subset['topographic_error'], 
                    marker='o', alpha=0.6, label=f'LR={lr}, σ={sigma}')

axes[1].set_xlabel('Number of SOM Units', fontsize=11, fontweight='bold')
axes[1].set_ylabel('Topographic Error (TE)', fontsize=11, fontweight='bold')
axes[1].set_title('SOM Grid Search: Topographic Error', fontsize=12, fontweight='bold')
axes[1].grid(False)
axes[1].legend(bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=8)

plt.tight_layout()
plt.show()

# Display best parameter combinations
grid_search_df

In [None]:
# Select best parameters based on lowest QE and TE
best_params = grid_search_df.iloc[0]

# Extract parameters for later use
selected_grid_size = int(best_params['grid_size'].split('x')[0])
selected_lr = best_params['learning_rate']
selected_sigma = best_params['sigma']

# Display best parameters
best_params_display = pd.DataFrame({
    'Parameter': ['Grid Size', 'Learning Rate', 'Sigma', 'Total Units', 'Quantization Error', 'Topographic Error'],
    'Value': [
        best_params['grid_size'],
        best_params['learning_rate'],
        best_params['sigma'],
        best_params['units'],
        f"{best_params['quantization_error']:.4f}",
        f"{best_params['topographic_error']:.4f}"
    ]
})

best_params_display

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
  <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">SOM Parameter Grid Search Summary</h3>

  <p style="margin: 10px 0; margin-right: 40px; color: #000;">
    Performed systematic grid search over 18 parameter combinations to identify optimal SOM configuration for demographic clustering. Evaluated 3 grid sizes (10×10, 20×20, 40×40), 3 learning rates (0.5, 0.75, 1.0), and 2 sigma values (0.5, 1.0), with iterations scaled proportionally to grid size (500 × number of neurons) to ensure convergence.
  </p>

  <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Goal:</h4>
  <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
    <li style="margin-right: 20px;">Find optimal SOM parameters that balance low Quantization Error (accurate data representation) with low Topographic Error (preserved neighborhood topology) for subsequent two-stage clustering</li>
  </ul>

  <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Methodology:</h4>
  <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
    <li style="margin-right: 20px;"><strong>Quality metrics:</strong> Quantization Error (QE) measures average distance between data points and their Best Matching Units (lower = better data representation). Topographic Error (TE) measures proportion of data points where first and second BMUs are not adjacent (lower = better topology preservation)</li>
    <li style="margin-right: 20px;"><strong>Iteration scaling:</strong> Training iterations set to 500 × grid_size² (e.g., 50,000 for 10×10, 200,000 for 20×20, 800,000 for 40×40) following the standard heuristic of ~500 iterations per neuron to ensure proper convergence across all grid sizes</li>
    <li style="margin-right: 20px;"><strong>Parameter selection rationale:</strong> For two-stage clustering (SOM + K-Means), moderate grid sizes are preferred to ensure meaningful data compression. While larger grids minimize QE, they reduce the SOM's ability to aggregate similar customers into prototype neurons, diminishing the benefit of the two-stage approach</li>
  </ul>

  <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Findings:</h4>
  <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
    <li style="margin-right: 20px;"><strong>Grid size dominates QE:</strong> Larger grids yield substantially lower QE. 10×10 produced QE 0.84–0.92, 20×20 achieved 0.44–0.48, and 40×40 reached 0.16–0.22.</li>
    <li style="margin-right: 20px;"><strong>Sigma controls QE-TE trade-off:</strong> Small sigma (σ=0.5) damages topology preservation (TE 0.30–0.80), while σ=1.0 maintains excellent TE (0.08–0.30) with marginal QE increase.</li>
    <li style="margin-right: 20px;"><strong>Learning rate has moderate impact:</strong> lr=1.0 outperforms lr=0.5 by 5–8% in QE, though lr=0.5 produces slightly better TE.</li>
    <li style="margin-right: 20px;"><strong>Best configuration:</strong> 40×40 grid with lr=1.0 and σ=1.0 achieved lowest QE (0.157) while maintaining excellent TE (0.092).</li>
    <li style="margin-right: 20px;"><strong>Two-stage clustering:</strong> 40×40 with 1,600 neurons provides meaningful data compression (8 customers per neuron), ensuring K-Means on SOM weights produces different results than direct K-Means on raw data.</li>
    <li style="margin-right: 20px;"><strong>Decision:</strong> Use 40×40 with lr=1.0 and σ=1.0 for SOM visualization and two-stage clustering.</li>
  </ul>
</div>


### **7.5.2 SOM Visualizations**

In [None]:
# Train SOM with selected best parameters
som_best = MiniSom(
    x=selected_grid_size,
    y=selected_grid_size,
    input_len=len(feats),
    sigma=selected_sigma,
    learning_rate=selected_lr,
    neighborhood_function='gaussian',
    topology='hexagonal',
    activation_distance='euclidean',
    random_seed=1
)

# Initialize and train
som_best.random_weights_init(som_data)
som_best.train_batch(som_data, num_iteration=500 * selected_grid_size * selected_grid_size, verbose=False)

print(f"SOM trained with parameters: Grid={selected_grid_size}x{selected_grid_size}, LR={selected_lr}, σ={selected_sigma}")

In [None]:
# Component Planes
n_cols = 3
n_rows = int(np.ceil(len(feats) / n_cols))

fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, n_rows * 4.5))
axes = axes.flatten()

for idx, feature in enumerate(feats):
    weights = som_best.get_weights()[:, :, idx]
    visualize_som_grid(som_best, weights, feature, ax=axes[idx])

for idx in range(len(feats), len(axes)):
    axes[idx].axis('off')

plt.suptitle('Component Planes: Feature Distribution across SOM Grid', fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()


In [None]:
# U-Matrix - Unified Distance Matrix
visualize_som_grid(som_best, som_best.distance_map(), 'U-Matrix: Unified Distance Matrix')

In [None]:
# Hit Map - Customer distribution across SOM grid
hitsmatrix = som_best.activation_response(som_data)
visualize_som_grid(som_best, hitsmatrix, 'Hit Map: Customer Distribution')

display(pd.DataFrame({
    'Metric': ['Total Customers', 'Avg per Unit', 'Max in Single Unit', 'Min in Single Unit'],
    'Value': [f"{hitsmatrix.sum():.0f}", f"{hitsmatrix.mean():.2f}", f"{hitsmatrix.max():.0f}", f"{hitsmatrix.min():.0f}"]
}))

# Count units with 0 hits
n_zero_hit_units = np.sum(hitsmatrix == 0)
print("Number of units with 0 hits:", n_zero_hit_units)

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
  <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">SOM Visualization Summary</h3>

  <p style="margin: 10px 0; margin-right: 40px; color: #000;">
    Visualized the trained 40×40 SOM using three complementary outputs: Component Planes (feature distributions), U-Matrix (cluster boundaries), and Hit Map (customer density). Each visualization serves a distinct purpose in understanding the SOM's learned structure.
  </p>

  <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Goal:</h4>
  <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
    <li style="margin-right: 20px;">Understand how demographic features are distributed across the SOM grid, identify natural cluster boundaries, and assess whether customer mappings are evenly distributed or concentrated in specific regions</li>
  </ul>

  <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Methodology:</h4>
  <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
    <li style="margin-right: 20px;"><strong>Component Planes:</strong> Display each feature's weight distribution across the SOM grid. <span style="color:#b2182b;">Red</span> shades indicate higher Z-scores, <span style="color:#2166ac;">blue</span> shades indicate lower Z-scores</li>
    <li style="margin-right: 20px;"><strong>U-Matrix:</strong> Shows average distance between each SOM unit and its neighbors. <span style="color:#b2182b;">Red</span> regions indicate high distances (cluster boundaries), <span style="color:#2166ac;">blue</span> regions indicate homogeneous areas (cluster centers)</li>
    <li style="margin-right: 20px;"><strong>Hit Map:</strong> Displays how many customers are mapped to each SOM unit. <span style="color:#b2182b;">Red</span> units have more customers, <span style="color:#2166ac;">blue</span> units have fewer</li>
  </ul>

  <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Findings:</h4>
  <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
    <li style="margin-right: 20px;"><strong>Education and Income show strong correlation:</strong> Both Education_Level_Num and Income_Bin_Num display similar gradients with <span style="color:#f4a582;">yellow/beige</span> regions (higher values, Z=+1.5 to +2.0) in the upper-left area transitioning to <span style="color:#2166ac;">blue</span> regions (lower values, Z=-1.0) in the lower portion, indicating these features co-vary and define a primary segmentation axis</li>
    <li style="margin-right: 20px;"><strong>Gender shows clear binary segmentation:</strong> Gender_Encoded displays distinct <span style="color:#b2182b;">red</span> (Z=+1, male) and <span style="color:#2166ac;">blue</span> (Z=-1, female) regions distributed in a patchy pattern across the grid, representing a secondary segmentation dimension independent of the Education-Income axis</li>
    <li style="margin-right: 20px;"><strong>Geographic features exhibit correlated patterns:</strong> Province_Encoded, City_Encoded, and Location_Code_Num show similar spatial distributions with <span style="color:#b2182b;">red</span> regions (common areas) concentrated in the upper-left and <span style="color:#2166ac;">blue</span> regions (rare areas) elsewhere, capturing related geographic information</li>
    <li style="margin-right: 20px;"><strong>FSA shows sparse high-value hotspots:</strong> FSA_Encoded is predominantly <span style="color:#2166ac;">blue</span> (Z=-0 to -1.0) with isolated <span style="color:#b2182b;">red</span> points reaching extreme values (Z=+2.5), indicating most customers have low FSA frequency with a small subset from high-frequency postal codes</li>
    <li style="margin-right: 20px;"><strong>Marital status shows scattered patterns:</strong> Marital_Divorced displays uniformly scattered <span style="color:#b2182b;">red</span> points across a predominantly <span style="color:#2166ac;">blue</span> grid, while Marital_Married shows more regional variation but no clear cluster alignment</li>
    <li style="margin-right: 20px;"><strong>U-Matrix reveals diffuse cluster boundaries:</strong> The U-Matrix shows predominantly <span style="color:#92c5de;">light blue</span> regions (low inter-neuron distances, 0.2-0.4) with scattered <span style="color:#f4a582;">orange</span>/<span style="color:#b2182b;">red</span> points (higher distances, 0.6-1.0) throughout, suggesting gradual transitions between customer segments rather than sharp cluster boundaries</li>
    <li style="margin-right: 20px;"><strong>Hit Map reveals uneven coverage:</strong> With 13,038 total customers mapped (avg 8.15 per unit), the distribution shows concentration in certain regions (max 78 per unit) while <strong>657 units (41%) remain empty</strong>. This indicates the data naturally clusters in specific SOM regions, supporting the two-stage clustering approach where K-Means can identify these dense areas</li>
  </ul>
</div>


### **7.5.3 Emergent SOM Training**

In [None]:
# Emergent SOM for Two-Stage Clustering
# Using 40x40 grid - optimal balance between granularity and neuron coverage

som_emergent = som_best  # Use already trained 40x40 SOM from Grid Search
emergent_grid_size = 40

qe_emergent = som_emergent.quantization_error(som_data)
te_emergent = som_emergent.topographic_error(som_data)

pd.DataFrame({
    'Metric': ['Grid Size', 'Total Units', 'Avg Customers/Unit', 'Quantization Error', 'Topographic Error'],
    'Value': [f'{emergent_grid_size}x{emergent_grid_size}', f'{emergent_grid_size**2:,}', f'{len(som_data)/emergent_grid_size**2:.2f}', f'{qe_emergent:.4f}', f'{te_emergent:.4f}']
})

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
  <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">SOM for Two-Stage Clustering</h3>

  <p style="margin: 10px 0; margin-right: 40px; color: #000;">
    Using the 40×40 SOM (1,600 units) trained with optimal Grid Search parameters (lr=1.0, σ=1.0) as foundation for two-stage clustering. K-Means will cluster the neuron weight vectors to identify final customer segments.
  </p>

  <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Why 40×40 Grid for Two-Stage Clustering:</h4>
  <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
    <li style="margin-right: 20px;"><strong>Meaningful data compression:</strong> With 13,038 customers across 1,600 neurons, each neuron represents ~8.15 customers on average. This compression ratio ensures the SOM provides genuine noise reduction and prototype learning, differentiating two-stage clustering from direct K-Means on raw data</li>
    <li style="margin-right: 20px;"><strong>Robust neuron coverage:</strong> The moderate density of ~8 customers per neuron ensures weight vectors are stable representations of local customer profiles, not dominated by individual outliers</li>
    <li style="margin-right: 20px;"><strong>Balanced granularity:</strong> 1,600 neurons provide sufficient resolution for K-Means to identify natural cluster boundaries while avoiding the over-granularity problem where very large grids (e.g., 50×50 or larger) would produce results nearly identical to direct clustering</li>
  </ul>

  <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Quality Metrics:</h4>
  <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
    <li style="margin-right: 20px;"><strong>Quantization Error: 0.1571</strong> - low error indicating neurons accurately represent the underlying data distribution while providing meaningful compression</li>
    <li style="margin-right: 20px;"><strong>Topographic Error: 0.0915</strong> - only 9.2% of data points have non-adjacent 1st and 2nd BMUs, confirming excellent topology preservation for interpretable clustering results</li>
  </ul>
</div>

### **7.5.4 Defining the number of clusters**

In [None]:
# Flatten SOM weights for K-Means clustering
som_weights_flat = som_emergent.get_weights().reshape(-1, len(feats))

pd.DataFrame({
    'Description': ['SOM weights shape', 'Total neurons', 'Features per neuron'],
    'Value': [str(som_weights_flat.shape), emergent_grid_size**2, len(feats)]
})

In [None]:
# Evaluate K-Means on SOM weights across range of k values
som_km_k_range = range(2, 14)
som_km_metrics = {
    'k': [],
    'Inertia': [],
    'Silhouette': [],
    'Calinski-Harabasz': [],
    'Davies-Bouldin': []
}

# Store fitted neuron labels for later use
som_km_fitted_neuron_labels = {}

# Helper function to map neuron labels to customers via BMU
def get_customer_labels(som, data, neuron_labels, grid_size):
    customer_labels = []
    for sample in data:
        bmu = som.winner(sample)
        neuron_idx = bmu[0] * grid_size + bmu[1]
        customer_labels.append(neuron_labels[neuron_idx])
    return np.array(customer_labels)

for k in som_km_k_range:
    kmeans = KMeans(n_clusters=k, init='k-means++', n_init=15, random_state=1, max_iter=300)
    km_neuron_labels = kmeans.fit_predict(som_weights_flat)
    
    # Store neuron labels
    som_km_fitted_neuron_labels[k] = km_neuron_labels
    
    # Map neuron labels to customers via BMU
    customer_labels = get_customer_labels(som_emergent, som_data, km_neuron_labels, emergent_grid_size)
    
    # Metrics on customer data
    metrics = evaluate_clustering_metrics(som_data, customer_labels)
    
    som_km_metrics['k'].append(k)
    som_km_metrics['Inertia'].append(kmeans.inertia_)
    som_km_metrics['Silhouette'].append(metrics['Silhouette Score'])
    som_km_metrics['Calinski-Harabasz'].append(metrics['Calinski-Harabasz Index'])
    som_km_metrics['Davies-Bouldin'].append(metrics['Davies-Bouldin Index'])

som_km_metrics_df = pd.DataFrame(som_km_metrics)

In [None]:
# Elbow Method
plot_elbow_method(
    som_km_k_range,
    som_km_metrics_df['Inertia'].tolist(),
    CUSTOM_HEX,
    title='SOM + K-Means: Elbow Method for Optimal k'
)

In [None]:
# Clustering metrics comparison
plot_clustering_metrics(
    som_km_metrics_df,
    som_km_k_range,
    CUSTOM_HEX,
    title='SOM + K-Means: Clustering Metrics Evaluation'
)

som_km_metrics_df

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">Optimal k Selection Summary (SOM + K-Means)</h3>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Evaluated k=2 to k=13 on the 1,600 SOM neuron weight vectors using multiple validation indices to identify the optimal number of clusters for two-stage clustering.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Goal:</h4>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-right: 20px;">Determine k that balances cluster quality metrics with business interpretability for actionable customer segmentation</li>
    </ul>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Methodology:</h4>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-right: 20px;"><strong>Elbow Method (Inertia):</strong> Total within-cluster sum of squared distances on SOM weights - look for "elbow" where adding clusters yields diminishing returns</li>
        <li style="margin-right: 20px;"><strong>Silhouette Score:</strong> Measures cluster cohesion and separation (range -1 to 1, higher is better)</li>
        <li style="margin-right: 20px;"><strong>Calinski-Harabasz Index:</strong> Ratio of between-cluster to within-cluster variance (higher is better)</li>
        <li style="margin-right: 20px;"><strong>Davies-Bouldin Index:</strong> Average similarity between clusters (lower is better)</li>
    </ul>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Findings:</h4>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-right: 20px;"><strong>k=3 - Statistical Optimum:</strong> Achieves the <strong>highest Calinski-Harabasz Index (3,064)</strong> across all k values and shows a clear <strong>elbow point</strong> where inertia reduction slows significantly. Silhouette (0.203) and Davies-Bouldin (1.740) are competitive. This aligns with the 3-cluster solutions from Hierarchical and K-Means clustering, enabling cross-method comparison</li>
        <li style="margin-right: 20px;"><strong>k=6 - Best Cluster Quality:</strong> Achieves <strong>highest Silhouette (0.214)</strong> among lower k values and <strong>best Davies-Bouldin (1.654)</strong>, indicating well-separated, compact clusters. This solution provides finer granularity while maintaining strong cluster cohesion</li>
        <li style="margin-right: 20px;"><strong>Trade-off Analysis:</strong> k=3 maximizes between-cluster separation (highest CH), aligns with the elbow point, and offers simplicity for high-level strategic segmentation. k=6 optimizes cluster cohesion (best Silhouette, DBI), suitable for more detailed customer differentiation</li>
        <li style="margin-right: 20px;"><strong>Decision for next step:</strong> Compare k=3 vs k=6 to determine the final clustering configuration based on cluster profiles and business interpretability</li>
    </ul>
</div>


### **7.5.5 Comparison of k Solutions**

In [None]:
# Compare two candidate solutions based on clustering metrics analysis
# k=3
# k=6

som_km_k_candidate_1 = 3
som_km_k_candidate_2 = 6

# Use pre-fitted neuron labels from 7.6.4
som_km_labels_k1 = som_km_fitted_neuron_labels[som_km_k_candidate_1]
som_km_labels_k2 = som_km_fitted_neuron_labels[som_km_k_candidate_2]

# Map to customers 
customer_labels_k1 = get_customer_labels(som_emergent, som_data, som_km_labels_k1, emergent_grid_size)
customer_labels_k2 = get_customer_labels(som_emergent, som_data, som_km_labels_k2, emergent_grid_size)

# Create temporary DataFrames with cluster labels
df_temp_k1 = pd.DataFrame(som_data, columns=feats)
df_temp_k1['Cluster'] = customer_labels_k1

df_temp_k2 = pd.DataFrame(som_data, columns=feats)
df_temp_k2['Cluster'] = customer_labels_k2

# Calculate cluster profiles (mean values per cluster)
cluster_profiles_k1 = df_temp_k1.groupby('Cluster').mean()
cluster_profiles_k2 = df_temp_k2.groupby('Cluster').mean()

# Display both profiles for comparison
print(f"Cluster Profiles for k={som_km_k_candidate_1}:")
display(cluster_profiles_k1.round(3))

print(f"\nCluster Profiles for k={som_km_k_candidate_2}:")
display(cluster_profiles_k2.round(3))

# Visualize cluster size comparison
plot_cluster_size_comparison(
    labels_dict={som_km_k_candidate_1: customer_labels_k1, som_km_k_candidate_2: customer_labels_k2},
    palette=CUSTOM_HEX,
    title='SOM + K-Means: Cluster Size Comparison (k=3 vs k=6)'
)

In [None]:
# Visualize k=3
cluster_grid_k3 = som_km_labels_k1.reshape((emergent_grid_size, emergent_grid_size))
visualize_som_grid(som_emergent, cluster_grid_k3.astype(float), f'SOM + K-Means Clustering (k={som_km_k_candidate_1})')


In [None]:
# Visualize k=6
cluster_grid_k6 = som_km_labels_k2.reshape((emergent_grid_size, emergent_grid_size))
visualize_som_grid(som_emergent, cluster_grid_k6.astype(float), f'SOM + K-Means Clustering (k={som_km_k_candidate_2})')

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">Comparison of Clustering Solutions Summary</h3>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        <strong>Decision: k=3 selected</strong>
    </p>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        K=3 emerges as the optimal solution, primarily driven by the <strong>clear elbow point</strong> in the inertia curve where additional clusters yield diminishing returns. This is further supported by the <strong>highest Calinski-Harabasz Index (3,064)</strong> across all k values, indicating optimal between-cluster separation. The SOM grid visualization shows three clearly defined, contiguous cluster regions with minimal fragmentation. The cluster size distribution is well-balanced (31.9%, 38.3%, 29.8%), ensuring all segments are substantial enough for actionable marketing strategies. While k=6 achieves better Silhouette (0.214 vs 0.203) and Davies-Bouldin (1.654 vs 1.740), the elbow criterion clearly favors k=3, and k=6 produces smaller clusters (smallest at 8.7%) with more fragmented spatial distribution. K=3 aligns with the solutions from Hierarchical and K-Means clustering, enabling direct cross-method comparison.
    </p>
</div>

### **7.5.6 Final SOM Model**

In [None]:
# Select k=3 and use pre-fitted labels
selected_k = 3
df_demographic_a_scaled['Cluster_SOM_KMeans'] = customer_labels_k1

# Calculate final metrics
som_labels_final = df_demographic_a_scaled['Cluster_SOM_KMeans'].values
som_final_metrics = evaluate_clustering_metrics(df_demographic_a_scaled[feats], som_labels_final)

# Store for final comparison (Section 9)
demo_clustering_results['SOM + K-Means'] = {
    'k': selected_k,
    'Silhouette': som_final_metrics['Silhouette Score'],
    'Calinski-Harabasz': som_final_metrics['Calinski-Harabasz Index'],
    'Davies-Bouldin': som_final_metrics['Davies-Bouldin Index'],
    'R2': get_rsq(df_demographic_a_scaled[feats + ['Cluster_SOM_KMeans']], feats, 'Cluster_SOM_KMeans'),
    'labels': som_labels_final
}

### **7.5.7 SOM Cluster Profiling**

In [None]:
# 1. Profile Heatmap - Z-scores of demographic features per cluster (SOM)
# Reuse cluster_profiles
cluster_profiles_som = cluster_profiles_k1
som_population_mean = df_demographic_a_scaled[feats].mean()

plot_cluster_profiles_heatmap(
    cluster_profiles_som,
    som_population_mean,
    GROUP80_palette_continuous,
    title='SOM + K-Means Clustering: Demographic Profiles (n=3)\nStandardized Z-Scores per Cluster'
)


In [None]:
# 2. Cluster Size Distribution
plot_cluster_sizes(
    df_demographic_a_scaled['Cluster_SOM_KMeans'].values,
    selected_k,
    CUSTOM_HEX,
    title='SOM + K-Means Clustering - Final Cluster Sizes'
)

In [None]:
# 3. Feature Importance Analysis - Variance across clusters
som_feature_variance = cluster_profiles_som.var(axis=0).sort_values(ascending=False)

plot_feature_importance(
    som_feature_variance,
    CUSTOM_HEX,
    title='SOM + K-Means Clustering: Feature Importance Analysis\nWhich Demographics Differentiate Clusters?'
)

# Display feature importance as DataFrame
som_feature_importance_df = pd.DataFrame({
    'Feature': som_feature_variance.index,
    'Variance': som_feature_variance.values
}).reset_index(drop=True)

display(som_feature_importance_df)


<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">SOM + K-Means Clustering Profiling Summary (k=3)</h3>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Analyzed demographic characteristics of the final 3 SOM + K-Means clusters to identify distinct customer segments. The two-stage approach produces segmentation driven by Education, Income, and Geographic location, with Gender showing no differentiation.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Key Findings:</h4>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-right: 20px;"><strong>Cluster 0 (31.9%, n=4,156) - Common Regions Higher-Income:</strong> Common cities (Z=+1.03), common provinces (Z=+0.52), common FSA (Z=+0.59), higher education (Z=+0.58), higher income (Z=+0.50), married (Z=+0.66).</li>
        <li style="margin-right: 20px;"><strong>Cluster 1 (38.3%, n=4,997) - Rare Regions Higher-Income:</strong> Rare cities (Z=-0.85), rare provinces (Z=-0.45), rare FSA (Z=-0.50), higher education (Z=+0.57), higher income (Z=+0.49), married (Z=+0.67).</li>
        <li style="margin-right: 20px;"><strong>Cluster 2 (29.8%, n=3,885) - Lower-Education Lower-Income:</strong> Average geographic distribution, significantly lower education (Z=-1.35), significantly lower income (Z=-1.17), less married (Z=+0.38).</li>
        <li style="margin-right: 20px;"><strong>Primary Segmentation Driver - Education (Variance: 1.24):</strong> Education level emerges as the strongest differentiator, isolating Cluster 2 with significantly lower education.</li>
        <li style="margin-right: 20px;"><strong>Secondary Drivers - Income (0.92) & City (0.88):</strong> Income correlates strongly with Education; City frequency differentiates Clusters 0 and 1 (common vs rare regions).</li>
        <li style="margin-right: 20px;"><strong>Gender Independence (Variance: 0.0003):</strong> Gender shows virtually no differentiation across clusters, indicating the SOM topology does not segment by gender.</li>
    </ul> 
</div>

# <a class='anchor' id='8'></a>
<br>

<div style="background: linear-gradient(to right, #00411E, #00622D, #00823C, #45AF28, #82BA72); 
            padding: 10px; color: white; text-align: center;  max-width: 97%;">
    <center><h1 style="margin-top: 10px; margin-bottom: 4px; color: white;
                       font-size: 32px; font-family: 'Roboto', sans-serif;">
        <b>8. Behavorial Clustering</b></h1></center>
</div>

## **8.0 Multivariate Outlier Detection (DBSCAN)**

<div style="background-color: #e1e1e1ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #212121, #313131, #595959, #909090) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #000000ff; font-weight: bold;">DBSCAN for Multivariate Outlier Detection</h3>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a <strong>density-based algorithm</strong> that groups points in high-density regions while explicitly identifying outliers as noise. Unlike univariate methods (IQR, Z-score), DBSCAN detects <strong>multivariate outliers</strong>, meaning customers whose combination of behavioral features is anomalous even if individual features appear normal. Points labeled as noise (cluster = -1) are natural outlier candidates that can distort centroid-based clustering algorithms and should be removed before subsequent analysis.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #313131; font-weight: bold;">Algorithm:</h4>
    <ol style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-bottom: 5px;"><strong>Pick a Point:</strong> Start with an arbitrary <strong>unvisited</strong> point and mark it as visited.</li>
        <li style="margin-bottom: 5px;"><strong>Check its Neighborhood (ε-neighborhood):</strong> Retrieve all points within distance <strong>ε</strong> of the selected point and count them.</li>
        <li style="margin-bottom: 5px;"><strong>Core vs. Border/Noise Decision:</strong> 
            If the neighborhood size is <strong>≥ MinPts</strong>, the point is a <strong>core point</strong> and a new cluster is started. 
            If it is <strong>< MinPts</strong>, the point is temporarily labeled as <strong>noise</strong> (it may later become a <strong>border point</strong> if found within ε of a core point).
        </li>
        <li style="margin-bottom: 5px;"><strong>Expand the Cluster:</strong> For a core point, add all ε-neighbors to the cluster. Maintain a "seed set" (queue) of neighbors to process.</li>
        <li style="margin-bottom: 5px;"><strong>Recursive Growth via Core Neighbors:</strong> For each neighbor in the seed set, if it is unvisited, mark it visited and compute its ε-neighborhood. 
            If that neighbor is also a <strong>core point</strong>, add its neighbors to the seed set. This continues until no new points can be added.
        </li>
        <li style="margin-bottom: 5px;"><strong>Outlier Identification:</strong> Points not assigned to any cluster remain labeled as <strong>noise</strong> (cluster = -1). These are the multivariate outliers.</li>
    </ol>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #313131; font-weight: bold;">Why DBSCAN for Outlier Detection:</h4>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-bottom: 5px;"><strong>Multivariate sensitivity:</strong> Detects customers with unusual feature combinations (e.g., high distance variability but zero redemption) that univariate methods miss</li>
        <li style="margin-bottom: 5px;"><strong>No distribution assumptions:</strong> Does not assume normality, making it robust for behavioral data with skewed distributions</li>
        <li style="margin-bottom: 5px;"><strong>Automatic outlier labeling:</strong> Noise points are natural outlier candidates without requiring manual threshold setting</li>
    </ul>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #313131; font-weight: bold;">Workflow for This Analysis:</h4>
    <ol style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-bottom: 5px;"><strong>ε Selection via k-distance Graph:</strong> Compute the k-distance curve with <strong>k=8</strong> (rule of thumb: <strong>2 × n_features</strong>, here 4 features) and identify the elbow region for ε candidates</li>
        <li style="margin-bottom: 5px;"><strong>Parameter Grid Evaluation:</strong> Test combinations of ε and MinPts, tracking <strong>noise count</strong> and <strong>noise percentage</strong> to achieve target outlier rate (typically 1-5%)</li>
        <li style="margin-bottom: 5px;"><strong>Outlier Identification:</strong> Select parameters that identify a meaningful outlier group, label noise points as multivariate outliers</li>
        <li style="margin-bottom: 5px;"><strong>Clean Dataset:</strong> Remove identified outliers from the behavioral dataset and proceed with clustering algorithms (8.1-8.5) on the cleaned data</li>
    </ol>
</div>

### **8.0.1 Selecting ε and MinPts**

In [None]:
# k-distance graph for ε selection
# k = 2 × n_features (4 behavioral features -> k=8)
behavioral_feats = [ 'distance_variability', 'companion_flight_ratio', 
                    'flight_regularity', 'redemption_frequency']

k = 8
neigh = NearestNeighbors(n_neighbors=k)
neigh.fit(df_behavioral_a_scaled[behavioral_feats])
distances, _ = neigh.kneighbors(df_behavioral_a_scaled[behavioral_feats])
k_distances = np.sort(distances[:, -1])

fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(k_distances, color=CUSTOM_HEX[0], linewidth=1.5)
ax.axhline(y=1, color='red', linestyle='--', linewidth=1.5, label='ε = 1 Potential eps (elbow)')
ax.set_xlabel('Sorted Points', fontweight='bold', fontsize=11)
ax.set_ylabel(f'{k}-NN Distance', fontweight='bold', fontsize=11)
ax.set_title('DBSCAN Outlier Detection - k-Distance Graph', fontweight='bold', fontsize=13, pad=15)
ax.grid(True, alpha=0.3)
ax.legend()
plt.tight_layout()
plt.show()

In [None]:
# Parameter grid evaluation
# The curve is fairly smooth until 0.8-1.2, then it starts rising sharply (and explodes toward 4 at the very end)
# So a first ε candidate (with MinPts round about 18) is around 0.8 - 1.2 — i.e., just before the steep ramp begins

eps_values = [0.8, 0.9, 1.0, 1.1, 1.2]
minpts_values = [7, 8, 9, 10, 11, 12, 13, 14, 15]
dbscan_outlier_results = []

for eps in eps_values:
    for minpts in minpts_values:
        dbscan_model = DBSCAN(eps=eps, min_samples=minpts, n_jobs=-1)
        labels = dbscan_model.fit_predict(df_behavioral_a_scaled[behavioral_feats])
        n_noise = np.sum(labels == -1)
        noise_pct = (n_noise / len(labels)) * 100
        dbscan_outlier_results.append({
            "eps": eps, 
            "min_samples": minpts, 
            "n_outliers": n_noise,
            "outlier_pct": round(noise_pct, 2)
        })

dbscan_outlier_results_df = pd.DataFrame(dbscan_outlier_results)
dbscan_outlier_results_df

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">DBSCAN Parameter Selection Summary</h3>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Determined ε and MinPts for multivariate outlier detection using a k-distance plot to anchor ε at the elbow, then validated via grid search covering 5 ε values (0.8 to 1.2) and 9 MinPts values (7 to 15), totaling 45 parameter combinations.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Methodology:</h4>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-right: 20px;"><strong>ε selection:</strong> k-distance graph (k=8) shows clear elbow around ε=1.0, representing the natural density break in the 4-dimensional behavioral feature space.</li>
        <li style="margin-right: 20px;"><strong>MinPts selection:</strong> Following the rule of thumb MinPts = 2 × n_features, MinPts=8 ensures robust local density estimation for 4 behavioral features.</li>
    </ul>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Findings:</h4>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-right: 20px;"><strong>Grid search pattern:</strong> Lower ε values (0.8) produced 2.3% to 3.5% outliers, while higher ε values (1.2) dropped below 0.8%. The ε=1.0 range yields 1.1% to 1.6% outliers across MinPts values.</li>
        <li style="margin-right: 20px;"><strong>Selected parameters:</strong> ε=1.0, MinPts=8 identifies <strong>153 outliers (1.17%)</strong>, representing customers with unusual behavioral patterns in sparsely populated regions of the feature space.</li>
    </ul>
</div>

### **8.0.2 Outlier Identification**

In [None]:
# Final DBSCAN parameters for outlier detection
chosen_eps_outlier = 1.0
chosen_minpts_outlier = 8

# Fit DBSCAN and identify outliers (label = -1)
dbscan_outlier = DBSCAN(eps=chosen_eps_outlier, min_samples=chosen_minpts_outlier, n_jobs=-1)
outlier_labels = dbscan_outlier.fit_predict(df_behavioral_a_scaled[behavioral_feats])

outlier_mask = outlier_labels == -1
behavioral_outlier_indices = df_behavioral_a_scaled[outlier_mask].index.tolist()

# Create clean behavioral dataset
df_behavioral_clean = df_behavioral_a_scaled[~outlier_mask].copy()
df_behavioral_clean.shape

In [None]:
# 3D Visualization of outliers in PCA space

pca_3d = PCA(n_components=3)
behavioral_pca_3d = pca_3d.fit_transform(df_behavioral_a_scaled[behavioral_feats])

fig = plt.figure(figsize=(12, 10))
ax = fig.add_subplot(111, projection='3d')

ax.scatter(behavioral_pca_3d[~outlier_mask, 0], behavioral_pca_3d[~outlier_mask, 1], behavioral_pca_3d[~outlier_mask, 2],
           c=CUSTOM_HEX[2], alpha=0.15, s=8, label=f'Clean ({(~outlier_mask).sum():,})')
ax.scatter(behavioral_pca_3d[outlier_mask, 0], behavioral_pca_3d[outlier_mask, 1], behavioral_pca_3d[outlier_mask, 2],
           c='red', alpha=0.9, s=30, marker='x', label=f'Outliers ({outlier_mask.sum():,})')

ax.set_xlabel(f'PC1 ({pca_3d.explained_variance_ratio_[0]:.1%})', fontweight='bold')
ax.set_ylabel(f'PC2 ({pca_3d.explained_variance_ratio_[1]:.1%})', fontweight='bold')
ax.set_zlabel(f'PC3 ({pca_3d.explained_variance_ratio_[2]:.1%})', fontweight='bold')
ax.set_title('DBSCAN Outlier Detection - 3D PCA Projection', fontweight='bold', fontsize=13, pad=15)
ax.legend()
plt.tight_layout()
plt.show()

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">DBSCAN Outlier Identification Summary</h3>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Applied DBSCAN (ε=1.0, MinPts=8) to identify multivariate outliers in the 4-dimensional behavioral feature space. The 3D PCA projection confirms that outliers (red) are located at the periphery or in sparsely populated regions, validating the density-based detection.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Results:</h4>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-right: 20px;"><strong>Outliers removed:</strong> 153 customers (1.17%)</li>
        <li style="margin-right: 20px;"><strong>Clean dataset:</strong> 12,924 customers</li>
    </ul>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Next Steps:</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Behavioral clustering (8.1-8.5) proceeds on <strong>df_behavioral_clean</strong>. Outlier indices are stored in <strong>behavioral_outlier_indices</strong> for post-hoc analysis/profiling.
    </p>
</div>

---

## **8.1 Hierarchical Clustering**

<div style="background-color: #e1e1e1ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #212121, #313131, #595959, #909090) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #000000ff; font-weight: bold;">Hierarchical Clustering Methodology</h3>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Hierarchical clustering is an <strong>agglomerative bottom-up algorithm</strong> that builds a tree-like structure (dendrogram) by iteratively merging the closest data points or clusters. Unlike K-Means, it does not require pre-specifying the number of clusters and provides a complete hierarchical view of data relationships.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #313131; font-weight: bold;">Algorithm:</h4>
    <ol style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-bottom: 5px;"><strong>Initialization:</strong> Start with each data point as its own cluster (n clusters for n points)</li>
        <li style="margin-bottom: 5px;"><strong>Distance Calculation:</strong> Compute pairwise distances between all clusters using a linkage criterion</li>
        <li style="margin-bottom: 5px;"><strong>Merge Step:</strong> Iteratively merge the two closest clusters into one larger cluster</li>
        <li style="margin-bottom: 5px;"><strong>Repeat:</strong> Continue until all points are merged into a single cluster, forming a hierarchical tree (dendrogram)</li>
    </ol>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #313131; font-weight: bold;">Workflow for This Analysis:</h4>
    <ol style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-bottom: 5px;"><strong>Linkage Method Selection:</strong> Test Ward, Complete, Average, and Single linkage methods using CCC (Cophenetic Correlation Coefficient) and R² (Variance Explained)</li>
        <li style="margin-bottom: 5px;"><strong>Optimal k Selection:</strong> For the best linkage method (Ward), identify optimal k using Silhouette, Calinski-Harabasz, Davies-Bouldin indices and dendrogram visual inspection</li>
        <li style="margin-bottom: 5px;"><strong>Solution Comparison:</strong> Compare k=2 vs k=6 solutions by examining cluster size distributions and feature profiles</li>
        <li style="margin-bottom: 5px;"><strong>Final Model & Profiling:</strong> Fit final Ward linkage model with k=6 and analyze behavioral characteristics of each cluster</li>
    </ol>
</div>

### **8.1.1 Finding the best Linkage Method**

In [None]:
# Compare linkage methods using two complementary metrics:
# 1. CCC (Cophenetic Correlation Coefficient)
# 2. R²

linkage_methods = ['ward', 'complete', 'average', 'single']

# Use plot_linkage_comparison
ccc_df_behav, r2_results_all_behav = plot_linkage_comparison(
    df=df_behavioral_clean,
    linkage_methods=linkage_methods,
    palette=CUSTOM_HEX,
    title='Linkage Method Comparison'
)

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">Linkage Method Selection Summary</h3>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Compared four linkage methods (Ward, Complete, Average, Single) using dendrogram quality and clustering performance metrics to identify the optimal approach for hierarchical clustering of behavioral features.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Goal:</h4>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-right: 20px;">Determine which linkage criterion best preserves hierarchical structure while maximizing clustering quality</li>
    </ul>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Methodology:</h4>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-right: 20px;"><strong>Linkage Methods Tested:</strong> Ward (minimizes within-cluster variance), Complete (maximum distance), Average (mean distance), Single (minimum distance)</li>
        <li style="margin-right: 20px;"><strong>CCC (Cophenetic Correlation Coefficient):</strong> Measures how faithfully the dendrogram preserves pairwise distances (higher is better)</li>
        <li style="margin-right: 20px;"><strong>R² (Variance Explained):</strong> Proportion of variance explained by clustering across k=2 to 10 (higher is better)</li>
    </ul>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Findings:</h4>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-right: 20px;"><strong>CCC Results:</strong> Average (0.558) achieves highest dendrogram preservation, followed by Single (0.485), Complete (0.388), and Ward (0.354)</li>
        <li style="margin-right: 20px;"><strong>R² Results:</strong> Ward consistently explains the most variance across all k values (reaching 0.52 at k=10), outperforming Average (0.45), Complete (0.11), and Single (0.01). Note that Ward has an inherent R² advantage because its optimization criterion (minimizing within-cluster variance) directly maximizes R²</li>
        <li style="margin-right: 20px;"><strong>Decision:</strong> Selected Ward linkage. Although Ward's R² advantage is mathematically built-in, this is precisely what we want for customer segmentation: compact, well-separated clusters suitable for actionable marketing personas. DBSCAN has already removed multivariate outliers, mitigating Ward's sensitivity to extreme values</li>
    </ul>
</div>

### **8.1.2 Defining the number of clusters**

In [None]:
# Determine optimal k using multiple evaluation metrics
# Ward linkage selected

hc_k_range_behav = range(2, 11)
hc_metrics_behav = {
    'k': [],
    'Silhouette': [],
    'Calinski-Harabasz': [],
    'Davies-Bouldin': []
}

for k in hc_k_range_behav:
    hc = AgglomerativeClustering(n_clusters=k, linkage='ward', metric='euclidean')
    labels = hc.fit_predict(df_behavioral_clean)
    
    metrics = evaluate_clustering_metrics(df_behavioral_clean, labels)
    
    hc_metrics_behav['k'].append(k)
    hc_metrics_behav['Silhouette'].append(metrics['Silhouette Score'])
    hc_metrics_behav['Calinski-Harabasz'].append(metrics['Calinski-Harabasz Index'])
    hc_metrics_behav['Davies-Bouldin'].append(metrics['Davies-Bouldin Index'])
    

# Create metrics DataFrame
hc_metrics_df_behav = pd.DataFrame(hc_metrics_behav)

In [None]:
# Visualize clustering metrics
plot_clustering_metrics(
    hc_metrics_df_behav,
    hc_k_range_behav,
    CUSTOM_HEX,
    title='Hierarchical Clustering: Optimal k Evaluation (Ward Linkage)'
)

# Display metrics table
hc_metrics_df_behav

In [None]:
# Dendrogram visualization for visual confirmation of cluster structure

linkage_matrix_behav = linkage(df_behavioral_clean, method='ward', metric='euclidean')

fig, ax = plt.subplots(figsize=(14, 6))

dendrogram(
    linkage_matrix_behav,
    ax=ax,
    truncate_mode='lastp',
    p=30,
    leaf_font_size=10,
    show_leaf_counts=True,
    color_threshold=0.7*max(linkage_matrix_behav[:,2])
)

ax.set_title('Hierarchical Clustering Dendrogram (Ward Linkage)\nVisual Confirmation of Natural Grouping Structure', 
             fontweight='bold', fontsize=14, pad=15)
ax.set_xlabel('Sample Index or Cluster Size', fontsize=11, fontweight='bold')
ax.set_ylabel('Euclidean Distance', fontsize=11, fontweight='bold')
ax.grid(False)

plt.tight_layout()
plt.show()

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">Optimal k Selection Summary (Hierarchical Clustering)</h3>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Multi-metric approach combining internal validation indices and dendrogram visual inspection to determine optimal k for hierarchical clustering with Ward linkage.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Goal:</h4>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-right: 20px;">Identify the number of clusters (k) that maximizes cluster quality while maintaining business interpretability</li>
    </ul>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Methodology:</h4>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-right: 20px;"><strong>Silhouette Score:</strong> Measures how similar data points are to their own cluster vs. neighboring clusters (range -1 to 1, higher is better)</li>
        <li style="margin-right: 20px;"><strong>Calinski-Harabasz Index:</strong> Ratio of between-cluster to within-cluster variance (higher values indicate better-defined clusters)</li>
        <li style="margin-right: 20px;"><strong>Davies-Bouldin Index:</strong> Average similarity between each cluster and its most similar cluster (lower values indicate better separation)</li>
        <li style="margin-right: 20px;"><strong>Dendrogram Analysis:</strong> Visual inspection of hierarchical tree structure to identify natural cluster boundaries</li>
    </ul>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Findings:</h4>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-right: 20px;"><strong>Silhouette Score:</strong> Decreases steadily from k=2 (0.169) to k=10 (0.086), with k=2 achieving the highest score</li>
        <li style="margin-right: 20px;"><strong>Calinski-Harabasz Peak:</strong> Maximum at k=2 (2370), then declining steadily; k=3 (2168) and k=4 (2112) still maintain reasonable between-cluster separation</li>
        <li style="margin-right: 20px;"><strong>Davies-Bouldin Minimum:</strong> Best separation at k=6 (1.66), followed by k=5 (1.70); k=2 shows DBI of 1.94, k=3 shows worst DBI (2.03)</li>
        <li style="margin-right: 20px;"><strong>Candidate Selection:</strong> k=2 and k=6 emerge as candidates. k=2 achieves optimal Silhouette (0.169) and CH (2370), corresponding to the primary dendrogram split at Euclidean distance 120. k=6 offers the best cluster separation (DBI 1.66) with Silhouette (0.099) and CH (1833), providing finer behavioral granularity for marketing personas</li>
    </ul>
</div>

### **8.1.3 Comparison of Clustering Solutions**

In [None]:
# Compare two candidate solutions based on clustering metrics analysis
# k=2
# k=6

hc_k_candidate_1_behav = 2
hc_k_candidate_2_behav = 6

# Fit both candidate solutions
hc_k1_behav = AgglomerativeClustering(n_clusters=hc_k_candidate_1_behav, linkage='ward', metric='euclidean')
hc_k2_behav = AgglomerativeClustering(n_clusters=hc_k_candidate_2_behav, linkage='ward', metric='euclidean')

hc_labels_k1_behav = hc_k1_behav.fit_predict(df_behavioral_clean)
hc_labels_k2_behav = hc_k2_behav.fit_predict(df_behavioral_clean)

# Create temporary DataFrames with cluster labels
df_temp_k1 = df_behavioral_clean.copy()
df_temp_k1['Cluster'] = hc_labels_k1_behav

df_temp_k2 = df_behavioral_clean.copy()
df_temp_k2['Cluster'] = hc_labels_k2_behav

# Calculate cluster profiles (mean values per cluster)
cluster_profiles_k1 = df_temp_k1.groupby('Cluster').mean()
cluster_profiles_k2 = df_temp_k2.groupby('Cluster').mean()

# Display both profiles for comparison
print(f"\nCluster Profiles for k={hc_k_candidate_1_behav}:")
display(cluster_profiles_k1.round(3))

print(f"\n\nCluster Profiles for k={hc_k_candidate_2_behav}:")
display(cluster_profiles_k2.round(3))

# Visualize cluster size comparison
plot_cluster_size_comparison(
    labels_dict={hc_k_candidate_1_behav: hc_labels_k1_behav, hc_k_candidate_2_behav: hc_labels_k2_behav},
    palette=CUSTOM_HEX,
    title='Hierarchical Clustering: Cluster Size Comparison'
)

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">Comparison of Clustering Solutions Summary</h3>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        <strong>Decision: k=6 selected</strong>
    </p>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        K=6 emerges as the optimal solution, primarily driven by the <strong>lowest Davies-Bouldin Index (1.66)</strong> across all k values, indicating best cluster separation. While k=2 achieves higher Silhouette (0.169 vs 0.099) and Calinski-Harabasz (2370 vs 1833), it only distinguishes two broad segments (71.3% vs 28.7%) without actionable granularity. The k=6 solution provides six well-balanced clusters (ranging from 9.6% to 31.4%), each with a distinct dominant behavioral feature, ensuring all segments are substantial enough for targeted marketing strategies. The dendrogram structure supports this split with clear secondary branching below the primary divide.
    </p>
</div>

### **8.1.4 Final Hierarchical Clustering Solution**

In [None]:
# Final Hierarchical Clustering Solution
# Using k=6 from comparison analysis

hc_final_k_behav = hc_k_candidate_2_behav  # k=6

# Reuse labels from comparison step
hc_labels_final_behav = hc_labels_k2_behav

# Create labeled dataset
df_behavioral_clean_labeled_hc = df_behavioral_clean.copy()
df_behavioral_clean_labeled_hc['Cluster'] = hc_labels_final_behav

# Calculate final metrics
hc_final_metrics_behav = evaluate_clustering_metrics(df_behavioral_clean, hc_labels_final_behav)

# Store for final comparison (Section 9)
if 'behavioral_clustering_results' not in dir():
    behavioral_clustering_results = {}

behavioral_clustering_results['Hierarchical'] = {
    'k': hc_final_k_behav,
    'Silhouette': hc_final_metrics_behav['Silhouette Score'],
    'Calinski-Harabasz': hc_final_metrics_behav['Calinski-Harabasz Index'],
    'Davies-Bouldin': hc_final_metrics_behav['Davies-Bouldin Index'],
    'R2': get_rsq(df_behavioral_clean_labeled_hc, df_behavioral_clean.columns.tolist(), 'Cluster'),
    'labels': hc_labels_final_behav
}

### **8.1.5 Cluster Profiling**

In [None]:
# 1. Cluster Profiles Heatmap - Z-scores of behavioral features per cluster
behavioral_feats = df_behavioral_clean.columns.tolist()
hc_cluster_profiles_behav = df_behavioral_clean_labeled_hc.groupby('Cluster')[behavioral_feats].mean()
hc_population_mean_behav = df_behavioral_clean[behavioral_feats].mean()

plot_cluster_profiles_heatmap(
    hc_cluster_profiles_behav, 
    hc_population_mean_behav, 
    GROUP80_palette_continuous,
    title='Hierarchical Clustering: Behavioral Profiles (k=6) \nStandardized Z-Scores per Cluster'
)

In [None]:
# 2. Cluster Size Distribution
plot_cluster_sizes(
    hc_labels_final_behav, 
    hc_final_k_behav, 
    CUSTOM_HEX,
    title='Hierarchical Clustering - Final Cluster Sizes'
)

# Display cluster size statistics
cluster_sizes_hc_behav = pd.Series(hc_labels_final_behav).value_counts().sort_index()
cluster_dist_df_hc_behav = pd.DataFrame({
    'Cluster': cluster_sizes_hc_behav.index,
    'Count': cluster_sizes_hc_behav.values,
    'Percentage': (cluster_sizes_hc_behav.values / len(hc_labels_final_behav) * 100).round(2)
})

In [None]:
# 3. Feature Importance Analysis - Variance across clusters
# Features with high variance differentiate clusters most effectively

hc_feature_variance_behav = hc_cluster_profiles_behav.var(axis=0).sort_values(ascending=False)

plot_feature_importance(
    hc_feature_variance_behav,
    CUSTOM_HEX,
    title='Hierarchical Clustering: Feature Importance Analysis\nWhich Features Differentiate Clusters?'
)

# Display feature importance ranking
hc_feature_importance_df_behav = pd.DataFrame({
    'Feature': hc_feature_variance_behav.index,
    'Variance': hc_feature_variance_behav.values.round(4),
})

hc_feature_importance_df_behav

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">Hierarchical Clustering Profiling Summary (k=6)</h3>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Analyzed behavioral characteristics of the final 6 hierarchical clusters to identify distinct customer segments. Ward linkage produces six well-separated personas, each dominated by a distinct behavioral pattern.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Cluster Profiles:</h4>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-right: 20px;"><strong>Cluster 0 (31.4%, n=4,043) - Sporadic Mainstream:</strong> Irregular flight patterns (flight_regularity Z=+0.75), average across other features. Largest segment representing occasional travelers.</li>
        <li style="margin-right: 20px;"><strong>Cluster 1 (18.2%, n=2,347) - Engaged Social Redeemer:</strong> Very high redemption activity (Z=+1.27), travels with companions (Z=+0.65). Most engaged loyalty program segment.</li>
        <li style="margin-right: 20px;"><strong>Cluster 2 (9.6%, n=1,240) - Ultra-Routine Solo:</strong> Minimal destination variability (Z=-1.09), strongly solo (Z=-1.00), passive redeemer (Z=-0.73). Fixed-route business travelers.</li>
        <li style="margin-right: 20px;"><strong>Cluster 3 (17.4%, n=2,236) - Solo Explorer:</strong> High distance variability (Z=+1.10), travels alone (Z=-0.64), passive redeemer (Z=-0.54). Adventure-seeking solo travelers.</li>
        <li style="margin-right: 20px;"><strong>Cluster 4 (10.5%, n=1,353) - Regular Family Traveler:</strong> Very high companion ratio (Z=+1.25), consistent schedule (Z=-0.48), average redemption. Family vacation segment.</li>
        <li style="margin-right: 20px;"><strong>Cluster 5 (12.9%, n=1,666) - Ultra-Regular Commuter:</strong> Very consistent monthly patterns (Z=-1.04), passive redeemer (Z=-0.41). Predictable business commuters.</li>
    </ul>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Feature Importance (Variance across clusters):</h4>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-right: 20px;"><strong>companion_flight_ratio (0.69):</strong> Primary driver distinguishing social vs. solo travelers</li>
        <li style="margin-right: 20px;"><strong>distance_variability (0.55):</strong> Separates explorers from routine travelers</li>
        <li style="margin-right: 20px;"><strong>redemption_frequency (0.51):</strong> Identifies engaged vs. passive loyalty members</li>
        <li style="margin-right: 20px;"><strong>flight_regularity (0.35):</strong> Distinguishes sporadic vs. consistent flyers</li>
    </ul>
</div>

---

## **8.2 K-Means Clustering**

<div style="background-color: #e1e1e1ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #212121, #313131, #595959, #909090) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #000000ff; font-weight: bold;">K-Means Clustering Methodology</h3>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        K-Means is an <strong>iterative partitioning algorithm</strong> that assigns data points to k clusters by minimizing within-cluster variance (Sum of Squared Errors). Unlike hierarchical clustering, it requires pre-specifying k and uses a centroid-based approach to create spherical, compact clusters.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #313131; font-weight: bold;">Algorithm:</h4>
    <ol style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-bottom: 5px;"><strong>Choose Seeds:</strong> Select k initial centroids (using k-means++ for better starting positions)</li>
        <li style="margin-bottom: 5px;"><strong>Assignment:</strong> Associate each data point with the nearest seed/centroid based on Euclidean distance</li>
        <li style="margin-bottom: 5px;"><strong>Update Centroids:</strong> Calculate the centroids of the formed clusters as the mean of all assigned points</li>
        <li style="margin-bottom: 5px;"><strong>Iterate:</strong> Go back to step 2 and repeat until centroids cease to be recentered (convergence)</li>
    </ol>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #313131; font-weight: bold;">Workflow for This Analysis:</h4>
    <ol style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-bottom: 5px;"><strong>Optimal k Selection:</strong> Test k=2-13 using Elbow Method (Inertia/SSE), Silhouette Score, Calinski-Harabasz Index, and Davies-Bouldin Index</li>
        <li style="margin-bottom: 5px;"><strong>Solution Comparison:</strong> Compare k=4 vs k=5 solutions by examining cluster sizes and feature profiles</li>
        <li style="margin-bottom: 5px;"><strong>Final Model & Profiling:</strong> Fit final K-Means model with k=5 (k-means++ initialization) and analyze behavioral characteristics of each cluster</li>
    </ol>
</div>

### **8.2.1 Defining the number of clusters**

In [None]:
# Evaluate K-Means clustering across range of k values
# Using k-means++ initialization and multiple runs for stability

km_k_range_behav = range(2, 14)
km_metrics_behav = {
    'k': [],
    'Inertia': [],
    'Silhouette': [],
    'Calinski-Harabasz': [],
    'Davies-Bouldin': []
}

# Store labels and silhouette samples for visualization
km_fitted_labels_behav = {}
km_silhouette_samples_behav = {}

for k in km_k_range_behav:
    kmeans = KMeans(n_clusters=k, init='k-means++', n_init=15, random_state=42, max_iter=300)
    km_labels = kmeans.fit_predict(df_behavioral_clean)
    
    # Store labels and silhouette samples
    km_fitted_labels_behav[k] = km_labels
    km_silhouette_samples_behav[k] = silhouette_samples(df_behavioral_clean, km_labels)
    
    metrics = evaluate_clustering_metrics(df_behavioral_clean, km_labels)
    
    km_metrics_behav['k'].append(k)
    km_metrics_behav['Inertia'].append(kmeans.inertia_)
    km_metrics_behav['Silhouette'].append(metrics['Silhouette Score'])
    km_metrics_behav['Calinski-Harabasz'].append(metrics['Calinski-Harabasz Index'])
    km_metrics_behav['Davies-Bouldin'].append(metrics['Davies-Bouldin Index'])

km_metrics_df_behav = pd.DataFrame(km_metrics_behav)

In [None]:
# Visualize Elbow Method
plot_elbow_method(
    km_k_range_behav,
    km_metrics_df_behav['Inertia'].tolist(),
    CUSTOM_HEX,
    title='K-Means Elbow Method: Optimal k Selection'
)

In [None]:
# Silhouette Analysis

for nclus in km_k_range_behav:
    fig, ax = plt.subplots(figsize=(12, 7))
    
    km_labels = km_fitted_labels_behav[nclus]
    sample_silhouette_values = km_silhouette_samples_behav[nclus]
    silhouette_avg = km_metrics_df_behav[km_metrics_df_behav['k'] == nclus]['Silhouette'].values[0]
    
    y_lower = 10
    for i in range(nclus):
        ith_cluster_silhouette_values = sample_silhouette_values[km_labels == i]
        ith_cluster_silhouette_values.sort()
        
        size_cluster_i = ith_cluster_silhouette_values.shape[0]
        y_upper = y_lower + size_cluster_i
        
        color = CUSTOM_HEX[i % len(CUSTOM_HEX)]
        ax.fill_betweenx(np.arange(y_lower, y_upper),
                         0, ith_cluster_silhouette_values,
                         facecolor=color, edgecolor=color, alpha=0.7)
        
        ax.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i), fontweight='bold', fontsize=11)
        y_lower = y_upper + 10
    
    ax.set_title(f"K-Means Silhouette Analysis for k={nclus}\nAverage Silhouette Score: {silhouette_avg:.4f}", 
                 fontsize=14, fontweight='bold', pad=15)
    ax.set_xlabel("Silhouette Coefficient Values", fontsize=12, fontweight='bold')
    ax.set_ylabel("Cluster", fontsize=12, fontweight='bold')
    
    ax.axvline(x=silhouette_avg, color="red", linestyle="--", linewidth=2.5, 
               label=f'Average: {silhouette_avg:.4f}')
    
    xmin = max(-0.3, np.round(sample_silhouette_values.min() - 0.1, 2))
    xmax = min(1.0, np.round(sample_silhouette_values.max() + 0.1, 2))
    ax.set_xlim([xmin, xmax])
    ax.set_ylim([0, len(df_behavioral_clean) + (nclus + 1) * 10])
    
    ax.set_yticks([])
    ax.set_xticks(np.arange(xmin, xmax + 0.1, 0.1))
    ax.legend(loc='upper right', fontsize=11)
    ax.grid(True, alpha=0.3, axis='x')
    
    plt.tight_layout()
    plt.show()

In [None]:
# Visualize clustering metrics
plot_clustering_metrics(
    km_metrics_df_behav,
    km_k_range_behav,
    CUSTOM_HEX,
    title='K-Means Clustering: Optimal k Evaluation'
)

# Display metrics table
km_metrics_df_behav

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">Optimal k Selection Summary (K-Means Clustering)</h3>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Evaluated k=2 to k=13 using multiple validation indices to identify the optimal number of clusters for K-Means partitioning of airline customer behavioral data.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Goal:</h4>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-right: 20px;">Determine k that balances cluster quality metrics with business interpretability for actionable customer segmentation</li>
    </ul>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Methodology:</h4>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-right: 20px;"><strong>Elbow Method (Inertia):</strong> Total within-cluster sum of squared distances - look for "elbow" where adding clusters yields diminishing returns</li>
        <li style="margin-right: 20px;"><strong>Silhouette Score:</strong> Measures cluster cohesion and separation (range -1 to 1, higher is better)</li>
        <li style="margin-right: 20px;"><strong>Calinski-Harabasz Index:</strong> Ratio of between-cluster to within-cluster variance (higher is better)</li>
        <li style="margin-right: 20px;"><strong>Davies-Bouldin Index:</strong> Average similarity between clusters (lower is better)</li>
    </ul>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Findings:</h4>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-right: 20px;"><strong>Elbow Analysis:</strong> Inertia curve shows continuous decline from 38,612 (k=2) to 17,127 (k=13). The rate of decrease slows notably after k=5, suggesting an elbow region around k=4 to k=5</li>
        <li style="margin-right: 20px;"><strong>Silhouette Score:</strong> Peaks at k=2 (0.194), with k=5 (0.186) and k=4 (0.185) achieving the next best scores before declining at k=6 (0.172)</li>
        <li style="margin-right: 20px;"><strong>Davies-Bouldin Index:</strong> Improves steadily, with k=5 (1.36) outperforming k=4 (1.45), indicating better cluster separation at k=5</li>
        <li style="margin-right: 20px;"><strong>Candidate Selection:</strong> k=4 and k=5 emerge as candidates. Both achieve similar Silhouette scores (0.185 vs 0.186) while k=5 offers superior cluster separation (DBI 1.36 vs 1.45) at minimal cost to cohesion</li>
    </ul>
</div>

### **8.2.2 Comparison of Clustering Solutions**

In [None]:
# Compare two candidate solutions based on clustering metrics analysis
# k=4
# k=5

km_k_candidate_1_behav = 4
km_k_candidate_2_behav = 5

# Use pre-fitted labels
km_labels_k1_behav = km_fitted_labels_behav[km_k_candidate_1_behav]
km_labels_k2_behav = km_fitted_labels_behav[km_k_candidate_2_behav]

# Create temporary DataFrames with cluster labels
df_temp_k1 = df_behavioral_clean.copy()
df_temp_k1['Cluster'] = km_labels_k1_behav

df_temp_k2 = df_behavioral_clean.copy()
df_temp_k2['Cluster'] = km_labels_k2_behav

# Calculate cluster profiles (mean values per cluster)
cluster_profiles_k1 = df_temp_k1.groupby('Cluster').mean()
cluster_profiles_k2 = df_temp_k2.groupby('Cluster').mean()

# Display both profiles for comparison
print(f"\nCluster Profiles for k={km_k_candidate_1_behav}:")
display(cluster_profiles_k1.round(3))

print(f"\n\nCluster Profiles for k={km_k_candidate_2_behav}:")
display(cluster_profiles_k2.round(3))

# Visualize cluster size comparison
plot_cluster_size_comparison(
    labels_dict={km_k_candidate_1_behav: km_labels_k1_behav, km_k_candidate_2_behav: km_labels_k2_behav},
    palette=CUSTOM_HEX,
    title='K-Means Clustering: Cluster Size Comparison'
)

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">Comparison of Clustering Solutions Summary</h3>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        <strong>Decision: k=5 selected</strong>
    </p>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        K=5 achieves best Davies-Bouldin Index (1.36) with comparable Silhouette to k=4 (0.186 vs 0.185). Balanced cluster sizes (17.6% to 23.6%) with distinct profiles across all four behavioral features. K=4 merges the high-redemption segment with sporadic flyers, losing a key marketing target. K=5 isolates the "Engaged Social Redeemer" (19.7%, redemption Z=+1.40) as a distinct actionable segment.
    </p>
</div>

### **8.2.3 Final K-Means Clustering Solution**

In [None]:
# Final K-Means Clustering Solution
# Using k=5 from comparison analysis

km_final_k_behav = km_k_candidate_2_behav  # k=5

# Reuse labels from comparison step
km_labels_final_behav = km_labels_k2_behav

# Create labeled dataset
df_behavioral_clean_labeled_km = df_behavioral_clean.copy()
df_behavioral_clean_labeled_km['Cluster'] = km_labels_final_behav

# Calculate final metrics
km_final_metrics_behav = evaluate_clustering_metrics(df_behavioral_clean, km_labels_final_behav)

# Store for final comparison (Section 9)
behavioral_clustering_results['K-Means'] = {
    'k': km_final_k_behav,
    'Silhouette': km_final_metrics_behav['Silhouette Score'],
    'Calinski-Harabasz': km_final_metrics_behav['Calinski-Harabasz Index'],
    'Davies-Bouldin': km_final_metrics_behav['Davies-Bouldin Index'],
    'R2': get_rsq(df_behavioral_clean_labeled_km, df_behavioral_clean.columns.tolist(), 'Cluster'),
    'labels': km_labels_final_behav
}

### **8.2.4 Cluster Profiling**

In [None]:
# 1. Cluster Profiles Heatmap - Z-scores of behavioral features per cluster
km_cluster_profiles_behav = df_behavioral_clean_labeled_km.groupby('Cluster')[behavioral_feats].mean()
km_population_mean_behav = df_behavioral_clean[behavioral_feats].mean()

plot_cluster_profiles_heatmap(
    km_cluster_profiles_behav, 
    km_population_mean_behav, 
    GROUP80_palette_continuous,
    title='K-Means Clustering: Behavioral Profiles(k=5)\nStandardized Z-Scores per Cluster'
)

In [None]:
# 2. Cluster Size Distribution
plot_cluster_sizes(
    km_labels_final_behav, 
    km_final_k_behav, 
    CUSTOM_HEX,
    title='K-Means Clustering - Final Cluster Sizes'
)

In [None]:
# 3. Feature Importance Analysis - Variance across clusters
km_feature_variance_behav = km_cluster_profiles_behav.var(axis=0).sort_values(ascending=False)

plot_feature_importance(
    km_feature_variance_behav,
    CUSTOM_HEX,
    title='K-Means Clustering: Feature Importance Analysis\nWhich Features Differentiate Clusters?'
)

# Display feature importance ranking
km_feature_importance_df_behav = pd.DataFrame({
    'Feature': km_feature_variance_behav.index,
    'Variance': km_feature_variance_behav.values.round(4)
})

km_feature_importance_df_behav

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">K-Means Clustering Profiling Summary (k=5)</h3>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Analyzed behavioral characteristics of the final 5 K-Means clusters to identify distinct customer segments. Redemption frequency, companion ratio, and flight regularity emerge as primary differentiators.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Key Findings:</h4>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-right: 20px;"><strong>Cluster 0 (17.6%, n=2,264) - Family Travelers:</strong> Very high companion ratio (Z=+1.18), below-average regularity (Z=-0.60), average redemption. Family vacation segment traveling together.</li>
        <li style="margin-right: 20px;"><strong>Cluster 1 (21.4%, n=2,760) - Business Commuters:</strong> High flight regularity (Z=+0.89), low distance variability (Z=-0.59), passive redeemer (Z=-0.38). Predictable business travelers.</li>
        <li style="margin-right: 20px;"><strong>Cluster 2 (17.7%, n=2,285) - Disengaged Solo:</strong> Very low regularity (Z=-0.91), solo traveler (Z=-0.81), passive redeemer (Z=-0.59). Re-engagement target segment.</li>
        <li style="margin-right: 20px;"><strong>Cluster 3 (23.6%, n=3,040) - Explorers:</strong> Very high distance variability (Z=+1.11), travels alone (Z=-0.38), average regularity. Adventure-seeking travelers visiting diverse destinations.</li>
        <li style="margin-right: 20px;"><strong>Cluster 4 (19.7%, n=2,536) - Engaged Loyalists:</strong> Very high redemption (Z=+1.40), travels with companions (Z=+0.37), somewhat regular (Z=+0.38). Most engaged loyalty segment.</li>
        <li style="margin-right: 20px;"><strong>Primary Segmentation Driver - Redemption Frequency (Variance: 0.64):</strong> Redemption creates the clearest separation, with Cluster 4 showing significantly higher engagement.</li>
        <li style="margin-right: 20px;"><strong>Secondary Drivers - Companion Ratio (0.60) & Flight Regularity (0.53):</strong> Companion ratio distinguishes family vs. solo travelers, while regularity separates commuters from sporadic flyers.</li>
        <li style="margin-right: 20px;"><strong>Distance Variability (Variance: 0.48):</strong> Differentiates explorers (Cluster 3) from routine travelers.</li>
    </ul>
</div>

---

## **8.3 Mean Shift Clustering**

<div style="background-color: #e1e1e1ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #212121, #313131, #595959, #909090) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #000000ff; font-weight: bold;">Mean Shift Clustering Methodology</h3>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Mean Shift is a <strong>density-based mode-seeking algorithm</strong> that identifies clusters by shifting a sliding window toward regions of highest point density. Unlike K-Means, it does not require pre-specifying k and can discover clusters of arbitrary shape by following the gradient of the underlying density estimate.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #313131; font-weight: bold;">Algorithm:</h4>
    <ol style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-bottom: 5px;"><strong>Initialize a Sliding Window:</strong> Begin with a circular kernel window centered at a point <strong>C</strong> (randomly selected or one per data point) with radius <strong>r</strong> (bandwidth)</li>
        <li style="margin-bottom: 5px;"><strong>Shift Toward Higher Density:</strong> At each iteration, compute the mean of all points inside the window and shift the center <strong>C</strong> to this mean, gradually moving toward higher-density regions</li>
        <li style="margin-bottom: 5px;"><strong>Convergence:</strong> Repeat the shift step until the movement of the window center becomes negligible (the center has converged to a mode)</li>
        <li style="margin-bottom: 5px;"><strong>Merge Modes and Assign Clusters:</strong> Run the process from many initial centers, then merge converged centers that are within a tolerance distance. Assign each data point to the cluster of the nearest converged center. If sliding windows overlap, the densest mode (window containing the most points) is preserved and points are grouped accordingly</li>
    </ol>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #313131; font-weight: bold;">Workflow for This Analysis:</h4>
    <ol style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-bottom: 5px;"><strong>Selecting the Best Bandwidth:</strong> Test bandwidth quantiles from 0.018 to 0.027 using <strong>estimate_bandwidth()</strong>, then fit Mean Shift for each bandwidth and track <strong>n_clusters</strong>, <strong>R²</strong>, and <strong>Silhouette</strong></li>
        <li style="margin-bottom: 5px;"><strong>Evaluation of Mean Shift Solutions:</strong> Compare q=0.027 (2 clusters) vs q=0.025 (3 clusters) by examining cluster sizes and feature profiles</li>
        <li style="margin-bottom: 5px;"><strong>Final Mean Shift Clustering Solution:</strong> Select q=0.025 (bandwidth=1.07, 3 clusters)</li>
        <li style="margin-bottom: 5px;"><strong>Mean Shift Cluster Profiling:</strong> Profile the final solution using cluster profile heatmap, cluster size distribution, and feature importance</li>
    </ol>
</div>

### **8.3.1 Selecting the best Bandwidth**

In [None]:
# Estimate bandwidth values
bandwidth_quantiles_behav = [0.018, 0.02, 0.021, 0.0215, 0.023, 0.024, 0.025, 0.026, 0.027]
ms_results_behav = []

X_ms_behav = df_behavioral_clean[behavioral_feats]

for q in bandwidth_quantiles_behav:
    bw = estimate_bandwidth(X_ms_behav, quantile=q, random_state=1)
    ms = MeanShift(bandwidth=bw, bin_seeding=True, n_jobs=-1)
    labels = ms.fit_predict(X_ms_behav)

    n_clusters = len(np.unique(labels))

    # R²
    df_tmp = X_ms_behav.copy()
    df_tmp["labels"] = labels
    r2 = get_rsq(df_tmp, behavioral_feats, "labels")

    # Silhouette (only defined if >= 2 clusters)
    sil = silhouette_score(X_ms_behav, labels) if n_clusters >= 2 else np.nan

    ms_results_behav.append({
        "quantile": q,
        "bandwidth": float(bw),
        "n_clusters": int(n_clusters),
        "R2": float(r2),
        "Silhouette": float(sil) if not np.isnan(sil) else np.nan
    })

ms_results_df_behav = pd.DataFrame(ms_results_behav).sort_values("quantile", ascending=False)
ms_results_df_behav

In [None]:
plot_meanshift_quantile_vs_clusters(ms_results_df_behav)

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">Mean Shift Bandwidth Selection Summary</h3>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Determined the Mean Shift kernel bandwidth by testing a targeted set of bandwidth quantiles. For each quantile, bandwidth was estimated via <strong>estimate_bandwidth()</strong>, Mean Shift was fitted, and the solution was evaluated using <strong>cluster count</strong>, <strong>R²</strong>, and <strong>Silhouette</strong> to balance segmentation granularity and separation quality.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Goal:</h4>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-right: 20px;">Select a bandwidth regime that yields interpretable, stable clusters with good separation, while avoiding over-fragmentation (many micro-clusters) or collapse into too few clusters</li>
    </ul>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Methodology:</h4>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-right: 20px;"><strong>Quantile scan (bandwidth selection):</strong> Evaluated quantiles from 0.018 to 0.027 to observe how cluster count changes as bandwidth varies.</li>
        <li style="margin-right: 20px;"><strong>Evaluation criteria:</strong> Compared <strong>n_clusters</strong> (solution granularity), <strong>Silhouette</strong> (cluster separation) and <strong>R²</strong> (variance explained).</li>
    </ul>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Findings:</h4>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-right: 20px;"><strong>Cluster count pattern:</strong> Higher quantiles (q=0.027) produce 2 clusters, q=0.025-0.026 yield 3 clusters, q=0.021-0.024 produce 5-6 clusters, and lower quantiles (q=0.018) fragment into 16 micro-clusters.</li>
        <li style="margin-right: 20px;"><strong>Candidate A (q=0.027):</strong> 2 clusters with highest Silhouette (0.241) but insufficient granularity for actionable segmentation.</li>
        <li style="margin-right: 20px;"><strong>Candidate B (q=0.025):</strong> 3 clusters with second-highest Silhouette (0.147), R²=0.072, bandwidth=1.068. Provides meaningful granularity while maintaining good cluster separation.</li>
        <li style="margin-right: 20px;"><strong>Decision for next step:</strong> Select q=0.025 (3 clusters) as candidate for comparison, balancing interpretability with cluster quality metrics.</li>
    </ul>
</div>

### **8.3.2 Evaluation of Mean Shift Solutions**

In [None]:
# Compare two candidate Mean Shift solutions based on previous evaluation results
# Candidate A: quantile=0.027 -> 2 Clusters
# Candidate B: quantile=0.025 -> 3 Clusters

ms_q_1_behav = 0.027
ms_q_2_behav = 0.025

X_ms_behav = df_behavioral_clean[behavioral_feats]

# Get bandwidths from the already computed results table
ms_bw_1_behav = float(ms_results_df_behav.loc[ms_results_df_behav["quantile"] == ms_q_1_behav, "bandwidth"].iloc[0])
ms_bw_2_behav = float(ms_results_df_behav.loc[ms_results_df_behav["quantile"] == ms_q_2_behav, "bandwidth"].iloc[0])

# Fit both candidate solutions
ms_cand_1_behav = MeanShift(bandwidth=ms_bw_1_behav, bin_seeding=True, n_jobs=-1)
ms_cand_2_behav = MeanShift(bandwidth=ms_bw_2_behav, bin_seeding=True, n_jobs=-1)

ms_labels_1_behav = ms_cand_1_behav.fit_predict(X_ms_behav)
ms_labels_2_behav = ms_cand_2_behav.fit_predict(X_ms_behav)

# Cluster profiles
df_temp_ms1 = df_behavioral_clean.copy()
df_temp_ms1["Cluster"] = ms_labels_1_behav
cluster_profiles_ms1_behav = df_temp_ms1.groupby("Cluster")[behavioral_feats].mean()

df_temp_ms2 = df_behavioral_clean.copy()
df_temp_ms2["Cluster"] = ms_labels_2_behav
cluster_profiles_ms2_behav = df_temp_ms2.groupby("Cluster")[behavioral_feats].mean()

display(cluster_profiles_ms1_behav.round(3))
display(cluster_profiles_ms2_behav.round(3))

# Visualize cluster size comparison
plot_cluster_size_comparison(
    labels_dict={
        f"q={ms_q_1_behav} (bw={ms_bw_1_behav:.2f})": ms_labels_1_behav,
        f"q={ms_q_2_behav} (bw={ms_bw_2_behav:.2f})": ms_labels_2_behav,
    },
    palette=CUSTOM_HEX,
    title="Mean Shift Clustering: Cluster Size Comparison (Quantile & Bandwidth)"
)

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">Comparison of Clustering Solutions Summary</h3>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        <strong>Decision: q=0.025 selected (3 clusters), but limited practical value</strong>
    </p>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Both solutions produce highly imbalanced clusters: q=0.027 yields one dominant cluster (97.3%) plus one micro-cluster (2.7%), while q=0.025 yields one dominant cluster (95.3%) plus two micro-clusters (2.1%, 2.7%). Mean Shift identifies density modes but the behavioral feature space has insufficient natural density separation for actionable segmentation. The algorithm finds one large mainstream mode with small outlier groups rather than distinct behavioral personas.
    </p>
</div>


### **8.3.3 Final Mean Shift Clustering Solution**

In [None]:
# Final Mean Shift Clustering Solution
# Using quantile from comparison analysis

chosen_quantile_ms_behav = ms_q_2_behav  # 0.025
chosen_bandwidth_ms_behav = ms_bw_2_behav

# Reuse labels from comparison step
ms_labels_final_behav = ms_labels_2_behav

df_behavioral_clean["ms_cluster"] = ms_labels_final_behav

# Calculate final metrics
ms_final_metrics_behav = evaluate_clustering_metrics(df_behavioral_clean[behavioral_feats], ms_labels_final_behav)

# Store for final comparison (Section 9)
behavioral_clustering_results['Mean Shift'] = {
    'k': len(np.unique(ms_labels_final_behav)),
    'Silhouette': ms_final_metrics_behav['Silhouette Score'],
    'Calinski-Harabasz': ms_final_metrics_behav['Calinski-Harabasz Index'],
    'Davies-Bouldin': ms_final_metrics_behav['Davies-Bouldin Index'],
    'R2': ms_results_df_behav[ms_results_df_behav['quantile'] == chosen_quantile_ms_behav]['R2'].values[0],
    'labels': ms_labels_final_behav
}

### **8.3.4 Mean Shift Cluster Profiling**

In [None]:
# 1. Profile Heatmap - Z-scores of behavioral features per cluster (Mean Shift)
ms_cluster_profiles_behav = (df_behavioral_clean
                       .groupby("ms_cluster")[behavioral_feats]
                       .mean())
ms_population_mean_behav = df_behavioral_clean[behavioral_feats].mean()

plot_cluster_profiles_heatmap(
    ms_cluster_profiles_behav,
    ms_population_mean_behav,
    GROUP80_palette_continuous,
    title="Mean Shift - Cluster Profiles"
)

In [None]:
# 2. Cluster sizes
nclus_ms_behav = len(np.unique(ms_labels_final_behav))
plot_cluster_sizes(
    ms_labels_final_behav,
    nclus_ms_behav,
    CUSTOM_HEX,
    title="Mean Shift - Final Cluster Sizes"
)

In [None]:
# 3. Feature Importance Analysis - Variance across clusters
ms_feature_variance_behav = ms_cluster_profiles_behav.var(axis=0).sort_values(ascending=False)

plot_feature_importance(
    ms_feature_variance_behav,
    CUSTOM_HEX,
    title="Mean Shift Clustering: Feature Importance Analysis\nWhich Features Differentiate Clusters?"
)

# Display feature importance ranking
ms_feature_importance_df_behav = pd.DataFrame({
    "Feature": ms_feature_variance_behav.index,
    "Variance": ms_feature_variance_behav.values.round(4),
})

ms_feature_importance_df_behav

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">Mean Shift Profiling Summary</h3>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Profiled the final Mean Shift solution (q=0.025, 3 clusters). Highly imbalanced: one dominant mainstream cluster captures 95.3% of customers.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Key Findings:</h4>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-right: 20px;"><strong>Cluster 0 (95.3%, n=12,273) - Mainstream:</strong> Near-average across all features. Represents the dense core of the behavioral feature space.</li>
        <li style="margin-right: 20px;"><strong>Cluster 1 (2.1%, n=266) - Extreme Explorer:</strong> Very high distance variability (Z=+1.84), solo (Z=-1.05), regular schedule (Z=-1.19).</li>
        <li style="margin-right: 20px;"><strong>Cluster 2 (2.7%, n=346) - Extreme Sporadic Routine:</strong> Very low distance variability (Z=-1.67), highly irregular (Z=+1.33).</li>
        <li style="margin-right: 20px;"><strong>Primary Driver - Distance Variability (Variance: 3.08):</strong> Mean Shift separates only extreme outliers on distance patterns, failing to segment the mainstream population meaningfully.</li>
    </ul>
</div>

---

## **8.4 Gaussian Mixture Models (GMM)**

<div style="background-color: #e1e1e1ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #212121, #313131, #595959, #909090) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #000000ff; font-weight: bold;">Gaussian Mixture Model (GMM) Methodology</h3>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        GMM is a <strong>probabilistic clustering algorithm</strong> that models data as a mixture of k Gaussian distributions. Unlike hard clustering (K-Means, Hierarchical), GMM provides soft assignments where each point has a probability of belonging to each cluster, enabling uncertainty quantification and overlapping segments.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #313131; font-weight: bold;">Algorithm:</h4>
    <ol style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-bottom: 5px;"><strong>Initialization:</strong> Randomly initialize k Gaussian components (each with mean, covariance, and mixing weight)</li>
        <li style="margin-bottom: 5px;"><strong>Expectation Step (E-step):</strong> Calculate probability that each data point belongs to each Gaussian component using current parameters</li>
        <li style="margin-bottom: 5px;"><strong>Maximization Step (M-step):</strong> Update component parameters (means, covariances, weights) to maximize the likelihood of the data given current assignments</li>
        <li style="margin-bottom: 5px;"><strong>Convergence:</strong> Iterate E-step and M-step until parameters stabilize (EM algorithm converges to local maximum)</li>
    </ol>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #313131; font-weight: bold;">Workflow for This Analysis:</h4>
    <ol style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-bottom: 5px;"><strong>Covariance Type Selection:</strong> Compare full, tied, diag, and spherical covariance structures using BIC and AIC to identify the best fitting model</li>
        <li style="margin-bottom: 5px;"><strong>Component Selection:</strong> For diag covariance, test n=2-10 components and evaluate using BIC, AIC, Silhouette, and R²</li>
        <li style="margin-bottom: 5px;"><strong>Candidate Comparison:</strong> Compare n=3 vs n=4 components using cluster profiles, assignment uncertainty, and cluster size distribution</li>
        <li style="margin-bottom: 5px;"><strong>Final Model and Profiling:</strong> Selected n=3 components with diag covariance for better assignment certainty (12.6% uncertain vs 29.8%), producing three segments (48.1%, 19.3%, 32.6%) differentiated primarily by redemption frequency</li>
    </ol>
</div>

### **8.4.1 Selecting covariance_type & n_components**

In [None]:
# Candidates
gmm_n_components_behav = list(range(2, 11))
gmm_cov_types_behav = ["full", "tied", "diag", "spherical"]

X_gmm_behav = df_behavioral_clean[behavioral_feats].copy()

gmm_results_df_behav = evaluate_gmm_grid(
    X=X_gmm_behav,
    feats=behavioral_feats,
    n_components_list=gmm_n_components_behav,
    covariance_types=gmm_cov_types_behav,
    n_init=10,
    random_state=1
)

gmm_results_df_sorted_behav = (
    gmm_results_df_behav
    .sort_values(["BIC", "AIC"], ascending=[True, True])
    .reset_index(drop=True)
)

gmm_results_df_sorted_behav

In [None]:
# 1) pick covariance_type
plot_gmm_covtype_bic_aic(gmm_results_df_behav, gmm_cov_types_behav, gmm_n_components_behav)

In [None]:
# 2) Find best n_components for chosen covariance_type
chosen_covariance_type_gmm_behav = "diag" # based on previous plot
plot_gmm_n_selection_for_covtype(gmm_results_df_behav, chosen_covariance_type_gmm_behav, gmm_n_components_behav)

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">
        GMM Parameter Selection Summary
    </h3>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Evaluated Gaussian Mixture Models by running a grid over <strong>n_components</strong> (2–10) and
        <strong>covariance_type</strong> (<strong>full</strong>, <strong>tied</strong>, <strong>diag</strong>, <strong>spherical</strong>)
        using <strong>init_params="kmeans"</strong> to stabilize initialization. Each configuration was scored with
        <strong>BIC</strong> and <strong>AIC</strong> (model selection), plus <strong>R²</strong> (variance explained) and
        <strong>Silhouette</strong> (cluster separation sanity check).
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Goal:</h4>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-right: 20px;">
            Select a <strong>covariance structure</strong> that best fits the data distribution, then choose
            <strong>n_components</strong> that balances model fit (BIC/AIC), separation (Silhouette), and interpretability.
        </li>
    </ul>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Methodology:</h4>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-right: 20px;">
            <strong>Step 1 – covariance_type selection:</strong>
            Compared BIC &amp; AIC curves across all covariance types. The best covariance_type achieves the
            <strong>lowest BIC/AIC</strong> consistently across n.
        </li>
        <li style="margin-right: 20px;">
            <strong>Step 2 – choose n_components within the winning covariance_type:</strong>
            For the selected covariance_type, inspected BIC/AIC, Silhouette, and R².
        </li>
    </ul>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Findings:</h4>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-right: 20px;"><strong>Best covariance_type:</strong> diag (diagonal) achieves lowest BIC/AIC consistently across n_components, with BIC decreasing from 141,446 (n=2) to 70,531 (n=10)</li>
        <li style="margin-right: 20px;"><strong>Candidate A (n=3):</strong> BIC 112,337, Silhouette 0.118 (highest among diag), R² 0.234. Best cluster separation.</li>
        <li style="margin-right: 20px;"><strong>Candidate B (n=4):</strong> BIC 112,165, Silhouette 0.110, R² 0.295. Better model fit and variance explained.</li>
        <li style="margin-right: 20px;"><strong>Decision:</strong> Compare n=3 vs n=4 for diag covariance. Higher n values (5+) achieve lower BIC but Silhouette drops below 0.06, indicating poor cluster separation.</li>
    </ul>
</div>

### **8.4.2 Evaluation of GMM Solutions**

In [None]:
# n=3
# n=4

# Fit only the selected candidates
gmm_3_behav = GaussianMixture(n_components=3, covariance_type='diag', n_init=10, init_params='kmeans', random_state=1)
gmm_4_behav = GaussianMixture(n_components=4, covariance_type='diag', n_init=10, init_params='kmeans', random_state=1)

gmm_labels_3_behav = gmm_3_behav.fit_predict(df_behavioral_clean[behavioral_feats])
gmm_labels_4_behav = gmm_4_behav.fit_predict(df_behavioral_clean[behavioral_feats])

# Calculate cluster means
df_temp_3 = df_behavioral_clean[behavioral_feats].copy()
df_temp_3['Cluster'] = gmm_labels_3_behav
cluster_means_3_behav = df_temp_3.groupby('Cluster').mean()

df_temp_4 = df_behavioral_clean[behavioral_feats].copy()
df_temp_4['Cluster'] = gmm_labels_4_behav
cluster_means_4_behav = df_temp_4.groupby('Cluster').mean()

display(cluster_means_3_behav.round(3))
display(cluster_means_4_behav.round(3))

# Visualize cluster size comparison
plot_cluster_size_comparison(
    labels_dict={'GMM (n=3)': gmm_labels_3_behav, 'GMM (n=4)': gmm_labels_4_behav},
    palette=CUSTOM_HEX,
    title='GMM Clustering: Cluster Size Comparison'
)

In [None]:
# Uncertainty Analysis: Assignment probability < 70%
uncertainty_threshold = 0.7

gmm_max_probs_3_behav = gmm_3_behav.predict_proba(df_behavioral_clean[behavioral_feats]).max(axis=1)
gmm_max_probs_4_behav = gmm_4_behav.predict_proba(df_behavioral_clean[behavioral_feats]).max(axis=1)

uncertain_pct_3_behav = (gmm_max_probs_3_behav < uncertainty_threshold).sum() / len(gmm_max_probs_3_behav) * 100
uncertain_pct_4_behav = (gmm_max_probs_4_behav < uncertainty_threshold).sum() / len(gmm_max_probs_4_behav) * 100

uncertainty_summary_behav = pd.DataFrame({
    'n_components': [3, 4],
    'uncertain_pct': [uncertain_pct_3_behav, uncertain_pct_4_behav],
    'mean_prob': [gmm_max_probs_3_behav.mean(), gmm_max_probs_4_behav.mean()],
    'min_prob': [gmm_max_probs_3_behav.min(), gmm_max_probs_4_behav.min()]
})

display(uncertainty_summary_behav)

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">Comparison of Clustering Solutions Summary</h3>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        <strong>Decision: n=3 components selected with diag covariance</strong>
    </p>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        N=3 achieves substantially better assignment certainty (12.6% uncertain vs 29.8% for n=4) and higher mean probability (0.881 vs 0.798). It captures the key behavioral segments: passive redeemers (Cluster 1: redemption Z=-1.31), engaged social travelers (Cluster 2: redemption Z=+1.11, companion Z=+0.40), and mainstream (Cluster 0). N=4 fragments into a small cluster (10.3%) without meaningful improvement. Balanced sizes (48.1%, 19.3%, 32.6%) ensure actionable segments.
    </p>
</div>

### **8.4.3 Final GMM Clustering Solution**

In [None]:
# Final parameters based on evaluation previous section
# Selected: n_components=3, covariance_type='diag'
chosen_n_components_gmm_behav = 3

gmm_labels_final_behav = gmm_labels_3_behav
gmm_final_behav = gmm_3_behav

df_behavioral_clean['gmm_cluster'] = gmm_labels_final_behav

# Calculate final metrics
gmm_final_metrics_behav = evaluate_clustering_metrics(df_behavioral_clean[behavioral_feats], gmm_labels_final_behav)

# Store for final comparison (Section 9)
behavioral_clustering_results['GMM'] = {
    'k': chosen_n_components_gmm_behav,
    'Silhouette': gmm_final_metrics_behav['Silhouette Score'],
    'Calinski-Harabasz': gmm_final_metrics_behav['Calinski-Harabasz Index'],
    'Davies-Bouldin': gmm_final_metrics_behav['Davies-Bouldin Index'],
    'R2': gmm_results_df_behav[(gmm_results_df_behav['n_components'] == chosen_n_components_gmm_behav) & (gmm_results_df_behav['covariance_type'] == 'diag')]['R2'].values[0],
    'labels': gmm_labels_final_behav
}

### **8.4.4 GMM Cluster Profiling**

In [None]:
# 1. Profile Heatmap - Z-scores of behavioral features per cluster (GMM)
df_temp_gmm_behav = df_behavioral_clean[behavioral_feats].copy()
df_temp_gmm_behav['Cluster'] = gmm_labels_final_behav
cluster_profiles_gmm_behav = df_temp_gmm_behav.groupby('Cluster').mean()
gmm_population_mean_behav = df_behavioral_clean[behavioral_feats].mean()

plot_cluster_profiles_heatmap(
    cluster_profiles_gmm_behav,
    gmm_population_mean_behav,
    GROUP80_palette_continuous,
    title='GMM Clustering: Behavioral Profiles (n=3)\nStandardized Z-Scores per Cluster'
)

In [None]:
# 2. Cluster Size Distribution
plot_cluster_sizes(
    gmm_labels_final_behav,
    chosen_n_components_gmm_behav,
    CUSTOM_HEX,
    title='GMM Clustering - Final Cluster Sizes'
)

In [None]:
# 3. Feature Importance Analysis - Variance across clusters
gmm_feature_variance_behav = cluster_profiles_gmm_behav.var(axis=0).sort_values(ascending=False)

plot_feature_importance(
    gmm_feature_variance_behav,
    CUSTOM_HEX,
    title='GMM Clustering: Feature Importance Analysis\nWhich Features Differentiate Clusters?'
)

# Display feature importance as DataFrame
gmm_feature_importance_df_behav = pd.DataFrame({
    'Feature': gmm_feature_variance_behav.index,
    'Variance': gmm_feature_variance_behav.values
}).reset_index(drop=True)

display(gmm_feature_importance_df_behav)

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">GMM Clustering Profiling Summary (n=3)</h3>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Analyzed behavioral characteristics of the final 3 GMM clusters (diag covariance) to identify probabilistic customer segments. Redemption frequency emerges as the dominant differentiator.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Key Findings:</h4>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-right: 20px;"><strong>Cluster 0 (48.1%, n=6,203) - Mainstream Travelers:</strong> Average across all features, slightly lower redemption (Z=-0.24). Largest segment representing typical customers.</li>
        <li style="margin-right: 20px;"><strong>Cluster 1 (19.3%, n=2,484) - Passive Redeemers:</strong> Very low redemption activity (Z=-1.31), solo travelers (companion Z=-0.33), irregular flight patterns (Z=-0.22).</li>
        <li style="margin-right: 20px;"><strong>Cluster 2 (32.6%, n=4,198) - Engaged Social Travelers:</strong> High redemption frequency (Z=+1.11), travels with companions (Z=+0.40), regular flight patterns (Z=+0.24).</li>
        <li style="margin-right: 20px;"><strong>Primary Driver - Redemption Frequency (Variance: 1.47):</strong> Redemption behavior creates the strongest separation between clusters.</li>
        <li style="margin-right: 20px;"><strong>Limited Multi-Feature Differentiation:</strong> Secondary features show minimal variance (companion: 0.15, regularity: 0.05, distance: 0.01). The GMM solution essentially segments by redemption behavior alone, with other behavioral dimensions contributing little to cluster separation.</li>
    </ul>
</div>

---

## **8.5 Self Organizing Maps**

<div style="background-color: #e1e1e1ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #212121, #313131, #595959, #909090) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #000000ff; font-weight: bold;">Self-Organizing Maps (SOM) + K-Means Two-Stage Clustering Methodology</h3>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Self-Organizing Maps are <strong>unsupervised neural networks</strong> that project high-dimensional data onto a low-dimensional grid while preserving topological relationships. Each neuron represents a weight vector in the input space, and during training, neurons are "pulled" toward data patterns, dragging their neighbors along. The two-stage approach combines SOM's dimensionality reduction with K-Means clustering on the learned neuron weights.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #313131; font-weight: bold;">Algorithm:</h4>
    <ol style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-bottom: 5px;"><strong>Initialization:</strong> Randomly initialize neuron weight vectors and set neighborhood radius (σ) and learning rate (α)</li>
        <li style="margin-bottom: 5px;"><strong>BMU Selection:</strong> For each input pattern, find the Best Matching Unit (BMU) - the neuron with minimum Euclidean distance to the input</li>
        <li style="margin-bottom: 5px;"><strong>Weight Update:</strong> Update the BMU and its neighbors: w(new) = w(old) + α[x - w(old)], with neighborhood function controlling update strength</li>
        <li style="margin-bottom: 5px;"><strong>Parameter Decay:</strong> Gradually reduce learning rate and neighborhood radius over iterations</li>
        <li style="margin-bottom: 5px;"><strong>Two-Stage Clustering:</strong> Apply K-Means to the trained SOM weight vectors, then map customers to clusters via their BMUs</li>
    </ol>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #313131; font-weight: bold;">Key Quality Metrics:</h4>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-bottom: 5px;"><strong>Quantization Error (QE):</strong> Average distance between data points and their BMUs - measures data representation accuracy (lower is better)</li>
        <li style="margin-bottom: 5px;"><strong>Topographic Error (TE):</strong> Proportion of data points where 1st and 2nd BMUs are not adjacent - measures topology preservation (lower is better)</li>
    </ul>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #313131; font-weight: bold;">Workflow for This Analysis:</h4>
    <ol style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-bottom: 5px;"><strong>Parameter Grid Search:</strong> Test 18 combinations of grid sizes (10x10, 20x20, 40x40), learning rates (0.5, 0.75, 1.0), and sigma values (0.5, 1.0) to optimize QE/TE trade-off</li>
        <li style="margin-bottom: 5px;"><strong>SOM Visualization:</strong> Analyze trained SOM using Component Planes (feature distributions), U-Matrix (cluster boundaries), and Hit Map (customer density)</li>
        <li style="margin-bottom: 5px;"><strong>Two-Stage Clustering:</strong> Apply K-Means to 1,600 neuron weight vectors (40x40 grid with lr=0.5, σ=1.0), evaluate k=2-13 using Silhouette, Calinski-Harabasz, and Davies-Bouldin indices</li>
        <li style="margin-bottom: 5px;"><strong>Solution Comparison:</strong> Compare k=4 vs k=5 solutions by examining SOM grid visualizations, cluster sizes, and feature profiles</li>
        <li style="margin-bottom: 5px;"><strong>Final Model and Profiling:</strong> Select k=5 for better cluster separation (DBI 1.38) and more actionable marketing segments. Results highly consistent with standalone K-Means, validating behavioral patterns</li>
    </ol>
</div>

### **8.5.1 SOM Parameter Grid Search**

In [None]:
# Define parameter grid for SOM optimization
grid_sizes_behav = [10, 20, 40]
learning_rates_behav = [0.5, 0.75, 1.0]
sigma_values_behav = [0.5, 1.0]

# Prepare scaled data for SOM training
som_data_behav = df_behavioral_clean[behavioral_feats].values


'''# Store grid search results
grid_search_results_behav = []

# Perform grid search
for grid_size in grid_sizes_behav:
    for lr in learning_rates_behav:
        for sigma in sigma_values_behav:
            # Initialize SOM with current parameters
            som = MiniSom(
                x=grid_size,
                y=grid_size,
                input_len=len(behavioral_feats),
                sigma=sigma,
                learning_rate=lr,
                neighborhood_function='gaussian',
                topology='hexagonal',
                activation_distance='euclidean',
                random_seed=1
            )
            
            # Initialize weights randomly from data
            som.random_weights_init(som_data_behav)
            
            # Train SOM
            # Scale iterations with map size (500 per neuron): larger grids have more neurons that need sufficient updates to converge properly
            num_iterations = 500 * grid_size * grid_size
            som.train_batch(som_data_behav, num_iteration=num_iterations, verbose=False)
            
            # Calculate quality metrics
            qe = som.quantization_error(som_data_behav)
            te = som.topographic_error(som_data_behav)
            
            # Store results
            grid_search_results_behav.append({
                'grid_size': f'{grid_size}x{grid_size}',
                'learning_rate': lr,
                'sigma': sigma,
                'units': grid_size * grid_size,
                'quantization_error': qe,
                'topographic_error': te
            })

# Convert results to DataFrame
grid_search_df_behav = pd.DataFrame(grid_search_results_behav)
grid_search_df_behav = grid_search_df_behav.sort_values(['quantization_error', 'topographic_error']).reset_index(drop=True)'''

In [None]:
'''grid_search_df_behav.to_csv('data/output_data/som_grid_search_results_behav.csv', index=False)'''

In [None]:
grid_search_df_behav = pd.read_csv('data/output_data/som_grid_search_results_behav.csv')

In [None]:
# Visualize grid search results
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Quantization Error by Grid Size
for lr in learning_rates_behav:
    for sigma in sigma_values_behav:
        subset = grid_search_df_behav[(grid_search_df_behav['learning_rate'] == lr) & (grid_search_df_behav['sigma'] == sigma)]
        axes[0].plot(subset['units'], subset['quantization_error'], 
                    marker='o', alpha=0.6, label=f'LR={lr}, σ={sigma}')

axes[0].set_xlabel('Number of SOM Units', fontsize=11, fontweight='bold')
axes[0].set_ylabel('Quantization Error (QE)', fontsize=11, fontweight='bold')
axes[0].set_title('SOM Grid Search: Quantization Error', fontsize=12, fontweight='bold')
axes[0].grid(False)

# Plot 2: Topographic Error by Grid Size
for lr in learning_rates_behav:
    for sigma in sigma_values_behav:
        subset = grid_search_df_behav[(grid_search_df_behav['learning_rate'] == lr) & (grid_search_df_behav['sigma'] == sigma)]
        axes[1].plot(subset['units'], subset['topographic_error'], 
                    marker='o', alpha=0.6, label=f'LR={lr}, σ={sigma}')

axes[1].set_xlabel('Number of SOM Units', fontsize=11, fontweight='bold')
axes[1].set_ylabel('Topographic Error (TE)', fontsize=11, fontweight='bold')
axes[1].set_title('SOM Grid Search: Topographic Error', fontsize=12, fontweight='bold')
axes[1].grid(False)
axes[1].legend(bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=8)

plt.tight_layout()
plt.show()

# Display best parameter combinations
grid_search_df_behav

In [None]:
# Select best parameters based on lowest QE and TE
best_params_behav = grid_search_df_behav.iloc[3]

# Extract parameters for later use
selected_grid_size_behav = int(best_params_behav['grid_size'].split('x')[0])
selected_lr_behav = best_params_behav['learning_rate']
selected_sigma_behav = best_params_behav['sigma']

# Display best parameters
best_params_display_behav = pd.DataFrame({
    'Parameter': ['Grid Size', 'Learning Rate', 'Sigma', 'Total Units', 'Quantization Error', 'Topographic Error'],
    'Value': [
        best_params_behav['grid_size'],
        best_params_behav['learning_rate'],
        best_params_behav['sigma'],
        best_params_behav['units'],
        f"{best_params_behav['quantization_error']:.4f}",
        f"{best_params_behav['topographic_error']:.4f}"
    ]
})

best_params_display_behav

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
  <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">SOM Parameter Grid Search Summary</h3>
  <p style="margin: 10px 0; margin-right: 40px; color: #000;">
    Performed systematic grid search over 18 parameter combinations to identify optimal SOM configuration for behavioral clustering. Evaluated 3 grid sizes (10x10, 20x20, 40x40), 3 learning rates (0.5, 0.75, 1.0), and 2 sigma values (0.5, 1.0), with iterations scaled proportionally to grid size (500 x number of neurons) to ensure convergence.
  </p>
  <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Goal:</h4>
  <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
    <li style="margin-right: 20px;">Find optimal SOM parameters that balance low Quantization Error (accurate data representation) with low Topographic Error (preserved neighborhood topology) for subsequent two-stage clustering</li>
  </ul>
  <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Methodology:</h4>
  <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
    <li style="margin-right: 20px;"><strong>Quality metrics:</strong> Quantization Error (QE) measures average distance between data points and their Best Matching Units (lower = better data representation). Topographic Error (TE) measures proportion of data points where first and second BMUs are not adjacent (lower = better topology preservation)</li>
    <li style="margin-right: 20px;"><strong>Iteration scaling:</strong> Training iterations set to 500 x grid_size² (e.g., 50,000 for 10x10, 200,000 for 20x20, 800,000 for 40x40) following the standard heuristic of 500 iterations per neuron to ensure proper convergence across all grid sizes</li>
    <li style="margin-right: 20px;"><strong>Parameter selection rationale:</strong> For two-stage clustering (SOM + K-Means), moderate grid sizes are preferred to ensure meaningful data compression. While larger grids minimize QE, they reduce the SOM's ability to aggregate similar customers into prototype neurons, diminishing the benefit of the two-stage approach</li>
  </ul>
  <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Findings:</h4>
  <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
    <li style="margin-right: 20px;"><strong>Grid size dominates QE:</strong> Larger grids yield substantially lower QE. 10x10 produced QE 0.68-0.71, 20x20 achieved 0.44-0.47, and 40x40 reached 0.25-0.26.</li>
    <li style="margin-right: 20px;"><strong>Sigma controls QE-TE trade-off:</strong> Small sigma (σ=0.5) severely damages topology preservation (TE 0.76-0.97), while σ=1.0 maintains acceptable TE (0.39-0.77) with marginal QE increase.</li> 
    <li style="margin-right: 20px;"><strong>Learning rate has moderate impact:</strong> lr=0.5 produces slightly better TE than lr=1.0 at σ=1.0, while QE remains similar across learning rates.</li>
    <li style="margin-right: 20px;"><strong>Grid size impact on two-stage clustering:</strong> Tested all grid sizes with subsequent K-Means clustering. Smaller grids (10x10, 20x20) produced worse Silhouette and Calinski-Harabasz scores in the final clustering, while larger grids (60x60) converged toward standalone K-Means results. The 40x40 grid achieves best clustering metrics while still providing meaningful SOM-based data compression.</li>
    <li style="margin-right: 20px;"><strong>Best configuration:</strong> 40x40 grid with lr=0.5 and σ=1.0 achieved lowest QE (0.258) while maintaining acceptable TE (0.389).</li>
    <li style="margin-right: 20px;"><strong>Two-stage clustering:</strong> 40x40 with 1,600 neurons provides meaningful data compression (approximately 8 customers per neuron). Note that results are similar to standalone K-Means, indicating the behavioral feature space is well-suited for centroid-based clustering.</li>
    <li style="margin-right: 20px;"><strong>Decision:</strong> Use 40x40 with lr=0.5 and σ=1.0 for SOM visualization and two-stage clustering.</li>
  </ul>
</div>

### **8.5.2 SOM Visualizations**

In [None]:
# Train SOM with selected best parameters
som_best_behav = MiniSom(
    x=selected_grid_size_behav,
    y=selected_grid_size_behav,
    input_len=len(behavioral_feats),
    sigma=selected_sigma_behav,
    learning_rate=selected_lr_behav,
    neighborhood_function='gaussian',
    topology='hexagonal',
    activation_distance='euclidean',
    random_seed=1
)

# Initialize and train
som_best_behav.random_weights_init(som_data_behav)
som_best_behav.train_batch(som_data_behav, num_iteration=500 * selected_grid_size_behav * selected_grid_size_behav, verbose=False)

print(f"SOM trained with parameters: Grid={selected_grid_size_behav}x{selected_grid_size_behav}, LR={selected_lr_behav}, σ={selected_sigma_behav}")

In [None]:
# Component Planes
n_cols = 3
n_rows = int(np.ceil(len(behavioral_feats) / n_cols))

fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, n_rows * 4.5))
axes = axes.flatten()

for idx, feature in enumerate(behavioral_feats):
    weights = som_best_behav.get_weights()[:, :, idx]
    visualize_som_grid(som_best_behav, weights, feature, ax=axes[idx])

for idx in range(len(behavioral_feats), len(axes)):
    axes[idx].axis('off')

plt.suptitle('Component Planes: Feature Distribution across SOM Grid', fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

In [None]:
# U-Matrix - Unified Distance Matrix
visualize_som_grid(som_best_behav, som_best_behav.distance_map(), 'U-Matrix: Unified Distance Matrix')

In [None]:
# Hit Map - Customer distribution across SOM grid
hitsmatrix_behav = som_best_behav.activation_response(som_data_behav)
visualize_som_grid(som_best_behav, hitsmatrix_behav, 'Hit Map: Customer Distribution')

display(pd.DataFrame({
    'Metric': ['Total Customers', 'Avg per Unit', 'Max in Single Unit', 'Min in Single Unit'],
    'Value': [f"{hitsmatrix_behav.sum():.0f}", f"{hitsmatrix_behav.mean():.2f}", f"{hitsmatrix_behav.max():.0f}", f"{hitsmatrix_behav.min():.0f}"]
}))

# Count units with 0 hits
n_zero_hit_units_behav = np.sum(hitsmatrix_behav == 0)
print("Number of units with 0 hits:", n_zero_hit_units_behav)

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
  <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">SOM Visualization Summary</h3>

  <p style="margin: 10px 0; margin-right: 40px; color: #000;">
    Visualized the trained 40x40 SOM using three complementary outputs: Component Planes (feature distributions), U-Matrix (cluster boundaries), and Hit Map (customer density). Each visualization serves a distinct purpose in understanding the SOM's learned structure.
  </p>

  <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Goal:</h4>
  <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
    <li style="margin-right: 20px;">Understand how behavioral features are distributed across the SOM grid, identify natural cluster boundaries, and assess whether customer mappings are evenly distributed or concentrated in specific regions</li>
  </ul>

  <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Methodology:</h4>
  <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
    <li style="margin-right: 20px;"><strong>Component Planes:</strong> Display each feature's weight distribution across the SOM grid. <span style="color:#b2182b;">Red</span> shades indicate higher Z-scores, <span style="color:#2166ac;">blue</span> shades indicate lower Z-scores</li>
    <li style="margin-right: 20px;"><strong>U-Matrix:</strong> Shows average distance between each SOM unit and its neighbors. <span style="color:#b2182b;">Red</span> regions indicate high distances (cluster boundaries), <span style="color:#2166ac;">blue</span> regions indicate homogeneous areas (cluster centers)</li>
    <li style="margin-right: 20px;"><strong>Hit Map:</strong> Displays how many customers are mapped to each SOM unit. <span style="color:#b2182b;">Red</span> units have more customers, <span style="color:#2166ac;">blue</span> units have fewer</li>
  </ul>

  <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Findings:</h4>
  <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
    <li style="margin-right: 20px;"><strong>Distance variability shows left-side concentration:</strong> <span style="color:#b2182b;">Red</span> regions (high variability, Z=+2 to +3) concentrate in both the upper-left and lower-left corners, while <span style="color:#2166ac;">blue</span> regions (low variability) form a diagonal band from lower-right toward the center. No simple gradient pattern</li>
    <li style="margin-right: 20px;"><strong>Companion flight ratio shows central clustering:</strong> <span style="color:#b2182b;">Red</span> cluster (social travelers, Z=+2) distinctly concentrated in the center of the grid with additional spots at the bottom edge, while <span style="color:#2166ac;">blue</span> regions (solo travelers, Z=-2) are located on the left side and upper areas. <span style="color:#fddbc7;">Yellow</span> transition zones between red and blue</li>
    <li style="margin-right: 20px;"><strong>Flight regularity shows scattered hotspots:</strong> Four concentrated <span style="color:#b2182b;">red</span> hotspots (irregular flights, Z=+2 to +3) distributed across the grid, with <span style="color:#2166ac;">blue</span> regions (regular patterns) appearing sparsely between the red and yellow sections. <span style="color:#fddbc7;">Yellow</span> (moderate) dominates most of the grid</li>
    <li style="margin-right: 20px;"><strong>Redemption frequency shows sparse high-value islands:</strong> One larger <span style="color:#fddbc7;">yellow</span>/<span style="color:#b2182b;">red</span> cluster (high redemption, Z=+1 to +3) in the center plus three smaller islands, while <span style="color:#2166ac;">blue</span> (passive redeemers, Z=-1) dominates the majority of the grid including upper-left, right side, and bottom areas. This confirms most customers are passive redeemers with engaged users forming distinct minority clusters</li>
    <li style="margin-right: 20px;"><strong>U-Matrix reveals diffuse cluster boundaries:</strong> Predominantly <span style="color:#92c5de;">light blue</span> regions (low inter-neuron distances, 0.2-0.5) with scattered <span style="color:#f4a582;">orange</span>/<span style="color:#b2182b;">red</span> spots (higher distances, 0.7-1.0) distributed throughout, suggesting gradual transitions between behavioral segments rather than sharp cluster boundaries</li>
    <li style="margin-right: 20px;"><strong>Hit Map shows relatively uniform coverage:</strong> With 12,885 customers mapped (avg 8.05 per unit), the distribution ranges from 0 to 20 customers per unit. Only 20 units (1.25%) remain empty, indicating excellent grid utilization.</li>
  </ul>
</div>

### **8.5.3 Emergent SOM Training**

In [None]:
# Emergent SOM for Two-Stage Clustering
# Using selected grid - optimal balance between granularity and neuron coverage

som_emergent_behav = som_best_behav  # Use already trained SOM from Grid Search
emergent_grid_size_behav = selected_grid_size_behav

qe_emergent_behav = som_emergent_behav.quantization_error(som_data_behav)
te_emergent_behav = som_emergent_behav.topographic_error(som_data_behav)

pd.DataFrame({
    'Metric': ['Grid Size', 'Total Units', 'Avg Customers/Unit', 'Quantization Error', 'Topographic Error'],
    'Value': [f'{emergent_grid_size_behav}x{emergent_grid_size_behav}', f'{emergent_grid_size_behav**2:,}', f'{len(som_data_behav)/emergent_grid_size_behav**2:.2f}', f'{qe_emergent_behav:.4f}', f'{te_emergent_behav:.4f}']
})

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
  <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">SOM for Two-Stage Clustering</h3>

  <p style="margin: 10px 0; margin-right: 40px; color: #000;">
    Using the 40x40 SOM (1,600 units) trained with optimal Grid Search parameters (lr=0.5, σ=1.0) as foundation for two-stage clustering. K-Means will cluster the neuron weight vectors to identify final customer segments.
  </p>

  <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Why 40x40 Grid for Two-Stage Clustering:</h4>
  <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
    <li style="margin-right: 20px;"><strong>Meaningful data compression:</strong> With 12,885 customers across 1,600 neurons, each neuron represents 8.05 customers on average. This compression ratio ensures the SOM provides genuine noise reduction and prototype learning, differentiating two-stage clustering from direct K-Means on raw data</li>
    <li style="margin-right: 20px;"><strong>Robust neuron coverage:</strong> The moderate density of 8 customers per neuron ensures weight vectors are stable representations of local customer profiles, not dominated by individual outliers</li>
    <li style="margin-right: 20px;"><strong>Balanced granularity:</strong> 1,600 neurons provide sufficient resolution for K-Means to identify natural cluster boundaries while avoiding the over-granularity problem where very large grids would produce results nearly identical to direct clustering</li>
  </ul>

  <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Quality Metrics:</h4>
  <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
    <li style="margin-right: 20px;"><strong>Quantization Error: 0.258</strong> - low error indicating neurons accurately represent the underlying data distribution while providing meaningful compression</li>
    <li style="margin-right: 20px;"><strong>Topographic Error: 0.389</strong> - 38.9% of data points have non-adjacent 1st and 2nd BMUs. Higher than demographic SOM due to the 4-dimensional behavioral space, but still acceptable for interpretable clustering results</li>
  </ul>
</div>

### **8.5.4 Defining the number of clusters**

In [None]:
# Flatten SOM weights for K-Means clustering
som_weights_flat_behav = som_emergent_behav.get_weights().reshape(-1, len(behavioral_feats))

pd.DataFrame({
    'Description': ['SOM weights shape', 'Total neurons', 'Features per neuron'],
    'Value': [str(som_weights_flat_behav.shape), emergent_grid_size_behav**2, len(behavioral_feats)]
})

In [None]:
# Evaluate K-Means on SOM weights across range of k values
som_km_k_range_behav = range(2, 14)
som_km_metrics_behav = {
    'k': [],
    'Inertia': [],
    'Silhouette': [],
    'Calinski-Harabasz': [],
    'Davies-Bouldin': []
}

# Store fitted neuron labels for later use
som_km_fitted_neuron_labels_behav = {}

# Helper function to map neuron labels to customers via BMU
def get_customer_labels_behav(som, data, neuron_labels, grid_size):
    customer_labels = []
    for sample in data:
        bmu = som.winner(sample)
        neuron_idx = bmu[0] * grid_size + bmu[1]
        customer_labels.append(neuron_labels[neuron_idx])
    return np.array(customer_labels)

for k in som_km_k_range_behav:
    kmeans = KMeans(n_clusters=k, init='k-means++', n_init=15, random_state=1, max_iter=300)
    km_neuron_labels = kmeans.fit_predict(som_weights_flat_behav)
    
    # Store neuron labels
    som_km_fitted_neuron_labels_behav[k] = km_neuron_labels
    
    # Map neuron labels to customers via BMU
    customer_labels = get_customer_labels_behav(som_emergent_behav, som_data_behav, km_neuron_labels, emergent_grid_size_behav)
    
    # Metrics on customer data
    metrics = evaluate_clustering_metrics(som_data_behav, customer_labels)
    
    som_km_metrics_behav['k'].append(k)
    som_km_metrics_behav['Inertia'].append(kmeans.inertia_)
    som_km_metrics_behav['Silhouette'].append(metrics['Silhouette Score'])
    som_km_metrics_behav['Calinski-Harabasz'].append(metrics['Calinski-Harabasz Index'])
    som_km_metrics_behav['Davies-Bouldin'].append(metrics['Davies-Bouldin Index'])

som_km_metrics_df_behav = pd.DataFrame(som_km_metrics_behav)

In [None]:
# Elbow Method
plot_elbow_method(
    som_km_k_range_behav,
    som_km_metrics_df_behav['Inertia'].tolist(),
    CUSTOM_HEX,
    title='SOM + K-Means: Elbow Method for Optimal k'
)

In [None]:
# Clustering metrics comparison
plot_clustering_metrics(
    som_km_metrics_df_behav,
    som_km_k_range_behav,
    CUSTOM_HEX,
    title='SOM + K-Means: Clustering Metrics Evaluation'
)

som_km_metrics_df_behav

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">Optimal k Selection Summary (SOM + K-Means)</h3>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Evaluated k=2 to k=13 on the 1,600 SOM neuron weight vectors using multiple validation indices to identify the optimal number of clusters for two-stage clustering.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Goal:</h4>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-right: 20px;">Determine k that balances cluster quality metrics with business interpretability for actionable customer segmentation</li>
    </ul>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Methodology:</h4>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-right: 20px;"><strong>Elbow Method (Inertia):</strong> Total within-cluster sum of squared distances on SOM weights - look for "elbow" where adding clusters yields diminishing returns</li>
        <li style="margin-right: 20px;"><strong>Silhouette Score:</strong> Measures cluster cohesion and separation (range -1 to 1, higher is better)</li>
        <li style="margin-right: 20px;"><strong>Calinski-Harabasz Index:</strong> Ratio of between-cluster to within-cluster variance (higher is better)</li>
        <li style="margin-right: 20px;"><strong>Davies-Bouldin Index:</strong> Average similarity between clusters (lower is better)</li>
    </ul>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Findings:</h4>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-right: 20px;"><strong>k=4 - Elbow Point:</strong> Clear elbow in the inertia curve where additional clusters yield diminishing returns. Good Silhouette (0.180), solid CH (2,900), and improved DBI (1.47) compared to lower k values</li>
        <li style="margin-right: 20px;"><strong>k=5 - Best Cluster Separation:</strong> Achieves best Davies-Bouldin Index (1.38) among k=2-6, indicating optimal cluster separation. Comparable Silhouette (0.178) to k=4 with slightly lower CH (2,791)</li>
        <li style="margin-right: 20px;"><strong>Trade-off Analysis:</strong> k=2-3 have higher Silhouette and CH but provide insufficient granularity. k=6+ shows declining Silhouette (drops to 0.163) without meaningful DBI improvement. k=4 and k=5 represent the optimal balance between separation quality and interpretability</li>
        <li style="margin-right: 20px;"><strong>Decision for next step:</strong> Compare k=4 vs k=5 to determine the final clustering configuration based on cluster profiles and business interpretability</li>
    </ul>
</div>

### **8.5.5 Comparison of k Solutions**

In [None]:
# Compare two candidate solutions based on clustering metrics analysis
# k=4
# k=5

som_km_k_candidate_1_behav = 4
som_km_k_candidate_2_behav = 5

# Use pre-fitted neuron labels from 8.5.4
som_km_labels_k1_behav = som_km_fitted_neuron_labels_behav[som_km_k_candidate_1_behav]
som_km_labels_k2_behav = som_km_fitted_neuron_labels_behav[som_km_k_candidate_2_behav]

# Map to customers 
customer_labels_k1_behav = get_customer_labels_behav(som_emergent_behav, som_data_behav, som_km_labels_k1_behav, emergent_grid_size_behav)
customer_labels_k2_behav = get_customer_labels_behav(som_emergent_behav, som_data_behav, som_km_labels_k2_behav, emergent_grid_size_behav)

# Create temporary DataFrames with cluster labels
df_temp_k1_behav = pd.DataFrame(som_data_behav, columns=behavioral_feats)
df_temp_k1_behav['Cluster'] = customer_labels_k1_behav

df_temp_k2_behav = pd.DataFrame(som_data_behav, columns=behavioral_feats)
df_temp_k2_behav['Cluster'] = customer_labels_k2_behav

# Calculate cluster profiles (mean values per cluster)
cluster_profiles_k1_behav = df_temp_k1_behav.groupby('Cluster').mean()
cluster_profiles_k2_behav = df_temp_k2_behav.groupby('Cluster').mean()

# Display both profiles for comparison
print(f"Cluster Profiles for k={som_km_k_candidate_1_behav}:")
display(cluster_profiles_k1_behav.round(3))

print(f"\nCluster Profiles for k={som_km_k_candidate_2_behav}:")
display(cluster_profiles_k2_behav.round(3))

# Visualize cluster size comparison
plot_cluster_size_comparison(
    labels_dict={som_km_k_candidate_1_behav: customer_labels_k1_behav, som_km_k_candidate_2_behav: customer_labels_k2_behav},
    palette=CUSTOM_HEX,
    title='SOM + K-Means: Cluster Size Comparison'
)

In [None]:
# Visualize k=candidate_1
cluster_grid_k1_behav = som_km_labels_k1_behav.reshape((emergent_grid_size_behav, emergent_grid_size_behav))
visualize_som_grid(som_emergent_behav, cluster_grid_k1_behav.astype(float), f'SOM + K-Means Clustering (k={som_km_k_candidate_1_behav})')

In [None]:
# Visualize k=candidate_2
cluster_grid_k2_behav = som_km_labels_k2_behav.reshape((emergent_grid_size_behav, emergent_grid_size_behav))
visualize_som_grid(som_emergent_behav, cluster_grid_k2_behav.astype(float), f'SOM + K-Means Clustering (k={som_km_k_candidate_2_behav})')

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">Comparison of Clustering Solutions Summary</h3>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        <strong>Decision: k=5 selected</strong>
    </p>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        K=5 achieves better cluster separation (DBI 1.38 vs 1.47) while providing more actionable marketing segments. The SOM grid visualization shows k=5 produces clearer spatial regions, particularly the yellow cluster (high redemption, Z=+1.39) forming a distinct contiguous area in the center-right. K=5 separates strategically different customer types: Explorers (variable destinations) vs Sporadic Flyers (irregular schedule) require different marketing approaches. The high-redemption segment is more clearly isolated in k=5 (redemption Z=+1.39) compared to k=4 (Z=+1.15). Balanced cluster sizes (17.1% to 23.4%) ensure all segments are substantial enough for targeted campaigns.
    </p>
</div>

### **8.5.6 Final SOM Model**

In [None]:
# Select k=5 and use pre-fitted labels
selected_k_som_behav = som_km_k_candidate_2_behav
df_behavioral_clean['Cluster_SOM_KMeans'] = customer_labels_k2_behav

# Calculate final metrics
som_labels_final_behav = df_behavioral_clean['Cluster_SOM_KMeans'].values
som_final_metrics_behav = evaluate_clustering_metrics(df_behavioral_clean[behavioral_feats], som_labels_final_behav)

# Store for final comparison (Section 9)
behavioral_clustering_results['SOM + K-Means'] = {
    'k': selected_k_som_behav,
    'Silhouette': som_final_metrics_behav['Silhouette Score'],
    'Calinski-Harabasz': som_final_metrics_behav['Calinski-Harabasz Index'],
    'Davies-Bouldin': som_final_metrics_behav['Davies-Bouldin Index'],
    'R2': get_rsq(df_behavioral_clean[behavioral_feats + ['Cluster_SOM_KMeans']], behavioral_feats, 'Cluster_SOM_KMeans'),
    'labels': som_labels_final_behav
}

### **8.5.7 SOM Cluster Profiling**

In [None]:
# 1. Profile Heatmap - Z-scores of behavioral features per cluster (SOM)
# Reuse cluster_profiles
cluster_profiles_som_behav = cluster_profiles_k2_behav
som_population_mean_behav = df_behavioral_clean[behavioral_feats].mean()

plot_cluster_profiles_heatmap(
    cluster_profiles_som_behav,
    som_population_mean_behav,
    GROUP80_palette_continuous,
    title='SOM + K-Means Clustering: Behavioral Profiles (k=5)\nStandardized Z-Scores per Cluster'
)

In [None]:
# 2. Cluster Size Distribution
plot_cluster_sizes(
    df_behavioral_clean['Cluster_SOM_KMeans'].values,
    selected_k_som_behav,
    CUSTOM_HEX,
    title='SOM + K-Means Clustering - Final Cluster Sizes'
)

In [None]:
# 3. Feature Importance Analysis - Variance across clusters
som_feature_variance_behav = cluster_profiles_som_behav.var(axis=0).sort_values(ascending=False)

plot_feature_importance(
    som_feature_variance_behav,
    CUSTOM_HEX,
    title='SOM + K-Means Clustering: Feature Importance Analysis\nWhich Behaviors Differentiate Clusters?'
)

# Display feature importance as DataFrame
som_feature_importance_df_behav = pd.DataFrame({
    'Feature': som_feature_variance_behav.index,
    'Variance': som_feature_variance_behav.values
}).reset_index(drop=True)

display(som_feature_importance_df_behav)

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">SOM + K-Means Clustering Profiling Summary (k=5)</h3>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Analyzed behavioral characteristics of the final 5 SOM + K-Means clusters. The two-stage approach produces segmentation highly consistent with standalone K-Means, validating that the 40x40 SOM preserves the essential cluster structure while providing topological visualization benefits.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Key Findings:</h4>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; padding-right: 0; color: #000;">
        <li style="margin-right: 20px;"><strong>Cluster 0 (23.4%, n=3,018) - Explorers:</strong> High distance variability (Z=+1.11), solo traveler (companion Z=-0.38). Visits diverse destinations. Equivalent to K-Means Cluster 3.</li>
        <li style="margin-right: 20px;"><strong>Cluster 1 (21.3%, n=2,744) - Business Commuters:</strong> High flight regularity (regularity Z=+0.87), fixed routes (distance Z=-0.65). Predictable booking behavior. Equivalent to K-Means Cluster 1.</li>
        <li style="margin-right: 20px;"><strong>Cluster 2 (17.1%, n=2,200) - Family Travelers:</strong> High companion ratio (Z=+1.16), below-average regularity (regularity Z=-0.69). Travels with family on leisure patterns. Equivalent to K-Means Cluster 0.</li>
        <li style="margin-right: 20px;"><strong>Cluster 3 (19.6%, n=2,525) - Engaged Loyalists:</strong> Very high redemption (Z=+1.39), moderate companion (Z=+0.38), somewhat regular (Z=+0.37). Most valuable loyalty segment. Equivalent to K-Means Cluster 4.</li>
        <li style="margin-right: 20px;"><strong>Cluster 4 (18.6%, n=2,398) - Disengaged Solo:</strong> All features negative: fixed routes (distance Z=-0.51), solo (companion Z=-0.83), low regularity (regularity Z=-0.82), passive redeemer (Z=-0.59). Re-engagement target. Equivalent to K-Means Cluster 2.</li>
        <li style="margin-right: 20px;"><strong>Primary Driver - Redemption Frequency (Variance: 0.62):</strong> Redemption behavior creates strongest cluster separation, isolating the high-value Engaged segment.</li>
        <li style="margin-right: 20px;"><strong>Secondary Drivers - Companion (0.59) and Regularity (0.51):</strong> Differentiate Family from Solo segments and Commuters from irregular travelers.</li>
        <li style="margin-right: 20px;"><strong>SOM vs K-Means Consistency:</strong> Both methods produce identical segment archetypes with similar cluster sizes, confirming robust behavioral patterns. SOM adds topological visualization for segment proximity analysis.</li>
    </ul>
</div>

---

# <a class='anchor' id='9'></a>
<br>

<div style="background: linear-gradient(to right, #00411E, #00622D, #00823C, #45AF28, #82BA72); 
            padding: 10px; color: white; text-align: center;  max-width: 97%;">
    <center><h1 style="margin-top: 10px; margin-bottom: 4px; color: white;
                       font-size: 32px; font-family: 'Roboto', sans-serif;">
        <b>9. Final Clustering</b></h1></center>
</div>

## **9.1 Select Final Model**

In [None]:
# to dataframe for final comparison
demo_clustering_results_df = pd.DataFrame.from_dict(demo_clustering_results, orient='index')
behavioral_clustering_results_df = pd.DataFrame.from_dict(behavioral_clustering_results, orient='index')

In [None]:
# Create comparison table with Silhouette and Davies-Bouldin
comparison_data = []

methods = ['Hierarchical', 'K-Means', 'SOM + K-Means', 'Mean Shift', 'GMM']

for method in methods:
    demo = demo_clustering_results_df.loc[method]
    behav = behavioral_clustering_results_df.loc[method]
    
    comparison_data.append({
        'Clustering Method': method,
        'Demo Silhouette': f"{demo['Silhouette']:.3f} ({int(demo['k'])})",
        'Demo DBI': f"{demo['Davies-Bouldin']:.3f}",
        'Behav Silhouette': f"{behav['Silhouette']:.3f} ({int(behav['k'])})",
        'Behav DBI': f"{behav['Davies-Bouldin']:.3f}"
    })

comparison_df = pd.DataFrame(comparison_data)
comparison_df.set_index('Clustering Method', inplace=True)
comparison_df

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">Final Model Selection Summary</h3>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        <strong>Evaluation Metrics: Silhouette Score and Davies-Bouldin Index</strong>
    </p>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Silhouette measures cluster cohesion and separation (higher is better), while Davies-Bouldin captures cluster compactness relative to inter-cluster distance (lower is better). R² was excluded because it inherently increases with more clusters, making cross-method comparison misleading when cluster counts differ. Silhouette and DBI provide complementary, cluster-count-independent quality assessment.
    </p>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        <strong>Demographic Perspective: K-Means (k=3)</strong><br>
        Achieves top Silhouette (0.203) and best DBI (1.736), outperforming all other methods including SOM + K-Means (DBI 1.740). The lean 3-cluster solution balances statistical quality with business interpretability.
    </p>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        <strong>Behavioral Perspective: K-Means (k=5)</strong><br>
        Clearly outperforms all methods with highest Silhouette (0.186) and lowest DBI (1.360). The 5-cluster solution provides more actionable marketing segments than k=4.
    </p>
</div>

## **9.2 Merge Clustering Perspectives**

### Step 1: Combine Clustering Labels from both Perspectives

In [None]:
# Add the final K-Means cluster labels to each dataframe
# Demographic: k=3 clusters | Behavioral: k=5 clusters

df_demographic_a_scaled['km_cluster'] = demo_clustering_results_df.loc['K-Means', 'labels']
df_behavioral_clean['km_cluster'] = behavioral_clustering_results_df.loc['K-Means', 'labels']

# Create crosstab: Count customers in each (demo, behav) cluster combination
# Rows = Demographic clusters (0-2), Columns = Behavioral clusters (0-4)
crosstab = pd.crosstab(
    df_demographic_a_scaled['km_cluster'],
    df_behavioral_clean['km_cluster'],
    margins=True
)

In [None]:
crosstab

In [None]:
colors = ["#00411E", "#00823C", "#82BA72", "#D5E6D0", "#FFFFFF"]

fig, ax = plt.subplots(figsize=(10, 6))
sns.heatmap(
    crosstab.iloc[:-1, :-1],
    annot=True, 
    fmt='d', 
    cmap=LinearSegmentedColormap.from_list('gw', colors[::-1]),  # reversed
    cbar_kws={'label': 'Number of Customers'},
    linewidths=0.5,
    ax=ax
)
ax.set_title('Cluster Overlap: Demographic vs Behavioral Perspective\n(Customer Count per Combination)', 
             fontweight='bold', fontsize=14)
ax.set_xlabel('Behavioral Cluster (k=5)', fontweight='bold', fontsize=12)
ax.set_ylabel('Demographic Cluster (k=3)', fontweight='bold', fontsize=12)
plt.tight_layout()
plt.show()


### Step 2: Calculate Centroids for each Cluster Combination

In [None]:
# Step 2: Calculate Centroids for Each Cluster Combination
# Align indices (behavioral has fewer rows due to outlier removal)
common_idx = df_demographic_a_scaled.index.intersection(df_behavioral_clean.index)

# Define feature lists (exclude all cluster label columns)
exclude_cols = ['km_cluster', 'ms_cluster', 'gmm_cluster', 'Cluster_SOM_KMeans', 'Cluster']
demographic_feats = [col for col in df_demographic_a_scaled.columns if col not in exclude_cols]
all_features = demographic_feats + behavioral_feats

# Combine perspectives: each customer gets demo + behav features and both cluster labels
df_merged = pd.DataFrame({
    'demo_cluster': df_demographic_a_scaled.loc[common_idx, 'km_cluster'].values,
    'behav_cluster': df_behavioral_clean.loc[common_idx, 'km_cluster'].values
})
df_merged[demographic_feats] = df_demographic_a_scaled.loc[common_idx, demographic_feats].values
df_merged[behavioral_feats] = df_behavioral_clean.loc[common_idx, behavioral_feats].values

# Calculate centroids: mean feature values for each (demo, behav) cluster combination
df_centroids = df_merged.groupby(['demo_cluster', 'behav_cluster'])[all_features].mean()

print(f"Features: {len(all_features)} | Combinations: {len(df_centroids)}")
df_centroids

### Step 3: Hierarchical Clustering on Centroids

In [None]:
# Create linkage matrix using Ward method
linkage_matrix = linkage(df_centroids[all_features], method='ward', metric='euclidean')

y_threshold_first_option = 5
y_threshold_second_option = 2.3

# Plot dendrogram to decide final k
fig, ax = plt.subplots(figsize=(14, 6))
dendrogram(
    linkage_matrix,
    ax=ax,
    labels=[f"D{i[0]}_B{i[1]}" for i in df_centroids.index],
    leaf_font_size=10,
    color_threshold=0.7 * max(linkage_matrix[:, 2])
)
plt.axhline(y=y_threshold_first_option, color='r', linestyle='--', linewidth=2, label=f'First option at {y_threshold_first_option}')
plt.axhline(y=y_threshold_second_option, color='b', linestyle='--', linewidth=2, label=f'Second option at {y_threshold_second_option}')
ax.set_title('Dendrogram: Hierarchical Clustering of Combined Centroids', fontweight='bold', fontsize=14)
ax.set_xlabel('Cluster Combination (Demo_Behav)', fontweight='bold')
ax.set_ylabel('Distance (Ward)', fontweight='bold')
plt.legend()
plt.tight_layout()
plt.show()

### Step 4: Apply Final HC with chosen k

In [None]:
# Step 4: Compare both k options
k_option_1 = 3  # cut at distance ~5
k_option_2 = 6  # cut at distance ~2.3

# Fit both options
hc_k3 = AgglomerativeClustering(linkage='ward', n_clusters=k_option_1)
hc_k6 = AgglomerativeClustering(linkage='ward', n_clusters=k_option_2)

df_centroids['merged_k3'] = hc_k3.fit_predict(df_centroids[all_features])
df_centroids['merged_k6'] = hc_k6.fit_predict(df_centroids[all_features])

### Step 5: Map merged labels back to all customers

In [None]:
# Option 1: k=3
mapper_k3 = df_centroids['merged_k3'].to_dict()
df_merged['merged_k3'] = df_merged.apply(
    lambda row: mapper_k3[(row['demo_cluster'], row['behav_cluster'])], axis=1
)

# Option 2: k=6
mapper_k6 = df_centroids['merged_k6'].to_dict()
df_merged['merged_k6'] = df_merged.apply(
    lambda row: mapper_k6[(row['demo_cluster'], row['behav_cluster'])], axis=1
)

# Compare cluster sizes
print("Option 1 (k=3) cluster sizes:")
print(df_merged['merged_k3'].value_counts().sort_index())
print(f"\nOption 2 (k=6) cluster sizes:")
print(df_merged['merged_k6'].value_counts().sort_index())


### Step 6: Compare metrics for both options

In [None]:
X_merged = df_merged[all_features].values

results = []
for k, col in [(3, 'merged_k3'), (6, 'merged_k6')]:
    labels = df_merged[col].values
    results.append({
        'k': k,
        'Silhouette': silhouette_score(X_merged, labels),
        'Davies-Bouldin': davies_bouldin_score(X_merged, labels),
        'Calinski-Harabasz': calinski_harabasz_score(X_merged, labels),
        'R2': get_rsq(df_merged, all_features, col)
    })

metrics_comparison = pd.DataFrame(results).set_index('k')
metrics_comparison

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">Multi-Perspective Merging Analysis</h3>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Goal</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        The goal of multi-perspective merging is to combine demographic (K-Means k=3) and behavioral (K-Means k=5) clustering perspectives into a unified segmentation solution. This approach attempts to create customer segments that simultaneously capture both "who customers are" (demographics: income, education, location, marital status) and "how they behave" (flight patterns, companion preferences, redemption behavior). The hypothesis is that combining both perspectives yields richer, more actionable segments than either perspective alone.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Methodology</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        The merging process follows an established multi-perspective segmentation workflow:
    </p>
    <ul style="margin: 5px 0; margin-right: 40px; color: #000;">
        <li><strong>Step 1 - Index Alignment:</strong> Aligned customer indices between demographic (n=13,038) and behavioral (n=12,885) datasets. The difference of 153 customers results from DBSCAN outlier removal in behavioral clustering. Analysis proceeds with 12,885 common customers.</li>
        <li><strong>Step 2 - Cluster Overlap Analysis:</strong> Created crosstab showing customer distribution across all 15 possible combinations (3 demographic x 5 behavioral clusters). Heatmap visualization reveals how perspectives interact and whether certain demographic-behavioral pairings are over- or under-represented.</li>
        <li><strong>Step 3 - Centroid Calculation:</strong> For each of the 15 cluster combinations, calculated centroids representing the mean feature values across all 13 features (9 demographic + 4 behavioral). Each centroid captures the "average customer profile" for customers belonging to that specific (demographic, behavioral) combination.</li>
        <li><strong>Step 4 - Hierarchical Clustering on Centroids:</strong> Applied Ward linkage hierarchical clustering to the 15 centroids. Ward linkage minimizes within-cluster variance, making it suitable for identifying compact, well-separated groups. The dendrogram visualization enables visual inspection of the hierarchical structure and potential cut points.</li>
        <li><strong>Step 5 - Cut Point Evaluation:</strong> Evaluated two cut options based on dendrogram structure:
            <ul>
                <li>Option 1: Cut at distance 5.0 --> k=3 merged clusters</li>
                <li>Option 2: Cut at distance 2.3 --> k=6 merged clusters</li>
            </ul>
        </li>
        <li><strong>Step 6 - Label Mapping & Evaluation:</strong> Mapped merged cluster labels from centroids back to all 12,885 customers using the (demo_cluster, behav_cluster) --> merged_cluster mapping. Calculated Silhouette Score, Davies-Bouldin Index, Calinski-Harabasz Index, and R² for both options.</li>
    </ul>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Findings</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        <strong>Quantitative Evaluation:</strong> Merged clustering shows significant quality degradation compared to standalone K-Means Behavioral clustering:
    </p>
    <table style="margin: 10px 0; border-collapse: collapse; font-size: 14px; color: #000;">
        <tr style="background-color: #82BA72;">
            <th style="padding: 8px; border: 1px solid #00622D; color: #000;">Solution</th>
            <th style="padding: 8px; border: 1px solid #00622D; color: #000;">k</th>
            <th style="padding: 8px; border: 1px solid #00622D; color: #000;">Silhouette</th>
            <th style="padding: 8px; border: 1px solid #00622D; color: #000;">Davies-Bouldin</th>
            <th style="padding: 8px; border: 1px solid #00622D; color: #000;">R²</th>
        </tr>
        <tr style="background-color: #D5E6D0;">
            <td style="padding: 8px; border: 1px solid #00622D; color: #000;">Merged (cut at 5.0)</td>
            <td style="padding: 8px; border: 1px solid #00622D; color: #000;">3</td>
            <td style="padding: 8px; border: 1px solid #00622D; color: #000;">0.129</td>
            <td style="padding: 8px; border: 1px solid #00622D; color: #000;">2.30</td>
            <td style="padding: 8px; border: 1px solid #00622D; color: #000;">0.212</td>
        </tr>
        <tr style="background-color: #e8f3e5;">
            <td style="padding: 8px; border: 1px solid #00622D; color: #000;">Merged (cut at 2.3)</td>
            <td style="padding: 8px; border: 1px solid #00622D; color: #000;">6</td>
            <td style="padding: 8px; border: 1px solid #00622D; color: #000;">0.065</td>
            <td style="padding: 8px; border: 1px solid #00622D; color: #000;">3.42</td>
            <td style="padding: 8px; border: 1px solid #00622D; color: #000;">0.268</td>
        </tr>
        <tr style="background-color: #00823C;">
            <td style="padding: 8px; border: 1px solid #00622D; color: #fff;"><strong>K-Means Behavioral</strong></td>
            <td style="padding: 8px; border: 1px solid #00622D; color: #fff;"><strong>5</strong></td>
            <td style="padding: 8px; border: 1px solid #00622D; color: #fff;"><strong>0.186</strong></td>
            <td style="padding: 8px; border: 1px solid #00622D; color: #fff;"><strong>1.36</strong></td>
            <td style="padding: 8px; border: 1px solid #00622D; color: #fff;"><strong>0.471</strong></td>
        </tr>
    </table>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        <strong>Key Observations:</strong>
    </p>
    <ul style="margin: 5px 0; margin-right: 40px; color: #000;">
        <li><strong>Silhouette Degradation:</strong> Merged k=3 achieves only 69% of behavioral-only Silhouette (0.129 vs 0.186). Merged k=6 drops to 35% (0.065), indicating poor cluster cohesion and separation.</li>
        <li><strong>Davies-Bouldin Increase:</strong> DBI increases from 1.36 (behavioral) to 2.30 (merged k=3) and 3.42 (merged k=6), representing 69% and 151% worse cluster compactness respectively.</li>
        <li><strong>R² Trade-off:</strong> While merged solutions explain variance across all 13 features, the behavioral-only solution explains 47% of variance in the 4 behavioral features that actually drive segment differentiation.</li>
    </ul>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        <strong>Dendrogram Structure Insight:</strong> The hierarchical clustering dendrogram reveals a critical pattern: merged clusters group almost exclusively by demographic perspective. All D0_* combinations (regardless of behavioral cluster B0-B4) cluster together, all D1_* together, and all D2_* together. This indicates that when demographic and behavioral features are combined in the same feature space, the demographic features dominate the distance calculations, effectively "drowning out" the behavioral signal. The behavioral perspective provides minimal additional differentiation beyond what demographics already capture.
    </p>
</div>

<div style="background-color: #fce8e8ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #8B0000, #A52A2A, #CD5C5C, #F08080) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #8B0000; font-weight: bold;">Critical Decision: Multi-Perspective Merging rejected</h3>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #A52A2A; font-weight: bold;">Final Decision</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        <strong>The multi-perspective merging approach is rejected.</strong> The final customer segmentation will be based exclusively on behavioral clustering (K-Means k=5). Demographic attributes will serve only as descriptive profiling variables for the behavioral segments, not as inputs to segment creation. Value-based features (Frequency, Monetary) will be used for a separate prioritization layer.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #A52A2A; font-weight: bold;">Rationale: Why Merging fails</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;"><strong>1. Significant Statistical Quality Degradation</strong></p>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        The merged solutions show unacceptable degradation in clustering quality metrics. The Silhouette score drops by 30-65% compared to behavioral-only clustering, while Davies-Bouldin increases by 70-150%. A Silhouette score of 0.065 (merged k=6) falls well below a threshold considered minimum for meaningful cluster structure. This indicates that the merged clusters lack internal cohesion and external separation.
    </p>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;"><strong>2. Demographic Features Issues</strong></p>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        The demographic features in this dataset suffer from systematic data quality problems that fundamentally undermine their value for clustering:
    </p>
    <ul style="margin: 5px 0; margin-right: 40px; color: #000;">
        <li><strong>Income Artifact:</strong> All customers with college education show Income=0, creating an artificial pattern where education and income are confounded. This systematic error means income-based segments reflect data collection issues rather than true economic differences.</li>
        <li><strong>Geographic Concentration:</strong> Location features (Province, City, FSA) show heavy concentration in a few regions, with sparse representation elsewhere. This limits the discriminative power of geographic segmentation.</li>
        <li><strong>Limited Variability:</strong> Features like Gender or Marital status provide simply limited clustering value.</li>
    </ul>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;"><strong>3. Limited Strategic Actionability of Demographics</strong></p>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Even if demographic data quality were perfect, demographic-based segments offer limited actionability for marketing strategy. Knowing that a customer is male, married, aged 35-44, or lives in Ontario does not directly inform:
    </p>
    <ul style="margin: 5px 0; margin-right: 40px; color: #000;">
        <li>What offer or promotion to send</li>
        <li>Which communication channel to prioritize</li>
        <li>When to engage (timing, frequency)</li>
        <li>What messaging or creative to use</li>
    </ul>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        In contrast, behavioral segments directly suggest differentiated strategies: "Explorers" (high distance variability) may respond to destination discovery campaigns, "Regular Families" (high companion ratio) to family package offers, "Engaged Actives" (high redemption) to loyalty program communications.
    </p>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;"><strong>4. Value-Based Features Reserved for Prioritization</strong></p>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        RFM-style value features (Frequency, Monetary) are intentionally excluded from clustering. Including value in cluster creation would conflate "what customers do" with "how valuable they are," making it impossible to identify high-potential customers in low-value segments. Instead, value indicators will create a separate prioritization layer, enabling strategic resource allocation: invest heavily in high-value customers across all behavioral segments, while identifying growth opportunities in high-potential, currently-low-value customers.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #A52A2A; font-weight: bold;">Strategic Framework: Three-Layer Segmentation</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        The final segmentation strategy adopts a three-layer approach that leverages each data perspective for its appropriate purpose:
    </p>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;"><strong>Layer 1: Behavioral Segmentation (5 Clusters)</strong></p>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        K-Means clustering on behavioral features (distance_variability, companion_flight_ratio, flight_regularity, redemption_frequency) creates 5 actionable customer segments. These segments answer "How do customers engage with our service?" and directly inform differentiated marketing strategies:
    </p>
    <ul style="margin: 5px 0; margin-right: 40px; color: #000;">
        <li><strong>Explorers:</strong> High distance variability, destination-seekers</li>
        <li><strong>Sporadic Flyers:</strong> Irregular patterns, re-engagement targets</li>
        <li><strong>Regular Families:</strong> High companion ratio, family travelers</li>
        <li><strong>Engaged Actives:</strong> High redemption, loyalty program enthusiasts</li>
        <li><strong>Passive Solos:</strong> Low engagement across features, activation candidates</li>
    </ul>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;"><strong>Layer 2: Value-Based Prioritization (4 Tiers per Cluster)</strong></p>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Each behavioral segment will be mapped onto a Frequency x Monetary matrix, creating 4 priority tiers:
    </p>
    <ul style="margin: 5px 0; margin-right: 40px; color: #000;">
        <li><strong>Champions (High F, High M):</strong> Protect and reward</li>
        <li><strong>Potential Loyalists (High F, Low M):</strong> Upsell opportunities</li>
        <li><strong>Big Spenders (Low F, High M):</strong> Increase engagement frequency</li>
        <li><strong>At Risk (Low F, Low M):</strong> Re-activation or deprioritize</li>
    </ul>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        This yields 20 strategic combinations (5 behavioral clusters x 4 value tiers), enabling precise resource allocation and tailored strategies.
    </p>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;"><strong>Layer 3: Demographic Profiling</strong></p>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Demographic attributes (income, education, location, gender, marital status) will be used exclusively for descriptive profiling of the 20 strategic combinations. This provides "persona" characteristics ("Who are our Engaged Active Champions?") for communication personalization and channel selection, without influencing segment membership or prioritization.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #A52A2A; font-weight: bold;">Next Steps</h4>
    <ul style="margin: 5px 0; margin-right: 40px; color: #000;">
        <li><strong>Section 9.3 - Final Profiling:</strong> Comprehensive profiling of the 5 behavioral segments including cluster heatmaps, parallel coordinates, t-SNE/UMAP visualizations, feature importance analysis, and demographic distributions.</li>
        <li><strong>Section 9.4 - Value Mapping:</strong> Implementation of the Frequency x Monetary prioritization matrix, creating 4 tiers per behavioral cluster and visualizing the 20 strategic combinations.</li>
        <li><strong>Section 9.5 - Strategic Recommendations:</strong> Actionable marketing strategies for each behavioral segment and value tier combination, including recommended channels, offer types, and engagement timing.</li>
    </ul>
</div>

## **9.3 Final Profiling**

### Step 1: Cluster Sizes & Distribution

In [None]:
# Step 1: Setup final clustering data
final_labels = behavioral_clustering_results_df.loc['K-Means', 'labels']
df_final = df_behavioral_clean.copy()
df_final['Cluster'] = final_labels
k_final = 5

# Cluster sizes (reuse existing function)
plot_cluster_sizes(final_labels, k_final, CUSTOM_HEX, title='Final Segmentation: Cluster Size Distribution (K-Means k=5)')

### Step 2: Final Profiles (Z-Scores + Original Values)

In [None]:
# Step 2: Behavioral profiles - Z-Scores vs Original Values
cluster_profiles = df_final.groupby('Cluster')[behavioral_feats].mean()

# Convert to original scale
original_stats = df_behavioral_a[behavioral_feats].agg(['mean', 'std'])
cluster_profiles_original = cluster_profiles.copy()
for feat in behavioral_feats:
    cluster_profiles_original[feat] = (cluster_profiles[feat] * original_stats.loc['std', feat]) + original_stats.loc['mean', feat]

# Normalize per row (feature) for better visualization
cluster_profiles_normalized = cluster_profiles_original.T.copy()
for feat in cluster_profiles_normalized.index:
    row_min = cluster_profiles_normalized.loc[feat].min()
    row_max = cluster_profiles_normalized.loc[feat].max()
    cluster_profiles_normalized.loc[feat] = (cluster_profiles_normalized.loc[feat] - row_min) / (row_max - row_min)

# Color palette
colors = ["#00411E", "#00823C", "#82BA72", "#D5E6D0", "#FFFFFF"]
green_white_cmap = LinearSegmentedColormap.from_list('gw', colors[::-1])

# Side-by-side comparison
fig, axes = plt.subplots(1, 2, figsize=(18, 6))

# Left: Z-Scores (global scale)
sns.heatmap(cluster_profiles.T, annot=True, fmt='.2f', cmap=GROUP80_palette_continuous, 
            center=0, ax=axes[0], cbar_kws={'label': 'Z-Score'}, linewidths=0.5)
axes[0].set_title('Final Profiles: Z-Scores', fontweight='bold')
axes[0].set_xlabel('Cluster', fontweight='bold')
axes[0].set_ylabel('Feature', fontweight='bold')

# Right: Original Values (normalized per row, no colorbar)
sns.heatmap(cluster_profiles_normalized, annot=cluster_profiles_original.T.values, fmt='.2f', 
            cmap=green_white_cmap, ax=axes[1], cbar=False, linewidths=0.5)
axes[1].set_title('Final Profiles: Original Scale\n(Colors normalized per feature: dark=high, light=low)', fontweight='bold')
axes[1].set_xlabel('Cluster', fontweight='bold')
axes[1].set_ylabel('Feature', fontweight='bold')

plt.tight_layout()
plt.show()

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">Key Findings: Five Distinct Final Segments</h3>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        The K-Means clustering (k=5) reveals <strong>five behaviorally distinct customer segments</strong> based on travel patterns and loyalty program engagement. Each cluster exhibits a unique combination of characteristics that will inform targeted marketing strategies in Section 9.5.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Cluster Profiles:</h4>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; color: #000;">
        <li style="margin-bottom: 8px;"><strong>Cluster 0 - Family Travelers:</strong> 36% of flights with companions (highest), moderate route diversity, below-average monthly consistency. Families or friends traveling together for leisure.</li>
        <li style="margin-bottom: 8px;"><strong>Cluster 1 - Business Commuters:</strong> 55% flight regularity score (highest), consistent routes with low distance variation, predominantly solo (23% companion flights). Predictable business travelers with fixed schedules.</li>
        <li style="margin-bottom: 8px;"><strong>Cluster 2 - Disengaged Solo:</strong> Only 18% companion flights, 46% regularity (lowest), redeems in just 4% of available months (lowest). Infrequent, disengaged travelers. Re-engagement target.</li>
        <li style="margin-bottom: 8px;"><strong>Cluster 3 - Explorers:</strong> Distance CV of 1.09 (highest) meaning flight distances vary as much as their average. Adventure seekers visiting diverse destinations near and far.</li>
        <li style="margin-bottom: 8px;"><strong>Cluster 4 - Engaged Loyalists:</strong> Redeems points in 16% of available months (highest, 4x more than Cluster 2), above-average regularity (53%) and companions (28%). Most engaged with the loyalty program.</li>
    </ul>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Feature Interpretation:</h4>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; color: #000;">
        <li style="margin-bottom: 5px;"><strong>companion_flight_ratio:</strong> Percentage of flights taken with at least one companion (0.36 = 36% of flights).</li>
        <li style="margin-bottom: 5px;"><strong>flight_regularity:</strong> Consistency of monthly flight activity (0.55 = moderate-high consistency, 1.0 = perfectly regular).</li>
        <li style="margin-bottom: 5px;"><strong>distance_variability:</strong> Coefficient of Variation of flight distances (1.09 = high diversity in destinations).</li>
        <li style="margin-bottom: 5px;"><strong>redemption_frequency:</strong> Proportion of months with point redemptions (0.16 = redeems in 16% of months).</li>
    </ul>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Strategic Implications:</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        These behavioral segments provide actionable foundations for differentiated marketing approaches. The distinguishing features directly translate to marketing levers: group packages, subscription models, re-engagement campaigns, destination promotions, and loyalty rewards. Detailed strategies will be developed in <strong>Section 9.5</strong>.
    </p>
</div>

### Step 3: Feature Importance

In [None]:
# Step 3: Feature Importance - Variance across cluster centroids
feature_variance = cluster_profiles.var().sort_values(ascending=False)

fig, ax = plt.subplots(figsize=(10, 6))
bars = ax.bar(feature_variance.index, feature_variance.values, color=CUSTOM_HEX[1])

# Add values inside bars
for bar, val in zip(bars, feature_variance.values):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height()/2, 
            f'{val:.3f}', ha='center', va='center', fontweight='bold', fontsize=11, color='white')

ax.set_xlabel('Feature', fontweight='bold')
ax.set_ylabel('Variance', fontweight='bold')
ax.set_title('Feature Importance: Variance Across Cluster Centroids', fontweight='bold', fontsize=13, pad=15)
ax.set_xticklabels(feature_variance.index, rotation=45, ha='right')
plt.tight_layout()
plt.show()

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">Feature Importance Summary</h3>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        <strong>Redemption frequency</strong> (0.64) is the strongest cluster differentiator, clearly separating Engaged Loyalists from other segments. <strong>Companion flight ratio</strong> (0.60) ranks second, distinguishing Family Travelers from solo segments. <strong>Flight regularity</strong> (0.53) and <strong>distance variability</strong> (0.48) contribute moderately, identifying Business Commuters and Explorers respectively. All four features show substantial variance (0.48-0.64), confirming each contributes meaningfully to the segmentation.
    </p>
</div>

### Step 4: PCA Visualization

In [None]:
behavioral_feats

In [None]:
# Step 4: PCA - 2D Visualization

# Fit PCA
pca = PCA(n_components=2, random_state=42)
X_pca = pca.fit_transform(df_final[behavioral_feats])

# Add to dataframe
df_final['pca_1'] = X_pca[:, 0]
df_final['pca_2'] = X_pca[:, 1]

# Variance explained
var_explained = pca.explained_variance_ratio_
print(f"PCA Variance Explained:")
print(f"PC1: {var_explained[0]:.1%}")
print(f"PC2: {var_explained[1]:.1%}")
print(f"Total: {var_explained.sum():.1%}")

In [None]:
df_final

In [None]:
# PCA Scatter Plot
fig, ax = plt.subplots(figsize=(12, 8))

# Colorful palette
colors = cm.tab10(np.linspace(0, 1, 10))

# Plot each cluster separately
for label in sorted(df_final['Cluster'].unique()):
    cluster_data = df_final[df_final['Cluster'] == label]
    ax.scatter(
        cluster_data['pca_1'],
        cluster_data['pca_2'],
        c=[colors[label]],
        alpha=0.6,
        s=30,
        edgecolors='none',
        label=f'Cluster {label}'
    )

# Add cluster centroids
for label in sorted(df_final['Cluster'].unique()):
    cluster_data = df_final[df_final['Cluster'] == label]
    centroid_x = cluster_data['pca_1'].mean()
    centroid_y = cluster_data['pca_2'].mean()
    ax.scatter(centroid_x, centroid_y, c='black', marker='X', s=200, 
               edgecolors=colors[label], linewidths=2, zorder=10)
    ax.annotate(f'C{label}', (centroid_x + 0.1, centroid_y + 0.1), 
                fontsize=12, fontweight='bold', ha='center', va='center', zorder=20)

ax.legend(loc='best', framealpha=1, title='Cluster', fontsize=10, markerscale=1.5)
ax.set_xlabel(f'PC1 ({var_explained[0]:.1%} variance)', fontweight='bold')
ax.set_ylabel(f'PC2 ({var_explained[1]:.1%} variance)', fontweight='bold')
ax.set_title('PCA: 2D Projection of 5 Behavioral Segments', fontweight='bold', fontsize=13, pad=15)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()


In [None]:
# PCA Loadings Heatmap
loadings = pd.DataFrame(
    pca.components_.T,
    columns=['PC1', 'PC2'],
    index=behavioral_feats
)

fig, ax = plt.subplots(figsize=(8, 5))
sns.heatmap(loadings, annot=True, cmap='PiYG', center=0, fmt='.3f', ax=ax, linewidths=0.5)
ax.set_title('PCA Component Loadings', fontweight='bold', fontsize=13, pad=15)
plt.tight_layout()
plt.show()

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">PCA Visualization Summary</h3>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Methodology:</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Principal Component Analysis (PCA) reduces the 4 behavioral features to 2 dimensions while preserving maximum variance. PC1 and PC2 together explain <strong>58.1%</strong> of total variance (32.3% + 25.8%), providing a reasonable 2D representation of the cluster structure.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Loading Interpretation:</h4>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; color: #000;">
        <li style="margin-bottom: 8px;"><strong>PC1 (Engagement Axis):</strong> Driven by redemption_frequency (0.74) and companion_flight_ratio (0.56). High PC1 values indicate engaged customers who redeem points frequently and travel with companions. Low PC1 values indicate disengaged solo travelers.</li>
        <li style="margin-bottom: 8px;"><strong>PC2 (Travel Style Axis):</strong> Driven by distance_variability (0.76) and flight_regularity (0.46), with negative loading on companion_flight_ratio (-0.45). High PC2 values indicate solo explorers visiting diverse destinations. Low PC2 values indicate family travelers with consistent routes.</li>
    </ul>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Key Findings:</h4>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; color: #000;">
        <li style="margin-bottom: 8px;"><strong>Four distinct corner clusters:</strong> Clusters 0, 2, 3, and 4 occupy separate regions of the PCA space, confirming behavioral differentiation.</li>
        <li style="margin-bottom: 8px;"><strong>Cluster 4 (Engaged Loyalists):</strong> Far right (high PC1) - highest redemption and companion activity.</li>
        <li style="margin-bottom: 8px;"><strong>Cluster 2 (Disengaged Solo):</strong> Far left (low PC1) - lowest engagement across all metrics.</li>
        <li style="margin-bottom: 8px;"><strong>Cluster 3 (Explorers):</strong> Top (high PC2) - highest distance variability, diverse destinations.</li>
        <li style="margin-bottom: 8px;"><strong>Cluster 0 (Family Travelers):</strong> Bottom (low PC2) - highest companion ratio, consistent travel patterns.</li>
        <li style="margin-bottom: 8px;"><strong>Cluster 1 (Business Commuters):</strong> Centered in the middle, overlapping with all other clusters. This segment has <strong>average values across most features</strong> except high flight regularity. Their central position reflects their balanced, predictable behavior without extreme characteristics in any direction.</li>
    </ul>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Interpretation:</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        The PCA projection validates the K-Means segmentation: four segments show clear spatial separation representing distinct behavioral profiles, while Business Commuters form a general baseline cluster. The overlap of Cluster 1 with others suggests these customers could potentially migrate to other segments with targeted marketing interventions.
    </p>
</div>

### Step 5: t-SNE Visualization

In [None]:
# Step 5: t-SNE with different perplexity values

perplexity_values = [5, 20, 40]
tsne_results = {}

for perp in perplexity_values:
    print(f"Computing t-SNE with perplexity={perp}...")
    tsne = TSNE(
        n_components=2,
        perplexity=perp,
        random_state=42,
        max_iter=1000,
        learning_rate='auto',
        init='pca'
    )
    tsne_results[perp] = tsne.fit_transform(df_final[behavioral_feats])

print("Done!")


In [None]:
# Visualize t-SNE perplexity comparison
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

colors = cm.tab10(np.linspace(0, 1, 10))

for idx, perp in enumerate(perplexity_values):
    ax = axes[idx]
    X_tsne = tsne_results[perp]
    
    for label in sorted(df_final['Cluster'].unique()):
        mask = df_final['Cluster'] == label
        ax.scatter(
            X_tsne[mask, 0],
            X_tsne[mask, 1],
            c=[colors[label]],
            alpha=0.6,
            s=20,
            edgecolors='none',
            label=f'Cluster {label}' if idx == 0 else ''
        )
    
    ax.set_title(f't-SNE (perplexity={perp})', fontweight='bold', fontsize=12)
    ax.set_xlabel('t-SNE 1', fontweight='bold')
    ax.set_ylabel('t-SNE 2', fontweight='bold')
    ax.grid(True, alpha=0.3)

# Add legend to first subplot
axes[0].legend(loc='best', framealpha=1, title='Cluster', fontsize=9, markerscale=1.5)

plt.suptitle('t-SNE: Effect of Perplexity on Cluster Visualization', fontweight='bold', fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

In [None]:
# t-SNE Stability Test - multiple runs with same perplexity
perp_test = 30
random_seeds = [42, 123, 456]
stability_results = {}

print(f"Testing t-SNE stability with perplexity={perp_test}\n")
for seed in random_seeds:
    print(f"Run with random_state={seed}...")
    tsne = TSNE(
        n_components=2,
        perplexity=perp_test,
        random_state=seed,
        max_iter=1000,
        learning_rate='auto',
        init='pca'
    )
    stability_results[seed] = tsne.fit_transform(df_final[behavioral_feats])

print("Done!")

In [None]:
# Visualize stability comparison
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

colors = cm.tab10(np.linspace(0, 1, 10))

for idx, seed in enumerate(random_seeds):
    ax = axes[idx]
    X_tsne = stability_results[seed]
    
    for label in sorted(df_final['Cluster'].unique()):
        mask = df_final['Cluster'] == label
        ax.scatter(
            X_tsne[mask, 0],
            X_tsne[mask, 1],
            c=[colors[label]],
            alpha=0.6,
            s=20,
            edgecolors='none',
            label=f'Cluster {label}' if idx == 0 else ''
        )
    
    ax.set_title(f't-SNE Run {idx+1} (seed={seed})', fontweight='bold', fontsize=12)
    ax.set_xlabel('t-SNE 1', fontweight='bold')
    ax.set_ylabel('t-SNE 2', fontweight='bold')
    ax.grid(True, alpha=0.3)

axes[0].legend(loc='best', framealpha=1, title='Cluster', fontsize=9, markerscale=1.5)

plt.suptitle(f't-SNE Stability Test (perplexity={perp_test})', fontweight='bold', fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">t-SNE Visualization Summary</h3>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Methodology - How t-SNE Differs from PCA:</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        t-SNE (t-Distributed Stochastic Neighbor Embedding) is a <strong>non-linear</strong> dimensionality reduction technique that preserves <strong>local neighborhood structure</strong>. Unlike PCA which finds linear combinations that maximize variance, t-SNE focuses on keeping similar points close together in the low-dimensional space. The axes in t-SNE have <strong>no interpretable meaning</strong> - only the relative distances between points matter.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Perplexity Parameter (5, 20, 40):</h4>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; color: #000;">
        <li style="margin-bottom: 8px;"><strong>Perplexity=5 (too local):</strong> Clusters are heavily fragmented and mixed. The algorithm focuses on too few neighbors, creating artificial micro-clusters and losing all global structure. Worst separation.</li>
        <li style="margin-bottom: 8px;"><strong>Perplexity=20 (balanced):</strong> Clusters begin to separate into distinct regions. Sub-cluster structure becomes visible within segments, revealing potential heterogeneity (e.g., Cluster 1 and 3 show internal groupings).</li>
        <li style="margin-bottom: 8px;"><strong>Perplexity=40 (more global):</strong> Best overall cluster separation. Clusters form cohesive groups with clearer boundaries. Cluster 4 (purple) and Cluster 0 (blue) are well-separated; Cluster 1 (orange) remains distributed across the space.</li>
    </ul>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Stability Test (perplexity=30, different seeds):</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        The three runs with different random seeds reveal a <strong>critical limitation</strong>: cluster positions change dramatically between runs. In Run 1, Cluster 4 (purple) appears on the right; in Run 2, it moves to the bottom; in Run 3, it shifts again. However, <strong>local groupings within clusters remain consistent</strong> - the same customers stay together, only their global position changes. This confirms t-SNE preserves local structure but not global distances.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Limitations:</h4>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; color: #000;">
        <li style="margin-bottom: 5px;"><strong>Non-reproducible:</strong> Different random seeds produce different visualizations.</li>
        <li style="margin-bottom: 5px;"><strong>No interpretable axes:</strong> Cannot explain what "t-SNE 1" or "t-SNE 2" represent.</li>
        <li style="margin-bottom: 5px;"><strong>Global distances unreliable:</strong> Distance between clusters does not reflect true similarity.</li>
        <li style="margin-bottom: 5px;"><strong>Computationally expensive:</strong> Slower than PCA, especially for large datasets.</li>
        <li style="margin-bottom: 5px;"><strong>Cannot project new data:</strong> Requires full recomputation for new customers.</li>
    </ul>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Key Insight:</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        t-SNE confirms that clusters have meaningful <strong>local cohesion</strong> - customers within each segment are genuinely similar to each other. The consistent finding across all perplexity values that <strong>Cluster 1 (Business Commuters) remains distributed</strong> throughout the space reinforces the PCA finding: this segment represents a general baseline without extreme behavioral characteristics.
    </p>
</div>

### Step 6: UMAP Visualization

In [None]:
# Step 6: UMAP with different n_neighbors values

n_neighbors_values = [5, 15, 50]
umap_results = {}

print("Running UMAP with different n_neighbors values...\n")

for n_neigh in n_neighbors_values:
    print(f"Computing UMAP with n_neighbors={n_neigh}...")
    reducer = umap.UMAP(
        n_components=2,
        n_neighbors=n_neigh,
        min_dist=0.1,
        random_state=42,
        n_jobs=1
    )
    umap_results[n_neigh] = reducer.fit_transform(df_final[behavioral_feats])

print("Done!")

In [None]:
# Visualize UMAP n_neighbors comparison
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

colors = cm.tab10(np.linspace(0, 1, 10))

for idx, n_neigh in enumerate(n_neighbors_values):
    ax = axes[idx]
    X_umap = umap_results[n_neigh]
    
    for label in sorted(df_final['Cluster'].unique()):
        mask = df_final['Cluster'] == label
        ax.scatter(
            X_umap[mask, 0],
            X_umap[mask, 1],
            c=[colors[label]],
            alpha=0.6,
            s=20,
            edgecolors='none',
            label=f'Cluster {label}' if idx == 0 else ''
        )
    
    ax.set_title(f'UMAP (n_neighbors={n_neigh})', fontweight='bold', fontsize=12)
    ax.set_xlabel('UMAP 1', fontweight='bold')
    ax.set_ylabel('UMAP 2', fontweight='bold')
    ax.grid(True, alpha=0.3)

axes[0].legend(loc='best', framealpha=1, title='Cluster', fontsize=9, markerscale=1.5)

plt.suptitle('UMAP: Effect of n_neighbors on Cluster Visualization', fontweight='bold', fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

In [None]:
# UMAP with different min_dist values
min_dist_values = [0.0, 0.1, 0.5]
umap_mindist_results = {}

print("Running UMAP with different min_dist values...\n")

for min_d in min_dist_values:
    print(f"Computing UMAP with min_dist={min_d}...")
    reducer = umap.UMAP(
        n_components=2,
        n_neighbors=15,
        min_dist=min_d,
        random_state=42,
        n_jobs=1
    )
    umap_mindist_results[min_d] = reducer.fit_transform(df_final[behavioral_feats])

print("Done!")

In [None]:
# Visualize min_dist comparison
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

colors = cm.tab10(np.linspace(0, 1, 10))

for idx, min_d in enumerate(min_dist_values):
    ax = axes[idx]
    X_umap = umap_mindist_results[min_d]
    
    for label in sorted(df_final['Cluster'].unique()):
        mask = df_final['Cluster'] == label
        ax.scatter(
            X_umap[mask, 0],
            X_umap[mask, 1],
            c=[colors[label]],
            alpha=0.6,
            s=20,
            edgecolors='none',
            label=f'Cluster {label}' if idx == 0 else ''
        )
    
    ax.set_title(f'UMAP (min_dist={min_d})', fontweight='bold', fontsize=12)
    ax.set_xlabel('UMAP 1', fontweight='bold')
    ax.set_ylabel('UMAP 2', fontweight='bold')
    ax.grid(True, alpha=0.3)

axes[0].legend(loc='best', framealpha=1, title='Cluster', fontsize=9, markerscale=1.5)

plt.suptitle('UMAP: Effect of min_dist on Cluster Visualization', fontweight='bold', fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">UMAP Visualization Summary</h3>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Methodology:</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        UMAP (Uniform Manifold Approximation and Projection) is a non-linear dimensionality reduction technique that balances local and global structure. Unlike t-SNE, UMAP is deterministic (reproducible) and faster.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">n_neighbors Parameter (5, 15, 50):</h4>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; color: #000;">
        <li style="margin-bottom: 8px;"><strong>n_neighbors=5:</strong> Clusters form horizontal band-like structures stretched across the space.</li>
        <li style="margin-bottom: 8px;"><strong>n_neighbors=15:</strong> Clusters form distinct, separated blob-like groups with clearer boundaries.</li>
        <li style="margin-bottom: 8px;"><strong>n_neighbors=50:</strong> Clusters become round, cohesive blobs that connect to each other in a chain-like pattern.</li>
    </ul>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">min_dist Parameter (0.0, 0.1, 0.5):</h4>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; color: #000;">
        <li style="margin-bottom: 8px;"><strong>min_dist=0.0:</strong> Tightest clustering - points pack closely together within each cluster.</li>
        <li style="margin-bottom: 8px;"><strong>min_dist=0.1:</strong> Similar to 0.0, clusters remain compact with good separation.</li>
        <li style="margin-bottom: 8px;"><strong>min_dist=0.5:</strong> Clusters spread out significantly and begin to overlap, boundaries less clear.</li>
    </ul>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Key Findings:</h4>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; color: #000;">
        <li style="margin-bottom: 8px;"><strong>Cluster 0 (Family Travelers, blue):</strong> Consistently positioned in the center, surrounded by other clusters. This central position indicates behavioral overlap with multiple segments - families share characteristics with other traveler types.</li>
        <li style="margin-bottom: 8px;"><strong>Cluster 4 (Engaged Loyalists, purple):</strong> The only cluster that forms its own isolated group across all parameter settings. This confirms Engaged Loyalists are behaviorally the most distinct segment with unique characteristics (high redemption frequency) that separate them from all other customers.</li>
    </ul>
</div>

### Step 7: Multi-Method Comparison

In [None]:
# Side-by-side comparison: PCA, t-SNE, UMAP
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

colors = cm.tab10(np.linspace(0, 1, 10))

# PCA
for label in sorted(df_final['Cluster'].unique()):
    mask = df_final['Cluster'] == label
    axes[0].scatter(
        df_final.loc[mask, 'pca_1'],
        df_final.loc[mask, 'pca_2'],
        c=[colors[label]],
        alpha=0.6,
        s=20,
        edgecolors='none',
        label=f'Cluster {label}'
    )
axes[0].set_title('PCA: Global Structure', fontweight='bold', fontsize=12)
axes[0].set_xlabel(f'PC1 ({var_explained[0]:.1%})', fontweight='bold')
axes[0].set_ylabel(f'PC2 ({var_explained[1]:.1%})', fontweight='bold')
axes[0].grid(True, alpha=0.3)

# t-SNE (perplexity=30)
X_tsne_best = tsne_results[20]  # oder 30 je nachdem welcher besser war
for label in sorted(df_final['Cluster'].unique()):
    mask = df_final['Cluster'] == label
    axes[1].scatter(
        X_tsne_best[mask, 0],
        X_tsne_best[mask, 1],
        c=[colors[label]],
        alpha=0.6,
        s=20,
        edgecolors='none'
    )
axes[1].set_title('t-SNE: Local Structure (perplexity=20)', fontweight='bold', fontsize=12)
axes[1].set_xlabel('t-SNE 1', fontweight='bold')
axes[1].set_ylabel('t-SNE 2', fontweight='bold')
axes[1].grid(True, alpha=0.3)

# UMAP (n_neighbors=15, min_dist=0.1)
X_umap_best = umap_results[15]
for label in sorted(df_final['Cluster'].unique()):
    mask = df_final['Cluster'] == label
    axes[2].scatter(
        X_umap_best[mask, 0],
        X_umap_best[mask, 1],
        c=[colors[label]],
        alpha=0.6,
        s=20,
        edgecolors='none'
    )
axes[2].set_title('UMAP: Balanced View (n_neighbors=15)', fontweight='bold', fontsize=12)
axes[2].set_xlabel('UMAP 1', fontweight='bold')
axes[2].set_ylabel('UMAP 2', fontweight='bold')
axes[2].grid(True, alpha=0.3)

# Legend
handles, labels = axes[0].get_legend_handles_labels()
fig.legend(handles, labels, loc='center right', bbox_to_anchor=(0.98, 0.5),
           framealpha=1, title='Cluster', fontsize=10, markerscale=2)

plt.suptitle('Method Comparison: Same 5 Segments, Different Views', fontweight='bold', fontsize=14, y=1.02)
plt.tight_layout()
plt.subplots_adjust(right=0.88)
plt.show()

<div style="background-color: #e8f3e5ff; padding: 15px; margin-right: 30px; border-left: 5px solid; border-image: linear-gradient(to bottom, #00411E, #00622D, #00823C, #45AF28, #82BA72) 1; border-radius: 5px; max-width: 95%;">
    <h3 style="margin-top: 0; margin-bottom: 10px; color: #00411E; font-weight: bold;">Multi-Method Comparison Summary</h3>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Where do the methods agree? And disagree?</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        <strong>Agreement:</strong> All three methods show Cluster 4 (Engaged Loyalists, purple) as the most concentrated cluster, and Cluster 1 (Business Commuters, orange) consistently in the center overlapping with other segments.<br><br>
        <strong>Disagreement:</strong> PCA shows heavy overlap in the center. t-SNE fragments everything into scattered sub-groups. UMAP forms three distinct blobs but with mixed cluster membership.
    </p>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Where can you find overlap? Sub-clusters? Isolated clusters?</h4>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; color: #000;">
        <li style="margin-bottom: 5px;"><strong>Overlap:</strong> In UMAP, Clusters 0, 1, 2, and 3 appear in all three blobs. In PCA, all clusters overlap in the center.</li>
        <li style="margin-bottom: 5px;"><strong>Sub-clusters:</strong> t-SNE fragments all clusters into multiple scattered groups, revealing internal heterogeneity.</li>
        <li style="margin-bottom: 5px;"><strong>Concentrated cluster:</strong> Cluster 4 (purple) is the only cluster that appears in just one UMAP blob.</li>
    </ul>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">What areas represent specific behaviors? (PCA)</h4>
    <ul style="margin: 10px 40px 10px 20px; padding-left: 20px; color: #000;">
        <li style="margin-bottom: 5px;"><strong>High PC1 (right):</strong> High redemption + high companions --> Cluster 4 (Engaged Loyalists)</li>
        <li style="margin-bottom: 5px;"><strong>High PC2 (top):</strong> High distance variability --> Cluster 3 (Explorers)</li>
        <li style="margin-bottom: 5px;"><strong>Low PC2 (bottom):</strong> High companion ratio --> Cluster 0 (Family Travelers)</li>
    </ul>
    <h4 style="margin-top: 15px; margin-bottom: 8px; color: #00622D; font-weight: bold;">Cross-Selling and Upselling Opportunities:</h4>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        The visualizations provide insights into which customers are most likely to respond to cross-selling and upselling campaigns. The key principle: customers who are positioned close to another cluster or in overlapping regions are behaviorally similar to that cluster and therefore more likely to adopt those behaviors with targeted marketing.
    </p>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        Clusters that appear in the center (like Cluster 1 in PCA) or spread across multiple blobs (like Clusters 0, 1, 2, 3 in UMAP) indicate customers who share characteristics with multiple segments. These are prime targets for behavioral migration campaigns. Conversely, isolated clusters (like Cluster 4) represent customers with unique, distinct behaviors who are less likely to shift but more valuable to retain.
    </p>
    <p style="margin: 10px 0; margin-right: 40px; color: #000;">
        <strong>Detailed segment-specific strategies will be developed in Section 9.5.</strong>
    </p>
</div>

### Step 9: Demographic Profiling (8 kommt noch)

In [None]:
# Step 9: Add behavioral and demographic cluster labels to df_Customer
# Die Indices von df_behavioral_clean und df_demographic_a_scaled matchen df_Customer's Index

# Behavioral cluster labels
df_profiling = df_Customer.copy()
df_profiling['Behavioral_Cluster'] = df_behavioral_clean['km_cluster'].reindex(df_profiling.index)

# Demographic cluster labels
df_profiling['Demographic_Cluster'] = df_demographic_a_scaled['km_cluster'].reindex(df_profiling.index)

# Drop rows without behavioral cluster (outliers removed by DBSCAN)
df_profiling = df_profiling.dropna(subset=['Behavioral_Cluster'])
df_profiling['Behavioral_Cluster'] = df_profiling['Behavioral_Cluster'].astype(int)
df_profiling['Demographic_Cluster'] = df_profiling['Demographic_Cluster'].astype(int)

print(f"Profiling DataFrame shape: {df_profiling.shape}")
print(f"\nBehavioral Clusters:")
print(df_profiling['Behavioral_Cluster'].value_counts().sort_index())
print(f"\nDemographic Clusters:")
print(df_profiling['Demographic_Cluster'].value_counts().sort_index())

Add scaled behavioral columns to df profiling and do PCA

In [None]:
# Get the scaled behavioral features for profiling
df_profiling["distance_variability_scaled"] = df_behavioral_clean["distance_variability"].reindex(df_profiling.index)
df_profiling["companion_flight_ratio_scaled"] = df_behavioral_clean["companion_flight_ratio"].reindex(df_profiling.index)
df_profiling["flight_regularity_scaled"] = df_behavioral_clean["flight_regularity"].reindex(df_profiling.index)
df_profiling["redemption_frequency_scaled"] = df_behavioral_clean["redemption_frequency"].reindex(df_profiling.index)

behavioral_feats_scaled = [
    "distance_variability_scaled",
    "companion_flight_ratio_scaled",
    "flight_regularity_scaled",
    "redemption_frequency_scaled"
]

In [None]:
# Do pca for 3D visualization
pca = PCA(n_components=3, random_state=42)
X_pca_final = pca.fit_transform(df_profiling[behavioral_feats_scaled])

# Add to dataframe
df_profiling['pca_1'] = X_pca_final[:, 0]
df_profiling['pca_2'] = X_pca_final[:, 1]
df_profiling['pca_3'] = X_pca_final[:, 2]

# Save as CSV
df_profiling.to_csv('data/clustering_data/customer_segmentation_profiles.csv', index=True)

# Drop the scaled columns to clean up
df_profiling.drop(columns=behavioral_feats_scaled, inplace=True)

## 9.4 Value Mapping

## 9.5 Strategic Recommendations

In [None]:
# open to dos but for later so when the final clustering is ready then we are doing more in depth cluster visualization and profiling, and also decision tree classifier for feature importance

# for feature importance
'''
- Train a Decision Tree Classifier model over the cluster labels
    - Estimate the respective feature importance (Gini importance)
    - Apply the trained model to classify multivariate outliers  
'''


# for cluster profiling visualizations
# Plot single variables to visualizes features across clusters
# example from tutorium 
'''
pd.crosstab(df['merged_labels'], df['education'], normalize='index').plot.bar(
    stacked=True,
    figsize=(10, 6)
)
plt.title("Education Level Distribution by Merged Segment", fontsize=14)
plt.xlabel("Merged Segment")
plt.ylabel("Proportion")
plt.legend(title="Education Level", bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()
'''



Cluster 1 (Business Commuters) as Gateway: In PCA, Cluster 1 (orange) sits in the center of the plot, surrounded by and overlapping with all other clusters. In UMAP, orange points appear in all three blobs. This central positioning indicates Business Commuters share behavioral characteristics with every other segment - making them the natural "gateway" cluster. Customers likely transition through this segment when changing travel patterns, so cross-selling different behaviors (more companions, more redemptions, diverse destinations) has the highest success probability here.
Cluster 4 (Engaged Loyalists) - Protect and Retain: Purple points concentrate in one area (PCA right side, UMAP bottom blob only). This isolation shows Engaged Loyalists have unique behavioral patterns distinct from all other customers. Focus should be on retention rather than cross-selling - these customers already exhibit the desired high-engagement behavior.
Cluster 2 (Disengaged Solo) - Re-activation Target: Green points spread across all areas but consistently show low engagement. The overlap with other clusters in the visualizations suggests these customers could exhibit other behaviors but currently don't. Re-activation campaigns should target moving them toward Cluster 1 (Business Commuters) first as the easiest behavioral shift.

---