# K-Means

## 1. Definition
K-means is an unsupervised clustering algorithm that groups data into k clusters by minimizing the distance between points and their assigned cluster centers (centroids).
It solves the problem of finding hidden structures or groupings in unlabeled data without predefined categories. It models clustering as an optimization problem to minimize the variance within each cluster.

## 2. Core Idea
Many datasets have hidden structure: customers with similar behaviour, products with similar attributes or documents with similar topics. K-means assumes:
* The "center" (mean) of a cluster is the best representative of that group.
* Points closer to each other are more similar than points far apart (Euclidean distance hypothesis).
* Within-cluster distance should be small, between-cluster distance should be large.


## 3. Mechanism
K-means built on an iterative process known as Expectation-Maximization (EM). The process starts from
1. Randomly assign k random points as intial centriods
2. Assign every data point to the nearest centroid based on Euclidean distance
3. Move the centroid to the average position of all the points assigned to it.
4. Repeat steps 2 and 3 until centroids stop moving or a maximum number of iteration is reached.


## 4. Mathematical Details and Training
The objective function is to minimize within-cluster sum of squared distances (WCSS):
$$ J = \sum_{i=1}^k \sum_{x \in C_i} ||x - \mu_i||^2$$

There is no gradient descent with a learning rate. Instead, it solves the minimization analytically in two steps:
* Assignment: minimize distance
* Update: recompute mean


## 5. Pros and Cons (trade-offs)
* Pros:
    * Very fast $O(n \cdot K \cdot I \cdot d)$ (where $I$ is iterations, $d$ is dimensions).
    * Scales to millions of points: Easily adapts to new examples.
    * Performs well on spherical, evenly sized clusters
    * Interpretability: Centroids gives a clear archetype for the cluster
* Cons:
    * Requires k upfront
    * Sensitive to initialization (local minimum)
    * Struggles with non-spherical clusters e.g. moons
    * Sensitive to outliers
    * High-dimensional data can collapse distances 


## 6. Production Consideration
* Use k-means ++ to avoid bad starting positions
* Choosing k: use elbow method, silhouette score
* Scaling: must standardize features
* Large scale training: Use minibatch k-means (streaming friendly)
* Deployment: Store final centroids → predictions are cheap (just nearest centroid).
* Memory considerations: Only centroids stored → tiny model footprint.
    * Monitoring:
    * Drift: cluster shapes and centroids shift over time
    * Re-run clustering periodically on new data


## 7. Other Variants
1. K-means++ (Better Initialization)
* Purpose: Reduce sensitivity to random initialization.
* Key idea: Choose initial centroids far apart using a probabilistic approach.
* Benefit: Faster convergence, better cluster quality.


2. Mini-Batch K-means (For large-scale data)
* Purpose: Handle datasets with millions of points.
* Key idea: Uses small random batches of data to update centroids.
* Benefit: Much faster; slightly noisier but good approximation.


3. Kernel K-means (Non-linear clusters)
* Purpose: Handle non-spherical clusters.
* Key idea: Map data into a high-dimensional feature space via kernels.
* Benefit: Similar to kernel SVM → can separate arbitrary shapes.

In [8]:
import sys, os
root = os.path.abspath("..")

sys.path.append(root)

import numpy as numpy
import pandas as pd
from src.preprocessing import StandardScaler
from src.k_means import KMeans
import matplotlib.pyplot as plt

## Case Study:  Dynamic Pricing

### Business Problem
You are working at a large e-commerce / marketplace, managing over 50,000 SKUs. The pricing strategy was mostly rule-based - fixed markups, generic discounts, and ad-hoc promotions.
There are some major issues:
* Some products were constantly discounted unnecessarily, hurting margin.
* Others were extremely price-sensitive, and small price changes caused demand to collapse.
* Others barely react to price changes at all.

Management wanted a scalable dynamic pricing system, but
* Per-SKU price elasticity is noisy and hard to estimate.
* They need something simple, interpretable, and scalable.


### Data Problem
Our goals are to:
1.	Identify product behavioral segments that reflect price sensitivity, promo dependency, seasonality, and volatility.
2.	Build a clustering model that can group similar SKUs so that:
    * Each cluster has a clear business meaning
    * Cluster-level elasticity models are stable.
    * Pricing rules can be defined per segment

Strategies: Use k-means to cluster products by demand-price behaviour, then apply cluster-level pricing rules.



### Dataset
You are working with 1 year of transactional data.

For each product $i$:
* Historical prices: $P_={i,t}$
* Units sold: $q_{i,t}$
* Revenue: $r_{i,t} = p_{i,t} * q_{i. t}$
* Discount flags / promo participation
* Category, brand, etc. (optional)


### Feature Engineering
* Price sensitivity / elasticity proxies
    * Correlation between price and log-sales
    * Slope of simple regression: log(sale) ~ price
    * Promotion lift: sales lift during promotions vs non-promotions
    * % of sales happening under discount
* Demand pattern features
    * Average weekly sales
    * Sales volatility $(std/mean)$
    * Seasonality index (peak vs off-peak ratio)
* Margin / economics
    * Average margin %
    * Return rate (for some categories like fashion)
You standardize these features (e.g. z-scores)

In [2]:
df = pd.read_csv(f"{root}/data/kmeans_product_pricing.csv")
df.head()

df

Unnamed: 0,product_id,avg_weekly_sales,volatility,promo_lift,price_slope,margin_pct,seasonality_idx,current_price
0,P001,120,0.12,0.05,-0.10,0.38,1.05,19.99
1,P002,85,0.20,0.30,-0.45,0.25,1.10,14.99
2,P003,15,0.55,1.20,-1.40,0.14,0.95,6.99
3,P004,45,0.18,0.10,-0.20,0.50,1.25,24.99
4,P005,300,0.05,0.02,-0.05,0.40,1.02,9.99
...,...,...,...,...,...,...,...,...
95,P096,6,0.76,1.95,-1.90,0.07,0.78,4.29
96,P097,198,0.10,0.03,-0.08,0.44,1.01,10.59
97,P098,56,0.31,0.39,-0.55,0.31,1.18,13.09
98,P099,21,0.51,0.96,-1.12,0.14,0.90,7.59


### K-means

In [3]:
features = ['avg_weekly_sales', 'volatility', 'promo_lift',
            'price_slope', 'margin_pct', 'seasonality_idx']
X = df[features].values

In [4]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [5]:
model = KMeans(n_clusters=5, random_state=101)
model.fit(X_scaled)
df['cluster'] = model.predict(X)

Converged after 3 iteration.


In [6]:
df['cluster'].value_counts()

cluster
0    100
Name: count, dtype: int64

### How Dynamic Pricing Uses the Clusters

For each cluster $c$:
1. Estimate elasticity at cluster level, fit:
$$\log{q_{i,t}} - \beta_{0,c} + \beta_{1,c}p_{i,t}...$$

using data from all products in cluster $c$. This reduce noise vs per-product.

2. Define a pricing objective
* Maximize expected revenue: $p * \hat{q}(p)$
* Subject to constraints: min margin %, price bounds, legal/brand limit

3. Derive price rules per cluster
* Inelastic staples — high margin, low promo dependency
* Highly elastic deal seekers — big response to discounts
* Seasonal products — strong monthly seasonality
* Stable mid-elastic products — predictable baseline
* Promo-dependent “zombie” products — almost no sales without discount
    

4. Online system
* For a product, look up its cluster label.
* Apply corresponding pricing policy.
* Recompute daily / hourly depending on use case.

In [7]:
# Example: define a simple pricing uplift rule based on cluster
cluster_price_uplift = {
    0: 0.03,  # +3% for inelastic
    1: -0.02, # -2% for very elastic (discount)
    2: -0.10, # -10% to clear zombie stock
    3: 0.00,  # no change
    4: 0.05   # +5% in peak season windows
}

df['price_factor'] = df['cluster'].map(cluster_price_uplift)

# new dynamic price (simple placeholder)
df['new_price'] = df['current_price'] * (1 + df['price_factor'])

In [10]:
df.head()

Unnamed: 0,product_id,avg_weekly_sales,volatility,promo_lift,price_slope,margin_pct,seasonality_idx,current_price,cluster,price_factor,new_price
0,P001,120,0.12,0.05,-0.1,0.38,1.05,19.99,0,0.03,20.5897
1,P002,85,0.2,0.3,-0.45,0.25,1.1,14.99,0,0.03,15.4397
2,P003,15,0.55,1.2,-1.4,0.14,0.95,6.99,0,0.03,7.1997
3,P004,45,0.18,0.1,-0.2,0.5,1.25,24.99,0,0.03,25.7397
4,P005,300,0.05,0.02,-0.05,0.4,1.02,9.99,0,0.03,10.2897
