# 1. Clustering-Based Anomaly Detection: K-Means

## Concept & Theory

- The core idea behind using k-means for anomaly detection is to group data points into clusters. Normal data points will tend to be close to the center of a cluster (centroid), while anomalies will be far from any cluster centroid. The distance of a data point from its nearest cluster centroid can be used as an anomaly score.

1. **Clustering**: The k-means algorithm partitions the data into *k* clusters by minimizing the within-cluster sum of squares (WCSS):

    $$ \text{WCSS} = \sum_{i=1}^{k} \sum_{x \in C_i} ||x - \mu_i||^2 $$

    where $C_i$ is the $i$-th cluster and $\mu_i$ is its centroid.

2. **Anomaly Score**: For any data point $x$, its anomaly score is its Euclidean distance to the nearest centroid:

$$ \text{Anomaly Score}(x) = \min_{i=1,...,k} ||x - \mu_i||^2 $$

3. **Thresholding**: A threshold is set on the anomaly scores to classify points as anomalies. A common approach is to use a quantile of the distances.

## Python Implementation

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs