- [k 均值聚类](#k-均值聚类)
    - [算法实现](#算法实现)

# k-均值聚类

聚类问题的关键是距离、相似度的定义。

k 均值聚类实际是对样本集进行划分。事先指定 k 个类别，以欧氏距离平方表示样本间距离，以中心或样本均值表示类别。以样本和所属类中心的距离总和为最优化目标函数。

n 个样本划分到 k 类这种组合优化问题是 NP 级的，于是采用启发式的迭代算法，不能保证得到全局最优。每个样本只能属于一个类，属于硬聚类。得到的类别是平坦、非层次化的。

因为不能保证全局最优，初始中心的选择直接影响聚类结果，初始中心不同，结果不同。

类别数 k 要预先指定。最优 k 值可以通过二分查找迅速找到。评判标准是类的平均直径，一般类别数变小，平均直径增加，类别数超过某个值时，平均直径不再发生改变，即最优 k 值。

- 距离

$$
\begin{aligned} d\left(x_{i}, x_{j}\right) &=\sum_{k=1}^{m}\left(x_{k i}-x_{k j}\right)^{2} \\ &=\left\|x_{i}-x_{j}\right\|^{2} \end{aligned}
$$

- 损失函数

$$
W(C)=\sum_{l=1}^{k} \sum_{C(i)=l}\left\|x_{i}-\bar{x}_{l}\right\|^{2}
$$

$l$ 表示某一类，$C$ 是样本到类别的多对一函数

- 目标

$$
\min _{C} \sum_{l=1}^{k} \sum_{C(i)=l}\left\|x_{i}-m_{l}\right\|^{2}
$$

$m_l$ 是中心值

- 算法

    1. 选择 k 个类中心，将样本按距离指派到最近的类中，得到聚类结果

    2. 更新各类的均值成为新中心

    3. 迭代直到聚类结果不变

# 算法实现

- 导入相关库

In [3]:
import numpy as np
import math

- 硬件与版本信息

In [2]:
%load_ext watermark
%watermark -m -v -p ipywidgets,numpy

CPython 3.7.3
IPython 7.6.1

ipywidgets 7.5.0
numpy 1.16.4

compiler   : MSC v.1915 64 bit (AMD64)
system     : Windows
release    : 10
machine    : AMD64
processor  : Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
CPU cores  : 4
interpreter: 64bit


- 辅助函数

In [4]:
def euclidean_distance(x, y):
    '''欧氏距离'''
    distance = 0
    for i in range(len(x)):
        distance += pow((x[i] - y[i]),2)
    return math.sqrt(distance)

In [5]:
def normalize(X, axis=-1, order=2):
    '''对数据集归一化'''
    temp = np.atleast_1d(np.linalg.norm(X, axis, order))
    temp[temp==0] = 1
    return X / np.expand_dims(temp, axis)

- 实现

In [None]:
class KMeans(object):
    '''k 均值聚类
    
    k: int
        要划分的类别数量
    max_iter: int
        最大迭代次数
    '''
    
    def __init__(self, k=3, max_iter=400):
        self.k = k
        self.max_iter = max_iter
    
    def predict(self, X_test):
        
        centers = self.__init_centers(X_test)
        
        for _ in range(self.max_iter):
            
            clusters = self.__create_clusters(center, X_test)
            
            prev_centers = centers
            
            centers = self.__calculate_centers(clusters, X_test)
            
            diff = centers - prev_centers
            
            if not diff.any():
                break
                
        return self.__get_cluster_labels(clusters, X_test)
    
    def __init_centers(self, X):
        
        n_samples, n_features = np.shape(X)
        
        centers = np.zeros((self.k, n_features))
        for i in range(self.k):
            centers[i] = X[np.random.choice(range(n_samples))]
        return centers
    
    def __create_clusters(self, centers, X):
        
        clusters = [[] for _ in range(self.k)]
        for i, sample in enumerate(X):
            center_idx = self.__closest(sample, centers)
            clusters[center_idx].append(i)
        return clusters
    
    def __closest(self, x, centers):
        
        idx = 0
        min_dist = float('inf')
        for i, center in enumerate(centers):
            distance = euclidean_distance(sample, center)
            if distance < min_dist:
                min_dist = distance
                idx = i
        return idx
    
    def __calculate_centers(self, clusters, X):
        
        n_features = np.shape(X)[1]
        centers = np.zeros((self.k, n_features))
        for i, cluster in enumerate(clusters):
            center = np.mean(X[cluster], axis=0)
            centers[i] = center
        return centers
    
    def __get_cluster_labels(self, clusters, X):
        
        y_pred = np.zeros(np.shape(X)[0])
        for i, cluster in enumerate(clusters):
            for x in cluster:
                y_pred[x] = i
        return y_pred

---

**作者：** Daniel Meng

**GitHub：**[LibertyDream](https://github.com/LibertyDream)

**博客：**[明月轩](https://LIbertydream.github.io)