# 聚类效果评估方法

## Methods
+ Adjust Rand index
  > 特性：取值范围$[-1, 1]$，取值越大，表示两种划分方法一致性越高，度量对称。

+ Adjust mutual information score
  > 特性：取值范围$[0, 1]$，取值越大，表示两种划分方法分布吻合度越高。

+ Homogeneity
  > 特性：同质性度量

+ Completeness
  > 特性：完整性度量

+ V-measure
  > 特性：homogeneity和completness的调和平均数

+ FMI(Fowlkers-Mallows score)
  > 特性：精度和召回的几何平均值, $FMI(labels\_true, labels\_pred) = \frac{TP}{\sqrt{(TP+FP)(TP+FN)}}$

+ Silhouette Coefficient
  > 特性：轮廓系数，取值[-1,1]，取值越高，则同类样本越近，不同类样本越远。

+ Calinski-Harabaz Index
  > 特性：$CHI = \frac{tr(B_k}{tr(W_k)} \frac{m-k}{k-1}$， $B_k$为类别间协方差矩阵, $W_k$为类别类协方差矩阵，$tr$为矩阵的迹，$K$ 为类别数。
  > 值越大越好，速度快。


## Reference

+ [[1] 简书：聚类算法评估](https://www.jianshu.com/p/b9528df2f57a)
+ [[2] WIKI: Rand index](https://en.wikipedia.org/wiki/Rand_index)
+ [[3] 六大分群质量评估](https://blog.csdn.net/sinat_26917383/article/details/70577710)

In [2]:
import matplotlib.pyplot as plt
from matplotlib import cm
import numpy as np
import pandas as pd
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

from sklearn.manifold import TSNE
from sklearn.decomposition import KernelPCA, PCA, TruncatedSVD, RandomizedPCA
from sklearn.metrics import pairwise

from cluster.models import KMeans, MiniBatchKMeans, SpectralClustering 
from cluster.dataset import load_time_series
from cluster.visual import plot_cluster_sequence, plot_cluster_dim_reduction
from cluster import evaluate

In [3]:
data = load_time_series(1)
data = data.groupby('datetime')['pwr'].sum()
data = pd.DataFrame(data.values.reshape(-1, 48), index=np.unique(pd.to_datetime(data.index).date.astype(str)))

In [5]:
kmeans_pp = KMeans(
    n_clusters=8,  # 聚类簇数
    max_iter=300,  # 最大迭代次数
    n_init=10,  # 随机初始化运行总次数，反馈最佳聚类结果
    init='k-means++',  # {'kmeans++', 'random', ndarray}
    algorithm='auto',  # {'auto', 'full': EM算法, 'elkan': 应用三角不等式，不支持sparse}
)

kmeans_random = KMeans(
    n_clusters=8,  # 聚类簇数
    max_iter=300,  # 最大迭代次数
    n_init=10,  # 随机初始化运行总次数，反馈最佳聚类结果
    init='random',  # {'kmeans++', 'random', ndarray}
    algorithm='auto',  # {'auto', 'full': EM算法, 'elkan': 应用三角不等式，不支持sparse}
)

kernel_kmeans = SpectralClustering(
    n_clusters=8,
    affinity='rbf',
    gamma=1.,
#     degree=3,
    n_init=10,  # 同上
)

kmeans_pp.fit(data.values)
kmeans_random.fit(data.values)
kernel_kmeans.fit(data.values)

SpectralClustering(affinity='rbf', assign_labels='kmeans', coef0=1, degree=3,
          eigen_solver=None, eigen_tol=0.0, gamma=1.0, kernel_params=None,
          n_clusters=8, n_init=10, n_jobs=1, n_neighbors=10,
          random_state=None)

## rand index

In [6]:
evaluate.adjusted_rand_score(kmeans_pp.labels_, kernel_kmeans.labels_)

-0.002043593009480593

In [8]:
str(kernel_kmeans)

"SpectralClustering(affinity='rbf', assign_labels='kmeans', coef0=1, degree=3,\n          eigen_solver=None, eigen_tol=0.0, gamma=1.0, kernel_params=None,\n          n_clusters=8, n_init=10, n_jobs=1, n_neighbors=10,\n          random_state=None)"

In [46]:
adjusted_rand_score_matrix([kmeans_pp.labels_, kernel_kmeans.labels_, kmeans_random.labels_])

array([[ 1.        , -0.00204359,  0.97944737],
       [-0.00204359,  1.        , -0.00244869],
       [ 0.97944737, -0.00244869,  1.        ]])

In [43]:

def adjusted_rand_score_matrix(*args):
    """

    Args:
        *args: list<list<string>>
    Returns:

    """
    size = len(args[0])
    assert size >= 2, 'clusters must more than 2'
    score_matrix = np.zeros([size, size])
    for i in range(size):
        for j in range(size):
            if i == j:
                score_matrix[i, j] = 1.
            else:
                score_matrix[i, j] = adjusted_rand_score(args[0][i], args[0][j])
    return score_matrix

In [45]:
from sklearn.metrics import adjusted_rand_score

In [42]:
np.zeros([10,2])

array([[0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.]])