# K-Means

- Unsupervised learning
- Klasifikasi tanpa mengetahui jumlah class dari data target
- Nilai k diketahui menggunakan __elbow method__ dari plot k vs sum squared error

In [37]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris

<hr>

### 1. Prepare dataset & plot it

In [38]:
iris = load_iris()
df = pd.DataFrame(iris['data'], columns=['SL', 'SW', 'PL', 'PW'])
df['target'] = iris['target']
df.head()

Unnamed: 0,SL,SW,PL,PW,target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


<hr>

### 2. KMeans Clustering

In [39]:
from sklearn.cluster import KMeans
model = KMeans(n_clusters = 3, max_iter=10000) 
# nilai n_clusters / k sementara ini didapat dari jumlah class dataset!

model.fit(df[['SL', 'SW', 'PL', 'PW']])

# training & predict at once
# model.fit_predict(df[['SL', 'SW', 'PL', 'PW']])

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=10000,
       n_clusters=3, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)

In [40]:
# prediksi data
model.predict(df[['SL', 'SW', 'PL', 'PW']])

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2,
       2, 2, 2, 1, 1, 2, 2, 2, 2, 1, 2, 1, 2, 1, 2, 2, 1, 1, 2, 2, 2, 2,
       2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 1])

In [41]:
df['prediksi'] = model.predict(df[['SL', 'SW', 'PL', 'PW']])
df.head()

Unnamed: 0,SL,SW,PL,PW,target,prediksi
0,5.1,3.5,1.4,0.2,0,0
1,4.9,3.0,1.4,0.2,0,0
2,4.7,3.2,1.3,0.2,0,0
3,4.6,3.1,1.5,0.2,0,0
4,5.0,3.6,1.4,0.2,0,0


In [42]:
model.labels_

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2,
       2, 2, 2, 1, 1, 2, 2, 2, 2, 1, 2, 1, 2, 1, 2, 2, 1, 1, 2, 2, 2, 2,
       2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 1])

<hr>

### 3. Clustering Evaluation Metrics

In [44]:
from sklearn import metrics

<hr>

- #### __Adjusted Rand index__

    The adjusted Rand index is a function that measures the __similarity__ of the two assignments, ignoring permutations and with chance normalization. __*Perfect labeling is scored 1.0.*__

In [52]:
# berlaku permutasi / swapping
print(metrics.adjusted_rand_score(df['target'], df['prediksi']))
print(metrics.adjusted_rand_score(df['prediksi'], df['target']))

0.7302382722834697
0.7302382722834697


<hr>

- #### __Mutual Information based scores__

    the Mutual Information is a function that measures the __agreement__ of the two assignments, ignoring permutations. Two different normalized versions of this measure are available, __Normalized Mutual Information (NMI)__ and __Adjusted Mutual Information (AMI)__. NMI is often used in the literature, while AMI was proposed more recently and is normalized against chance. Perfect labeling is scored 1.0.

In [51]:
# swapping OK
print(metrics.adjusted_mutual_info_score(df['target'], df['prediksi']))
print(metrics.adjusted_mutual_info_score(df['prediksi'], df['target']))

0.7551191675800484
0.7551191675800483


<hr>

- #### Homogeneity, completeness and V-measure

    1. __Homogeneity__: each cluster contains only members of a single class (higher is better). Homogenitas: setiap kluster hanya berisi anggota dari satu kelas.

    2. __Completeness__: all members of a given class are assigned to the same cluster (higher is better). Kelengkapan: semua anggota kelas yang diberikan ditugaskan ke cluster yang sama.
    
    3. __V-measure__: harmonic mean of Homogeneity & Completeness.

In [60]:
# homogenitas
metrics.homogeneity_score(df['target'], df['prediksi'])

0.7514854021988338

In [61]:
# komplisitas
metrics.completeness_score(df['target'], df['prediksi'])

0.7649861514489815

In [62]:
# v-measure
metrics.v_measure_score(df['target'], df['prediksi'])

0.7581756800057784

In [63]:
metrics.homogeneity_completeness_v_measure(df['target'], df['prediksi'])

(0.7514854021988338, 0.7649861514489815, 0.7581756800057784)

<hr>

- #### __Fowlkes-Mallows scores__

    The Fowlkes-Mallows score FMI is defined as the geometric mean of the pairwise precision and recall. Perfect labeling is scored 1.0.

In [65]:
metrics.fowlkes_mallows_score(df['target'], df['prediksi'])

0.8208080729114153

<hr>

- #### __Silhouette Coefficient__

    The Silhouette Coefficient is an example of such an evaluation, where a higher Silhouette Coefficient score relates to a model with better defined clusters. The score is bounded between -1 for incorrect clustering and +1 for highly dense clustering. Scores around zero indicate overlapping clusters.

In [67]:
metrics.silhouette_score(
    df[['SL', 'SW', 'PL', 'PW']], 
    model.labels_, 
    metric='euclidean'
)

0.5528190123564091

<hr>

- #### __Calinski-Harabasz Index__

    also known as __the Variance Ratio Criterion__ - can be used to evaluate the model, where a higher Calinski-Harabasz score relates to a model with better defined clusters. The index is the ratio of the sum of between-clusters dispersion and of inter-cluster dispersion for all clusters (where dispersion is defined as the sum of distances squared)

In [68]:
metrics.calinski_harabasz_score(df[['SL', 'SW', 'PL', 'PW']], model.labels_)

561.62775662962

<hr>

- #### __Davies-Bouldin Index__

    __Lower Davies-Bouldin index relates to a model with better separation between the clusters__. This index signifies the average ‘similarity’ between clusters, where the similarity is a measure that compares the distance between clusters with the size of the clusters themselves. Zero is the lowest possible score. Values closer to zero indicate a better partition.

In [69]:
metrics.davies_bouldin_score(df[['SL', 'SW', 'PL', 'PW']], model.labels_)

0.6619715465007528

<hr>

- #### __Contingency Matrix__

    Contingency matrix reports the intersection cardinality for every true/predicted cluster pair. The contingency matrix provides sufficient statistics for all clustering metrics where the samples are independent and identically distributed and one doesn’t need to account for some instances not being clustered.

In [70]:
metrics.cluster.contingency_matrix(df['target'], df['prediksi'])

array([[50,  0,  0],
       [ 0, 48,  2],
       [ 0, 14, 36]])

- The first row of output array indicates that there are 50 samples whose true cluster is 0. Of them, 50 are in predicted cluster 0, 0 is in 1, and 0 is in 2. 
- The second row of output array indicates that there are 50 samples whose true cluster is 1. Of them, 0 are in predicted cluster 0, 48 is in 1, and 2 is in 2. 
- The third row of output array indicates that there are 50 samples whose true cluster is 2. Of them, 0 are in predicted cluster 0, 14 is in 1, and 36 is in 2. 
- A confusion matrix for classification is a square contingency matrix where the order of rows and columns correspond to a list of classes.

In [71]:
metrics.confusion_matrix(df['target'], df['prediksi'])

array([[50,  0,  0],
       [ 0, 48,  2],
       [ 0, 14, 36]], dtype=int64)