# K-Mean Clustering (Tugas 5)

K-Means Clustering merupakan algoritma yang efektif untuk menentukan cluster dalam sekumpulan data, di mana pada algortima tersebut dilakukan analisis kelompok yang mengacu pada pemartisian N objek ke dalam K kelompok (Cluster) berdasarkan nilai rata-rata (means) terdekat.

Adapun tahapan algoritma ini adalah sebagai berikut :

**Pertama,** tentukan berapa banyak jumlah k (cluster)

**Kedua,**  secara acak tentukan record yang menjadi lokasi pusat cluster.

**Ketiga,** temukan pusat cluster terdekat untuk setiap record. Adapun persamaan yang sering digunakan dalam pemecahan masalah dalam menentukan jarak terdekat adalah persamaan Euclidean berikut :

$$
d_{euclidean}\left ( x,y\right )=\sqrt{\sum_{i}\left (x_{i}-y_{i} \right )^{2}}
$$

Dimana x=x1,x2,x3……xm dan y=y1,y2,y3…ym, sementara m menyatakan banyaknya nilai atribut dari 2 buah record.

**Keempat,** tentukan cluster terdekat untk setiap data dengan membandingkan nilai jarak terdekat, lalu perbaharui nilai pusat clusternya.

$$
ClusterCenter=\sum \frac{a_{i}}{n}
$$

**Kelima,** ulangi langkah 3 sampai 5 hingga tidak ada record yang berpindah cluster atau convergen.

## Import libraries

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # for data visualization
import seaborn as sns # for statistical data visualization
from sklearn.cluster import KMeans
from sklearn.preprocessing import LabelEncoder

## Import dataset

In [None]:
df = pd.read_csv("https://gist.githubusercontent.com/netj/8836201/raw/6f9306ad21398ea43cba4f7d537619d0e07d5ae3/iris.csv")

Exploratory data analysis

In [None]:
df.head()

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
0,5.1,3.5,1.4,0.2,Setosa
1,4.9,3.0,1.4,0.2,Setosa
2,4.7,3.2,1.3,0.2,Setosa
3,4.6,3.1,1.5,0.2,Setosa
4,5.0,3.6,1.4,0.2,Setosa


## Declare feature vector and target variable

Convert categorical variable into integers

In [None]:
X = df

y = df["variety"]

le = LabelEncoder()

X['variety'] = le.fit_transform(X['variety'])

y = le.transform(y)

y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [None]:
X.head()

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


## Feature Scaling

In [None]:
cols = X.columns

In [None]:
from sklearn.preprocessing import MinMaxScaler

ms = MinMaxScaler()

X = ms.fit_transform(X)

In [None]:
X = pd.DataFrame(X, columns=[cols])

In [None]:
X

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
0,0.222222,0.625000,0.067797,0.041667,0.0
1,0.166667,0.416667,0.067797,0.041667,0.0
2,0.111111,0.500000,0.050847,0.041667,0.0
3,0.083333,0.458333,0.084746,0.041667,0.0
4,0.194444,0.666667,0.067797,0.041667,0.0
...,...,...,...,...,...
145,0.666667,0.416667,0.711864,0.916667,1.0
146,0.555556,0.208333,0.677966,0.750000,1.0
147,0.611111,0.416667,0.711864,0.791667,1.0
148,0.527778,0.583333,0.745763,0.916667,1.0


## K-Means model with three clusters 

In [None]:
from sklearn.cluster import KMeans

# Fit K-means with Scikit
kmeans = KMeans(n_clusters=3, random_state=0) 

# Predict the cluster for all the samples
kmeans.fit(X)

pred = kmeans.predict(X)



## K-Means model parameters study

In [None]:
kmeans.cluster_centers_

array([[0.19611111, 0.595     , 0.07830508, 0.06083333, 0.        ],
       [0.63555556, 0.40583333, 0.77152542, 0.8025    , 1.        ],
       [0.45444444, 0.32083333, 0.55254237, 0.51083333, 0.5       ]])

In [None]:
kmeans.inertia_

7.801559361268047

## Check quality of weak classification by the model

In [None]:
kmeans.labels_

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)

In [None]:
labels = kmeans.labels_

# check how many of the samples were correctly labeled
correct_labels = sum(y == labels)

print("Result: %d out of %d samples were correctly labeled." % (correct_labels, y.size))

Result: 50 out of 150 samples were correctly labeled.


In [None]:
print('Accuracy score: {0:0.2f}'. format(correct_labels/float(y.size)))

Accuracy score: 0.33


In [None]:
df2 = pd.DataFrame(X)
df2["cluster"] = pred
df2

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety,cluster
0,0.222222,0.625000,0.067797,0.041667,0.0,0
1,0.166667,0.416667,0.067797,0.041667,0.0,0
2,0.111111,0.500000,0.050847,0.041667,0.0,0
3,0.083333,0.458333,0.084746,0.041667,0.0,0
4,0.194444,0.666667,0.067797,0.041667,0.0,0
...,...,...,...,...,...,...
145,0.666667,0.416667,0.711864,0.916667,1.0,1
146,0.555556,0.208333,0.677966,0.750000,1.0,1
147,0.611111,0.416667,0.711864,0.791667,1.0,1
148,0.527778,0.583333,0.745763,0.916667,1.0,1
