# Cluster in Python
在此我們要示範如何使用Python進行分群法的實作


## Data Preparing

跟Weka不同的是，鳶尾花資料集已經存在於Weka的預設資料集當中

而使用Python自行操作的時候，則需要自己下載鳶尾花資料集


In [None]:
import pandas as pd
from sklearn import datasets, cluster, metrics, preprocessing

In [None]:
import pandas as pd

# fetch dataset
# iris = fetch_ucirepo(id=53)

# data (as pandas dataframes)
row_data=pd.read_csv('winequality-red.csv')
row_data=pd.DataFrame(row_data)
X = row_data.iloc[:, :11+1]
y= row_data[['quality category']]


In [None]:
le = preprocessing.LabelEncoder()
class_ls = le.fit_transform(y['quality category'])

new_y = pd.DataFrame(class_ls, columns = ['quality category'])
new_y

## Model Training

資料準備完成後，進行k-means分群

In [None]:
km = cluster.KMeans(n_clusters=3, init='k-means++')
km.fit(X)

In [28]:
km.labels_

array([0, 2, 2, ..., 2, 2, 2])

In [None]:
new_y['quality category'].to_numpy()

## Visualization
將分群結果以視覺化方式呈現

In [None]:
# import 繪圖model
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
np_wine_X = X.to_numpy()
np_wine_X

In [None]:
np_wine_X[:, 11]

In [None]:
new_y['quality category']

0       0
1       0
2       0
3       0
4       0
       ..
1594    0
1595    0
1596    0
1597    0
1598    0
Name: quality category, Length: 1599, dtype: int32

In [None]:
# plt.subplots(橫列數量, 直行數量)
# sharey=True 共享y軸
f, axes = plt.subplots(1, 2, sharey=True, figsize=(14,6))
axes[0].set_title('K-Means')
# 選擇petal length/petal width兩個維度來畫點狀分布圖
axes[0].scatter(np_wine_X[:,0 ], np_wine_X[:, 11], c=km.labels_, cmap='viridis')
axes[0].set_xlabel('x')
axes[0].set_ylabel('y')

axes[1].set_title('Original')
axes[1].scatter(np_wine_X[:, 0], np_wine_X[:, 11], c=new_y['quality category'], cmap='viridis')
axes[1].set_xlabel('x')
axes[1].set_ylabel('y')

In [None]:
axes[0].scatter()

## Evaluation
評估模型分群成效

In [None]:
# 此數值越接近1，表示群內差異小、且不同分群之間的差異大 => 好的分群
# 此數值越接近0，表示群內差異大、且不同分群之間的差異小 => 壞的分群
metrics.silhouette_score(X, km.labels_)

In [None]:
def get_kscore(k):
  km = cluster.KMeans(n_clusters=k)
  km.fit(X)
  return metrics.silhouette_score(X, km.labels_)

In [None]:
# 迭代找到最佳的分群數量
for k in range(2, 11):
  plt.bar(k, get_kscore(k),edgecolor="black")

plt.xlabel('n_cluster')
plt.ylabel('kscore')