## Key words
### 정규화, 표준화, k-means, MinMaxScaler, StandardScaler, KMeans

### k-means 군집분석의 특징
- 임의의 k개의 `점`을 기반으로 가까운 거리의 데이터를 묶는 것과 더불어 평균을 활용하는 군집분석 기법
- 군집 개수(k)개를 확정하기 위해 여러 번의 시행착오 필요
- 결과 고정을 위해 seed 설정 필요: random_state

### sklearn - MinMaxScaler()
- `MinMax 정규화`를 실시하는 sklearn의 함수
- fit() 메서드로 규칙 모델을 만들고 transform() 함수로 변환을 실시

### sklearn - StandardScaler()
- `표준화`를 실시하는 sklearn의 함수
- fit() 메서드로 규칙 모델을 만들고 transform() 함수로 변환을 실시

In [1]:
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler

In [2]:
df = pd.read_csv("iris.csv")
df.head(2)

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa


In [5]:
# 데이터 두개로 쪼개보기
df_1 = df.head()
df_2 = df.tail(1)

In [4]:
df_1

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [6]:
df_2

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
149,5.9,3.0,5.1,1.8,virginica


In [8]:
# nor_minmax = MinMaxScaler().fit(df_1) # Species 때문에 오류가남

In [9]:
nor_minmax = MinMaxScaler().fit(df_1.iloc[:, :-1]) # 학습시키기
nor_minmax # 최대값, 최소값을 생산하기 위한 값이 일부 nor_minmax 객체에 저장되어 있음

MinMaxScaler()

In [10]:
# 이 모델을 변환시켜보기
nor_minmax.transform(df_1.iloc[:, :-1]) # array 형태로 나옴, Petal.Width는 전부 0처리가 됨

array([[1.        , 0.83333333, 0.5       , 0.        ],
       [0.6       , 0.        , 0.5       , 0.        ],
       [0.2       , 0.33333333, 0.        , 0.        ],
       [0.        , 0.16666667, 1.        , 0.        ],
       [0.8       , 1.        , 0.5       , 0.        ]])

In [11]:
nor_minmax.transform(df_2.iloc[:, :-1])

array([[ 2.6,  0. , 19. ,  1.6]])

In [12]:
nor_minmax = MinMaxScaler().fit(df_2.iloc[:, :-1]) # 학습시키기
nor_minmax.transform(df_2.iloc[:, :-1]) # df_2는 범위가 아니라 하나만 들어가서 0만 나옴

array([[0., 0., 0., 0.]])

- MinMax랑 StandardScaler 사용법은 똑같습니다.

데이터 프레임으로 변환

In [14]:
nor_minmax = MinMaxScaler().fit(df_1.iloc[:, :-1]) # 학습시키기
a = nor_minmax.transform(df_1.iloc[:, :-1])

In [16]:
pd.DataFrame(a , columns = df_1.columns[:4]) # 정규화 시키고 다시 데이터프레임으로 쓸경우 사용

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width
0,1.0,0.833333,0.5,0.0
1,0.6,0.0,0.5,0.0
2,0.2,0.333333,0.0,0.0
3,0.0,0.166667,1.0,0.0
4,0.8,1.0,0.5,0.0


### sklearn - Kmeans()
- k-means 군집순석을 실시하는 sklearn의 함수
- n_clusters, max_iter, random_state 인자에 각각 군집 개수, 최대 반복 연산, 결과 고정 설정 가능
- KMeans() 함수의 fit() 메서드에 데이터를 할당하여 학습 진행
- 결과 객체의 cluster_centers_와 labels_ 어트리뷰트로 군집중심과 각 행의 군집 번호 확인 가능

In [17]:
from sklearn.cluster import KMeans

In [18]:
df = pd.read_csv("iris.csv")
df.head(2)

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa


In [20]:
model = KMeans(n_clusters = 3, random_state = 123).fit(df.iloc[:, :-1])
model

KMeans(n_clusters=3, random_state=123)

In [21]:
model.labels_

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 0, 0, 0, 0, 2, 0, 0, 0,
       0, 0, 0, 2, 2, 0, 0, 0, 0, 2, 0, 2, 0, 2, 0, 0, 2, 2, 0, 0, 0, 0,
       0, 2, 0, 0, 0, 0, 2, 0, 0, 0, 2, 0, 0, 0, 2, 0, 0, 2])

In [22]:
model.cluster_centers_ # 각각의 labels 기반으로 groupby 연산을 해서 평균낸 것의 각각의 그룹의 centroid

array([[6.85      , 3.07368421, 5.74210526, 2.07105263],
       [5.006     , 3.428     , 1.462     , 0.246     ],
       [5.9016129 , 2.7483871 , 4.39354839, 1.43387097]])

In [27]:
df["cluster"] = model.labels_
df.groupby("cluster").mean() # 위에거랑 똑같음

Unnamed: 0_level_0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,6.85,3.073684,5.742105,2.071053
1,5.006,3.428,1.462,0.246
2,5.901613,2.748387,4.393548,1.433871


### 1. BMI가 0이 아닌 사람 데이터를 대상으로 k-means 군집분석을 실시하는 경우 군집 개수가 가장 큰 군집의 Insulin 평균은 얼마인가?
- diabets.csv
- 군집은 4개로 설정하고 Seed는 123

In [28]:
df = pd.read_csv("diabetes.csv")
df.head(2)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0


In [51]:
df_sub = df.loc[df["BMI"] != 0, ]
df_sub.head(2)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0


In [52]:
model = KMeans(n_clusters=4, random_state=123).fit(df_sub)
model

KMeans(n_clusters=4, random_state=123)

In [53]:
df_sub["clusters"] = model.labels_
df_sub.head(2)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_sub["clusters"] = model.labels_


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome,clusters
0,6,148,72,35,0,33.6,0.627,50,1,0
1,1,85,66,29,0,26.6,0.351,31,0,0


In [54]:
pd.crosstab(df_sub["Pregnancies"], df_sub["clusters"])

clusters,0,1,2,3
Pregnancies,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,57,3,29,19
1,57,6,47,25
2,44,3,42,12
3,34,3,27,10
4,44,1,12,11
5,38,1,11,6
6,30,1,10,8
7,25,2,10,7
8,24,3,4,6
9,17,1,6,4


In [55]:
df_sub.groupby("clusters")["Insulin"].mean()

clusters
0      4.103194
1    509.166667
2    102.674528
3    224.035088
Name: Insulin, dtype: float64

1. 정답

In [56]:
df_sub["clusters"].value_counts()

0    407
2    212
3    114
1     24
Name: clusters, dtype: int64

In [57]:
df_sub.groupby("clusters")["Insulin"].mean()

clusters
0      4.103194
1    509.166667
2    102.674528
3    224.035088
Name: Insulin, dtype: float64

### 2. BMI가 0이 아닌 사람 데이터를 대상으로 k-means 군집분석을 실시하는 경우 군집 개수가 가장 큰 군집의 나이 평균은 얼마인가?
- diabete.csv
- 군집은 4개, Seed는 123
- 군집 분석 이전에 Min-Max 정규화를 실시
- 나이 계산은 정규화 실시 전의 데이터를 사용

In [99]:
df = pd.read_csv("diabetes.csv")
df.head(2)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0


In [100]:
df_sub = df.loc[df["BMI"] != 0, ]
df_sub.head(2)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0


In [101]:
model = MinMaxScaler().fit(df_sub)
model

MinMaxScaler()

In [102]:
model.transform(df_sub)

array([[0.35294118, 0.74371859, 0.59016393, ..., 0.23441503, 0.48333333,
        1.        ],
       [0.05882353, 0.42713568, 0.54098361, ..., 0.11656704, 0.16666667,
        0.        ],
       [0.47058824, 0.91959799, 0.52459016, ..., 0.25362938, 0.18333333,
        1.        ],
       ...,
       [0.29411765, 0.6080402 , 0.59016393, ..., 0.07130658, 0.15      ,
        0.        ],
       [0.05882353, 0.63316583, 0.49180328, ..., 0.11571307, 0.43333333,
        1.        ],
       [0.05882353, 0.46733668, 0.57377049, ..., 0.10119556, 0.03333333,
        0.        ]])

In [103]:
df_j = pd.DataFrame(model.transform(df_sub), columns = df_sub.columns)
df_j.head(2)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,0.352941,0.743719,0.590164,0.353535,0.0,0.314928,0.234415,0.483333,1.0
1,0.058824,0.427136,0.540984,0.292929,0.0,0.171779,0.116567,0.166667,0.0


In [94]:
model2 = KMeans(n_clusters=4, random_state=123).fit(df_j) # 정규화한 데이터를 모델링에 학습

In [95]:
df_sub["cluster"] = model2.labels_
df_sub.head(3)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_sub["cluster"] = model2.labels_


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome,cluster
0,6,148,72,35,0,33.6,0.627,50,1,2
1,1,85,66,29,0,26.6,0.351,31,0,0
2,8,183,64,0,0,23.3,0.672,32,1,2


In [96]:
df_sub["cluster"].value_counts()

0    361
1    135
2    131
3    130
Name: cluster, dtype: int64

In [97]:
df_sub.groupby("cluster")["Age"].mean()

cluster
0    25.667590
1    29.977778
2    44.297710
3    46.753846
Name: Age, dtype: float64

### 3. BMI가 0이 아닌 사람 데이터를 대상으로 k-means 군집분석을 실시하고 군집의 중심점간 유클리드 거리가 가장 가까운 그룹간 거리는?
- diabete.csv
- 군집은 3개로 설정하고 Seed는 123으로 한다

In [104]:
df = pd.read_csv("diabetes.csv")
df_sub = df.loc[df["BMI"] != 0, ]
df_sub.head(2)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0


In [106]:
model.cluster_centers_

array([[4.02631579e+00, 1.58447368e+02, 7.20000000e+01, 3.22631579e+01,
        4.41289474e+02, 3.51078947e+01, 5.69210526e-01, 3.47631579e+01,
        5.78947368e-01],
       [3.97525773e+00, 1.14237113e+02, 6.86474227e+01, 1.52597938e+01,
        1.46969072e+01, 3.14406186e+01, 4.34579381e-01, 3.38082474e+01,
        3.01030928e-01],
       [3.54273504e+00, 1.29376068e+02, 7.14786325e+01, 3.03376068e+01,
        1.59401709e+02, 3.41346154e+01, 5.35188034e-01, 3.19487179e+01,
        4.18803419e-01]])

In [105]:
model = KMeans(n_clusters=3, random_state=123).fit(df_sub)
df_centers = pd.DataFrame(model.cluster_centers_, columns=df_sub.columns)
df_centers

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,4.026316,158.447368,72.0,32.263158,441.289474,35.107895,0.569211,34.763158,0.578947
1,3.975258,114.237113,68.647423,15.259794,14.696907,31.440619,0.434579,33.808247,0.301031
2,3.542735,129.376068,71.478632,30.337607,159.401709,34.134615,0.535188,31.948718,0.418803


In [107]:
df_centers = df_centers.transpose()
df_centers

Unnamed: 0,0,1,2
Pregnancies,4.026316,3.975258,3.542735
Glucose,158.447368,114.237113,129.376068
BloodPressure,72.0,68.647423,71.478632
SkinThickness,32.263158,15.259794,30.337607
Insulin,441.289474,14.696907,159.401709
BMI,35.107895,31.440619,34.134615
DiabetesPedigreeFunction,0.569211,0.434579,0.535188
Age,34.763158,33.808247,31.948718
Outcome,0.578947,0.301031,0.418803


In [108]:
print(sum((df_centers.iloc[:, 0] - df_centers.iloc[:, 1]) ** 2) ** 0.5)
print(sum((df_centers.iloc[:, 1] - df_centers.iloc[:, 2]) ** 2) ** 0.5)
print(sum((df_centers.iloc[:, 0] - df_centers.iloc[:, 2]) ** 2) ** 0.5)

429.24419310888464
146.33847909815492
283.405999774738


정답 : 146