# 군집화(Clustering)
- 앞서 df_train_final 데이터 셋에 존재하는 연속형 변수들과 취급액(sales)의 상관관계를 확인해보았다.
- 하지만 판매량(pd_count)외에 상관관계가 높은 변수들은 존재하지 않았다.
    - 판매량의 경우 취급액과 높은 양의 상관관계가 있는 것이 당연함...
- 따라서 **군집화(Clustering)**를 통해 유의미한 패턴 및 요인들을 발견해보고자 한다!!

**<Raw Data 변수 설명>**
1. date: 방송일시
2. exp_mins: 노출(분)
3. mom_code: 마더코드
4. pd_code: 상품코드
5. pd_name: 상품명
6. pd_group: 상품군
7. pd_price: 판매단가
8. sales: 취급액
    - 취급액 = 판매단가 x 주문량
9. weekdays: 요일
10. seasons: 계절
    - 1: 겨울(winter): 12 ~ 2월
    - 2: 봄(spring): 3 ~ 5월
    - 3: 여름(summer): 6 ~ 8월
    - 4: 가을(autumn): 9 ~ 11월
11. rating: 시청률
    - 단위는 "%"
    - exp_mins(노출(분))을 기준으로 평균 값을 계산
12. temp: 기온
13. rain: 강수량
14. humidity: 습도
15. snow: 적설량
16. dust: 미세먼지 농도
17. is_rain: 비 내림 여부
    - 0: 비 내리지 않음
    - 1: 비 내림
18. is_snow: 눈 내림 여부
    - 0: 눈 내리지 않음
    - 1: 눈 내림

In [1]:
import numpy as np
import pandas as pd
import datetime as dt
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
%matplotlib inline

# for문 진행 현황을 확인할 수 있는 패키지
from tqdm.notebook import tqdm

# 여러 개의 데이터 프레임을 한 번에 merge 해주기 위한 패키지
from functools import reduce

# 이 두 줄의 코드는 matplotlib의 기본 scheme말고, seaborn scheme을 세팅해준다
# 일일이 graph의 font size를 지정할 필요 없이, seaborn의 font_scale을 사용하면 편리하다
plt.style.use('seaborn')
sns.set(font_scale = 2.5)

# 그래프에서 한글 폰트 깨지는 문제를 해결해주기 위한 코드
from matplotlib import font_manager, rc
plt.rcParams['axes.unicode_minus'] = False

import platform

if platform.system() == 'Darwin':
    rc('font', family='AppleGothic')
elif platform.system() == 'Windows':
    path = "c:/Windows/Fonts/malgun.ttf"
    font_name = font_manager.FontProperties(fname = path).get_name()
    rc('font', family = font_name)
else:
    print('Unknown system... sorry~~~~')

# 데이터 셋 확인

In [2]:
df_train = pd.read_csv('C:/Users/Playdata/2020_Bigcontest_working/dataset/df_train_final2.csv')
df_train

Unnamed: 0,date,exp_mins,mom_code,pd_code,pd_name,pd_group,pd_price,sales,weekdays,seasons,...,snow,dust,is_rain,is_snow,pd_count,month,day,hour,month_cat,hour_cat
0,2019-01-01 06:00:00,20.0,100346,201072,테이트 남성 셀린니트3종,의류,39900,2099000.0,Tuesday,1,...,,65.0,0,0,53,1,1,6,상반기,오전
1,2019-01-01 06:00:00,20.0,100346,201079,테이트 여성 셀린니트3종,의류,39900,4371000.0,Tuesday,1,...,,65.0,0,0,110,1,1,6,상반기,오전
2,2019-01-01 06:20:00,20.0,100346,201072,테이트 남성 셀린니트3종,의류,39900,3262000.0,Tuesday,1,...,,65.0,0,0,82,1,1,6,상반기,오전
3,2019-01-01 06:20:00,20.0,100346,201079,테이트 여성 셀린니트3종,의류,39900,6955000.0,Tuesday,1,...,,65.0,0,0,175,1,1,6,상반기,오전
4,2019-01-01 06:40:00,20.0,100346,201072,테이트 남성 셀린니트3종,의류,39900,6672000.0,Tuesday,1,...,,65.0,0,0,168,1,1,6,상반기,오전
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
37367,2019-12-31 23:40:00,20.0,100448,201391,일시불쿠첸압력밥솥 6인용,주방,148000,10157000.0,Tuesday,1,...,,24.0,0,0,69,12,31,23,하반기,밤
37368,2020-01-01 00:00:00,20.0,100448,201383,무이자쿠첸압력밥솥 10인용,주방,178000,50929000.0,Wednesday,1,...,,27.0,0,0,287,1,1,0,상반기,심야
37369,2020-01-01 00:00:00,20.0,100448,201390,일시불쿠첸압력밥솥 10인용,주방,168000,104392000.0,Wednesday,1,...,,27.0,0,0,622,1,1,0,상반기,심야
37370,2020-01-01 00:00:00,20.0,100448,201384,무이자쿠첸압력밥솥 6인용,주방,158000,13765000.0,Wednesday,1,...,,27.0,0,0,88,1,1,0,상반기,심야


In [3]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37372 entries, 0 to 37371
Data columns (total 24 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   date       37372 non-null  object 
 1   exp_mins   37372 non-null  float64
 2   mom_code   37372 non-null  int64  
 3   pd_code    37372 non-null  int64  
 4   pd_name    37372 non-null  object 
 5   pd_group   37372 non-null  object 
 6   pd_price   37372 non-null  int64  
 7   sales      37372 non-null  float64
 8   weekdays   37372 non-null  object 
 9   seasons    37372 non-null  int64  
 10  rating     37372 non-null  float64
 11  temp       37372 non-null  float64
 12  rain       37372 non-null  float64
 13  humidity   37372 non-null  float64
 14  snow       368 non-null    float64
 15  dust       37372 non-null  float64
 16  is_rain    37372 non-null  int64  
 17  is_snow    37372 non-null  int64  
 18  pd_count   37372 non-null  int64  
 19  month      37372 non-null  int64  
 20  day   

# 데이터 타입 변경
- 분석에 사용할 컬럼들 중, 범주형 변수들의 데이터 타입을 'category'로 변경해주겠다.

In [4]:
categoricalFeatureNames = ['month', 'day', 'hour', 'month_cat', 'hour_cat', 'weekdays', 'seasons',
                           'pd_group', 'is_rain', 'is_snow']

for var in categoricalFeatureNames:
    df_train[var] = df_train[var].astype('category')
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37372 entries, 0 to 37371
Data columns (total 24 columns):
 #   Column     Non-Null Count  Dtype   
---  ------     --------------  -----   
 0   date       37372 non-null  object  
 1   exp_mins   37372 non-null  float64 
 2   mom_code   37372 non-null  int64   
 3   pd_code    37372 non-null  int64   
 4   pd_name    37372 non-null  object  
 5   pd_group   37372 non-null  category
 6   pd_price   37372 non-null  int64   
 7   sales      37372 non-null  float64 
 8   weekdays   37372 non-null  category
 9   seasons    37372 non-null  category
 10  rating     37372 non-null  float64 
 11  temp       37372 non-null  float64 
 12  rain       37372 non-null  float64 
 13  humidity   37372 non-null  float64 
 14  snow       368 non-null    float64 
 15  dust       37372 non-null  float64 
 16  is_rain    37372 non-null  category
 17  is_snow    37372 non-null  category
 18  pd_count   37372 non-null  int64   
 19  month      37372 non-null

# 범주형 변수 추출
- K-prototype 알고리즘을 사용할 것이므로, One-Hot Encoding 작업은 수행해주지 않겠다.

In [5]:
df_cate = df_train[['month', 'day', 'hour', 'month_cat', 'hour_cat', 'weekdays', 
                    'seasons', 'pd_group', 'is_rain', 'is_snow']]
df_cate.head()

Unnamed: 0,month,day,hour,month_cat,hour_cat,weekdays,seasons,pd_group,is_rain,is_snow
0,1,1,6,상반기,오전,Tuesday,1,의류,0,0
1,1,1,6,상반기,오전,Tuesday,1,의류,0,0
2,1,1,6,상반기,오전,Tuesday,1,의류,0,0
3,1,1,6,상반기,오전,Tuesday,1,의류,0,0
4,1,1,6,상반기,오전,Tuesday,1,의류,0,0


# 연속형 변수 추출 및 Feature Scaling
- 스케일링 기법으로는 ```MinMaxScaler()```를 사용할 것이다.

In [6]:
# 연속형 변수만 추출된 데이터 셋
df_cont = df_train[['exp_mins', 'pd_price', 'rating', 'temp', 'rain', 'humidity', 'dust']]

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler.fit(df_cont)
df_cont_scaled = scaler.transform(df_cont)

# transform()을 수행하면 스케일 변환된 데이터 셋이 Numpy ndarray로 반환되기 때문에, 이를 DataFrame으로 변경
df_cont_scaled = pd.DataFrame(df_cont_scaled, columns = df_cont.columns)

df_cont_scaled.head()

Unnamed: 0,exp_mins,pd_price,rating,temp,rain,humidity,dust
0,0.46714,0.003423,0.0,0.061053,0.04902,0.574713,0.237736
1,0.46714,0.003423,0.0,0.061053,0.04902,0.574713,0.237736
2,0.46714,0.003423,0.0,0.061053,0.04902,0.574713,0.237736
3,0.46714,0.003423,0.0,0.061053,0.04902,0.574713,0.237736
4,0.46714,0.003423,0.0,0.061053,0.04902,0.574713,0.237736


# 군집화를 위한 데이터 셋 생성
- "범주형 변수" + "스케일링 처리된 연속형 변수"

In [7]:
df_cate.shape

(37372, 10)

In [8]:
df_cont_scaled.shape

(37372, 7)

In [9]:
df_cluster = pd.concat([df_cate, df_cont_scaled], axis = 1)
df_cluster

Unnamed: 0,month,day,hour,month_cat,hour_cat,weekdays,seasons,pd_group,is_rain,is_snow,exp_mins,pd_price,rating,temp,rain,humidity,dust
0,1,1,6,상반기,오전,Tuesday,1,의류,0,0,0.46714,0.003423,0.000000,0.061053,0.04902,0.574713,0.237736
1,1,1,6,상반기,오전,Tuesday,1,의류,0,0,0.46714,0.003423,0.000000,0.061053,0.04902,0.574713,0.237736
2,1,1,6,상반기,오전,Tuesday,1,의류,0,0,0.46714,0.003423,0.000000,0.061053,0.04902,0.574713,0.237736
3,1,1,6,상반기,오전,Tuesday,1,의류,0,0,0.46714,0.003423,0.000000,0.061053,0.04902,0.574713,0.237736
4,1,1,6,상반기,오전,Tuesday,1,의류,0,0,0.46714,0.003423,0.000000,0.061053,0.04902,0.574713,0.237736
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
37367,12,31,23,하반기,밤,Tuesday,1,주방,0,0,0.46714,0.017077,0.172264,0.071579,0.00000,0.402299,0.083019
37368,1,1,0,상반기,심야,Wednesday,1,주방,0,0,0.46714,0.020866,0.219676,0.090526,0.00000,0.321839,0.094340
37369,1,1,0,상반기,심야,Wednesday,1,주방,0,0,0.46714,0.019603,0.219676,0.090526,0.00000,0.321839,0.094340
37370,1,1,0,상반기,심야,Wednesday,1,주방,0,0,0.46714,0.018340,0.219676,0.090526,0.00000,0.321839,0.094340


# 군집화(Clustering)

## K-prototype 알고리즘
- 우리가 **사용할 데이터 셋은 범주형 변수와 연속형 변수를 모두 포함**하고 있다.
- 따라서 **K-prototype 알고리즘**을 사용하도록 하겠다.

- 먼저 분석에 사용할 데이터 셋을 ndarray 형태로 변환해주겠다.

In [10]:
# Converting the dataset into ndarray
df_cluster_ndarray = df_cluster.to_numpy()
df_cluster_ndarray

array([[1, 1, 6, ..., 0.04901960784313726, 0.574712643678161,
        0.23773584905660378],
       [1, 1, 6, ..., 0.04901960784313726, 0.574712643678161,
        0.23773584905660378],
       [1, 1, 6, ..., 0.04901960784313726, 0.574712643678161,
        0.23773584905660378],
       ...,
       [1, 1, 0, ..., 0.0, 0.32183908045977017, 0.09433962264150944],
       [1, 1, 0, ..., 0.0, 0.32183908045977017, 0.09433962264150944],
       [1, 1, 0, ..., 0.0, 0.32183908045977017, 0.09433962264150944]],
      dtype=object)

### 군집화를 위한 최적의 K 값 찾기

In [None]:
from kmodes.kprototypes import KPrototypes

cost = []
for num_clusters in tqdm(list(range(2, 6))):
    kproto = KPrototypes(n_clusters = num_clusters, init = 'Huang', n_init = 50, max_iter = 15, n_jobs = -2, random_state = 0) 
    kproto.fit_predict(df_cluster_ndarray, categorical = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
    cost.append(kproto.cost_)

plt.plot(cost)
plt.xlabel('K')
plt.ylabel('cost')
plt.show()

- Elbow 방법으로 찾은 최적의 K 값은 ???라고 나온다.

### 최적의 K 값으로 군집화 수행
- **Elbow 방법**을 통해 **최적의 K 값은 ???라고 판단**하였다.

In [None]:
kproto = KPrototypes(n_clusters = ?, init = 'Huang', n_init = 50, max_iter = 15, n_jobs = -2, random_state = 0) 
clusters = kproto.fit_predict(df_cluster_ndarray, categorical = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [None]:
cluster_dict = []
for c in clusters:
    cluster_dict.append(c)

# 각 데이터가 속하는 군집을 의미하는 'cluster' 변수를 df_train_fill_dust 데이터 셋에 추가
df_train['cluster'] = cluster_dict
df_train.cluster.value_counts()