# 고혈압 분석 모델
## @author: sh22h

- year0_NA를 DataFrame으로 불러오고 4개로 나눔
  - binary: 범주형(binary) 변수, 0 / 1 로 변경
  - categoryH0: 계층 없는 범주형(>3) 변수, one-hot-encoding
  - categoryH1: 계층 있는 범주형 변수, 표준화
  - ctn:연속형 변수, 정규화, 표준화 
  - hyperTension: 고혈압

- 각각 정규화 또는 표준화한 후 변수는 X 고혈압은 y로 둠
  - 정규화(normalization): 0, 1
  - 표준화(standardization): 평균: 0 표준편차: 1
  - 계층화(quantile transform): 4분위 수

## 학습 모델

- 0차
  - 로컬에서 구현

- 1차
  - 2021-07-20
  - normalize_ctn
  - 
  ```
  model = Sequential()
  model.add(Dense(12, input_dim=52, activation='relu'))  # input layer requires input_dim param
  model.add(Dense(15, activation='relu'))
  model.add(Dense(8, activation='relu'))
  model.add(Dense(10, activation='relu'))
  model.add(Dense(1, activation='sigmoid'))  # sigmoid instead of relu for final probability between 0 and 1
  model.compile(loss="binary_crossentropy", optimizer="adam", metrics=['accuracy'])
  history = model.fit(X_train, y_train, epochs=100, batch_size=10, verbose=0)
  scores = model.evaluate(X_test, y_test)
  print("%s: %.2f%%" % (model.metrics_names[1], scores[1] * 100))
  ```

- 2차
  - 2021-07-26
  - Decision Tree 구현

- 3차
  - 2021-08-02
  - one-hot encoding 구현

- 4차
  - 2021-08-07
  - 전처리 개선
  - Keras Tunor 사용 준비
  - P1 ~ P4 모델 구현

- 5차
  - 2021-08-08
  - 데이터 결측값 관리
    - TOTALC 유의미한 값: 15개
    - 키, 몸무게 결측값 매우 많음
  - 데이터 결측값 제거
    - TOTALC 삭제
    - KNNImputer로 결측값 채우기
    - https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html#sklearn.impute.KNNImputer

- 6차
  - 2021-08-12
  - 데이터 결측값 관리
    - 변수 모두 포함하라!
    - 범주형: 최빈값
    - 수치형: 평균값(kNN)
    - TOTALC 포함하라
    - 주말까지
  - DicisionTree 보는 법을 해석하라.
  - 모델 정확도 올리기

- 7차
  - 2021-08-23
  - unit 통일
  - dropout 통일
  - 모델 정확도 여전히 안좋음

- 8차
  - 2021-08-24
  - unit, dropout 다시 분리
  - 레이어 늘림

- 9차
  - 2021-08-26
  - 변수 중요도 낮은 변수 일부 제거하고 모델 돌려보기
    - 영향을 조금이라도 주는 것
      1. 'AS1_AGE'
      2. 'AS1_WEIGHT'
      3. 'AS1_B18'
      4. 'AS1_SEX'
      5. 'P3'
      6. 'AS1_B01'
      7. 'AS1_B04'
    - 나머지, 영향 없음.

- 10차
  - 2021-08-29
  - unit 통일
  - dropout 통일
 
- 11차
  - 2021-08-30
  - max_epoch = 10으로 통일

- 12차
  - 데이터셋 변경

- 13차
  - 2021-09-09
  - 데이터셋 변경
    - P1 ~ P4 -> FA1 ~ FA5 
    - columns 54

- 14차
  - 2021-09-13
  - 데이터 추가
    - AS1_WAIST3_A
    - 허리둘레
    - 계층있는 연속형
    - columns: 55

- 15차
  - 2021-09-26
  - 누락 되었던 식품군 F1 ~ F17 데이터 추가
  - 모델 제작에는 사용하지 않음

- 16차
  - 2021-10-01
  - JOBB 추가
  - 키 대신에 BMI 넣음
  - FA를 DP로 변경

- 17차
  - 2022-05-04
  - _1000 변수만으로 모델 학습

- 18차
  - 2022-05-13
  - 혈압을 예측하는 회귀모델 구현
  - _1000 변수 + 원래 쓰던거
  - 77777 to 0
    - AS1_DRDUA: 77777 to 0
    - AS1_HVSMAM: 77777 to 0

  - 그냥 삭제
    - 


In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tensorflow import keras

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Conv1D, GlobalMaxPooling1D, Embedding
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## 데이터 전처리

### 데이터 불러오기

In [None]:
# 종속변수에 결측값이 있는 열은 제외하고 불러옴. 데이터의 순수성 지킴
dataset = pd.read_excel('/content/drive/MyDrive/Colab Notebooks/HyperTension_Returns/dataset220513_3.xlsx',
                        index_col=0, na_values=['NA', ' ', '#NULL!','#DIV/0!']).dropna(subset=['AS1_DRUGHTCU', 'AS1_BPLIE2S_A', 'AS1_BPLIE2D_A'])
dataset

In [None]:
dataset.info()

### 데이터 그룹 별로 분류

'AS1_HEIGHT', 'AS1_WEIGHT'는 결측값이 매우 높은데 어떻게 스케일링 해야할까?

- 40 50 60대 남녀 6개 그룹으로 나눔
- 각 그룹의 평균값으로 대치
- np.select 사용

In [None]:
condition = [(dataset['AS1_SEX'] == 1 ) & (dataset['AS1_AGE'] >= 60),  # 60대 남자
             (dataset['AS1_SEX'] == 1 ) & (dataset['AS1_AGE'] >= 50),  # 50대 남자
             (dataset['AS1_SEX'] == 1 ) & (dataset['AS1_AGE'] >= 40),  # 40대 남자
             (dataset['AS1_SEX'] == 2 ) & (dataset['AS1_AGE'] >= 60),  # 60대 여자
             (dataset['AS1_SEX'] == 2 ) & (dataset['AS1_AGE'] >= 50),  # 50대 여자
             (dataset['AS1_SEX'] == 2 ) & (dataset['AS1_AGE'] >= 40)   # 40대 여자
            ]
choice = ['M60', 'M50', 'M40', 'F60', 'F50', 'F40']

dataset['group'] = np.select(condition, choice, default=np.nan)

### dataset 결측값 대치

#### 결측값이 너무 많은 데이터는 제거
- AS1_TOTALC
- AS1_FMHTREL1A
- AS1_FMDMREL1A
- AS1_FMHEREL1A
- AS1_FMCVAREL1A
- AS1_FMCDREL1A
- AS1_FMCDREL1AG
- AS1_FMCHREL1A
- AS1_FMPVREL1A
- AS1_FMLPREL1A


In [None]:
dataset = dataset.drop(columns=['AS1_TOTALC', 'AS1_FMHTREL1A', 'AS1_FMDMREL1A',
                                'AS1_FMHEREL1A', 'AS1_FMCVAREL1A', 'AS1_FMCDREL1AG',
                                'AS1_FMCDREL1A', 'AS1_FMCHREL1A', 
                                'AS1_FMPVREL1A', 'AS1_FMLPREL1A'])

#### AS1_WAIST3_A 

In [None]:
dataset = dataset.dropna(subset=['AS1_WAIST3_A'])

#### AS1_HEIGHT, AS1_WEIGHT

In [None]:
fill_mean_func = lambda g: g.fillna(g.mean()) # 각 그룹별 평균으로 결측값 대치

dataset['AS1_HEIGHT'] = dataset.groupby('group')['AS1_HEIGHT'].apply(fill_mean_func)
dataset['AS1_WEIGHT'] = dataset.groupby('group')['AS1_WEIGHT'].apply(fill_mean_func)

#### AS1_BMI

In [None]:
dataset['AS1_BMI'] = np.where(pd.notnull(dataset['AS1_BMI']) == True, dataset['AS1_BMI'], dataset['AS1_WEIGHT'] / ((dataset['AS1_HEIGHT']/100) ** 2))

#### 마지막까지 남아있는 것 dropna, 완료



In [None]:
dataset.dropna(inplace=True)

In [None]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
Index: 8997 entries, EPI20_026_2_000002 to EPI20_026_2_010030
Data columns (total 56 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   AS1_SEX        8997 non-null   int64  
 1   AS1_AGE        8997 non-null   int64  
 2   AS1_HEIGHT     8997 non-null   float64
 3   AS1_WEIGHT     8997 non-null   float64
 4   AS1_BMI        8997 non-null   float64
 5   AS1_WAIST3_A   8997 non-null   float64
 6   AS1_EDUA       8997 non-null   int64  
 7   AS1_INCOME     8997 non-null   int64  
 8   AS1_DRINK      8997 non-null   int64  
 9   AS1_DRDUA      8997 non-null   int64  
 10  AS1_SMOKEA     8997 non-null   int64  
 11  AS1_HVSMAM     8997 non-null   float64
 12  AS1_HVSMDU     8997 non-null   int64  
 13  AS1_PHYSTB     8997 non-null   int64  
 14  AS1_PHYSIT     8997 non-null   int64  
 15  AS1_PHYACTL    8997 non-null   int64  
 16  AS1_PHYACTM    8997 non-null   int64  
 17  AS1_PHYACTH    8997 non-nu

In [None]:
sum(dataset.isnull().sum())

0

In [None]:
dataset

Unnamed: 0_level_0,AS1_SEX,AS1_AGE,AS1_HEIGHT,AS1_WEIGHT,AS1_BMI,AS1_WAIST3_A,AS1_EDUA,AS1_INCOME,AS1_DRINK,AS1_DRDUA,...,AS1_B24_1000,FA1,FA2,FA3,FA4,FA5,AS1_DRUGHTCU,AS1_BPLIE2S_A,AS1_BPLIE2D_A,group
RID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
EPI20_026_2_000002,1,66,164.677632,62.936330,23.207705,68.000000,1,1,3,5,...,61.317678,24,56,42,0,786,1,128.0,80.0,M60
EPI20_026_2_000004,2,56,153.679198,68.000000,99999.000000,89.333333,1,4,1,1,...,126.462816,75,695,89,0,735,1,158.0,81.0,F50
EPI20_026_2_000006,2,43,155.673302,70.000000,28.884827,81.000000,2,4,3,1,...,133.461354,3,90,594,115,638,1,146.0,81.0,F40
EPI20_026_2_000007,1,56,176.000000,71.000000,22.920971,84.000000,6,8,3,5,...,59.332509,12,62,20,30,818,1,108.0,75.0,M50
EPI20_026_2_000010,1,50,175.000000,83.000000,27.102041,94.000000,3,3,3,5,...,113.557358,75,43,39,30,960,1,122.0,86.0,M50
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
EPI20_026_2_010025,1,55,166.869880,67.493373,24.238471,88.000000,3,2,3,5,...,84.629187,67,196,277,25,1709,1,120.0,70.0,M50
EPI20_026_2_010026,2,41,165.000000,60.000000,22.038567,73.333333,3,6,1,1,...,106.329114,16,28,292,20,729,1,73.0,47.0,F40
EPI20_026_2_010028,1,40,168.589552,73.000000,25.683934,84.000000,3,5,3,5,...,85.513078,30,1,115,20,746,1,121.0,85.0,M40
EPI20_026_2_010029,1,53,166.869880,65.000000,25.000000,87.000000,1,2,3,5,...,55.441478,37,2,15,0,960,1,109.0,70.0,M50


In [None]:
dataset.to_pickle('/content/drive/MyDrive/Colab Notebooks/HyperTension_Returns/ReplacedDatasets.pkl')
dataset.to_csv('/content/drive/MyDrive/Colab Notebooks/HyperTension_Returns/ReplacedDatasets.csv')

In [None]:
# dataset에 있는 변수 분리
# 범주형, 연속형 등으로 분리하여 raw_var 형태로 저장하여 각각 관리한다.

# 종속변수
label = 

# 범주형(binary, 0 or 1)
raw_binary = dataset.reindex(columns=['AS1_SEX', 'AS0_TIED', 'AS0_SLPAMSF', 'AS1_STRPHYSJ'])
col_b = raw_binary.columns

# 범주형(계층 없음, without hierarchy)
# raw_categoryH0 = dataset.reindex(columns=['AS1_INSUR'])
# col_H0 = raw_categoryH0.columns

# 범주형(계층 있음, with hierarchy)
raw_categoryH1 = dataset.reindex(columns=['AS1_EDUA', 'AS1_INCOME', 'AS1_DRINK', 'AS1_DRDUA',
                                          'AS1_SMOKEA', 'AS1_PHYSTB', 'AS1_PHYSIT', 'AS1_PHYACTL',
                                          'AS1_PHYACTM', 'AS1_PHYACTH', 'AS1_HEALTH'
                                          ])
col_H1 = raw_categoryH1.columns

# 연속형 변수
raw_ctn = dataset.reindex(columns=['AS1_AGE', 'AS1_HVSMAM', 'AS1_HVSMDU', 'AS1_TOTALC',
                                   'AS1_SLPAMTM', 'AS1_RGMEALFQA',
                                   'AS1_HEIGHT', 'AS1_WEIGHT', 'AS1_WAIST3_A',
                                   'AS1_B01', 'AS1_B02', 'AS1_B03', 'AS1_B04', 'AS1_B05',
                                   'AS1_B06', 'AS1_B07', 'AS1_B08', 'AS1_B09', 'AS1_B10',
                                   'AS1_B11', 'AS1_B12', 'AS1_B13', 'AS1_B14', 'AS1_B15',
                                   'AS1_B16', 'AS1_B17', 'AS1_B18', 'AS1_B19', 'AS1_B20',
                                   'AS1_B21', 'AS1_B23', 'AS1_B24',
                                   'DP1', 'DP2', 'DP3', 'DP4', 'DP5',
                                   ])
col_c = raw_ctn.columns

In [None]:
# 연속형 변수
raw_X = dataset.reindex(columns=['AS1_B02_1000', 'AS1_B03_1000', 'AS1_B04_1000',
                                 'AS1_B05_1000', 'AS1_B06_1000', 'AS1_B07_1000',
                                 'AS1_B08_1000', 'AS1_B09_1000', 'AS1_B10_1000',
                                 'AS1_B11_1000', 'AS1_B12_1000', 'AS1_B13_1000',
                                 'AS1_B14_1000', 'AS1_B15_1000', 'AS1_B16_1000',
                                 'AS1_B17_1000', 'AS1_B18_1000', 'AS1_B19_1000',
                                 'AS1_B20_1000', 'AS1_B21_1000', 'AS1_B23_1000',
                                 'AS1_B24_1000'
                                 ])
idx = raw_X.index
col = raw_X.columns

y = dataset['HYPERTENSION']

### dataset 스케일링

In [None]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler, QuantileTransformer
from sklearn.model_selection import train_test_split

scaler0 = StandardScaler()
scaler1 = MinMaxScaler()
scaler2 = QuantileTransformer()

In [None]:
X = pd.DataFrame(QuantileTransformer().fit_transform(raw_X), index=idx, columns=col)

## 피어슨 상관계수 분석

In [None]:
df = pd.concat([X, y], axis=1)

In [None]:
df.corr()

Unnamed: 0,AS1_B02_1000,AS1_B03_1000,AS1_B04_1000,AS1_B05_1000,AS1_B06_1000,AS1_B07_1000,AS1_B08_1000,AS1_B09_1000,AS1_B10_1000,AS1_B11_1000,...,AS1_B15_1000,AS1_B16_1000,AS1_B17_1000,AS1_B18_1000,AS1_B19_1000,AS1_B20_1000,AS1_B21_1000,AS1_B23_1000,AS1_B24_1000,HYPERTENSION
AS1_B02_1000,1.0,0.730212,-0.847473,0.624672,0.826015,0.66538,0.533243,0.487741,0.300626,0.216483,...,0.671888,0.461164,0.440043,0.556319,0.391156,0.33013,0.233472,0.495018,0.740685,-0.04961
AS1_B03_1000,0.730212,1.0,-0.963065,0.419656,0.476841,0.322374,0.262545,0.321681,0.077483,0.163598,...,0.421978,0.246435,0.139787,0.626092,0.202407,0.115774,-0.097781,0.417275,0.67372,-0.118918
AS1_B04_1000,-0.847473,-0.963065,1.0,-0.45383,-0.573565,-0.372089,-0.264205,-0.321898,-0.101518,-0.162416,...,-0.513994,-0.265393,-0.153537,-0.630524,-0.20133,-0.121627,0.08892,-0.385892,-0.730614,0.104365
AS1_B05_1000,0.624672,0.419656,-0.45383,1.0,0.821788,0.641102,0.719745,0.66708,0.45735,0.122636,...,0.495061,0.459379,0.6089,0.605746,0.56032,0.505352,0.437128,0.536202,0.483955,-0.033586
AS1_B06_1000,0.826015,0.476841,-0.573565,0.821788,1.0,0.757977,0.736095,0.53779,0.402381,0.184786,...,0.598631,0.492365,0.59684,0.548793,0.440889,0.445345,0.495191,0.475567,0.566849,-0.012975
AS1_B07_1000,0.66538,0.322374,-0.372089,0.641102,0.757977,1.0,0.786161,0.667472,0.445738,0.206929,...,0.526269,0.583938,0.802257,0.2674,0.646802,0.525512,0.717707,0.653184,0.404211,-0.00216
AS1_B08_1000,0.533243,0.262545,-0.264205,0.719745,0.736095,0.786161,1.0,0.716145,0.59072,0.229526,...,0.415672,0.601529,0.806449,0.286663,0.68986,0.655829,0.769953,0.666859,0.315668,-0.001727
AS1_B09_1000,0.487741,0.321681,-0.321898,0.66708,0.53779,0.667472,0.716145,1.0,0.586973,0.204687,...,0.39925,0.533195,0.76738,0.366084,0.966639,0.628895,0.569728,0.634395,0.389412,-0.014014
AS1_B10_1000,0.300626,0.077483,-0.101518,0.45735,0.402381,0.445738,0.59072,0.586973,1.0,0.184733,...,0.198034,0.393814,0.603286,0.041044,0.612657,0.825739,0.627494,0.345518,0.13775,0.020809
AS1_B11_1000,0.216483,0.163598,-0.162416,0.122636,0.184786,0.206929,0.229526,0.204687,0.184733,1.0,...,0.168589,0.086386,0.202772,0.033388,0.20487,0.195432,0.197198,0.115679,0.101255,0.01535


## dataset 분리
- train, test

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=415)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(7763, 22) (1941, 22) (7763,) (1941,)


## 모델 제작 CNN

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Dropout, Conv1D, GlobalMaxPooling1D, Dense
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint

In [None]:
class ClearTrainingOutput(keras.callbacks.Callback):
  def on_train_end(*args, **kwargs):
    IPython.display.clear_output(wait = True)

es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=3)
# 검증 데이터 손실이 3회 증가하면 정해진 에포크가 도달하지 못하였더라도 학습을 조기 종료(Early Stopping)

mc = ModelCheckpoint('best_model.h5', monitor='val_acc', mode='max', verbose=1, save_best_only=True)
# 검증 데이터의 정확도(val_acc)가 이전보다 좋아질 경우에만 모델을 저장

In [None]:
EPOCH = 128
DROPOUT = 0.05
LEARNINGRATE = 0.01
i = 8

embedding_dim = 256 # 임베딩 벡터의 차원
dropout_ratio = 0.05 # 드롭아웃 비율
num_filters = 256 # 커널의 수
kernel_size = 3 # 커널의 크기
hidden_units = 16 # 뉴런의 수

In [None]:
embedding_dim = 8 # 임베딩 벡터의 차원
dropout_ratio = 0.05 # 드롭아웃 비율
num_filters = 8 # 커널의 수
kernel_size = 3 # 커널의 크기
hidden_units = 64 # 뉴런의 수

model = Sequential()

model.add(Embedding(len(X_train), embedding_dim))
model.add(Dropout(dropout_ratio))

model.add(Conv1D(num_filters, kernel_size, padding='valid', activation='relu'))
model.add(GlobalMaxPooling1D())

model.add(Dense(hidden_units, activation='relu'))
model.add(Dropout(dropout_ratio))

model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer='adam', loss='mse', metrics=['acc'])
history = model.fit(X_train, y_train, epochs=100, validation_data=(X_test, y_test), callbacks=[es, mc])

Epoch 1/100
Epoch 1: val_acc improved from -inf to 0.77846, saving model to best_model.h5
Epoch 2/100
Epoch 2: val_acc did not improve from 0.77846
Epoch 3/100
Epoch 3: val_acc did not improve from 0.77846
Epoch 4/100
Epoch 4: val_acc did not improve from 0.77846
Epoch 5/100
Epoch 5: val_acc did not improve from 0.77846
Epoch 6/100
Epoch 6: val_acc did not improve from 0.77846
Epoch 7/100
Epoch 7: val_acc did not improve from 0.77846
Epoch 8/100
Epoch 8: val_acc did not improve from 0.77846
Epoch 9/100
Epoch 9: val_acc did not improve from 0.77846
Epoch 9: early stopping


In [None]:
model.summary()
model.evaluate(X_test, y_test)

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_3 (Embedding)     (None, None, 8)           62104     
                                                                 
 dropout_7 (Dropout)         (None, None, 8)           0         
                                                                 
 conv1d_3 (Conv1D)           (None, None, 8)           200       
                                                                 
 global_max_pooling1d_3 (Glo  (None, 8)                0         
 balMaxPooling1D)                                                
                                                                 
 dense_6 (Dense)             (None, 64)                576       
                                                                 
 dropout_8 (Dropout)         (None, 64)                0         
                                                      

[0.17405691742897034, 0.7784647345542908]

In [None]:
EPOCH = 64
DROPOUT = 0.05
LEARNINGRATE = 0.01
embedding_dim = 256 # 임베딩 벡터의 차원
dropout_ratio = 0.05 # 드롭아웃 비율
num_filters = 256 # 커널의 수
kernel_size = 3 # 커널의 크기
hidden_units = 16 # 뉴런의 수

def model_builder(hp):
  model = Sequential()
  hp_units = hp.Int('units', min_value = 4, max_value = EPOCH, step = 4)
  hp_dropout = hp.Float('dropout', min_value=0.0, max_value=0.5, default=0.05, step=0.05)
  hp_emb_dim = hp.Int('embedding_dim', min_value = 64, max_value = 512, step = 64)
  hp_filters = hp.Int('filters', min_value = 8, max_value = 5)

  model.add(Embedding(len(X_train), embedding_dim))
  model.add(Dropout(dropout_ratio))

  model.add(Conv1D(hp_filters, kernel_size, padding='valid', activation='relu'))
  model.add(GlobalMaxPooling1D())
  
  model.add(Dropout(hp_dropout))
  model.add(Dense(units = hp_units, activation='relu'))

  model.add(Dropout(hp_dropout))
  model.add(Dense(1, activation='sigmoid')) # 출력층

  # Tune the learning rate for the optimizer S
  hp_learning_rate = hp.Choice('learning_rate', values = [1e-2, 1e-3]) # 0.01 or 0.001

  model.compile(optimizer = Adam(learning_rate = hp_learning_rate),
                # loss="binary_crossentropy", # 손실함수: binary_crossentropy
                loss='mse'
                # metrics = ['accuracy']
                ) # 평가지표

  # model.compile(optimizer='rmsprop', 
  #               loss='mse', metrics=['mse']) #손실함수: MSE(mean squared error)
  
  return model

In [None]:
# input_shape = (X_train.shape[1],)
# hypermodel = RegressionHyperModel(input_shape)

tuner = kt.Hyperband(model_builder,
                     objective = 'val_accuracy',
                     max_epochs = EPOCH,
                     hyperband_iterations = EPOCH,
                     directory = '/content/drive/MyDrive/Colab Notebooks/HyperTension_Returns',
                     project_name = '0507_2')

tuner.search(X_train, y_train,
             epochs = EPOCH,
             validation_split=0.2,
             callbacks = [ClearTrainingOutput(), es, mc])

KeyError: ignored

In [None]:
# Get the optimal hyperparameters
best_hps = tuner.get_best_hyperparameters(num_trials = 1)[0]

print(f"""
//하이퍼 파라미터 검색 완료//
최적의 은닉층 unit 수는
{best_hps.get('units')}
최적의 학습률은
{best_hps.get('learning_rate')}
최적의 드롭아웃 확률은
{best_hps.get('dropout')}.
""")

## 기존 모델 사용

In [None]:
EPOCH = 128
DROPOUT = 0.05
LEARNINGRATE = 0.01
i = 8

In [None]:
model = Sequential()
model.add(Dense(i, activation='relu'))  # input layer requires input_dim param

model.add(Dropout(DROPOUT))
model.add(Dense(i, activation='relu'))

model.add(Dropout(DROPOUT))
model.add(Dense(i, activation='relu'))

model.add(Dense(1, activation='sigmoid'))  # sigmoid instead of relu for final probability between 0 and 1

model.compile(loss="binary_crossentropy",
              optimizer = Adam(learning_rate = LEARNINGRATE),
              metrics=['accuracy'])

history = model.fit(X_train, y_train, epochs=EPOCH, verbose=0)
scores = model.evaluate(X_test, y_test)

print("%s: %.2f%%" % (model.metrics_names[1], scores[1] * 100))

In [None]:
model.summary()

## 모델 제작 Regression

In [None]:
import IPython
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import Adam

!pip install -q -U keras-tuner
import keras_tuner as kt
from keras_tuner import HyperModel

In [None]:
EPOCH = 64
DROPOUT = 0.05
LEARNINGRATE = 0.01
i = 8

In [None]:
def model_builder(hp):
  model = Sequential()
  hp_units = hp.Int('units', min_value = 4, max_value = EPOCH, step = 4)
  # hp_dropout = hp.Float('dropout', min_value=0.0, max_value=0.5, default=0.05, step=0.05)
  # Tune the learning rate for the optimizer S
  # hp_learning_rate = hp.Choice('learning_rate', values = [1e-2, 1e-3]) # 0.01 or 0.001


  model.add(Dense(units = hp_units, activation='relu')) # input_shape = 63
  
  model.add(Dropout(0.1))
  model.add(Dense(units = hp_units, activation='relu'))
  
  model.add(Dropout(0.1))
  model.add(Dense(units = hp_units, activation='relu'))
  
  model.add(Dropout(0.1))
  model.add(Dense(units = hp_units, activation='relu'))
  
  model.add(Dropout(0.1))
  model.add(Dense(units = hp_units, activation='relu'))

  model.add(Dropout(0.1))
  model.add(Dense(1, activation='sigmoid')) # 출력층

  model.compile(optimizer = Adam(learning_rate = 0.001),
                loss="binary_crossentropy", # 손실함수: binary_crossentropy
                metrics = ['accuracy']) # 평가지표

  # model.compile(optimizer='rmsprop', 
  #               loss='mse', metrics=['mse']) #손실함수: MSE(mean squared error)
  
  return model

In [None]:
class ClearTrainingOutput(keras.callbacks.Callback):
  def on_train_end(*args, **kwargs):
    IPython.display.clear_output(wait = True)

es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=3)
# 검증 데이터 손실이 3회 증가하면 정해진 에포크가 도달하지 못하였더라도 학습을 조기 종료(Early Stopping)

mc = ModelCheckpoint('best_model.h5', monitor='val_acc', mode='max', verbose=1, save_best_only=True)
# 검증 데이터의 정확도(val_acc)가 이전보다 좋아질 경우에만 모델을 저장

In [None]:
# input_shape = (X_train.shape[1],)
# hypermodel = RegressionHyperModel(input_shape)

tuner = kt.Hyperband(model_builder,
                     objective = 'val_accuracy',
                     max_epochs = EPOCH,
                     hyperband_iterations = EPOCH,
                     directory = '/content/drive/MyDrive/Colab Notebooks/HyperTension_Returns',
                     project_name = '0507_3')

In [None]:
tuner.search(X_train, y_train,
             epochs = EPOCH,
             validation_split=0.2,
             callbacks = [ClearTrainingOutput(), es])

Trial 231 Complete [00h 00m 10s]
val_accuracy: 0.7675467133522034

Best val_accuracy So Far: 0.7675467133522034
Total elapsed time: 00h 36m 42s

Search: Running Trial #232

Value             |Best Value So Far |Hyperparameter
12                |40                |units
0.1               |0.45              |dropout
0.01              |0.01              |learning_rate
64                |3                 |tuner/epochs
22                |0                 |tuner/initial_epoch
3                 |3                 |tuner/bracket
3                 |0                 |tuner/round
0227              |None              |tuner/trial_id

Epoch 23/64
Epoch 24/64
Epoch 25/64
Epoch 26/64

In [None]:
# Get the optimal hyperparameters
best_hps = tuner.get_best_hyperparameters(num_trials = 1)[0]

print(f"""
//하이퍼 파라미터 검색 완료//
최적의 은닉층 unit 수는
{best_hps.get('units')}
최적의 학습률은
{best_hps.get('learning_rate')}
최적의 드롭아웃 확률은
{best_hps.get('dropout')}.
""")

#### best_hps

- 2021-08-08
  ```
  INFO:tensorflow:Oracle triggered exit
  The hyperparameter search is complete. The optimal number of units in the densely-connected layer is
  72, 8, 104, 32)
   the optimal learning rate for the optimizer is
  0.001
  drop-out is
  (0.03, 0.06).
  ```
  - min_value = 8, max_value = 128, step = 8

- 2021-08-12
  ```
  Best val_accuracy So Far: 0.7678571343421936
  Total elapsed time: 00h 00m 33s
  INFO:tensorflow:Oracle triggered exit
  The hyperparameter search is complete. The optimal number of units in the densely-connected layer is
  (10, 16, 6, 24)
  and the optimal learning rate for the optimizer is
  0.0001
  drop-out is
  (0.044, 0.096).
  ```

  - min_value = 2, max_value = 32, step = 2

- 2021-08-14
  ```
  Best val_accuracy So Far: 0.7726648449897766
  Total elapsed time: 00h 01m 27s
  INFO:tensorflow:Oracle triggered exit

  The hyperparameter search is complete. The optimal number of units in the densely-connected layer is
  (16, 8, 14, 8)
  and the optimal learning rate for the optimizer is
  0.0001
  drop-out is
  (0.085, 0.09).
  ```

  - min_value = 4, max_value = 32, step = 2

- 2021-08-23
  ```
  Trial 16 Complete [00h 00m 21s]
  val_accuracy: 0.7726648449897766

  Best val_accuracy So Far: 0.7743818759918213
  Total elapsed time: 00h 01m 49s
  INFO:tensorflow:Oracle triggered exit

  The hyperparameter search is complete. The optimal number of units in the densely-connected layer is
  28
  and the optimal learning rate for the optimizer is
  0.01
  drop-out is
  0.05.
  ```
- 중요한 변수만 했을때
  - 0.7743818759918213
  - 큰 차이 없음

- 2021-08-29
  ```
  Trial 184 Complete [00h 00m 02s]
  val_accuracy: 0.7779740691184998

  Best val_accuracy So Far: 0.7891637086868286
  Total elapsed time: 00h 22m 51s
  INFO:tensorflow:Oracle triggered exit

  The hyperparameter search is complete.
  The optimal number of units in the densely-connected layer is
  30
  and the optimal learning rate for the optimizer is
  0.001
  drop-out is
  0.05.
  ```

- 2021-08-30
  ```
  Trial 382 Complete [00h 00m 07s]
  val_accuracy: 0.7623953819274902

  Best val_accuracy So Far: 0.7707662582397461
  Total elapsed time: 00h 27m 33s
  INFO:tensorflow:Oracle triggered exit

  The hyperparameter search is complete.
  The optimal number of units in the densely-connected layer is
  22
  and the optimal learning rate for the optimizer is
  0.001
  drop-out is
  0.0.
  ```

```
Trial 441 Complete [00h 00m 05s]
val_accuracy: 0.765614926815033

Best val_accuracy So Far: 0.7701223492622375
Total elapsed time: 00h 33m 03s
INFO:tensorflow:Oracle triggered exit

The hyperparameter search is complete.
The optimal number of units in the densely-connected layer is
30
and the optimal learning rate for the optimizer is
0.001
drop-out is
0.05.
```

```
Trial 90 Complete [00h 00m 21s]
val_accuracy: 0.7604635953903198

Best val_accuracy So Far: 0.7617514729499817
Total elapsed time: 00h 07m 40s
INFO:tensorflow:Oracle triggered exit

//하이퍼 파라미터 검색 완료//
최적의 은닉층 unit 수는
56
최적의 학습률은
0.01
최적의 드롭아웃 확률은
0.2
```

```
Trial 843 Complete [00h 00m 01s]
val_accuracy: 0.769478440284729

Best val_accuracy So Far: 0.7746297717094421
Total elapsed time: 01h 38m 26s
INFO:tensorflow:Oracle triggered exit

//하이퍼 파라미터 검색 완료//
최적의 은닉층 unit 수는
20
최적의 학습률은
0.001
최적의 드롭아웃 확률은
0.1.
```
```
jobb 제외, bmi 추가

Trial 814 Complete [00h 00m 02s]
val_accuracy: 0.7675467133522034

Best val_accuracy So Far: 0.7797810435295105
Total elapsed time: 01h 46m 50s
INFO:tensorflow:Oracle triggered exit

//하이퍼 파라미터 검색 완료//
최적의 은닉층 unit 수는
92
최적의 학습률은
0.001
최적의 드롭아웃 확률은
0.25.
```
```
Trial 830 Complete [00h 00m 02s]
val_accuracy: 0.7675467133522034

Best val_accuracy So Far: 0.7765614986419678
Total elapsed time: 01h 58m 33s
INFO:tensorflow:Oracle triggered exit
//하이퍼 파라미터 검색 완료//
최적의 은닉층 unit 수는
48
최적의 학습률은
0.001
최적의 드롭아웃 확률은
0.1.


In [None]:
model = tuner.hypermodel.build(best_hps)
scores = model.evaluate(X_test, y_test)
print("%s: %.2f, %s: %.2f%%" % (model.metrics_names[0], scores[0], model.metrics_names[1], scores[1] * 100))