<a href="https://colab.research.google.com/github/Deok97/AIHub_nipa/blob/main/lifelog_dementia_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 수치/놀이터/치매 및 인지능력 판별

- 이 노트북을 차례로 살펴보며 코드의 빈 곳을 채우며 실행하면 수치 과제의 전반적인 과정을 수행해볼 수 있게 제작되었습니다.

## 과제 설명
- 실시간 수면/활동 라이프 로그 데이터 수집을 통한 인지능력 모니터링 과제 


## 데이터 설명
- 입력 데이터 feature
  - 반지 형태의 데일리 수면/활동 데이터 수집기를 통해 착용자의 수면 데이터(수면 시작/종료 시간, 수면 점수, 수면방해, 수면 효율, 램수면 시간, 수면의 깊이 등)와 활동 데이터(활동 시작/종료 시간, 운동 시간, 활동 점수, 신진대사량, 회복 시간, 움직인 거리, 칼로리 소모량 등)을 5분 단위로 수집하여 활동->수면->활동 과 같이 사람의 기본적인 삶의 패턴을 24시간 동안 라이프 로그 모니터링한 데이터.


- 출력 데이터 label
  - CN : Cognitive Normal(인지기능 정상)
  - MCI : Mild Cognitive Impairment(경도 인지기능 장애)
  - Dem : Dementia(치매)

## 세팅
### 라이브러리
코드 전반에 사용되는 라이브러리를 설치 및 로드합니다.

In [87]:
# Python version >= 3.5
import sys
assert sys.version_info >= (3, 5)

# scikit learn version >= 0.20 
import sklearn
assert sklearn.__version__ >= '0.20'

# tensorflow version >= 2.0
import tensorflow as tf
assert tf.__version__ >= '2.0'

# Is this notebook running on Colab?
IS_COLAB = 'google.colab' in sys.modules

if not tf.config.list_physical_devices('GPU'):
  if IS_COLAB:
    print('Runtime > Change the runtime type, or Operation Speed can be so slow')
  else:
    print('Machine Learning procedure can be so slow')

# 필요한 라이브러리 불러오기
import os
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
from sklearn.base import BaseEstimator, TransformerMixin

from sklearn.preprocessing import LabelBinarizer, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.metrics import f1_score
from datetime import datetime, timezone, timedelta
import random

# from tf.keras.layers

from google.colab import drive

# To see the same results
np.random.seed(42)
tf.random.set_seed(42)

# Clear graph drawing
%matplotlib inline
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

Runtime > Change the runtime type, or Operation Speed can be so slow


In [2]:
drive.mount('/content/gdrive')

Mounted at /content/gdrive


### 기타
- 디렉토리 설정 : 추후 반복적으로 사용하게 될 현재 디렉토리 경로를 저장합니다.
  데이터는 현재 디렉토리의 `data/`폴더 안에 저장합니다.  
- working directory 구조  
  |--code.ipynb  
  |--data/  
  |--|--train/  
  |--|--|--train.csv  
  |--|--test/  
  |--|--|--test.csv

In [3]:
# 경로 설정
ROOT_PATH =  '/content/gdrive/MyDrive/lifelog'##### 코드가 위치한 경로 지정 #####
DATA_DIR = '/content/gdrive/MyDrive/lifelog/data'##### 데이터가 위치한 경로 지정 #####

# train / val set 분할 비율 설정
TRAIN_RATIO = 0.9

### EDA (Explaratory Data Analylsis)
데이터를 간단하게 살펴보겠습니다.  
데이터를 이해하기 위해 더 필요하다고 생각되는 부분을 각자 추가해보세요.  

`EDA: 탐색적 데이터 분석(Exploratory Data Analysis)는 데이터 사이언티스트가 데이터세트를 분석하고 조사하여 주요 특성을 파악하는 데에 사용되며, 데이터 시각화 방법을 사용하기도 합니다`

In [4]:
train_df = pd.read_csv('/content/gdrive/MyDrive/lifelog/data/train/train.csv')
test_df = pd.read_csv('/content/gdrive/MyDrive/lifelog/data/test/test.csv') 
train_df.head()

Unnamed: 0,EMAIL,summary_date,activity_average_met,activity_cal_active,activity_cal_total,activity_class_5min,activity_daily_movement,activity_high,activity_inactive,activity_inactivity_alerts,...,sleep_temperature_delta,sleep_temperature_deviation,sleep_temperature_trend_deviation,timezone,sleep_total,CONVERT(activity_class_5min USING utf8),CONVERT(activity_met_1min USING utf8),CONVERT(sleep_hr_5min USING utf8),CONVERT(sleep_hypnogram_5min USING utf8),CONVERT(sleep_rmssd_5min USING utf8)
0,nia+404@rowan.kr,2020-11-27,1.71875,730,2944,...,14346,0,417,0,...,-0.12,-0.12,99.99,,\r,2/1/1/1/1/1/2/2/1/1/1/1/1/1/2/2/2/3/2/2/2/2/2/...,0.9/0.9/1.4/1.9/1.1/0.9/0.9/1.1/1.3/1/0.9/1.1/...,0/73/73/73/72/71/70/71/71/71/70/70/73/72/74/74...,4/2/4/3/3/1/2/2/2/2/2/2/3/3/3/4/4/3/2/2/2/2/2/...,0/10/10/10/11/11/10/12/18/13/14/12/10/10/18/17...
1,nia+404@rowan.kr,2020-11-28,1.40625,342,2449,...,6352,0,473,0,...,-0.32,-0.32,99.99,,\r,1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/2/2/2/2/2/2/...,1.2/1.1/0.9/0.9/0.9/0.9/0.9/0.9/0.9/0.9/0.9/0....,69/70/69/69/70/72/71/72/70/69/69/69/68/68/63/6...,2/4/2/2/2/2/3/1/2/2/4/4/2/2/2/2/2/2/2/2/2/2/4/...,23/23/26/24/18/13/15/14/17/20/24/30/23/25/22/1...
2,nia+404@rowan.kr,2020-11-29,1.46875,401,2544,...,7297,0,586,0,...,0.07,0.07,99.99,,\r,1/1/1/1/1/1/1/2/1/1/1/1/2/2/2/2/2/1/1/1/1/1/2/...,1.1/1.1/1.2/1.1/0.9/0.9/0.9/0.9/0.9/0.9/0.9/0....,0/74/73/73/74/74/74/71/71/70/70/69/70/68/66/69...,4/2/4/4/1/1/1/4/4/4/4/4/4/4/2/3/4/2/2/4/2/2/2/...,0/11/14/20/13/14/14/16/27/29/27/20/19/19/14/12...
3,nia+404@rowan.kr,2020-11-30,0.34375,27,1850,...,491,0,176,0,...,-0.41,-0.41,99.99,,\r,2/1/2/2/1/2/1/1/2/1/1/1/1/1/2/1/1/1/1/1/2/2/2/...,0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/...,73/70/71/72/75/75/73/70/70/70/67/63/63/63/63/6...,4/4/4/4/3/3/3/2/4/4/4/2/2/2/2/2/2/2/2/4/2/2/2/...,24/28/19/17/12/10/17/20/23/23/25/31/26/25/34/3...
4,nia+404@rowan.kr,2020-12-01,1.46875,333,2518,...,5861,0,646,0,...,-0.27,-0.27,99.99,,\r,1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/2/2/3/3/2/...,0.9/0.9/0.9/0.9/0.9/0.9/0.9/0.9/0.9/0.9/0.9/0....,0/0/0/0/0/0/0/0/69/69/71/69/65/66/64/64/65/66/...,4/4/4/4/4/4/4/4/4/4/4/2/2/2/2/3/3/2/4/4/4/2/2/...,0/0/0/0/0/0/0/0/21/22/26/23/19/29/22/17/14/13/...


In [114]:
train_df.info() # 9327(n_rows) x 65(n_columns)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9327 entries, 0 to 9326
Data columns (total 65 columns):
 #   Column                                    Non-Null Count  Dtype  
---  ------                                    --------------  -----  
 0   EMAIL                                     9327 non-null   object 
 1   summary_date                              9327 non-null   object 
 2   activity_average_met                      9327 non-null   float64
 3   activity_cal_active                       9327 non-null   int64  
 4   activity_cal_total                        9327 non-null   int64  
 5   activity_class_5min                       9327 non-null   object 
 6   activity_daily_movement                   9327 non-null   int64  
 7   activity_high                             9327 non-null   int64  
 8   activity_inactive                         9327 non-null   int64  
 9   activity_inactivity_alerts                9327 non-null   int64  
 10  activity_low                        

In [None]:
train_df.columns

In [None]:
train_label_df = pd.read_csv(os.path.join('/content/gdrive/MyDrive/lifelog/data/train', 'train_label.csv'))
train_label_df.head()

Unnamed: 0,SAMPLE_EMAIL,DIAG_NM
0,nia+315@rowan.kr,CN
1,nia+220@rowan.kr,CN
2,nia+096@rowan.kr,MCI
3,nia+163@rowan.kr,CN
4,nia+396@rowan.kr,CN


In [141]:
# SimpleImputer can handle only numeric columns. The other columns must be dropped
# Select only numeric columns
numeric_cls = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
train_df_num = train_df.select_dtypes(include = numeric_cls)
train_df_num.head()

Unnamed: 0,activity_average_met,activity_cal_active,activity_cal_total,activity_daily_movement,activity_high,activity_inactive,activity_inactivity_alerts,activity_low,activity_medium,activity_met_min_high,...,sleep_score_deep,sleep_score_disturbances,sleep_score_efficiency,sleep_score_latency,sleep_score_rem,sleep_score_total,sleep_temperature_delta,sleep_temperature_deviation,sleep_temperature_trend_deviation,timezone
0,1.71875,730,2944,14346,0,417,0,545,47,0,...,41,50,27,97,66,62,-0.12,-0.12,99.99,
1,1.40625,342,2449,6352,0,473,0,392,8,0,...,49,43,46,91,35,44,-0.32,-0.32,99.99,
2,1.46875,401,2544,7297,0,586,0,362,24,0,...,56,47,34,89,41,62,0.07,0.07,99.99,
3,0.34375,27,1850,491,0,176,0,34,1,0,...,35,40,44,86,29,56,-0.41,-0.41,99.99,
4,1.46875,333,2518,5861,0,646,0,377,6,0,...,64,55,39,14,54,59,-0.27,-0.27,99.99,


#### check how many nan values exist

In [148]:
na_list = train_df_num.isna().sum()
na_list[na_list != 0]


timezone    9327
dtype: int64

# Preprocessing - 파이프 라인 구축

## 1. SimpleImputer: 결측값을 처리하는 변환기 사용
imputer는 각 특성의 중간값(strategy에 따라)을 계산해서 그 결과를 객체의 statistics_ 속성에 저장합니다.

In [134]:
si = SimpleImputer(strategy = 'median') # NA -> np.nan
train_df_si = si.fit_transform(train_df_num)
print(len(train_df_num.columns))

52


In [140]:
si.statistics_

array([1.4375e+00, 4.1600e+02, 2.4650e+03, 7.8440e+03, 1.0000e+00,
       5.1400e+02, 0.0000e+00, 2.7300e+02, 4.2000e+01, 7.0000e+00,
       7.0000e+00, 1.7600e+02, 1.2600e+02, 4.0000e+00, 5.2600e+02,
       8.5000e+01, 9.5000e+01, 1.0000e+02, 1.0000e+02, 8.0000e+01,
       1.0000e+02, 9.7000e+01, 9.7650e+03, 3.2700e+02, 4.7700e+03,
       1.6875e+01, 4.8900e+03, 2.9040e+04, 8.3000e+01, 5.9830e+01,
       5.5000e+01, 1.0000e+00, 1.4610e+04, 8.4080e+03, 1.4460e+04,
       5.7000e+02, 1.0000e+00, 3.6000e+03, 3.5000e+01, 3.0000e+01,
       7.5000e+01, 1.0000e+02, 9.6000e+01, 6.6000e+01, 7.9000e+01,
       8.1000e+01, 5.6000e+01, 7.4000e+01, 0.0000e+00, 0.0000e+00,
       9.9990e+01,        nan])

## 2. 학습에 도움되지 않는 변수를 제거하는 변환기 구현  
- TransformerMixin 클래스를 상속하면 자동으로 fit_transform() 메서드를 구현하지 않아도 사용할 수 있게 해줍니다.  
- BaseEstimator를 상속하면(생성자에 \*args나 **kargs를 사용하지 않으면) 하이퍼 파라미터 튜닝에 필요한 get_params()와 set_params() 메서드를 추가로 얻게 됩니다.

In [129]:
# Custom transformer which removes unnecessary column
class RemoveTrashCols(BaseEstimator, TransformerMixin):
  def __init__(self, df=None):
    self.df = df

  def fit(self, df):
    trash_col = list()

    # if length of unique data of columns == 1, treat as unnecessary column
    train_df.apply(lambda x : trash_col.append(x.name) if len(np.unique(x)) == 1 else None)

    self.trash_col = trash_col
    return self # return self 하지 않으면, error occurred

  def transform(self, df):
    print(f'unnecessary columns(every data in these columns has same value): \n{self.trash_col}')
    return train_df.loc[:, [x for x in train_df.columns if x not in self.trash_col]]

rtc = RemoveTrashCols()
train_df_rmv = rtc.fit_transform(train_df)
len(train_df_rmv.columns)

unnecessary columns(every data in these columns has same value): 
['activity_class_5min', 'activity_met_1min', 'sleep_hr_5min', 'sleep_hypnogram_5min', 'sleep_is_longest', 'sleep_rmssd_5min', 'sleep_temperature_trend_deviation', 'timezone', 'sleep_total']


56

In [None]:
train_df_grouped = train_df.pivot_table(train_df,
                                        index = ['EMAIL', 'summary_date'])
len(train_df.iloc[0]), len(train_df_grouped.iloc[0])

(65, 51)

In [None]:
test_df.head()

Unnamed: 0,EMAIL,summary_date,activity_average_met,activity_cal_active,activity_cal_total,activity_class_5min,activity_daily_movement,activity_high,activity_inactive,activity_inactivity_alerts,...,sleep_temperature_delta,sleep_temperature_deviation,sleep_temperature_trend_deviation,timezone,sleep_total,CONVERT(activity_class_5min USING utf8),CONVERT(activity_met_1min USING utf8),CONVERT(sleep_hr_5min USING utf8),CONVERT(sleep_hypnogram_5min USING utf8),CONVERT(sleep_rmssd_5min USING utf8)
0,nia+075@rowan.kr,2020-10-19,1.738393,627.0,2718.0,...,17125.461981,0.0,588.0,1,...,-0.022292,-0.346215,99.99,,\r,1/1/1/1/1/1/1/1/2/1/1/1/1/1/1/1/1/1/1/1/1/1/2/...,1.4/1.8/1.2/3.9/0.9/0.1/0.9/0.9/1.3/1.2/0.9/0....,71/0/65/64/71/64/0/60/66/69/64/66/61/64/62/66/...,2/2/3/2/3/1/3/2/4/4/2/2/4/4/3/2/1/2/4//2/4/2/2...,44/19/19/16/0/20/32/26/13/14/25/17/22/25/17/14...
1,nia+075@rowan.kr,2020-10-20,1.442223,137.0,2672.0,...,11410.09949,0.0,544.0,1,...,0.481816,0.019516,99.99,,\r,1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/...,1.6/0.9/0.9/1.3/3.9/2.9/0.9/0.9/0.9/1.1/0.9/1/...,65/56/56/60/59/58/59/58/57/59/62/60/59/56/54/5...,4/3/3/2/1/2/2/1/3/4/2/3/3/1/3/3/1/2/2/2/2/2/3/...,21/17/23//34/19/17/15/19/19/17/30/29/13/19/21/...
2,nia+075@rowan.kr,2020-10-21,1.4797,175.0,2514.0,...,5497.769969,2.0,559.0,2,...,-0.082092,0.022043,99.99,,\r,1/1/1/1/1/2/2/1/1/1/1/1/1/1/1/1/1/1/1/1/1/2/1/...,1.5/0.9/1.4/0.9/0.9/0.9/3.6/1.5/0.9/0.9/0.9/1....,64/0/57/66/52/0/62/0/61/61/0/54/54/52/0/55/64/...,3/1/4/2/2/2/4/4/3/2/2/3/1/2/1/2/3/4/1/2/2/4/2/...,0/21/0/19/53/30/0/27/0/0/70/0/0/44/0/24/31/0/1...
3,nia+075@rowan.kr,2020-10-22,1.069079,217.0,2653.0,...,3852.789155,0.0,671.0,0,...,0.177269,0.013762,99.99,,\r,1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/...,4.8/1.9/1/1.3/1.4/1.8//2.2/2.1/1.2/1.2/1.2/2.3...,61/61/63/64/59/62/59/60/62/63/58/65/64/61/61/6...,4/1/2/1/4/2/1/2/1/1/2/2/4/4/2/1/2/3/1/3/2/3/4/...,27/20/15/15/16/17/18/22/16/16/16/37/18/18/16/1...
4,nia+075@rowan.kr,2020-10-23,1.645156,16.0,2327.0,...,4483.044208,0.0,767.0,0,...,-0.037126,-0.110565,99.99,,\r,1/1/1/1/1/2/1/1/2/2/1/1/1/1/1/1/1/1/1/1/1/1/1/...,1.9/1.4/1.2/1.3/1.4/1.7/1.2/1/1.8/1.2/0.9/0.9/...,58/0/60/57/60/63/57/55/56/57/57/0//57/58/56/58...,4/4/2/2/4/4/4/4/2/2/2/4/2/1/4/2/4//2/2/4/4/4/4...,25/33/13/19/14/29/25/19/0/22/0/29/0/37/0/27/0/...


In [None]:
customers, time_steps_per_customer  = np.unique(train_df['EMAIL'], return_counts=True)
print(f'number of customers: {len(customers)}         \n\
time steps per customer(time step은 보다시피 일정하지 않음. length is same with number of customers: {len(time_steps_per_customer)}): \n\
{time_steps_per_customer}')

number of customers: 148         
time steps per customer(time step은 보다시피 일정하지 않음. length is same with number of customers: 148): 
[62 85 52 93 35 65 73 61 59 71 69 87 58 67 82 66 85 65 64 74 92 69 53 65
 73 73 41 54 48 77 62 90 72 61 69 72 89 82 65 52 69 66 69 72 70 49 68 76
 76 66 90 73 65 43 62 61 62 64 59 57 73 70 67 77 86 71 43 71 50 65 56 56
 70 52 64 65 50 59 63 81 82 93 62 59 82 66 63 73 66 63 87 58 74 85 57 71
 85 67 56 67 75 72 65 61 55 67 60 48 79 62 57 75 59 49 72 56 65 76 78 77
 72 61 51 54 47 53 36 38 36 43 41 37 36 39 49 53 39 38 52 41 40 51 36 36
 52 48 40 46]


In [None]:
batch = range(len(customers))
time_steps_list = list()
for time_stpes in time_steps_per_customer:
  temp = list(range(time_stpes))
  time_steps_list.append(temp)
print(f'data batch_size: {batch}')
print(f'length of time steps batch: {len(time_steps_list)}      \ntime steps list[0]: \n{time_steps_list[0]}')

data batch_size: range(0, 148)
length of time steps batch: 148      
time steps list[0]: 
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61]


In [None]:
 unique, counts= np.unique(train_label_df['DIAG_NM'], return_counts=True)
 print(f'labels: {unique}   counts: {counts}')

(array(['CN', 'Dem', 'MCI'], dtype=object), array([97, 10, 41]))

***
***
***
***
#   \#\#\#\# Just Baseline Worksheet (no manipulation) \#\#\#\#\ 

In [None]:
class CustomDataset(Dataset):
    def __init__(self, data_dir, mode):
        self.data_dir = data_dir
        self.mode = mode
                
        # 인코딩을 위한 레이블 딕셔너리
        self.states = ##### 코드 #####
        self.inputs, self.labels = self.data_loader(data_dir)

    def data_loader(self, path):
        print('Loading ' + self.mode + ' dataset..')
        if not os.path.isdir(self.data_dir):
            print(f'!!! Cannot find {self.data_dir}... !!!')
            sys.exit()
            
        if self.mode == 'train':

            inputs, labels = pd.read_csv(os.path.join(self.data_dir, self.mode, self.mode + '.csv')), pd.read_csv(os.path.join(self.data_dir, self.mode, self.mode + '_label.csv'))
            inputs, labels = self.preprocessing(inputs, labels)
            inputs = inputs[:int(len(inputs)*TRAIN_RATIO)]
            labels = labels[:int(len(labels)*TRAIN_RATIO)]

            return inputs, labels
        
        elif self.mode == 'val':

            inputs, labels = pd.read_csv(os.path.join(self.data_dir, 'train/train.csv')), pd.read_csv(os.path.join(self.data_dir, 'train/train_label.csv'))
            inputs, labels = self.preprocessing(inputs, labels)

            inputs = inputs[int(len(inputs)*TRAIN_RATIO):]
            labels = labels[int(len(labels)*TRAIN_RATIO):]

            return inputs, labels

    
    
    def preprocessing(self, inputs, labels):
        print('Preprocessing ' + self.mode + ' dataset..')
        
        # Cut time series length based on the shortest length
        train_df = pd.read_csv(#### train / val 분할 이전 train.csv의 경로 ####)
        test_df = pd.read_csv(#### test.csv의 경로 ####)
        time_series_length= pd.concat([train_df['EMAIL'].value_counts(), test_df['EMAIL'].value_counts()])
        shortest_length = time_series_length[-1]
        arranged_labels = []

        for id in inputs['EMAIL'].unique():
            idx = inputs['EMAIL'][inputs['EMAIL'] == id].index
            start_idx = idx[0]
            end_idx = idx[-1]
            inputs.drop(list((range(start_idx + shortest_length , end_idx+1))), axis=0, inplace=True)
            inputs = inputs.reset_index(drop=True)

        # Selecting usage columns
        del_col = ['EMAIL', 'summary_date',
                   'activity_class_5min', 'activity_met_1min',
                   'sleep_hr_5min', 'sleep_hypnogram_5min', 'sleep_rmssd_5min', 'timezone', 'sleep_total',
                   'CONVERT(activity_class_5min USING utf8)', 'CONVERT(activity_met_1min USING utf8)',
                   'CONVERT(sleep_hr_5min USING utf8)', 'CONVERT(sleep_hypnogram_5min USING utf8)',
                   'CONVERT(sleep_rmssd_5min USING utf8)']
        inputs.drop(del_col, axis=1, inplace=True)

        #Normalization
        scaler = preprocessing.StandardScaler()
        if self.mode == 'test' :
            train_df.drop(del_col, axis=1, inplace=True)
            scaler.fit_transform(train_df)
            inputs = scaler.transform(inputs)
        else:
            inputs = scaler.fit_transform(inputs)

        # Convert dataframe to tensor
        inputs = torch.FloatTensor(inputs).reshape(len(labels), -1, inputs.shape[1])
        
        
        
        labels = list(map(lambda x: self.states[x], labels['DIAG_NM'].tolist()))
        labels = torch.LongTensor(labels)
        #labels = self.label_encoder(labels)
        #labels = torch.FloatTensor(labels).rehshape(len(labels),-1)

        return inputs, labels

    def label_encoder(self, labels):
        try:
            labels = list(map(lambda x : self.states[x], labels['DIAG_NM'].tolist()))
            return labels
        except:
            assert 'Invalid states'

    def __len__(self):
        return len(self.inputs)

    def __getitem__(self, index):
        return self.inputs[index, :, :], self.labels[index]

In [None]:
# DATASET 만들기
train_dataset = CustomDataset(data_dir=DATA_DIR, mode='train')
validation_dataset = CustomDataset(data_dir=DATA_DIR, mode='val')

# 데이터로드 파라미터
BATCH_SIZE = 512

# DATASET 로딩하기
train_dataloader = DataLoader(dataset=train_dataset, batch_size=BATCH_SIZE, shuffle=True)
validation_dataloader = DataLoader(dataset=validation_dataset, batch_size=BATCH_SIZE, shuffle=False)

## 모델 설계
### 사용할 파라미터
- `LEARNING_RATE` : 경사하강법(Gradient Descent)을 통해 loss function의 minimum값을 찾아다닐 때, 그 탐색 과정에 있어서의 보폭 정도로 직관적으로 이해 할 수 있습니다. 보폭이 너무 크다면 최적값을 쉽게 지나칠 위험이 있고, 보폭이 너무 작다면 탐색에 걸리는 시간이 길어집니다.
- `EPOCHS` : 
  - 한 번의 epoch는 인공 신경망에서 전체 데이터 셋에 대해 forward pass/backward pass 과정을 거친 것입니다.
  - 즉, epoch이 1만큼 지나면, 전체 데이터 셋에 대해 한번의 학습이 완료된 상태입니다.
  - 모델을 만들 때 적절한 epoch 값을 설정해야만 underfitting과 overfitting을 방지할 수 있습니다.
  - 1 epoch = (데이터 갯수 / batch size) interations
- `HIDDEN_SIZE` : 
    - 신경망에서 인풋 레이어와 아웃풋 레이어 사이의 레이어들을 말합니다.
    - 기본적으로 1개의 hidden layer가 있어야 하며 hidden layer의 units의 수는 input units의 수에 배수로 지정하는 것이 일반적입니다.
    - 모든 hidden layers들은 같은 수의 units들을 가지고 있어야 합니다.
- `EARLY_STOPPING_PATIENCE` :
  - 너무 많은 epoch은 overfitting을 일으키고, 너무 적은 epoch은 underfitting을 일으킵니다. 이런 딜레마에 빠지지 않기 위도록 특정 시점에 학습을 멈추는 방법이 early stopping입니다.
  - 해당 변수는 validation score가 개선되지 않아도 학습을 몇 에폭 더 진행할 지 결정합니다. 예를 들어 EARLY_STOPPING_PATIENCE를 5로 설정하고 validation score가 10에폭에서 가장 높은 후 다음 에폭부터 줄어든다면, 15에폭까지는 학습을 진행하며 validation score가 더 높아지는지 확인하고, 그렇지 않다면 학습을 중단합니다.
- `WEIGHT_DECAY` :
  - overfitting을 억제하는 학습 기법의 하나로, 학습된 모델의 복잡도를 줄이기 위해서 학습 중 weight가 너무 큰 값을 가지지 않도록 Loss function에 Weight가 커질 경우에 대한 패널티 항목을 넣습니다.

#### 코드 채워넣기
- 모델 학습을 위한 하이퍼 파라미터를 값을 지정해 보세요.
- 지정하는 파라미터 값에 따라 모델의 학습 속도와 성능이 달라질 수 있습니다.

In [None]:
# hyper-parameters
LEARNING_RATE = ##### 코드 #####
EPOCHS = ##### 코드 #####
HIDDEN_SIZE = ##### 코드 #####
EARLY_STOPPING_PATIENCE = ##### 코드 #####
WEIGHT_DECAY = ##### 코드 #####

In [None]:
class LSTM(nn.Module):

    def __init__(self, input_dim, hidden_dim, output_dim, device, n_layers=1):
        super(LSTM, self).__init__()
        self.device = device
        self.n_layers = n_layers
        self.hidden_dim = hidden_dim
        self.h_0 = self.init_hidden(BATCH_SIZE)
        self.lstm = nn.LSTM(input_size = input_dim, hidden_size = hidden_dim, num_layers = n_layers, batch_first=True, bidirectional=True)
        self.fc = nn.Linear(2 * hidden_dim, output_dim)

    def init_hidden(self, batch_size):
        h_0 = Variable(torch.randn(2 * self.n_layers, batch_size, self.hidden_dim)).to(self.device)
        c_0 = Variable(torch.randn(2 * self.n_layers, batch_size, self.hidden_dim)).to(self.device)
        return (h_0, c_0)

    def forward(self, x):
        #batch_size = x.shape[0]
        #self.h_c = self.init_hidden(batch_size)
        lstm_out, self.h_c = self.lstm(x)
        output = self.fc(lstm_out[:, -1, :])
        return F.softmax(output, dim=1)

In [None]:
# 모델 생성하기
model = LSTM(input_dim=train_dataset.inputs.shape[2], hidden_dim=512, output_dim=3, device=device).to(device)

In [None]:
# Set optimizer, scheduler, loss function, metric function
optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE, weight_decay=WEIGHT_DECAY)
loss_fn = nn.CrossEntropyLoss()
scheduler = optim.lr_scheduler.OneCycleLR(optimizer=optimizer, pct_start=0.1, div_factor=1e5, max_lr=0.0001, epochs=EPOCHS, steps_per_epoch=len(train_dataloader))
metric_fn = f1_score

#### 코드 채워넣기
- `train_epoch`과 `valid_epoch` 함수에서 data와 target을 위에서 정의한 device에 할당해보세요.  
- train_epoch 부분과 valid_epoch 부분에 들어갈 코드는 같습니다.

In [None]:
class Trainer():
    def __init__(self, model, device, loss_fn, metric_fn, optimizer=None, scheduler=None):
        self.model = model
        self.device = device
        self.loss_fn = loss_fn
        self.metric_fn = metric_fn
        self.optimizer = optimizer
        self.scheduler = scheduler


    def train_epoch(self, dataloader, epoch_index=0):
        self.model.train()
        self.train_total_loss = 0
        target_lst = []
        pred_lst = []
        for batch_idx, (data, target) in enumerate(dataloader):
            data = ##### 코드 #####
            target = ##### 코드 #####
            output = self.model(data)
            self.optimizer.zero_grad()
            loss = self.loss_fn(output, target)
            self.train_total_loss += loss.item()
            loss.backward()
            self.optimizer.step()
            self.scheduler.step()
            target_lst.extend(target.tolist())
            pred_lst.extend(output.argmax(dim=1).tolist())
        self.train_mean_loss = self.train_total_loss / len(dataloader)
        self.train_score = f1_score(y_true=target_lst, y_pred=pred_lst, average='macro')
        msg = f'Epoch {epoch_index}, Train, loss: {self.train_mean_loss}, Score: {self.train_score}'
        print(msg)


    def validate_epoch(self, dataloader, epoch_index=0):
        self.model.eval()
        self.val_total_loss = 0
        target_lst = []
        pred_lst = []
        with torch.no_grad():
            for batch_index, (data, target) in enumerate(dataloader):
                data = ##### 코드 #####
                target = ##### 코드 #####
                output = self.model(data)
                loss = self.loss_fn(output, target)
                self.val_total_loss += loss.item()
                target_lst.extend(target.tolist())
                pred_lst.extend(output.argmax(dim=1).tolist())
            self.val_mean_loss = self.val_total_loss / len(dataloader)
            self.validation_score = f1_score(y_true=target_lst, y_pred=pred_lst, average='macro')
            msg = f'Epoch {epoch_index}, Validation, loss: {self.val_mean_loss}, Score: {self.validation_score}'
            print(msg)

In [None]:
class LossEarlyStopper():
    """Early stopper
    
    Attributes:
        patience (int): loss가 줄어들지 않아도 학습할 epoch 수
        verbose (bool): 로그 출력 여부, True 일 때 로그 출력
        patience_counter (int): loss 가 줄어들지 않을 때 마다 1씩 증가
        min_loss (float): 최소 loss
        stop (bool): True 일 때 학습 중단

    """

    def __init__(self, patience: int, verbose: bool)-> None: # logger:logging.RootLogger=None
        """ 초기화

        Args:
            patience (int): loss가 줄어들지 않아도 학습할 epoch 수
            weight_path (str): weight 저장경로
            verbose (bool): 로그 출력 여부, True 일 때 로그 출력
        """
        self.patience = patience
        self.verbose = verbose

        self.patience_counter = 0
        self.min_loss = np.Inf
        self.stop = False

    def check_early_stopping(self, loss: float)-> None:

        if self.min_loss == np.Inf:
            self.min_loss = loss
            # self.save_checkpoint(loss=loss, model=model)

        elif loss > self.min_loss:
            self.patience_counter += 1
            msg = f"Early stopper, Early stopping counter {self.patience_counter}/{self.patience}"

            if self.patience_counter == self.patience:
                self.stop = True


        elif loss <= self.min_loss:
            self.patience_counter = 0
            self.save_model = True
            msg = f"Early stopper, Validation loss decreased {self.min_loss} -> {loss}"
            self.min_loss = loss
            # self.save_checkpoint(loss=loss, model=model)

In [None]:
# Trainer 셋팅하기
trainer = Trainer(model, device, loss_fn, metric_fn, optimizer, scheduler)

# Earlystopper 셋팅하기
early_stopper = LossEarlyStopper(patience=EARLY_STOPPING_PATIENCE, verbose=True)

## 학습

In [None]:
criterion = 0

for epoch_index in range(EPOCHS):
    trainer.train_epoch(train_dataloader, epoch_index=epoch_index)
    trainer.validate_epoch(validation_dataloader, epoch_index=epoch_index)
     
    # early_stopping check
    early_stopper.check_early_stopping(loss=trainer.val_mean_loss)

    if early_stopper.stop:
        print('Early stopped')
        break

    if trainer.validation_score > criterion:
        # 모델이 개선됨 -> 검증 점수와 weight 갱신
        criterion = trainer.validation_score
        check_point = {
            'model': model.state_dict(),
            'optimizer': optimizer.state_dict(),
            'scheduler': scheduler.state_dict()
        }
        torch.save(check_point, os.path.join(ROOT_PATH, 'best.pt'))

## 추론
테스트 데이터의 타겟 변수를 `submit` 양식에 맞춰 저장한 파일을 aiconnect 플랫폼을 통해 제출하면 추론 점수를 확인할 수 있습니다.  

`answer` 컬럼 값을 여러분의 모델의 추론 결과로 채워 제출 파일을 만듭니다 (현재는 모두 동일한 값으로 채워져 있습니다).

ID값을 기준으로 채점을 진행하는 점 유의해주시기 바랍니다.

In [None]:
class TestDataset(Dataset):
    def __init__(self, data_dir):
        self.data_dir = data_dir
        self.mode = 'test'
                
        # 인코딩을 위한 레이블 딕셔너리
        self.states = {'CN': 0, 'MCI': 1, 'Dem': 2}
        self.inputs = self.data_loader(data_dir)

    def data_loader(self, path):
        print('Loading ' + self.mode + ' dataset..')
        if not os.path.isdir(self.data_dir):
            print(f'!!! Cannot find {self.data_dir}... !!!')
            sys.exit()
            
        if os.path.isfile(os.path.join(self.data_dir, self.mode, self.mode + '_X.pt')):
            inputs = torch.load(os.path.join(self.data_dir, self.mode, self.mode + '_X.pt'))

        else:
            inputs = pd.read_csv(os.path.join(self.data_dir, self.mode, self.mode + '.csv'))
            inputs = self.preprocessing(inputs)
            torch.save(inputs, os.path.join(self.data_dir, self.mode, self.mode + '_X.pt'))
            
        return inputs
        
    
    def preprocessing(self, inputs):
        print('Preprocessing ' + self.mode + ' dataset..')
        
        # Cut time series length based on the shortest length
        train_df = pd.read_csv(##### train / val 분할 이전 train.csv의 경로 #####)
        test_df = pd.read_csv(##### test.csv의 경로 #####) 
        time_series_length= pd.concat([train_df['EMAIL'].value_counts(), test_df['EMAIL'].value_counts()])
        shortest_length = time_series_length[-1]

        for id in inputs['EMAIL'].unique():
            idx = inputs['EMAIL'][inputs['EMAIL'] == id].index
            start_idx = idx[0]
            end_idx = idx[-1]
            inputs.drop(list((range(start_idx + shortest_length , end_idx+1))), axis=0, inplace=True)
            inputs = inputs.reset_index(drop=True)

        # Selecting usage columns
        del_col = ['EMAIL', 'summary_date',
                   'activity_class_5min', 'activity_met_1min',
                   'sleep_hr_5min', 'sleep_hypnogram_5min', 'sleep_rmssd_5min', 'timezone', 'sleep_total',
                   'CONVERT(activity_class_5min USING utf8)', 'CONVERT(activity_met_1min USING utf8)',
                   'CONVERT(sleep_hr_5min USING utf8)', 'CONVERT(sleep_hypnogram_5min USING utf8)',
                   'CONVERT(sleep_rmssd_5min USING utf8)']
        inputs.drop(del_col, axis=1, inplace=True)

        #Normalization
        scaler = preprocessing.StandardScaler()
        if self.mode == 'test' :
            train_df.drop(del_col, axis=1, inplace=True)
            scaler.fit_transform(train_df)
            inputs = scaler.transform(inputs)
        else:
            inputs = scaler.fit_transform(inputs)

        # Convert dataframe to tensor
        inputs = torch.FloatTensor(inputs).reshape(len(test_df['EMAIL'].unique()), -1, inputs.shape[1])

        return inputs

    def __len__(self):
        return len(self.inputs)

    def __getitem__(self, index):
        return self.inputs[index, :, :]

In [None]:
TRAINED_MODEL_PATH = os.path.join(ROOT_PATH, 'best.pt')

# Load dataset & dataloader
test_dataset = TestDataset(data_dir=DATA_DIR)
test_dataloader = DataLoader(dataset=test_dataset, batch_size=BATCH_SIZE, shuffle=False)

# Load Model
model = LSTM(input_dim=test_dataset.inputs.shape[2], hidden_dim=512, output_dim=3, device=device).to(device)
model.load_state_dict(torch.load(TRAINED_MODEL_PATH)['model'])
model.eval()

# Set metrics & Loss function
pred_lst = []
with torch.no_grad():
    for batch_index, data in enumerate(test_dataloader):
        data = data.to(device)
        output = model(data)
        pred_lst.extend(output.argmax(dim=1).tolist())

In [None]:
pred_lst[:5]

In [None]:
# 추론 결과 디코드
for i in range(len(pred_lst)):
    if pred_lst[i] == 0:
        pred_lst[i] = 'CN'
    elif pred_lst[i] == 1:
        pred_lst[i] = 'MCI'
    else:
        pred_lst[i] = 'Dem'
pred_lst[:5]

In [None]:
submit = pd.read_csv(##### sample_submission.csv 경로 #####) 
submit = pd.DataFrame(submit)
submit.head()

In [None]:
submit['DIAG_NM'] = pred_lst
submit.head()

In [None]:
# 제출 파일 제작
submit.to_csv("submission.csv", index=False)