<a href="https://colab.research.google.com/github/KangHwan-Cha/MyStudy/blob/main/TensorProject/Category5B___HEPC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Category 5

`Individual House Hold Electric Power Consumption Dataset`을 활용한 예측

2021년 7월 1일 신규 업데이트

## 확인

1. GPU 옵션 켜져 있는지 확인할 것!!! (수정 - 노트설정 - 하드웨어설정 (GPU))

## 순서

1. **import**: 필요한 모듈 import
2. **전처리**: 학습에 필요한 데이터 전처리를 수행합니다.
3. **모델링(model)**: 모델을 정의합니다.
4. **컴파일(compile)**: 모델을 생성합니다.
5. **학습 (fit)**: 모델을 학습시킵니다.

## 문제

ABOUT THE DATASET

Original Source:
https://archive.ics.uci.edu/ml/datasets/individual+household+electric+power+consumption

The original 'Individual House Hold Electric Power Consumption Dataset'
has Measurements of electric power consumption in one household with
a one-minute sampling rate over a period of almost 4 years.

Different electrical quantities and some sub-metering values are available.

For the purpose of the examination we have provided a subset containing
the data for the first 60 days in the dataset. We have also cleaned the
dataset beforehand to remove missing values. The dataset is provided as a
csv file in the project.

The dataset has a total of 7 features ordered by time.
==============================================================================

INSTRUCTIONS

Complete the code in following functions:
1. windowed_dataset()
2. solution_model()

The model input and output shapes must match the following
specifications.

1. Model input_shape must be (BATCH_SIZE, N_PAST = 24, N_FEATURES = 7),
   since the testing infrastructure expects a window of past N_PAST = 24
   observations of the 7 features to predict the next 24 observations of
   the same features.

2. Model output_shape must be (BATCH_SIZE, N_FUTURE = 24, N_FEATURES = 7)

3. DON'T change the values of the following constants
   N_PAST, N_FUTURE, SHIFT in the windowed_dataset()
   BATCH_SIZE in solution_model() (See code for additional note on
   BATCH_SIZE).
4. Code for normalizing the data is provided - DON't change it.
   Changing the normalizing code will affect your score.

HINT: Your neural network must have a **validation MAE of approximately 0.055** or
less on the normalized validation dataset for top marks.

WARNING: Do not use lambda layers in your model, they are not supported
on the grading infrastructure.

WARNING: If you are using the GRU layer, it is advised not to use the
'recurrent_dropout' argument (you can alternatively set it to 0),
since it has not been implemented in the cuDNN kernel and may
result in much longer training times.

## 필요한 모듈 import

In [1]:
import urllib
import os
import zipfile
import pandas as pd

import tensorflow as tf
from tensorflow.keras.layers import Dense, Conv1D, LSTM, Bidirectional
from tensorflow.keras.models import Sequential
from tensorflow.keras.callbacks import ModelCheckpoint

## 데이터셋 다운로드

In [2]:
def download_and_extract_data():
    url = 'https://storage.googleapis.com/download.tensorflow.org/data/certificate/household_power.zip'
    urllib.request.urlretrieve(url, 'household_power.zip')
    with zipfile.ZipFile('household_power.zip', 'r') as zip_ref:
        zip_ref.extractall()

In [3]:
download_and_extract_data()

In [4]:
df = pd.read_csv('household_power_consumption.csv', sep=',', infer_datetime_format=True, index_col='datetime', header=0)
df.head(5)

Unnamed: 0_level_0,Global_active_power,Global_reactive_power,Voltage,Global_intensity,Sub_metering_1,Sub_metering_2,Sub_metering_3
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2006-12-16 17:24:00,4.216,0.418,234.84,18.4,0.0,1.0,17.0
2006-12-16 17:25:00,5.36,0.436,233.63,23.0,0.0,1.0,16.0
2006-12-16 17:26:00,5.374,0.498,233.29,23.0,0.0,2.0,17.0
2006-12-16 17:27:00,5.388,0.502,233.74,23.0,0.0,1.0,17.0
2006-12-16 17:28:00,3.666,0.528,235.68,15.8,0.0,1.0,17.0


## 데이터 정규화

데이터의 스케일(Scale)을 0 ~ 1 사이로 정규화 합니다.

In [5]:
def normalize_series(data, min, max):
    data = data - min
    data = data / max
    return data

In [6]:
# FEATURES에 데이터프레임의 Column 개수 대입
N_FEATURES = len(df.columns)

# 데이터프레임을 numpy array으로 가져와 data에 대입
data = df.values

# 데이터 정규화
data = normalize_series(data, data.min(axis=0), data.max(axis=0))
data

array([[0.43377912, 0.47826087, 0.04036551, ..., 0.        , 0.01282051,
        0.85      ],
       [0.55716135, 0.49885584, 0.0355582 , ..., 0.        , 0.01282051,
        0.8       ],
       [0.55867127, 0.56979405, 0.03420739, ..., 0.        , 0.02564103,
        0.85      ],
       ...,
       [0.03710095, 0.        , 0.05983313, ..., 0.        , 0.        ,
        0.        ],
       [0.03559103, 0.        , 0.06515693, ..., 0.        , 0.        ,
        0.        ],
       [0.03774806, 0.        , 0.06730234, ..., 0.        , 0.01282051,
        0.        ]])

In [7]:
pd.DataFrame(data).describe()

Unnamed: 0,0,1,2,3,4,5,6
count,86400.0,86400.0,86400.0,86400.0,86400.0,86400.0,86400.0
mean,0.156411,0.147141,0.064697,0.152278,0.01695,0.024085,0.375711
std,0.14404,0.134578,0.0139,0.139343,0.086787,0.097022,0.433595
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.021786,0.0,0.055344,0.024752,0.0,0.0,0.0
50%,0.131795,0.132723,0.065713,0.123762,0.0,0.0,0.0
75%,0.239431,0.224256,0.074652,0.227723,0.0,0.012821,0.85
max,0.979077,1.0,0.10735,0.980198,1.0,1.0,1.0


## 데이터 분할

In [8]:
# 데이터셋 분할 (0.8). 
# 기존 0.5 -> 0.8로 변경 // 다른 비율로 변경 가능
split_time = int(len(data) * 0.8)

In [9]:
x_train = data[:split_time]
x_valid = data[split_time:]

## Windowed Dataset 생성

This line converts the dataset into a windowed dataset where a
window consists of both the observations to be included as features and the targets.

Don't change the shift parameter. The test windows are
created with the specified shift and hence it might affect your
scores. Calculate the window size so that based on the **past 24 observations (observations at time steps t=1,t=2,...t=24) of the 7 variables**

in the dataset, you predict the **next 24 observations
(observations at time steps t=25,t=26....t=48) of the 7 variables of the dataset.**

Hint: Each window should include both the past observations and
the future observations which are to be predicted. Calculate the
window size based on n_past and n_future.

In [10]:
def windowed_dataset(series, batch_size, n_past=24, n_future=24, shift=1):
    ds = tf.data.Dataset.from_tensor_slices(series)
    print(type(ds))
    ds = ds.window(size=(n_past + n_future), shift = shift, drop_remainder = True)
    ds = ds.flat_map(lambda w: w.batch(n_past + n_future))
    ds = ds.shuffle(len(series))
    ds = ds.map(
        lambda w: (w[:n_past], w[n_past:])   # w[:n_past] => x / w[n_past:] => y
    )
    return ds.batch(batch_size).prefetch(1)

`train_set`과 `valid_set`을 생성합니다.

In [11]:
# 다음 4개의 옵션은 주어 집니다.
BATCH_SIZE = 32 # 변경 가능하나 더 올리는 것은 비추 (내리는 것은 가능하나 시간 오래 걸림)
N_PAST = 24 # 변경 불가.
N_FUTURE = 24 # 변경 불가.
SHIFT = 1 # 변경 불가.

In [12]:
train_set = windowed_dataset(series=x_train, 
                             batch_size=BATCH_SIZE,
                             n_past=N_PAST, 
                             n_future=N_FUTURE,
                             shift=SHIFT)

valid_set = windowed_dataset(series=x_valid, 
                             batch_size=BATCH_SIZE,
                             n_past=N_PAST, 
                             n_future=N_FUTURE,
                             shift=SHIFT)

<class 'tensorflow.python.data.ops.dataset_ops.TensorSliceDataset'>
<class 'tensorflow.python.data.ops.dataset_ops.TensorSliceDataset'>


## 모델 생성

In [13]:
model = tf.keras.models.Sequential([
    Conv1D(filters=32, 
            kernel_size=3,
            padding="causal",
            activation="relu",
            input_shape=[N_PAST, 7],
            ),
    Bidirectional(LSTM(32, return_sequences=True)),
    Dense(32, activation="relu"),
    Dense(16, activation="relu"),
    Dense(N_FEATURES)
])

In [14]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv1d (Conv1D)             (None, 24, 32)            704       
                                                                 
 bidirectional (Bidirectiona  (None, 24, 64)           16640     
 l)                                                              
                                                                 
 dense (Dense)               (None, 24, 32)            2080      
                                                                 
 dense_1 (Dense)             (None, 24, 16)            528       
                                                                 
 dense_2 (Dense)             (None, 24, 7)             119       
                                                                 
Total params: 20,071
Trainable params: 20,071
Non-trainable params: 0
____________________________________________________

## 체크포인트 생성

In [15]:
checkpoint_path='model/my_checkpoint.ckpt'

checkpoint = ModelCheckpoint(checkpoint_path,
                             save_weights_only=True,
                             save_best_only=True,
                             monitor='val_loss',
                             verbose=1,
                             )

## 모델 컴파일

In [16]:
# learning_rate=0.0005, Adam 옵티마이저
optimizer =  tf.keras.optimizers.Adam(learning_rate=0.0005)

model.compile(loss='mae',
              optimizer=optimizer,
              metrics=["mae"]
              )

## 모델 학습

In [17]:
model.fit(train_set, 
        validation_data=(valid_set), 
        epochs=20, 
        callbacks=[checkpoint], 
        )

Epoch 1/20
   2158/Unknown - 38s 9ms/step - loss: 0.0520 - mae: 0.0520
Epoch 1: val_loss improved from inf to 0.04466, saving model to model/my_checkpoint.ckpt
Epoch 2/20
Epoch 2: val_loss improved from 0.04466 to 0.04374, saving model to model/my_checkpoint.ckpt
Epoch 3/20
Epoch 3: val_loss improved from 0.04374 to 0.04294, saving model to model/my_checkpoint.ckpt
Epoch 4/20
Epoch 4: val_loss improved from 0.04294 to 0.04166, saving model to model/my_checkpoint.ckpt
Epoch 5/20
Epoch 5: val_loss improved from 0.04166 to 0.04076, saving model to model/my_checkpoint.ckpt
Epoch 6/20
Epoch 6: val_loss improved from 0.04076 to 0.03965, saving model to model/my_checkpoint.ckpt
Epoch 7/20
Epoch 7: val_loss improved from 0.03965 to 0.03935, saving model to model/my_checkpoint.ckpt
Epoch 8/20
Epoch 8: val_loss improved from 0.03935 to 0.03920, saving model to model/my_checkpoint.ckpt
Epoch 9/20
Epoch 9: val_loss improved from 0.03920 to 0.03910, saving model to model/my_checkpoint.ckpt
Epoch 10

<keras.callbacks.History at 0x7f34bd5886a0>

`load_weights` 로 저장한 모델 로드

In [18]:
model.load_weights(checkpoint_path)

<tensorflow.python.training.tracking.util.CheckpointLoadStatus at 0x7f344e4ed610>

## 모델 검증

In [19]:
# HINT: Your neural network must have a validation MAE of approximately 0.055 or
# less on the normalized validation dataset for top marks.

In [20]:
model.evaluate(valid_set)



[0.037752263247966766, 0.03775227442383766]