## [배경]

태양광 발전은 매일 기상 상황과 계절에 따른 일사량의 영향을 받습니다.

이에 대한 예측이 가능하다면 보다 원활한 전력 수급 계획이 가능합니다.

인공지능 기반 태양광 발전량 예측 모델을 만들어주세요.

## [평가]

- 심사 기준: NMAE-10(Normalized Mean Absolute Error)

- Public 평가: 학습용 제공 데이터를 이용하여 미래 한 달간 발전량 예측 후 평가

- Private 평가: 대회 종료 시점부터 30일간 실제 발전량을 하루씩 평가

- 1일 1회 채점 후 누적 결과 리더보드 반영

- Private 평가 기간 제출물 업데이트 가능 (※제출 후 선택 파일 확인을 반드시 해주시기 바랍니다.)

## [외부 데이터 및 사전학습 모델]

- 예측일 전날 자정까지 확인이 가능한 데이터만 학습 및 추론 과정에서 사용 가능

```
ex) 2021년 6월 11일 예측 -> 2021년 6월 10일 24:00까지 획득 가능한 데이터만 사용
(6월 10일 기상 관측 정보, 6월10일에 예보한 6월 11일 예보 등...)
```

- 예측 이전 시점의 데이터만 사용 가능

- 공공데이터와 같이 누구나 얻을 수 있고 법적 제약이 없는 외부 데이터 허용

- 사전학습 모델의 경우 사전학습에 사용된 데이터를 명시해야함

- 대회 진행 중 data leakage 및 규칙 위반 사항이 의심되는 경우 코드 제출 요청을 할 수 있으며 요청 2일 이내 코드 미제출 혹은 외부 데이터 사용이 확인되었을 경우 리더보드 기록 삭제

- 최종 평가시 외부데이터 및 출처 제출

## [유의 사항]

- 1일 최대 제출 횟수: 3회
- 사용 가능 언어: Python, R
- 모델 학습에서 검증 혹은 평가 데이터셋 활용 시(Data Leakage 등) 실격
- 최종 순위는 선택된 파일 중에서 채점되므로 참가자는 제출 창에서 자신이 최종적으로 채점 받고 싶은 파일을 선택해야 함
- 대회 직후 공개되는 Private 랭킹은 최종 순위가 아니며 코드 검증 후 수상자가 결정됨
- 데이콘은 부정 제출 행위를 금지하고 있으며 데이콘 대회 부정 제출 이력이 있는 경우 평가가 제한됩니다. 자세한 사항은 아래의 링크를 참고해 주시기 바랍니다. https://dacon.io/notice/notice/13

In [1]:
from google.colab import auth
auth.authenticate_user()

from google.colab import drive
drive.mount('/content/drive', force_remount=False)

Mounted at /content/drive


In [2]:
import os
import pandas as pd
import datetime as dt
import seaborn as sns
import matplotlib.pyplot as plt
from pathlib import Path
from functools import reduce

In [3]:
folder = "dacon"
project_dir = "korea_east_west_power"

base_path = Path("/content/drive/My Drive/")
project_path = base_path / folder / project_dir
os.chdir(project_path)

for x in list(project_path.glob("*")):
    if x.is_dir():
        dir_name = str(x.relative_to(project_path))
        os.rename(dir_name, dir_name.split(" ", 1)[0])

print(f"현재 디렉토리 위치: {os.getcwd()}")

현재 디렉토리 위치: /content/drive/My Drive/dacon/korea_east_west_power


In [75]:
dangjin_fcst_df = pd.read_csv('./data/dangjin_fcst_data.csv')
energy_df = pd.read_csv('./data/energy.csv')
ulsan_fcst_df = pd.read_csv('./data/ulsan_fcst_data.csv')
sample_submission_df = pd.read_csv('./data/sample_submission.csv')

## energy.csv - 발전소별 발전량
- time : 1시간 단위 계량된 시간 (ex-2018-03-01 1:00:00 => 2018-03-01 00:00:00 ~ 2018-03-01 1:00:00 1시간동안 발전량 계량)
- dangjin_floating : 당진수상태양광 발전량(KW)
- dangjin_warehouse : 당진자재창고태양광 발전량(KW)
- dangjin : 당진태양광 발전량(KW)
- ulsan : 울산태양광 발전량(KW)

In [76]:
energy_df

Unnamed: 0,time,dangjin_floating,dangjin_warehouse,dangjin,ulsan
0,2018-03-01 1:00:00,0.0,0.0,0,0
1,2018-03-01 2:00:00,0.0,0.0,0,0
2,2018-03-01 3:00:00,0.0,0.0,0,0
3,2018-03-01 4:00:00,0.0,0.0,0,0
4,2018-03-01 5:00:00,0.0,0.0,0,0
...,...,...,...,...,...
25627,2021-01-31 20:00:00,0.0,0.0,0,0
25628,2021-01-31 21:00:00,0.0,0.0,0,0
25629,2021-01-31 22:00:00,0.0,0.0,0,0
25630,2021-01-31 23:00:00,0.0,0.0,0,0


In [77]:
def to_datetime(date_str):
    if date_str[11:13] != '24':
        return pd.to_datetime(date_str, format='%Y-%m-%d %H:%M:%S')

    date_str = date_str[0:11] + '00' + date_str[13:]
    
    return pd.to_datetime(date_str, format='%Y%m%d %H:%M:%S') + \
           dt.timedelta(days=1)

def set_date_info(df, column_name):
    df['year'] = df[column_name].dt.year
    df['month'] = df[column_name].dt.month
    df['day'] = df[column_name].dt.day
    df['dayofweek'] = df[column_name].dt.dayofweek
    df['hour'] = df[column_name].dt.hour

    return df

In [78]:
energy_df['time'] = energy_df.time.apply(to_datetime)
energy_df = set_date_info(energy_df, 'time')

In [79]:
energy_df = energy_df[['time', 'year', 'month', 'day', 'dayofweek', 'hour', 'dangjin_floating', 'dangjin_warehouse', 'dangjin', 'ulsan']]
energy_df

Unnamed: 0,time,year,month,day,dayofweek,hour,dangjin_floating,dangjin_warehouse,dangjin,ulsan
0,2018-03-01 01:00:00,2018,3,1,3,1,0.0,0.0,0,0
1,2018-03-01 02:00:00,2018,3,1,3,2,0.0,0.0,0,0
2,2018-03-01 03:00:00,2018,3,1,3,3,0.0,0.0,0,0
3,2018-03-01 04:00:00,2018,3,1,3,4,0.0,0.0,0,0
4,2018-03-01 05:00:00,2018,3,1,3,5,0.0,0.0,0,0
...,...,...,...,...,...,...,...,...,...,...
25627,2021-01-31 20:00:00,2021,1,31,6,20,0.0,0.0,0,0
25628,2021-01-31 21:00:00,2021,1,31,6,21,0.0,0.0,0,0
25629,2021-01-31 22:00:00,2021,1,31,6,22,0.0,0.0,0,0
25630,2021-01-31 23:00:00,2021,1,31,6,23,0.0,0.0,0,0


## dangjin_fcst_data.csv - 당진지역 발전소 동네 예보
- Forecast time : 예보 발표 시점
- forecast : 예보 시간 (ex - Forecast time:2018-03-01 11:00:00, forecast:4.0 => 2018-03-01 11:00:00에 발표한 2018-03-01 15:00:00 예보

### 예보 발표 시점 'forecast'시간 후 기상 예보
- Temperature : 온도(℃)
- Humidity : 습도(%)
- WindSpeed : 풍속(m/s)
- WindDirection : 풍향(º)
- Cloud : 하늘상태(1-맑음, 2-구름보통, 3-구름많음, 4-흐림)

In [80]:
dangjin_fcst_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 162208 entries, 0 to 162207
Data columns (total 7 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   Forecast time  162208 non-null  object 
 1   forecast       162208 non-null  float64
 2   Temperature    162208 non-null  float64
 3   Humidity       162208 non-null  float64
 4   WindSpeed      162208 non-null  float64
 5   WindDirection  162208 non-null  float64
 6   Cloud          162208 non-null  float64
dtypes: float64(6), object(1)
memory usage: 8.7+ MB


In [81]:
dangjin_fcst_df

Unnamed: 0,Forecast time,forecast,Temperature,Humidity,WindSpeed,WindDirection,Cloud
0,2018-03-01 11:00:00,4.0,0.0,60.0,7.3,309.0,2.0
1,2018-03-01 11:00:00,7.0,-2.0,60.0,7.1,314.0,1.0
2,2018-03-01 11:00:00,10.0,-2.0,60.0,6.7,323.0,1.0
3,2018-03-01 11:00:00,13.0,-2.0,55.0,6.7,336.0,1.0
4,2018-03-01 11:00:00,16.0,-4.0,55.0,5.5,339.0,1.0
...,...,...,...,...,...,...,...
162203,2021-03-01 08:00:00,52.0,7.0,40.0,3.2,187.0,1.0
162204,2021-03-01 08:00:00,55.0,8.0,40.0,4.5,217.0,1.0
162205,2021-03-01 08:00:00,58.0,5.0,55.0,2.2,210.0,1.0
162206,2021-03-01 08:00:00,61.0,1.0,80.0,1.9,164.0,1.0


In [82]:
dangjin_feb_temperature_df = pd.read_csv('./data/dangjin/석문면_3시간기온_201802_201802.csv')
dangjin_feb_humidity_df = pd.read_csv('./data/dangjin/석문면_습도_201802_201802.csv')
dangjin_feb_wind_direction_df = pd.read_csv('./data/dangjin/석문면_풍향_201802_201802.csv')
dangjin_feb_wind_speed_df = pd.read_csv('./data/dangjin/석문면_풍속_201802_201802.csv')
dangjin_feb_cloud_df = pd.read_csv('./data/dangjin/석문면_하늘상태_201802_201802.csv')

In [83]:
# 2018-02-28의 14시 예보만 사용
def filter_feb_14hour_data(df, key):
    df = df[(df[' format: day'] == 28) & (df['hour'] == 1400)]
    df_ = df.copy()
    df_[key] = df_.copy()[df_.columns[-1]].astype(float)
    df_['Forecast time'] = pd.to_datetime('2018-02-28 14:00:00')
    df_['forecast'] = df_['forecast'].astype(float)

    return df_[['Forecast time', 'forecast', key]]

dangjin_feb_temperature_df = filter_feb_14hour_data(dangjin_feb_temperature_df, 'Temperature')
dangjin_feb_humidity_df = filter_feb_14hour_data(dangjin_feb_humidity_df, 'Humidity')
dangjin_feb_wind_direction_df = filter_feb_14hour_data(dangjin_feb_wind_direction_df, 'WindDirection')
dangjin_feb_wind_speed_df = filter_feb_14hour_data(dangjin_feb_wind_speed_df, 'WindSpeed')
dangjin_feb_cloud_df = filter_feb_14hour_data(dangjin_feb_cloud_df, 'Cloud')

In [84]:
dfs = [dangjin_feb_temperature_df, dangjin_feb_humidity_df, dangjin_feb_wind_direction_df, dangjin_feb_wind_speed_df, dangjin_feb_cloud_df]
dangjin_feb_fcst_df = reduce(lambda df_left, df_right: pd.merge(df_left, df_right, how = 'left'), dfs)

In [85]:
# 2018-02-28 14시 예보 추가
dangjin_fcst_df = pd.concat([dangjin_feb_fcst_df, dangjin_fcst_df])

In [86]:
"""
- DACON.Dobby님의 소스를 참고하였습니다.
- https://dacon.io/competitions/official/235720/codeshare/2499?page=1&dtype=recent
"""
def make_fcst_df(df):
    df['Forecast time'] = pd.to_datetime(df['Forecast time'])

    # 14시 예보만 사용
    fcst_14_df = df[df['Forecast time'].dt.hour == 14]

    # 다음 날의 기상 예보가 필요하여 예보시간 기준 10시간 후(00:00)부터 33시간 후(23:00) 데이터만 사용
    fcst_14_df = fcst_14_df[(fcst_14_df['forecast'] >= 10) & (fcst_14_df['forecast'] <= 33)]

    # 예보 시점에 focast를 더하여 예보 시각을 구한다.
    fcst_14_df['Forecast_time'] = fcst_14_df['Forecast time'] + fcst_14_df['forecast'].map(lambda x: pd.DateOffset(hours=x))

    # 풍향을 radian으로
    # fcst_14_df['WindDirection_Rad'] = fcst_14_df['WindDirection'] * np.pi / 180

    fcst_14_df =  fcst_14_df[['Forecast_time', 'Temperature', 'Humidity', 'WindSpeed', 'WindDirection', 'Cloud']]

    # 태양광 발전량 예측은 1시간 간격으로 해야하나 예보는 3시간 간격으로 나오기 떄문에 선형보간법을 활용하여 비어있는 값을 채운다
    # 한시간 간격의 데이터프레임을 생성
    fcst_14_df_ = pd.DataFrame()
    fcst_14_df_['Forecast_time'] = pd.date_range(start='2018-03-01 00:00:00', end='2021-03-01 23:00:00', freq='H')

    # 기존 예보 df와 병합
    fcst_14_df_ = pd.merge(fcst_14_df_, fcst_14_df, on='Forecast_time', how='outer')

    # 선형보간
    inter_fcst_14_df = fcst_14_df_.interpolate()

    return inter_fcst_14_df
    

In [87]:
# 선형보간
inter_dangjin_fcst_14_df = make_fcst_df(dangjin_fcst_df)



## ulsan_fcst_data.csv - 울산지역 발전소 동네 예보
- Forecast time : 예보 발표 시점
- forecast : 예보 시간 (ex - Forecast time:2018-03-01 11:00:00, forecast:4.0 => 2018-03-01 11:00:00에 발표한 2018-03-01 15:00:00 예보

### 예보 발표 시점 'forecast'시간 후 기상 예보
- Temperature : 온도(℃)
- Humidity : 습도(%)
- WindSpeed : 풍속(m/s)
- WindDirection : 풍향(º)
- Cloud : 하늘상태(1-맑음, 2-구름보통, 3-구름많음, 4-흐림)

In [88]:
ulsan_fcst_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 162208 entries, 0 to 162207
Data columns (total 7 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   Forecast time  162208 non-null  object 
 1   forecast       162208 non-null  float64
 2   Temperature    162208 non-null  float64
 3   Humidity       162208 non-null  float64
 4   WindSpeed      162208 non-null  float64
 5   WindDirection  162208 non-null  float64
 6   Cloud          162208 non-null  float64
dtypes: float64(6), object(1)
memory usage: 8.7+ MB


In [89]:
ulsan_feb_temperature_df = pd.read_csv('./data/ulsan/선암동_3시간기온_201802_201802.csv')
ulsan_feb_humidity_df = pd.read_csv('./data/ulsan/선암동_습도_201802_201802.csv')
ulsan_feb_wind_direction_df = pd.read_csv('./data/ulsan/선암동_풍향_201802_201802.csv')
ulsan_feb_wind_speed_df = pd.read_csv('./data/ulsan/선암동_풍속_201802_201802.csv')
ulsan_feb_cloud_df = pd.read_csv('./data/ulsan/선암동_하늘상태_201802_201802.csv')

In [90]:
ulsan_feb_temperature_df = filter_feb_14hour_data(ulsan_feb_temperature_df, 'Temperature')
ulsan_feb_humidity_df = filter_feb_14hour_data(ulsan_feb_humidity_df, 'Humidity')
ulsan_feb_wind_direction_df = filter_feb_14hour_data(ulsan_feb_wind_direction_df, 'WindDirection')
ulsan_feb_wind_speed_df = filter_feb_14hour_data(ulsan_feb_wind_speed_df, 'WindSpeed')
ulsan_feb_cloud_df = filter_feb_14hour_data(ulsan_feb_cloud_df, 'Cloud')

In [91]:
dfs = [ulsan_feb_temperature_df, ulsan_feb_humidity_df, ulsan_feb_wind_direction_df, ulsan_feb_wind_speed_df, ulsan_feb_cloud_df]
ulsan_feb_fcst_df = reduce(lambda df_left, df_right: pd.merge(df_left, df_right, how = 'left'), dfs)

In [92]:
# 2018-02-28 14시 예보 추가
ulsan_fcst_df = pd.concat([ulsan_feb_fcst_df, ulsan_fcst_df])

In [93]:
# # 선형보간
inter_ulsan_fcst_14_df = make_fcst_df(ulsan_fcst_df)



## 학습데이터 전처리
### 1. 기상정보와 발전량 merge
### 2. 결측치 확인 및 fillna
### 3. 발전량이 0인 데이터 제거
### 4. 시간 데이터 전처리

In [94]:
energy_df

Unnamed: 0,time,year,month,day,dayofweek,hour,dangjin_floating,dangjin_warehouse,dangjin,ulsan
0,2018-03-01 01:00:00,2018,3,1,3,1,0.0,0.0,0,0
1,2018-03-01 02:00:00,2018,3,1,3,2,0.0,0.0,0,0
2,2018-03-01 03:00:00,2018,3,1,3,3,0.0,0.0,0,0
3,2018-03-01 04:00:00,2018,3,1,3,4,0.0,0.0,0,0
4,2018-03-01 05:00:00,2018,3,1,3,5,0.0,0.0,0,0
...,...,...,...,...,...,...,...,...,...,...
25627,2021-01-31 20:00:00,2021,1,31,6,20,0.0,0.0,0,0
25628,2021-01-31 21:00:00,2021,1,31,6,21,0.0,0.0,0,0
25629,2021-01-31 22:00:00,2021,1,31,6,22,0.0,0.0,0,0
25630,2021-01-31 23:00:00,2021,1,31,6,23,0.0,0.0,0,0


In [97]:
dangjin_floating_df = energy_df.loc[:, 'time':'dangjin_floating']
# dangjin_warehouse_df = energy_df.iloc[:, [0, 1, 2, 4]]
# dangjin_df = energy_df.iloc[:, [0, 1, 2, 5]]
# ulsan_df = energy_df.iloc[:, [0, 1, 2, 6]]

In [98]:
dangjin_floating_df

Unnamed: 0,time,year,month,day,dayofweek,hour,dangjin_floating
0,2018-03-01 01:00:00,2018,3,1,3,1,0.0
1,2018-03-01 02:00:00,2018,3,1,3,2,0.0
2,2018-03-01 03:00:00,2018,3,1,3,3,0.0
3,2018-03-01 04:00:00,2018,3,1,3,4,0.0
4,2018-03-01 05:00:00,2018,3,1,3,5,0.0
...,...,...,...,...,...,...,...
25627,2021-01-31 20:00:00,2021,1,31,6,20,0.0
25628,2021-01-31 21:00:00,2021,1,31,6,21,0.0
25629,2021-01-31 22:00:00,2021,1,31,6,22,0.0
25630,2021-01-31 23:00:00,2021,1,31,6,23,0.0


In [99]:
columns = ['time', 'year', 'month', 'day', 'dayofweek', 'hour', 'Temperature', 'Humidity', 'WindSpeed', 'WindDirection', 'Cloud']

# 기상정보와 발전량 merge
dangjin_floating_df = pd.concat([dangjin_floating_df, inter_dangjin_fcst_14_df],axis=1, join='inner')[columns + ['dangjin_floating']]
# dangjin_warehouse_df = pd.concat([dangjin_warehouse_df, inter_dangjin_fcst_14_df],axis=1, join='inner')[columns + ['dangjin_warehouse']]
# dangjin_df = pd.concat([dangjin_df, inter_dangjin_fcst_14_df],axis=1, join='inner')[columns + ['dangjin']]
# ulsan_df = pd.concat([ulsan_df, inter_dangjin_fcst_14_df],axis=1, join='inner')[columns + ['ulsan']]

In [101]:
dangjin_floating_df

Unnamed: 0,time,year,month,day,dayofweek,hour,Temperature,Humidity,WindSpeed,WindDirection,Cloud,dangjin_floating
0,2018-03-01 01:00:00,2018,3,1,3,1,1.000000,85.000000,7.500000,309.000000,3.000000,0.0
1,2018-03-01 02:00:00,2018,3,1,3,2,1.000000,80.000000,7.133333,310.333333,2.666667,0.0
2,2018-03-01 03:00:00,2018,3,1,3,3,1.000000,75.000000,6.766667,311.666667,2.333333,0.0
3,2018-03-01 04:00:00,2018,3,1,3,4,1.000000,70.000000,6.400000,313.000000,2.000000,0.0
4,2018-03-01 05:00:00,2018,3,1,3,5,0.666667,66.666667,6.666667,311.666667,2.000000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...
25627,2021-01-31 20:00:00,2021,1,31,6,20,5.666667,76.666667,3.266667,170.666667,4.000000,0.0
25628,2021-01-31 21:00:00,2021,1,31,6,21,5.333333,78.333333,3.433333,167.333333,4.000000,0.0
25629,2021-01-31 22:00:00,2021,1,31,6,22,5.000000,80.000000,3.600000,164.000000,4.000000,0.0
25630,2021-01-31 23:00:00,2021,1,31,6,23,5.333333,80.000000,4.266667,168.333333,4.000000,0.0


In [100]:
# 결측치 확인
display(dangjin_floating_df['dangjin_floating'].isnull().sum())
# display(dangjin_warehouse_df['dangjin_warehouse'].isnull().sum())
# display(dangjin_df['dangjin'].isnull().sum()) # 결측치 없음
# display(ulsan_df['ulsan'].isnull().sum()) # 결측치 없음

24

In [105]:
def set_missing_value_group_by_time(df, key):
    df[key] = df.groupby(df.time.dt.time, sort=False)[key].apply(lambda x: x.fillna(x.mean()))
    return df

def filter_zero(df, key):
    df = df[df[key] != 0]
    return df

In [106]:
# 결측치는 시간별 평균 값으로 채움
dangjin_floating_df = set_missing_value_group_by_time(dangjin_floating_df, 'dangjin_floating')
# dangjin_warehouse_df = set_missing_value_group_by_time(dangjin_warehouse_df, 'dangjin_warehouse')

In [None]:
# 발전량이 0인 데이터 제거
dangjin_floating_df = filter_zero(dangjin_floating_df, 'dangjin_floating')
# dangjin_warehouse_df = filter_zero(dangjin_warehouse_df, 'dangjin_warehouse')
# dangjin_df = filter_zero(dangjin_df, 'dangjin')
# ulsan_df = filter_zero(ulsan_df, 'ulsan')

In [107]:
dangjin_floating_df

Unnamed: 0,time,year,month,day,dayofweek,hour,Temperature,Humidity,WindSpeed,WindDirection,Cloud,dangjin_floating
0,2018-03-01 01:00:00,2018,3,1,3,1,1.000000,85.000000,7.500000,309.000000,3.000000,0.0
1,2018-03-01 02:00:00,2018,3,1,3,2,1.000000,80.000000,7.133333,310.333333,2.666667,0.0
2,2018-03-01 03:00:00,2018,3,1,3,3,1.000000,75.000000,6.766667,311.666667,2.333333,0.0
3,2018-03-01 04:00:00,2018,3,1,3,4,1.000000,70.000000,6.400000,313.000000,2.000000,0.0
4,2018-03-01 05:00:00,2018,3,1,3,5,0.666667,66.666667,6.666667,311.666667,2.000000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...
25627,2021-01-31 20:00:00,2021,1,31,6,20,5.666667,76.666667,3.266667,170.666667,4.000000,0.0
25628,2021-01-31 21:00:00,2021,1,31,6,21,5.333333,78.333333,3.433333,167.333333,4.000000,0.0
25629,2021-01-31 22:00:00,2021,1,31,6,22,5.000000,80.000000,3.600000,164.000000,4.000000,0.0
25630,2021-01-31 23:00:00,2021,1,31,6,23,5.333333,80.000000,4.266667,168.333333,4.000000,0.0


In [None]:
# # 시간데이터 전처리
# import time
# import numpy as np

# def set_normalize_time(df, key):
#     day = 24 * 60 * 60
#     year = (365.2425) * day

#     timestamp_s = df['start_time'].map(lambda x: time.mktime(pd.Timestamp(x).timetuple()))

#     df['Day_sin'] = np.sin(timestamp_s * (2 * np.pi / day))
#     df['Day_cos'] = np.cos(timestamp_s * (2 * np.pi / day))
#     df['Year_sin'] = np.sin(timestamp_s * (2 * np.pi / year))
#     df['Year_cos'] = np.cos(timestamp_s * (2 * np.pi / year))

#     return df[['Day_sin', 'Day_cos', 'Year_sin', 'Year_cos', 'Temperature', 'Humidity', 'WindSpeed', 'WindDirection', 'Cloud', key]]

In [None]:
dangjin_floating_df = set_normalize_time(dangjin_floating_df, 'dangjin_floating')
# dangjin_warehouse_df = set_normalize_time(dangjin_warehouse_df, 'dangjin_warehouse')
# dangjin_df = set_normalize_time(dangjin_df, 'dangjin')
# ulsan_df = set_normalize_time(ulsan_df, 'ulsan')

In [108]:
# 데이터 분할 및 정규화
def split_df(df):
    n = len(df)
    train_df = df[0:int(n*0.7)]
    val_df = df[int(n*0.7):int(n*0.9)]
    test_df = df[int(n*0.9):]

    display()
    

    # train_x = train_df.loc[:, 'Day_sin': 'Cloud']
    train_x = train_df.loc[:, 'year': 'Cloud']
    train_y = train_df.iloc[:,-1:]
    
    # val_x = val_df.loc[:, 'Day_sin': 'Cloud']
    val_x = val_df.loc[:, 'year': 'Cloud']
    val_y = val_df.iloc[:, -1:]

    # test_x = test_df.loc[:, 'Day_sin': 'Cloud']
    test_x = test_df.loc[:, 'year': 'Cloud']
    test_y = test_df.iloc[:, -1:]

    train_mean = train_x.mean()
    train_std = train_x.std()

    train_x = (train_x - train_mean) / train_std
    val_x = (val_x - train_mean) / train_std
    test_x = (test_x - train_mean) / train_std

    return train_x, train_y, val_x, val_y, test_x, test_y

In [109]:
train_x, train_y, val_x, val_y, test_x, test_y = split_df(dangjin_floating_df)

In [110]:
train_x.shape

(17942, 10)

In [111]:
train_y.shape

(17942, 1)

In [112]:
# 데이터셋 생성
import tensorflow as tf

WINDOW_SIZE = 24 * 30
BATCH_SIZE = 32
LEARNING_RATE = 0.0005

def windowed_dataset(x, y, window_size, batch_size, shuffle, target):
    # X 데이터
    ds_x = tf.data.Dataset.from_tensor_slices(x)
    ds_x = ds_x.window(window_size, shift = 1, stride = 1, drop_remainder = True)
    ds_x = ds_x.flat_map(lambda x: x.batch(window_size))

    # Y 데이터
    ds_y = tf.data.Dataset.from_tensor_slices(y[target][window_size:])

    # zip
    ds = tf.data.Dataset.zip((ds_x, ds_y))

    if shuffle:
        ds = ds.shuffle(1000)
    
    return ds.batch(batch_size).prefetch(1)

In [113]:
train_data = windowed_dataset(train_x, train_y, WINDOW_SIZE, BATCH_SIZE, True, 'dangjin_floating')
val_data = windowed_dataset(val_x, val_y, WINDOW_SIZE, BATCH_SIZE, False, 'dangjin_floating')

In [114]:
for x, y in train_data.take(1):
    print(x.shape)

(32, 720, 10)


In [115]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Conv1D, Lambda, GRU, Bidirectional, Dropout
from tensorflow.keras.losses import Huber
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint

model = Sequential([
    Bidirectional(LSTM(32, return_sequences=True, dropout=0.5, input_shape = (None, 10, ))),
    Bidirectional(LSTM(16, return_sequences=True, dropout=0.5, input_shape = (None, 10, ))),
    Dense(16, activation='relu'),
    Dropout(0.5),
    Dense(8, activation='relu'),
    Dropout(0.5),
    Dense(1)
])

In [116]:
loss = Huber()
optimizer = Adam(LEARNING_RATE)
model.compile(loss=Huber(), optimizer=optimizer, metrics=['mae'])

In [117]:
# earlystopping은 10번 epoch통안 val_loss 개선이 없다면 학습을 멈춥니다.
early_stopping = EarlyStopping(monitor='val_loss', patience=100)

# val_loss 기준 체크포인터도 생성합니다.
filename = os.path.join('tmp', 'ckeckpointer.ckpt')
checkpoint = ModelCheckpoint(
    filename, 
    save_weights_only=True, 
    save_best_only=True, 
    monitor='val_loss', 
    verbose=1
)

In [None]:
EPOCHS = 50

history = model.fit(
    train_data, 
    validation_data = (val_data), 
    epochs = EPOCHS,
    batch_size = BATCH_SIZE,
    callbacks=[early_stopping, checkpoint]
)

Epoch 1/50

Epoch 00001: val_loss improved from inf to 121.10912, saving model to tmp/ckeckpointer.ckpt
Epoch 2/50

Epoch 00002: val_loss improved from 121.10912 to 115.14478, saving model to tmp/ckeckpointer.ckpt
Epoch 3/50
  5/539 [..............................] - ETA: 42s - loss: 133.2997 - mae: 133.7520