# TF data API - 파일 데이터를 학습
TensorFlow의 Data API 를 이용하여 디스크의 파일을 직접 처리하도록 하겠습니다.

먼저 기존과 같은 부분들은 먼저 작성해보겠습니다.

In [0]:
from google.colab import drive
drive.mount('/content/drive')

In [0]:
import pandas as pd
import tensorflow as tf
from tensorflow import keras
from tensorflow import feature_column

path = '/content/drive/My Drive/dnn_tutorial/'
#   set LABEL first
output_cols = ['vel']
#   set traffic column
input_cols_num = ['vel_t05', 'vel_t10', 'vel_t15', 'vel_t20',
               'vel_t25', 'vel_t30', 'vel_t35',  'vel_t40']
input_cols_cat = ['V_ID']
input_cols= input_cols_num + input_cols_cat

traffic_data = pd.read_csv(path + 'traffic_data_2link.csv', index_col = 0)
from sklearn.model_selection import train_test_split

train_data, test_data = train_test_split(traffic_data, test_size = 1024)
V_ID_list = train_data['V_ID'].unique().tolist()

vel_upper_limit = train_data['vel'].quantile(q=0.98)
train_data = train_data[train_data['vel']<=vel_upper_limit]

from sklearn.preprocessing import MinMaxScaler
scaler= MinMaxScaler()
scaler.fit(train_data['vel'].values.reshape(-1,1))

def normalize_numeric(dataframe, input_col_list, output_col_list):
    return_df = dataframe.copy()
    for col in input_col_list:
        return_df.loc[:, col] = pd.DataFrame(
            scaler.transform(return_df[col].values.reshape(-1, 1)),
            columns=[col], index=return_df.index)
    for col in output_col_list:
        col_backup = 'backup_'+col
        return_df[col_backup] = return_df[col].copy()
    for col in output_col_list:
        return_df.loc[:, col] = pd.DataFrame(
            scaler.transform(return_df[col].values.reshape(-1, 1)),
            columns=[col], index=return_df.index)
    return return_df

In [0]:
feature_columns = []
for col in input_cols_num:
    feature_columns.append(feature_column.numeric_column(col))
V_ID_column = feature_column.categorical_column_with_vocabulary_list('V_ID',
                                                                     V_ID_list )
feature_columns.append(feature_column.indicator_column(V_ID_column))

train_data 와 test_data 를 normalize 한 뒤에 파일로 저장하겠습니다.

(각각 traffic_2links_train_normalized.csv, traffic_2links_test_normalized.csv)


In [0]:
train_data = normalize_numeric(train_data, input_cols_num, output_cols)
test_data = normalize_numeric(test_data, input_cols_num, output_cols)
train_path = path+'traffic_2links_train_normalized.csv'
test_path = path+'traffic_2links_test_normalized.csv'
train_data.to_csv(train_path)
test_data.to_csv(test_path)

csv_to_dataset 함수를 이용해 csv 파일을 DNN 모델에 파이프라인으로 넘겨줄 수 있습니다.

이때, CSV 파일을 line by line 으로 tensor로 변환하는 _parse_line 함수의 작성이 필요합니다.


In [0]:
def _parse_line(line, default_types, input_col_list, output, 
                input_col_list_name):
    #    Decode the line into its fields
    fields = tf.io.decode_csv(
        line, record_defaults=default_types, select_cols=input_col_list)
    #    Pack the result into a dictionary
    features = dict(zip(input_col_list_name, fields))
    labels = tf.stack([features[x] for x in output], axis=0)
    for i in output:
        features.pop(i)

    return features, labels

from functools import partial
def csv_to_dataset(csv_path, input_col_num_list,input_col_cat_list, 
                   output_col_list, batch_size):
    reader = pd.read_csv(csv_path, chunksize=1)
    DF = reader.get_chunk()

    default_types = []
    col_list = []
    col_list_name = []

    for i in range(DF.columns.shape[0]):
        if DF.columns[i] in input_col_num_list:
            default_types.append([0.0])
            col_list_name.append(DF.columns[i])
            col_list.append(i)
        elif DF.columns[i] in input_col_cat_list:
            default_types.append(['0'])
            col_list_name.append(DF.columns[i])
            col_list.append(i)
        elif DF.columns[i] in output_col_list:
            default_types.append([0.0])
            col_list_name.append(DF.columns[i])
            col_list.append(i)

    dataset = tf.data.TextLineDataset(csv_path).skip(1)
    dataset = dataset.shuffle(20000)
    partial_parse_line = partial(_parse_line, default_types=default_types, 
                                 input_col_list=col_list, 
                                 output=output_col_list,
                                 input_col_list_name=col_list_name)
    dataset = dataset.map(partial_parse_line, num_parallel_calls=4)
    dataset = dataset.batch(batch_size)
    return dataset

이제 csv 파일의 경로와 기타 패러미터를 이용하여 파이프라인을 생성합니다.
이후 DNN 모델을 생성하고 학습시켜봅니다.

In [8]:
train_dataset = csv_to_dataset(train_path, input_cols_num,input_cols_cat, 
                               output_cols, 1024)

model = keras.Sequential()
model.add(keras.layers.DenseFeatures(feature_columns))
model.add(keras.layers.Dense(30, activation='relu'))
model.add(keras.layers.Dense(30, activation='relu'))
model.add(keras.layers.Dense(30, activation='relu'))
model.add(keras.layers.Dense(30, activation='relu'))
model.add(keras.layers.Dense(1, activation=None))
model.compile(optimizer=tf.keras.optimizers.Adam(0.001), loss='mse', metrics=['mape'])
model.fit(train_dataset, epochs= 10)
model.summary()

train_predict= model.predict(train_dataset)
train_predict =pd.DataFrame(scaler.inverse_transform(train_predict), index= train_data.index, columns=['prediction'])
percentage_error = (train_predict['prediction'] - train_data['backup_vel']).abs()/ train_data['backup_vel']*100

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_features_1 (DenseFeatu multiple                  0         
_________________________________________________________________
dense_5 (Dense)              multiple                  330       
_________________________________________________________________
dense_6 (Dense)              multiple                  930       
_________________________________________________________________
dense_7 (Dense)              multiple                  930       
_________________________________________________________________
dense_8 (Dense)              multiple                  930       
_________________________________________________________________
dense_9 (Dense)              multiple                  31        
Total par

test_data 에도 파이프라인을 생성하여 evaluate을 수행합니다.

In [9]:
test_dataset = csv_to_dataset(test_path, input_cols_num,input_cols_cat,  output_cols, 1024)

model.evaluate(test_dataset)



[0.001202865387313068, 3.4598822593688965]