### **13. 텐서플로에서 데이터 적재와 전처리하기**

- Tensorflow 외 다른 딥러닝 라이브러리는 대규모의 데이터셋을 효율적으로 로드하고 전처리하는 것이 힘들다.

- Tf.data API는 대용량의 데이터를 다룰 수 있게 도와주고, 서로 다른 데이터 포맷을 읽을 수 있으며, 복잡한 변환 작업을 수행한다.

In [40]:
import tensorflow as tf
import pandas as pd
import numpy as np
import os

### **13.1 데이터API**


In [21]:
X = tf.range(10) #Sample Tensor
X

<tf.Tensor: shape=(10,), dtype=int32, numpy=array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int32)>

In [24]:
dataset = tf.data.Dataset.from_tensor_slices(X) #numpy array나 list를 tensor로 변경해주는 함수.
dataset

<TensorSliceDataset shapes: (), types: tf.int32>

#### **13.1.1 연쇄 변환**

데이터 API를 활용해 변환 메서드를 호출하여 여러 종류의 변환을 수행한다.

- batch() : 데이터셋을 반복한다.
- repeat() : 데이터 배치의 크기를 정한다.

In [25]:
dataset1 = dataset.repeat(3).batch(7) 
for item in dataset1:
    print(item)

tf.Tensor([0 1 2 3 4 5 6], shape=(7,), dtype=int32)
tf.Tensor([7 8 9 0 1 2 3], shape=(7,), dtype=int32)
tf.Tensor([4 5 6 7 8 9 0], shape=(7,), dtype=int32)
tf.Tensor([1 2 3 4 5 6 7], shape=(7,), dtype=int32)
tf.Tensor([8 9], shape=(2,), dtype=int32)


- map() : 전체 데이터셋에 대해 변환한다.

In [26]:
dataset2 = dataset.map(lambda x: x * 2)
for item in dataset2:
    print(item)

tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(8, shape=(), dtype=int32)
tf.Tensor(10, shape=(), dtype=int32)
tf.Tensor(12, shape=(), dtype=int32)
tf.Tensor(14, shape=(), dtype=int32)
tf.Tensor(16, shape=(), dtype=int32)
tf.Tensor(18, shape=(), dtype=int32)


- filter() : 조건식을 사용해 데이터셋을 필터링한다.

In [28]:
dataset3 = dataset.filter(lambda x: x<5)
for item in dataset3:
    print(item)

tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(3, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)


- take() : 지정한 숫자만큼 확인한다.

In [29]:
for item in dataset.take(3):
  print(item)

tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)


#### **13.1.2 데이터 셔플링**

shuffle()
- 해당 dataset의 원소를 랜덤하게 셔플해준다.

In [65]:
tf.random.set_seed(42)

dataset = tf.data.Dataset.range(10).repeat(3) # 0에서 9까지 3번 반복
dataset = dataset.shuffle(buffer_size=10, seed=42).batch(7) # 버퍼크기 5와 랜덤 시드 4를 이용해 셔플링을 하고 배치 크기 7 로 나누어 출력
for item in dataset:
    print(item)

tf.Tensor([5 1 1 0 2 0 9], shape=(7,), dtype=int64)
tf.Tensor([8 4 2 5 7 6 9], shape=(7,), dtype=int64)
tf.Tensor([7 1 3 3 7 6 8], shape=(7,), dtype=int64)
tf.Tensor([9 5 2 0 4 3 8], shape=(7,), dtype=int64)
tf.Tensor([4 6], shape=(2,), dtype=int64)


In [31]:

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

housing = fetch_california_housing()
X_train_full, X_test, y_train_full, y_test = train_test_split(
    housing.data, housing.target.reshape(-1, 1), random_state=42)
X_train, X_valid, y_train, y_valid = train_test_split(
    X_train_full, y_train_full, random_state=42)

scaler = StandardScaler()
scaler.fit(X_train)
X_mean = scaler.mean_
X_std = scaler.scale_

Downloading Cal. housing from https://ndownloader.figshare.com/files/5976036 to /root/scikit_learn_data


In [35]:
def save_to_multiple_csv_files(data, name_prefix, header=None, n_parts=10):
    housing_dir = os.path.join("datasets", "housing")
    os.makedirs(housing_dir, exist_ok=True)
    path_format = os.path.join(housing_dir, "my_{}_{:02d}.csv")

    filepaths = []
    m = len(data)
    for file_idx, row_indices in enumerate(np.array_split(np.arange(m), n_parts)):
        part_csv = path_format.format(name_prefix, file_idx)
        filepaths.append(part_csv)
        with open(part_csv, "wt", encoding="utf-8") as f:
            if header is not None:
                f.write(header)
                f.write("\n")
            for row_idx in row_indices:
                f.write(",".join([repr(col) for col in data[row_idx]]))
                f.write("\n")
    return filepaths

In [38]:
# train/valid/test set으로 나누어줌.
train_data = np.c_[X_train, y_train]
valid_data = np.c_[X_valid, y_valid]
test_data = np.c_[X_test, y_test]
header_cols = housing.feature_names + ["MedianHouseValue"]
header = ",".join(header_cols)

#각 세트를 CSV파일 여러개로 나누어줌.
train_filepaths = save_to_multiple_csv_files(train_data, "train", header, n_parts=20)
valid_filepaths = save_to_multiple_csv_files(valid_data, "valid", header, n_parts=10)
test_filepaths = save_to_multiple_csv_files(test_data, "test", header, n_parts=10)

In [41]:
pd.read_csv(train_filepaths[0]).head() #train set 경로를 담은 리스트

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedianHouseValue
0,3.5214,15.0,3.049945,1.106548,1447.0,1.605993,37.63,-122.43,1.442
1,5.3275,5.0,6.49006,0.991054,3464.0,3.44334,33.69,-117.39,1.687
2,3.1,29.0,7.542373,1.591525,1328.0,2.250847,38.44,-122.98,1.621
3,7.1736,12.0,6.289003,0.997442,1054.0,2.695652,33.55,-117.7,2.621
4,2.0549,13.0,5.312457,1.085092,3297.0,2.244384,33.93,-116.93,0.956


In [43]:
train_filepaths

['datasets/housing/my_train_00.csv',
 'datasets/housing/my_train_01.csv',
 'datasets/housing/my_train_02.csv',
 'datasets/housing/my_train_03.csv',
 'datasets/housing/my_train_04.csv',
 'datasets/housing/my_train_05.csv',
 'datasets/housing/my_train_06.csv',
 'datasets/housing/my_train_07.csv',
 'datasets/housing/my_train_08.csv',
 'datasets/housing/my_train_09.csv',
 'datasets/housing/my_train_10.csv',
 'datasets/housing/my_train_11.csv',
 'datasets/housing/my_train_12.csv',
 'datasets/housing/my_train_13.csv',
 'datasets/housing/my_train_14.csv',
 'datasets/housing/my_train_15.csv',
 'datasets/housing/my_train_16.csv',
 'datasets/housing/my_train_17.csv',
 'datasets/housing/my_train_18.csv',
 'datasets/housing/my_train_19.csv']

- list_files() : 클래스 디렉토리를 받아 하위에 존재하는 데이터의 경로를 가져다주는 함수.

In [47]:
filepath_dataset = tf.data.Dataset.list_files(train_filepaths, seed=42)
for f in filepath_dataset.take(5):
  print(f.numpy())

b'datasets/housing/my_train_15.csv'
b'datasets/housing/my_train_08.csv'
b'datasets/housing/my_train_03.csv'
b'datasets/housing/my_train_01.csv'
b'datasets/housing/my_train_10.csv'


- interleave() : 파일을 번갈아가며 사용할 수 있게 해줌. cycle_length = 5 이므로 파일당 5개의 행씩 번갈아가면서 보여줌.

In [48]:
n_readers = 5
dataset = filepath_dataset.interleave(
    lambda filepath: tf.data.TextLineDataset(filepath).skip(1),
    cycle_length=n_readers)

In [49]:
for line in dataset.take(5):
    print(line.numpy())

b'4.6477,38.0,5.03728813559322,0.911864406779661,745.0,2.5254237288135593,32.64,-117.07,1.504'
b'8.72,44.0,6.163179916317992,1.0460251046025104,668.0,2.794979079497908,34.2,-118.18,4.159'
b'3.8456,35.0,5.461346633416459,0.9576059850374065,1154.0,2.8778054862842892,37.96,-122.05,1.598'
b'3.3456,37.0,4.514084507042254,0.9084507042253521,458.0,3.2253521126760565,36.67,-121.7,2.526'
b'3.6875,44.0,4.524475524475524,0.993006993006993,457.0,3.195804195804196,34.04,-118.15,1.625'
