<a href="https://colab.research.google.com/github/Apple03244/Colaboratory/blob/main/TFRecord%3C%EB%B0%9C%ED%91%9C%EC%9E%90%EB%A3%8C%3E.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 0. Python에서 데이터를 읽는 방법

1. os.listdir(path) 을 읽어옴
2. 여기서 우리가 필요한 파일을 통째로 읽어옴
3. pandas 생각해보면 파일용량이 크면 굉장히 오래걸림
4. industty level에서는 하루에 쏟아지는 데이터양이 굉장히 큼 + 하나의 기기나 개인에서 오지 않음

5. 훈련데이터의 구별의 문제도 있고 매일 읽어올 수도 없어.+ 파일을 나눠도 문제가 생김

## 1. 간단한 예시를 보자

In [1]:
import tensorflow as tf
import pandas as pd

In [2]:
data={"col1":[1,2,3],"col2":["a","b","c"],"col3":[True,False,True]}

In [3]:
# 이제껏 우리가 데이터를 확인했던 방법
pd.read_csv('/content/train.csv')

Unnamed: 0,sessionID,userID,TARGET,browser,OS,device,new,quality,duration,bounced,transaction,transaction_revenue,continent,subcontinent,country,traffic_source,traffic_medium,keyword,referral_path
0,SESSION_000000,USER_000000,17.0,Chrome,Macintosh,desktop,0,45.0,839.0,0,0.0,0.0,Americas,Northern America,United States,google,organic,Category8,
1,SESSION_000001,USER_000001,3.0,Chrome,Windows,desktop,1,1.0,39.0,0,0.0,0.0,Europe,Western Europe,Germany,google,organic,Category8,
2,SESSION_000002,USER_000002,1.0,Samsung Internet,Android,mobile,1,1.0,0.0,1,0.0,0.0,Asia,Southeast Asia,Malaysia,(direct),(none),,
3,SESSION_000003,USER_000003,1.0,Chrome,Macintosh,desktop,1,1.0,0.0,1,0.0,0.0,Americas,Northern America,United States,Partners,affiliate,,
4,SESSION_000004,USER_000004,1.0,Chrome,iOS,mobile,0,1.0,0.0,1,0.0,0.0,Americas,Northern America,United States,groups.google.com,referral,,Category6_Path_0000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
37856,SESSION_037856,USER_032130,2.0,Safari,iOS,mobile,1,1.0,58.0,0,0.0,0.0,Americas,Northern America,United States,google,organic,Category8,
37857,SESSION_037857,USER_032090,1.0,Chrome,Windows,desktop,0,1.0,0.0,1,0.0,0.0,Europe,Western Europe,Germany,youtube.com,referral,,Category2_Path_0018
37858,SESSION_037858,USER_032131,1.0,Chrome,iOS,mobile,1,1.0,0.0,1,0.0,0.0,Americas,South America,Brazil,google,organic,Category8,
37859,SESSION_037859,USER_032132,1.0,Edge,Windows,desktop,1,1.0,0.0,1,0.0,0.0,Americas,Northern America,United States,Partners,affiliate,,


## 2. TFRecord의 데이터 형식

|형식|변환가능한 형태|
|--|--|
|BytesList|string,byte|
|FloatList|float,double|
|Int64List|bool,enum,...|

이와 같이 데이터를 바이너리 형태로 바꿈


In [4]:
# 위의 형태를 인식하기 위한 함수를 만들어보자
def _bytes_feature(value):
  if isinstance(value, type(tf.constant(0))):
    value = value.numpy() # BytesList won't unpack a string from an EagerTensor.
  return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

def _float_feature(value):
  return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))

def _int64_feature(value):
  return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))

In [5]:
# test
_bytes_feature(b"a")

bytes_list {
  value: "a"
}

여기서 하나 알아야 할 것

- 문자열 앞의 b : byte 형식
- u : unicode 형식

따라서 아래와 같이 unicode 형식은 변환해야함.

In [6]:
_bytes_feature(u"test_encoding".encode("utf-8"))

bytes_list {
  value: "test_encoding"
}

실제 파이썬에서 csv파일을 읽는다면 다음과 같다

In [7]:
# TFrecord
def read_csv(file_path):
  with open(file_path,"r") as r:
    lines=r.readlines()
  return lines

실제 python에서 csv 파일을 읽으면 다음과 같다

In [8]:
# 여기는 feature_name 임
read_csv("/content/train.csv")[0]

'sessionID,userID,TARGET,browser,OS,device,new,quality,duration,bounced,transaction,transaction_revenue,continent,subcontinent,country,traffic_source,traffic_medium,keyword,referral_path\n'

In [9]:
# 변환
def tfrecord_maker(data,output_file):
  with tf.io.TFRecordWriter(output_file) as w:
    for line in data[1:]: # 첫번재 줄은 column_name 이므로
      feature={"css_line":tf.train.Feature(bytes_list=tf.train.BytesList(value=[line.encode('utf-8')]))}
      example=tf.train.Example(features=tf.train.Features(feature=feature))
      w.write(example.SerializeToString())

In [10]:
# 저장하는 방법
tfrecord_maker(read_csv("/content/train.csv"),"test_tfr")

In [11]:
# column_name은 메타데이터로 저장하는 것이 정석
# 꼭 메타데이터가 다음과 같이 않고 따로 파일화 하거나, java형태를 띄기도 함
feature_name=read_csv("/content/train.csv")[0].split(",")
print(feature_name)

['sessionID', 'userID', 'TARGET', 'browser', 'OS', 'device', 'new', 'quality', 'duration', 'bounced', 'transaction', 'transaction_revenue', 'continent', 'subcontinent', 'country', 'traffic_source', 'traffic_medium', 'keyword', 'referral_path\n']


In [12]:
# 다음과 같이 tfrecord를 만들 수 있다.
def tfrecord_maker(data,output_file):
  with tf.io.TFRecordWriter(output_file) as w:
    feature={"meta":tf.train.Feature(bytes_list=tf.train.BytesList(value=[data[0].encode("utf-8")]))
        }
    example=tf.train.Example(features=tf.train.Features(feature=feature))
    w.write(example.SerializeToString())
    for line in data[1:]: # 첫번재 줄은 column_name 이므로
      feature={"css_line":tf.train.Feature(bytes_list=tf.train.BytesList(value=[line.encode('utf-8')]))}
      example=tf.train.Example(features=tf.train.Features(feature=feature))
      w.write(example.SerializeToString())

In [13]:
tfrecord_maker(read_csv("/content/train.csv"),"test_tfr")

## 3. 읽어오기

In [14]:
test_tfr=tf.data.TFRecordDataset(["/content/test_tfr"])

In [15]:
test_tfr

<TFRecordDatasetV2 element_spec=TensorSpec(shape=(), dtype=tf.string, name=None)>

In [16]:
j=0
for i in test_tfr:
  print(i)
  j+=1
  if j==3:
    break

tf.Tensor(b'\n\xcc\x01\n\xc9\x01\n\x04meta\x12\xc0\x01\n\xbd\x01\n\xba\x01sessionID,userID,TARGET,browser,OS,device,new,quality,duration,bounced,transaction,transaction_revenue,continent,subcontinent,country,traffic_source,traffic_medium,keyword,referral_path\n', shape=(), dtype=string)
tf.Tensor(b'\n\xa8\x01\n\xa5\x01\n\x08css_line\x12\x98\x01\n\x95\x01\n\x92\x01SESSION_000000,USER_000000,17.0,Chrome,Macintosh,desktop,0,45.0,839.0,0,0.0,0.0,Americas,Northern America,United States,google,organic,Category8,\n', shape=(), dtype=string)
tf.Tensor(b'\n\x99\x01\n\x96\x01\n\x08css_line\x12\x89\x01\n\x86\x01\n\x83\x01SESSION_000001,USER_000001,3.0,Chrome,Windows,desktop,1,1.0,39.0,0,0.0,0.0,Europe,Western Europe,Germany,google,organic,Category8,\n', shape=(), dtype=string)


In [17]:
train_data=pd.read_csv("/content/train.csv")

In [18]:
train=train_data.astype("str")

In [19]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37861 entries, 0 to 37860
Data columns (total 19 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   sessionID            37861 non-null  object
 1   userID               37861 non-null  object
 2   TARGET               37861 non-null  object
 3   browser              37861 non-null  object
 4   OS                   37861 non-null  object
 5   device               37861 non-null  object
 6   new                  37861 non-null  object
 7   quality              37861 non-null  object
 8   duration             37861 non-null  object
 9   bounced              37861 non-null  object
 10  transaction          37861 non-null  object
 11  transaction_revenue  37861 non-null  object
 12  continent            37861 non-null  object
 13  subcontinent         37861 non-null  object
 14  country              37861 non-null  object
 15  traffic_source       37861 non-null  object
 16  traf

In [20]:
values=[]
for col in train.columns:
  values.append(train[col].values)
values=tuple(values)

In [21]:
values

(array(['SESSION_000000', 'SESSION_000001', 'SESSION_000002', ...,
        'SESSION_037858', 'SESSION_037859', 'SESSION_037860'], dtype=object),
 array(['USER_000000', 'USER_000001', 'USER_000002', ..., 'USER_032131',
        'USER_032132', 'USER_032133'], dtype=object),
 array(['17.0', '3.0', '1.0', ..., '1.0', '1.0', '1.0'], dtype=object),
 array(['Chrome', 'Chrome', 'Samsung Internet', ..., 'Chrome', 'Edge',
        'Chrome'], dtype=object),
 array(['Macintosh', 'Windows', 'Android', ..., 'iOS', 'Windows',
        'Windows'], dtype=object),
 array(['desktop', 'desktop', 'mobile', ..., 'mobile', 'desktop',
        'desktop'], dtype=object),
 array(['0', '1', '1', ..., '1', '1', '1'], dtype=object),
 array(['45.0', '1.0', '1.0', ..., '1.0', '1.0', '1.0'], dtype=object),
 array(['839.0', '39.0', '0.0', ..., '0.0', '0.0', '0.0'], dtype=object),
 array(['0', '0', '1', ..., '1', '1', '1'], dtype=object),
 array(['0.0', '0.0', '0.0', ..., '0.0', '0.0', '0.0'], dtype=object),
 array(['0.0',

다만 이 방법을 쓰지 않을거임(많이 돌아가는 방법이라 굉장히 비효율)

In [22]:
trans_train=tf.data.Dataset.from_tensor_slices(values)

In [23]:
trans_train

<_TensorSliceDataset element_spec=(TensorSpec(shape=(), dtype=tf.string, name=None), TensorSpec(shape=(), dtype=tf.string, name=None), TensorSpec(shape=(), dtype=tf.string, name=None), TensorSpec(shape=(), dtype=tf.string, name=None), TensorSpec(shape=(), dtype=tf.string, name=None), TensorSpec(shape=(), dtype=tf.string, name=None), TensorSpec(shape=(), dtype=tf.string, name=None), TensorSpec(shape=(), dtype=tf.string, name=None), TensorSpec(shape=(), dtype=tf.string, name=None), TensorSpec(shape=(), dtype=tf.string, name=None), TensorSpec(shape=(), dtype=tf.string, name=None), TensorSpec(shape=(), dtype=tf.string, name=None), TensorSpec(shape=(), dtype=tf.string, name=None), TensorSpec(shape=(), dtype=tf.string, name=None), TensorSpec(shape=(), dtype=tf.string, name=None), TensorSpec(shape=(), dtype=tf.string, name=None), TensorSpec(shape=(), dtype=tf.string, name=None), TensorSpec(shape=(), dtype=tf.string, name=None), TensorSpec(shape=(), dtype=tf.string, name=None))>

In [24]:
j=0
for i in trans_train:
  print(i)
  j+=1
  if j==3:
    break

(<tf.Tensor: shape=(), dtype=string, numpy=b'SESSION_000000'>, <tf.Tensor: shape=(), dtype=string, numpy=b'USER_000000'>, <tf.Tensor: shape=(), dtype=string, numpy=b'17.0'>, <tf.Tensor: shape=(), dtype=string, numpy=b'Chrome'>, <tf.Tensor: shape=(), dtype=string, numpy=b'Macintosh'>, <tf.Tensor: shape=(), dtype=string, numpy=b'desktop'>, <tf.Tensor: shape=(), dtype=string, numpy=b'0'>, <tf.Tensor: shape=(), dtype=string, numpy=b'45.0'>, <tf.Tensor: shape=(), dtype=string, numpy=b'839.0'>, <tf.Tensor: shape=(), dtype=string, numpy=b'0'>, <tf.Tensor: shape=(), dtype=string, numpy=b'0.0'>, <tf.Tensor: shape=(), dtype=string, numpy=b'0.0'>, <tf.Tensor: shape=(), dtype=string, numpy=b'Americas'>, <tf.Tensor: shape=(), dtype=string, numpy=b'Northern America'>, <tf.Tensor: shape=(), dtype=string, numpy=b'United States'>, <tf.Tensor: shape=(), dtype=string, numpy=b'google'>, <tf.Tensor: shape=(), dtype=string, numpy=b'organic'>, <tf.Tensor: shape=(), dtype=string, numpy=b'Category8'>, <tf.Tens

-----


## 6. 그렇다면 왜 TFRecord가 왜 필요한가?

- tf.data.dataset 의 형태와 가장 적합
- 용량이 굉장히 줄어듬.
- TFRecord의 압축버전 Gzip의 형태로 네트워크상 데이터를 교환함

---------

## 텐서플로우의 데이터셋 형태에 대해 알아보자

### 1. Data API
간단히 살펴보자

In [25]:
# 위에 비효율적이라 코딩된 부분
test_dataset=tf.data.Dataset.from_tensor_slices([
    [1,2,3],
    [4,5,6]
])

In [26]:
for i in test_dataset:
  print(i)

tf.Tensor([1 2 3], shape=(3,), dtype=int32)
tf.Tensor([4 5 6], shape=(3,), dtype=int32)


우리는 아래의 기능이 필요하다.

위의 방법을 응용하면 다음과 같은 데이터 셋을 만들수 있는데

In [27]:
# 위에서 보여줬던 데이터
train_data=pd.read_csv("/content/train.csv")

In [28]:
# column 이름들
dict(train_data).keys()

dict_keys(['sessionID', 'userID', 'TARGET', 'browser', 'OS', 'device', 'new', 'quality', 'duration', 'bounced', 'transaction', 'transaction_revenue', 'continent', 'subcontinent', 'country', 'traffic_source', 'traffic_medium', 'keyword', 'referral_path'])

아래는 Nan값이 오류를 발생시키므로 전처리 하는 과정이므로


개별적으로 편한대로 전처리 해도 상관없음


In [29]:
# Nan 값은 에러를 발생시키므로
import sklearn.preprocessing as skpre
import sklearn.pipeline as skpip
import sklearn.impute as skimp
import sklearn.compose as skcom
import numpy as np

num_cols=train_data.select_dtypes(np.number).columns # numeric_columns
obj_cols=train_data.select_dtypes("object").columns

num_pipe=skpip.make_pipeline(skimp.SimpleImputer(strategy="mean")) # Nan값을 평균으로 채움
obj_pipe=skpip.make_pipeline(skimp.SimpleImputer(strategy="most_frequent")) # Nan값을 최빈값으로 채움

total_pipe=skcom.make_column_transformer(
    (num_pipe,num_cols),
    (obj_pipe,obj_cols),
    remainder="passthrough" # 위의 column 중에 속하지 않는 column은 무시하라는 코드 -> 자세히 몰라도 상관없음
)

total_pipe.fit(train_data)

In [30]:
def recall_column_name(x):
  x=x.replace("pipeline-1__","")
  x=x.replace("pipeline-2__","")
  return x
column_names_=list(map(recall_column_name,total_pipe.get_feature_names_out()))

In [31]:
train_data=pd.DataFrame(total_pipe.transform(train_data),columns=column_names_)

In [32]:
train_data.head(3)

Unnamed: 0,TARGET,new,quality,duration,bounced,transaction,transaction_revenue,sessionID,userID,browser,OS,device,continent,subcontinent,country,traffic_source,traffic_medium,keyword,referral_path
0,17.0,0.0,45.0,839.0,0.0,0.0,0.0,SESSION_000000,USER_000000,Chrome,Macintosh,desktop,Americas,Northern America,United States,google,organic,Category8,Category1
1,3.0,1.0,1.0,39.0,0.0,0.0,0.0,SESSION_000001,USER_000001,Chrome,Windows,desktop,Europe,Western Europe,Germany,google,organic,Category8,Category1
2,1.0,1.0,1.0,0.0,1.0,0.0,0.0,SESSION_000002,USER_000002,Samsung Internet,Android,mobile,Asia,Southeast Asia,Malaysia,(direct),(none),Category8,Category1


In [33]:
train_data[num_cols]=train_data[num_cols].astype("float")

# 우리가 사용할때 유용한 형태로 변환해보자. 비효율적이라 얘기했지만 우선 보여주기 위해 진행
features=train_data.columns.difference(["TARGET"]) # 종속변수를 제외
target="TARGET"

In [34]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45462 entries, 0 to 45461
Data columns (total 19 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   TARGET               45462 non-null  float64
 1   new                  45462 non-null  float64
 2   quality              45462 non-null  float64
 3   duration             45462 non-null  float64
 4   bounced              45462 non-null  float64
 5   transaction          45462 non-null  float64
 6   transaction_revenue  45462 non-null  float64
 7   sessionID            45462 non-null  object 
 8   userID               45462 non-null  object 
 9   browser              45462 non-null  object 
 10  OS                   45462 non-null  object 
 11  device               45462 non-null  object 
 12  continent            45462 non-null  object 
 13  subcontinent         45462 non-null  object 
 14  country              45462 non-null  object 
 15  traffic_source       45462 non-null 

In [35]:
# 구조에 집중
# 타입이 같은 것끼리 묶어야 오류가 안남 -> tfrecord에서는 모두 str으로 바꿨음(csv파일과 동일하게 보여지기 위해서)
num_features=train_data.select_dtypes(np.number).columns.difference([target])
obj_features=train_data.select_dtypes("object").columns.difference([target])
train_slice_={"num_features":train_data[num_features].values,
              "obj_features":train_data[obj_features].values,
              "target":train_data[target].values}

In [36]:
train_tfd=tf.data.Dataset.from_tensor_slices(train_slice_)

In [37]:
j=0
for i in train_tfd:
  print(i)
  j+=1
  if j==3:
    break

{'num_features': <tf.Tensor: shape=(6,), dtype=float64, numpy=array([  0., 839.,   0.,  45.,   0.,   0.])>, 'obj_features': <tf.Tensor: shape=(12,), dtype=string, numpy=
array([b'Macintosh', b'Chrome', b'Americas', b'United States', b'desktop',
       b'Category8', b'Category1', b'SESSION_000000', b'Northern America',
       b'organic', b'google', b'USER_000000'], dtype=object)>, 'target': <tf.Tensor: shape=(), dtype=float64, numpy=17.0>}
{'num_features': <tf.Tensor: shape=(6,), dtype=float64, numpy=array([ 0., 39.,  1.,  1.,  0.,  0.])>, 'obj_features': <tf.Tensor: shape=(12,), dtype=string, numpy=
array([b'Windows', b'Chrome', b'Europe', b'Germany', b'desktop',
       b'Category8', b'Category1', b'SESSION_000001', b'Western Europe',
       b'organic', b'google', b'USER_000001'], dtype=object)>, 'target': <tf.Tensor: shape=(), dtype=float64, numpy=3.0>}
{'num_features': <tf.Tensor: shape=(6,), dtype=float64, numpy=array([1., 0., 1., 1., 0., 0.])>, 'obj_features': <tf.Tensor: shape=(12

위처럼 데이터를 구분해서 변환이 가능하다.

또한 아래의 기능들을 지원하는데 간단히 알아보자

#### 1. 데이터셋 반복시키기

In [41]:
train_tfd.repeat(3)

<_RepeatDataset element_spec={'num_features': TensorSpec(shape=(6,), dtype=tf.float64, name=None), 'obj_features': TensorSpec(shape=(12,), dtype=tf.string, name=None), 'target': TensorSpec(shape=(), dtype=tf.float64, name=None)}>

#### 2. 데이터셋에서 배치사이즈 조절

In [42]:
train_tfd.batch(30) # 배치사이즈를 30으로 조정

<_BatchDataset element_spec={'num_features': TensorSpec(shape=(None, 6), dtype=tf.float64, name=None), 'obj_features': TensorSpec(shape=(None, 12), dtype=tf.string, name=None), 'target': TensorSpec(shape=(None,), dtype=tf.float64, name=None)}>

#### 3. 데이터셋에서 데이터 셔플링
- 1. 버퍼 크기의 샘플을 랜덤 추출 후 하나의 샘플을 뽑음
- 2. 버퍼가 비워지면 다시 버퍼 채우기
- 3. 원본 데이터 고갈까지 반복

In [45]:
shuffled_tdf=train_tfd.shuffle(buffer_size=30,seed=5)

In [47]:
j=0
for i in shuffled_tdf:
  print(i)
  j+=1
  if j==3:
    break

{'num_features': <tf.Tensor: shape=(6,), dtype=float64, numpy=array([ 0., 34.,  1.,  2.,  0.,  0.])>, 'obj_features': <tf.Tensor: shape=(12,), dtype=string, numpy=
array([b'Macintosh', b'Chrome', b'Americas', b'United States', b'desktop',
       b'Category8', b'Category1', b'SESSION_000014', b'Northern America',
       b'organic', b'google', b'USER_000014'], dtype=object)>, 'target': <tf.Tensor: shape=(), dtype=float64, numpy=4.0>}
{'num_features': <tf.Tensor: shape=(6,), dtype=float64, numpy=array([1., 0., 1., 1., 0., 0.])>, 'obj_features': <tf.Tensor: shape=(12,), dtype=string, numpy=
array([b'iOS', b'Safari (in-app)', b'Americas', b'Mexico', b'mobile',
       b'Category9', b'Category1', b'SESSION_000016', b'Central America',
       b'cpc', b'google', b'USER_000016'], dtype=object)>, 'target': <tf.Tensor: shape=(), dtype=float64, numpy=1.0>}
{'num_features': <tf.Tensor: shape=(6,), dtype=float64, numpy=array([  0., 208.,   0.,  13.,   0.,   0.])>, 'obj_features': <tf.Tensor: shape=(1

TFRecord는 텐서플로우에 굉장히 잘 들어맞는 데이터형태이고 텐서플로우에서 데이터를 어떻게 다루는지 간단히 소개해봤다.

데이터 전처리와 더 많은 기능은 천천히 소개할 예정.

우선 이정도로 간단한 TFRecord와 텐서플로우 데이터셋의 소개를 마치겠다.