It allows you to create binaries of your data

It simplifies creating data binaries for saving/loading data with TFRecords

Key/Value relationship inside dictionary-like object 

Arranged with datatypes and nested for full dataset

TFExample is just an example of serializing dictionaries to byte strings

In [1]:
import numpy as np
import pandas as pd
import tensorflow as tf

##### To read data efficiently it can be helpful to serialize your data and store it in a set of files (100-200MB each) that can each be read linearly. This is especially true if the data is being streamed over a network. This can also be useful for caching any data-preprocessing.

##### The TFRecord format is a simple format for storing a sequence of binary records.

##### Protocol buffers are a cross-platform, cross-language library for efficient serialization of structured data.

##### Protocol messages are defined by .proto files, these are often the easiest way to understand a message type.

##### The tf.Example message (or protobuf) is a flexible message type that represents a {"string": value} mapping. It is designed for use with TensorFlow and is used throughout the higher-level APIs such as TFX.

Let's create a dataset using tf.data.Dataset

In [2]:
n_observations = 10000

feature0=np.random.choice([False, True], n_observations)
feature1 = np.random.randint(0, 5, n_observations)
strings = np.array([b'a', b'b', b'c', b'd', b'e'])
feature2 = strings[feature1]
feature3 = np.random.randn(n_observations)

In [3]:
features_dataset = tf.data.Dataset.from_tensor_slices((feature0, feature1, feature2, feature3))

In [4]:
features_dataset

<TensorSliceDataset shapes: ((), (), (), ()), types: (tf.bool, tf.int32, tf.string, tf.float64)>

In [6]:
for f0, f1, f2, f3 in features_dataset.take(1):
    print(f0)
    print(f1)
    print(f2)
    print(f3)

tf.Tensor(False, shape=(), dtype=bool)
tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(b'a', shape=(), dtype=string)
tf.Tensor(-0.15304671816324567, shape=(), dtype=float64)


Here are the couple of helper functions, used to convert numpy array data types to TFExample data types 

In [12]:
def _bytes_features(value):
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

def _float_features(value):
    return tf.train.Feature(float_list = tf.train.FloatList(value=[value]))

def _int64_features(value):
    return tf.train.Feature(int64_list = tf.train.Int64List(value=[value]))

def serialize_example(feature0, feature1, feature2, feature3):
    feature = {
        'feature0':_int64_features(feature0),
        'feature1':_int64_features(feature1),
        'feature2':_bytes_features(feature2),
        'feature3':_float_features(feature3)
    }
    # Create a Features message using tf.train.Example.
    example_proto = tf.train.Example(features=tf.train.Features(feature=feature))
    return example_proto.SerializeToString()

In [13]:
# let's check the serialize_example function

serialized_example = serialize_example(False, 4, b'c', 0.1234)
serialized_example

b'\nO\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x00\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x04\n\x11\n\x08feature2\x12\x05\n\x03\n\x01c\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04$\xb9\xfc='

In [14]:
exampled_returned = tf.train.Example.FromString(serialized_example)
exampled_returned

features {
  feature {
    key: "feature0"
    value {
      int64_list {
        value: 0
      }
    }
  }
  feature {
    key: "feature1"
    value {
      int64_list {
        value: 4
      }
    }
  }
  feature {
    key: "feature2"
    value {
      bytes_list {
        value: "c"
      }
    }
  }
  feature {
    key: "feature3"
    value {
      float_list {
        value: 0.1234000027179718
      }
    }
  }
}

### TFRecords

Simple format for storing a sequence of binary records. Efficiently reads data linearly from disk. This format is especially beneficial if streamed over network

TFExample can convert into binary records. TFRecord will convert that binary into some processor where we can load data quicker from a disk


Caveat: TFRecord adds complexity. Only use if bottleneck in training is loading data

In [15]:
# Now, Saving and loading data from tf.example using TFRecord

In [16]:
def tf_serialize_example(f0, f1, f2, f3):
    tf_string=tf.py_function(
    serialize_example,
    (f0, f1, f2, f3),
    tf.string)
    return tf.reshape(tf_string, ())# The result is a scalar

In [17]:
serialized_dataset = features_dataset.map(tf_serialize_example)

In [18]:
serialized_dataset

<MapDataset shapes: (), types: tf.string>

In [19]:
def generator():
    for features in features_dataset:
        yield serialize_example(*features)

In [20]:
serialized_dataset = tf.data.Dataset.from_generator(generator, output_types=tf.string, output_shapes=())

In [21]:
serialized_dataset

<DatasetV1Adapter shapes: (), types: tf.string>

###### filename = 'tf_data_pipelines.tfrecord'
###### writer = tf.data.experimental.TFRecordWriter(filename)
###### writer.write(serialized_dataset)


# loading

filenames=[filname]

raw_dataset=tf.data.TFRecordDataset(filenames)

raw_dataset