#TensorFlow Records

TFRecord file format is a Tensorflow’s own binary storage format. Binary data takes up less space on disk, takes less time to copy and can be read much more efficiently from disk. This is especially true if your data is stored on spinning disks, due to the much lower read/write performance in comparison with SSDs.

In this tutorial, we will use the movie recommendation application from the Tensorflow documentation as an example:

In [1]:
import tensorflow as tf

In [2]:
# Create example data
data = {
    # Context
    'Locale': 'pt_BR',
    'Age': 19,
    'Favorites': ['Majesty Rose', 'Savannah Outen', 'One Direction'],
    # Data
    'Data': [
        {   # Movie 1
            'Movie Name': 'The Shawshank Redemption',
            'Movie Rating': 9.0,
            'Actors': ['Tim Robbins', 'Morgan Freeman']
        },
        {   # Movie 2
            'Movie Name': 'Fight Club',
            'Movie Rating': 9.7,
            'Actors': ['Brad Pitt', 'Edward Norton', 'Helena Bonham Carter']
        }
    ]
}

print(data)

{'Locale': 'pt_BR', 'Age': 19, 'Favorites': ['Majesty Rose', 'Savannah Outen', 'One Direction'], 'Data': [{'Movie Name': 'The Shawshank Redemption', 'Movie Rating': 9.0, 'Actors': ['Tim Robbins', 'Morgan Freeman']}, {'Movie Name': 'Fight Club', 'Movie Rating': 9.7, 'Actors': ['Brad Pitt', 'Edward Norton', 'Helena Bonham Carter']}]}


##Structuring TFRecords

A TFRecord file stores your data as a sequence of binary strings. This means you need to specify the structure of your data before you write it to the file. Tensorflow provides two components for this purpose: `tf.train.Example` and `tf.train.SequenceExample`. You have to store each sample of your data in one of these structures, then serialize it and use a `tf.python_io.TFRecordWriter` to write it to disk.

`tf.train.BytesList`, `tf.train.FloatList`, and `tf.train.Int64List` are at the core of a tf.train.Feature. All three have a single attribute value, which expects a list of respective bytes, float, and int.

In [3]:
movie_name_list = tf.train.BytesList(value=[b'The Shawshank Redemption', b'Fight Club'])
movie_rating_list = tf.train.FloatList(value=[9.0, 9.7])

`tf.train.Feature` wraps a list of data of a specific type so Tensorflow can understand it. It has a single attribute, which is a **union** of *bytes_list/float_list/int64_list*. Being a union, the stored list can be of type `tf.train.BytesList` (attribute name bytes_list), `tf.train.FloatList` (attribute name float_list), or `tf.train.Int64List` (attribute name int64_list).

In [4]:
movie_names = tf.train.Feature(bytes_list=movie_name_list)
movie_ratings = tf.train.Feature(float_list=movie_rating_list)

`tf.train.Features` is a collection of named features. It has a single attribute feature that expects a dictionary where the key is the name of the features and the value a tf.train.Feature.

In [5]:
movie_dict = {
  'Movie Names': movie_names,
  'Movie Ratings': movie_ratings
}
movies = tf.train.Features(feature=movie_dict)

`tf.train.Example` is one of the main components for structuring a TFRecord. An `tf.train.Example` stores features in a single attribute features of type `tf.train.Features`.

In [6]:
example = tf.train.Example(features=movies)

`tf.io.TFRecordWriter` accepts a file path in its path attribute and creates a writer object that works just like any other file object. The TFRecordWriter class offers *write*, *flush* and *close* methods. The method write accepts a string as parameter and writes it to disk, meaning that structured data must be serialized first. To this end, tf.train.Example and `tf.train.SequenceExample` provide *SerializeToString* methods:

In [8]:
with tf.io.TFRecordWriter('movie_ratings.tfrecord') as writer:
    writer.write(example.SerializeToString())

In [10]:
# Read and print TFRecord file
filepaths = ["/content/movie_ratings.tfrecord"]
dataset = tf.data.TFRecordDataset(filepaths)

# Define features
feature_description = {
    'Movie Names': tf.io.VarLenFeature(dtype=tf.string),
    'Movie Ratings': tf.io.VarLenFeature(dtype=tf.float32),
}

# Function to parse the examples
def _parse_function(example_proto):
    return tf.io.parse_single_example(example_proto, feature_description)

# Use the parse function
dataset = dataset.map(_parse_function)

# Iterate and print features
for data in dataset:
    for name, tensor in data.items():
        print('{}: {}'.format(name, tensor.values))

Movie Names: [b'The Shawshank Redemption' b'Fight Club']
Movie Ratings: [9.  9.7]


##Reference:
<li> https://medium.com/mostly-ai/tensorflow-records-what-they-are-and-how-to-use-them-c46bc4bbb564
<li> https://github.com/tensorflow/tensorflow/blob/r1.5/tensorflow/core/example/example.proto