# Creating `tfrecords` files

As we saw in [`001-tf-intro`](001-tf-intro.ipynb), you can use TensorFlow with data that is represented in memory. But for many use-cases, data size is so large that you wouldn't want to read the entire data-set into memory at once. On the other hand, you don't want to spend time in every learning iteration reading data from memory. To solve this conundrum, TensorFlow offers a file-format that integrates neatly with the computation dispatch system underlying . That is, this file-format allows you to read data directly into the computation graph, using specialized functions. 

We'll see these functions in action in the next notebook. For now, let's stuff some data into `tfrecords` files

In [1]:
import numpy as np
import pandas as pd
import tensorflow as tf
import skimage.io as sio
import os.path as op

The metadata file contains the labels that are associated with the data, and the names of the data files. The classification problem at hand is labeling the images as one of three categories: labeled as 0,1 or 2 in the `label` column:

In [2]:
labels_df = pd.read_csv(op.join('..', 'data', 'labels.csv'))

In [3]:
labels_df.head(5)

Unnamed: 0,file,label
0,9890.13_80_drn_f_0000_class2.jpg,2
1,p2540_98_drn-f_0012_class1.jpg,1
2,11477.13_96_drn_0005_class1.jpg,1
3,11477.13_104_drn_final_0034_class1.jpg,1
4,p2540_98_drn-f_0016_class1.jpg,1


The next cell contains most of the action. We read in the data (label and image) one-by-one from the respective files. 
We then convert the data into a set of TF "Features". These are stored in the file as "Examples". The whole thing is serialized and stored in the file. We will deserialize these records as needed -- details in [`003-tf-linear-classifier`](003-tf-linear-classifier.ipynb).

In [4]:
def _int64_feature(value):
      return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))

def _bytes_feature(value):
      return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

def write_tfrecords(labels_df, fname, idx):
    writer = tf.python_io.TFRecordWriter(fname)
    for example_idx in idx:
        # Read in the data one-by-one:
        image = sio.imread(op.join('../data/cells/', 
                                   labels_df['file'][example_idx]))
        label = labels_df['label'][example_idx]
        rows = image.shape[0]
        cols = image.shape[1]
        depth = image.shape[2]
        image_raw = image.tostring()
        # construct the Example proto object
        example = tf.train.Example(
            # Example contains a Features proto object
            features=tf.train.Features(feature={
            # Features contains a map of string to Feature proto objects
                'image/height': _int64_feature(rows),
                'image/width': _int64_feature(cols), 
                'image/depth': _int64_feature(depth),
                'label': _int64_feature(int(label)),
                'image/raw': _bytes_feature(image_raw)}))
                
        # use the proto object to serialize the example to a string
        serialized = example.SerializeToString()
        # write the serialized object to disk
        writer.write(serialized)

    writer.close()


We generate a set of indices to use to select an example each time. The data has to be randomly shuffled in advance, to take advantage of TF's out-of-core shuffling mechanisms:

In [5]:
idx = np.arange(labels_df.shape[0])
np.random.shuffle(idx)

We split the data into three sets: the first is used for training, the second is used for evaluating the training procedure, while it is still taking place. The third is used as a test set, to evaluate the whole procedure at its end. See Chapter 7 of Hastie, Tibshirani and Friedman's ["Elements of Statistical Learning"]( http://statweb.stanford.edu/~tibs/ElemStatLearn/printings/ESLII_print10.pdf)

In [6]:
prop_train = 0.6
prop_eval = 0.2 
# First 60% are for training:
train_idx = idx[:int(prop_train*idx.shape[0])]
# Next 20% are for evaluation:
eval_idx = idx[int(prop_train*idx.shape[0]):int(prop_train*idx.shape[0] + prop_eval*idx.shape[0])]
# Last 20% are for testing:
test_idx = idx[int(prop_train*idx.shape[0] + prop_eval*idx.shape[0]):]

In [7]:
import os.path as op
tfrecords_train_file = op.join('../data', 'cells_train.tfrecords')
tfrecords_eval_file = op.join('../data', 'cells_eval.tfrecords')
tfrecords_test_file = op.join('../data', 'cells_test.tfrecords')

In [8]:
write_tfrecords(labels_df, tfrecords_train_file, train_idx)

In [9]:
write_tfrecords(labels_df, tfrecords_eval_file, eval_idx)

In [10]:
write_tfrecords(labels_df, tfrecords_test_file, test_idx)