### Reading data with Tensorflow

In this notebook I cover how to prepare a queued & batched dataset to feed into your tensorflow machine learning model. The data is read from a google-cloud-storage bucket, and is optimized for cloud training with google's ml-engine. All functions work for local paths, or for paths directed to a GCS bucket and uses only Numpy & Tensorflow.

Data used here is the <a href = "http://yann.lecun.com/exdb/mnist/"> MNIST dataset </a> which contains handwritten digits from <em>0</em> to <em>9</em>.

Data is given in 4 bianary files: training data, training labels and test data, test labels.


All we need to know about the dataset:

        Training and test label data starts at byte 8. Each label is one byte, uint8, value 0-9 
    
        Training and test image data starts at byte 16. Label is ordered by row, each image is [28,28] = 784 bytes, uint8, value 0-255 (these are grayscale images).
    


For more information about the datafiles see the <a href = "http://yann.lecun.com/exdb/mnist/">MNIST website.</a>





In [1]:
import tensorflow as tf
import numpy as np
import os

First things first, we need to know which files we want to read. Assuming all our data is in one directory we can list the files using <em> tf.gfile.ListDirectory() </em>. I hosted the MNIST data in a GCS bucket so you can directly run this notebook, or you can download the MNIST data youself and run locally. Remember tf.gfile works for local directories as well.

In [2]:
BUCKET = "gs://organicml-reading-data"
data_prefix = BUCKET + "/data"
file_list = tf.gfile.ListDirectory(data_prefix)
print("Files to read: ",file_list)

Files to read:  ['t10k-images-idx3-ubyte', 't10k-labels-idx1-ubyte', 'train-images-idx3-ubyte', 'train-labels-idx1-ubyte']


Great, now let's read our data!

Setting up the tensorflow data pipeline consists of the following steps:

        First we prepare a list of files to read (already done!)
    
        Next we queue the file list for tensorflow to read,
    
        Third, initialize a reader and read our filename queue
    
        Following reading, we decode our data into the correct format (plus any addictional preprocessing steps desired).

        Lastly we batch our data. Now it is ready to be passed to our model!
 

Because our images and labels are contained in seperate files, we will use two queues, readers and decoders and then batch images and labels together.



In [3]:
# Split image and label files.
image_files = [os.path.join(data_prefix,i) for i in file_list if "image" in i]
label_files = [os.path.join(data_prefix,i) for i in file_list if "label" in i]

# Create queue.
image_queue = tf.train.string_input_producer(image_files,name = "image_queue")
label_queue = tf.train.string_input_producer(label_files,name = "label_queue")

After creating our filename queue, we are ready to read. We use a tensorflow <em>reader</em> to pull our data from files, then a <em>decoder</em> to decode the data into a suitable data type. We can keep track of the number of files we have read with the reader's <em>num_records_produced()</em> attribute

Before reading we need to recall some details:

    The size of the header data in the file (8 bytes for labels & 16 for images).

    The data size and type (28 x 28 pixel images, all data is unsigned 8 bit).

In [4]:
IMAGE_SHAPE = [28,28,1]# The size 1 dimension is included for generalization to color images.
image_start_byte = 16 
label_start_byte = 8

height = IMAGE_SHAPE[0]
width = IMAGE_SHAPE[1]
depth = IMAGE_SHAPE[2]

image_bytes = height * width * depth 
label_bytes = 1  

# Create reader. 
image_reader = tf.FixedLengthRecordReader(record_bytes = image_bytes, 
                                        header_bytes = image_start_byte)

label_reader = tf.FixedLengthRecordReader(record_bytes = label_bytes, 
                                          header_bytes = label_start_byte)


# Read! Key is the location of file read. Value is the data returned.
image_key, image_value = image_reader.read(image_queue)
label_key, label_value = label_reader.read(label_queue)

# Convert from a string to a vector of uint8 that is record_bytes long.
decoded_image = tf.decode_raw(image_value, tf.uint8)
decoded_label = tf.decode_raw(label_value, tf.uint8)

Our data has been read from the file and is almost ready to be batched. We obtain single image and label pairs by taking slices of decoded_bytes.

Before batching, tensorflow requires shapes to be fully defined.

Labels are shaped to <em>[1]</em>
Our images need to be reshaped to <em>[height,width,depth]</em>

In [5]:
training_label = tf.reshape(
    tf.strided_slice(decoded_label,[0],[label_bytes]),
    [1])

uint8_img = tf.reshape(
    tf.strided_slice(decoded_image, [0],[image_bytes]),
    [height, width, depth])

# Cast image to desired dtype 
training_image = tf.cast(uint8_img,tf.float32)


All we have left to do is make our batches. Nearly all deep learning models accept data in batches to improve speed and prevent overfitting. For large datasets we can use multiple threads to prepare our data in parrallel.

In [6]:
batch_size = 4
num_preprocess_threads = 2
min_queue_examples = 1

image_batch, label_batch = tf.train.batch(
    [training_image, training_label],
    batch_size=batch_size,
    num_threads=num_preprocess_threads,
    capacity=min_queue_examples + 16 * batch_size)

The batches are of shape <em>[batch_size,height,width,depth]</em> and ready to be fed into a model. Each time we call image and label batch in a session we have one batch returned. To fetch one batch of training data, we can run the session:

In [7]:
with tf.Session() as sess:
    BATCH = [image_batch,label_batch] 
    
    sess.run(tf.global_variables_initializer())
    sess.run(tf.local_variables_initializer())
    coord = tf.train.Coordinator() 
    threads = tf.train.start_queue_runners(coord=coord)
    B = sess.run(BATCH)
    #x_train,y_train = sess.run([image_batch,label_batch])
            
    coord.request_stop()
    coord.join(threads)


In [8]:
img = B[0]
label = B[1]
print(img.shape)
print()
print(label)

(4, 28, 28, 1)

[[5]
 [0]
 [4]
 [1]]


Here is our first batch! We can view the images and labels to make sure they match.

In [9]:
import matplotlib.pyplot as plt
import matplotlib.animation
matplotlib.rc('animation', html='html5')


def plot_image_batch(image,label):
    shape = image.shape
    num_steps = shape[0]
    fig = plt.figure(figsize = (5,5))
    ax = plt.subplot(1,1,1)
    def animate(i):
        im = ax.imshow(image[i],cmap=plt.get_cmap("gray"))
        return im,
    return matplotlib.animation.FuncAnimation(fig, animate, frames=range(0,num_steps), interval=2000, blit=True)




In [10]:
img = np.reshape(img,[4,28,28])
print("Labels: {}, {}, {}, {}".format(label[0],label[1],label[2],label[3]))
plot_image_batch(img,label) 

Labels: [5], [0], [4], [1]


The training batches are good to go.