# Preliminary Notes

Utilizing this notebook requires you to download the new hdf5 formatted data files. These will *eventually* be found in the same directory as the previous data sets.

Once again, it should be noted that this should not be run in notebook form, but rather from within a monitoring script.



In [1]:
import logging
import h5py
from keras.layers import Convolution2D, MaxPooling2D # Layer Def
from keras.models import Model
from keras.layers import Input, Flatten, Dense
from keras.utils.io_utils import HDF5Matrix
import h5py
import numpy as np
from collections import defaultdict

from keras import backend as K

Using TensorFlow backend.


# $\mu$BooNEDataGenerator for Large Data Sets

This was created in all of about 5 minutes. Therefore, don't expect it to be robust at all.
The idea here is that since the data sets are rather large, the generator only pulls out a slice at a time.

For the HDF5 data, the frames have been compressed using `gzip` and `chunk`'d with a chunk size of `10`.

Therefore, for the sake of performance, the batch size I'm definining in the generator is the same as the chunk size.

Lastly, I tend to prefer class oriented generators over functional models as this will be allow me later to check the state of the generator after training completes. This check I just mentioned has been omitted.

In [4]:
class UBooNEDataGenerator(object):
  logger = logging.getLogger("uboone.data")
  def __init__(self, datapath, dataset, labelset):
    self.logger.info("Assembling DataSet")
    self._file = h5py.File(datapath,'r')
    self._dataset = self._file[dataset]
    self._labelset = self._file[labelset]
    self.current_index=0

  def __len__(self):
    return self._dataset.shape[0]

  def __iter__(self):
    return self

  def __next__(self):
    return self.next()

  def next(self):
    batch_size = 10
    #This next bit causes the generator to loop indefinitely
    if self.current_index>= len(self):
        self.logger.info("Reusing Data at Size: {}".format(len(self)))
        self.current_index = 0
    if self.current_index+batch_size>= len(self):
        batch_size = len(self)-current_index
    x = self._dataset[self.current_index:self.current_index+batch_size]
    y = self._labelset[self.current_index:self.current_index+batch_size]
    self.current_index+=batch_size

    return (x,y)


# Model Definition

Once again, for the sake of performance, I've pre-defined out the network parameters. The indexing of the data sets here is critical as this defines the output of each successive layer.

In [3]:
class VGG16(Model):
  logger = logging.getLogger('uboone.vgg16')
  def __init__(self):

    self.logger.info("Assembling Model")
    # The input shape is defined as 3 planes at 576x576 pixels
    # TODO: I think with the Theano backend, this might need to be reversed.

    if K.image_dim_ordering() != 'th':
        self.logger.error("Dimension Ordering Incorrect")

    self._input = Input(shape=(3,576,576))
    #self.logger.debug("Input Shape: {}".format(self._input.output_shape))

    # Block 1
    layer = Convolution2D(64, 3, 3, activation='relu', border_mode='same', 
                          name='block1_conv1')(self._input)
    layer = Convolution2D(64, 3, 3, activation='relu', border_mode='same', 
                          name='block1_conv2')(layer)
    layer = MaxPooling2D((2, 2), strides=(2, 2), name='block1_pool')(layer)

    # Block 2
    layer = Convolution2D(128, 3, 3, activation='relu', border_mode='same', 
                          name='block2_conv1')(layer)
    layer = Convolution2D(128, 3, 3, activation='relu', border_mode='same', 
                          name='block2_conv2')(layer)
    layer = MaxPooling2D((2, 2), strides=(2, 2), name='block2_pool')(layer)

    # Block 3
    layer = Convolution2D(256, 3, 3, activation='relu', border_mode='same', 
                          name='block3_conv1')(layer)
    layer = Convolution2D(256, 3, 3, activation='relu', border_mode='same', 
                          name='block3_conv2')(layer)
    layer = Convolution2D(256, 3, 3, activation='relu', border_mode='same', 
                          name='block3_conv3')(layer)
    layer = MaxPooling2D((2, 2), strides=(2, 2), name='block3_pool')(layer)

    # Block 4
    layer = Convolution2D(512, 3, 3, activation='relu', border_mode='same', 
                          name='block4_conv1')(layer)
    layer = Convolution2D(512, 3, 3, activation='relu', border_mode='same', 
                          name='block4_conv2')(layer)
    layer = Convolution2D(512, 3, 3, activation='relu', border_mode='same', 
                          name='block4_conv3')(layer)
    layer = MaxPooling2D((2, 2), strides=(2, 2), name='block4_pool')(layer)

    # Block 5
    layer = Convolution2D(512, 3, 3, activation='relu', border_mode='same', 
                          name='block5_conv1')(layer)
    layer = Convolution2D(512, 3, 3, activation='relu', border_mode='same', 
                          name='block5_conv2')(layer)
    layer = Convolution2D(512, 3, 3, activation='relu', border_mode='same', 
                          name='block5_conv3')(layer)
    layer = MaxPooling2D((2, 2), strides=(2, 2), name='block5_pool')(layer)

    # Classification block
    layer = Flatten(name='flatten')(layer)
    layer = Dense(4096, activation='relu', name='fc1')(layer)
    layer = Dense(4096, activation='relu', name='fc2')(layer)
    layer = Dense(5, activation='softmax', name='predictions')(layer)
    
    super(VGG16, self).__init__(self._input, layer)
    self.logger.info("Compiling Model")
    self.compile(loss='binary_crossentropy', optimizer='sgd')


# The training step

First of all, I'm a huge fan of using the logging module as this will allow me to monitor performance externally. Thus, the first steps are to setup the logging configuration.

The data generator is the next thing to be instantiated. Here, the network is setup to train on the input tpc images with an output expected to match particle type.

Lastly, the model is instantiated and call to `train` or `fit` as those terms seem to be interchangeable in Keras.
The samples per epoch and number of epoch figures were chosen arbitrarily. This was done out of sheer laziness and due to the fact that this was written at the same time as the data was being processed.

In [None]:
logging.basicConfig(level=logging.DEBUG)
logging.info("Starting...")

  data_generator = UBooNEDataGenerator('hdf5/eminus_train.h5', 'image2d/tpc', 'labels/type')

  model = VGG16()
  model.fit_generator(data_generator, samples_per_epoch = 20000, nb_epoch=10)
