# CIFAR100 Hierarchical Classification (Tensorflow 2.2.0)

    Author: Lukas Friedrichsen (friedrichsen.luk@googlemail.com)
    License: Apache License, Version 2.0

Description: In this notebook, different experiments are conducted on various scenarios from the field of image recognition to assess the viability of modularization as a technique to counteract specific inherent shortcomings of Neural Networks. In the first experiment, different performance criteria are measured on the CIFAR100 dataset for a hierarchically composed network and a comparative reference model. The second experiment demonstrated in this notebook serves to assess the effect of modularization with regard to predictive model robustness. For this, the performance of the two models from the first experiment is evaluated and compared on distributionally shifted and OOD (Out-Of-Distribution) testing data (the former modelled by the CIFAR100-C, the latter by the CIFAR10 dataset).

## Table of contents

1. [Imports](#imports)
2. [Configuration](#config)
3. [Loading the dataset](#load)
4. [Mapping synset relationships](#synset_mapping)
5. [Processing and augmentation](#processing_augmentation)
6. [Composed Network (CompNet)](#compnet)
  1. [Model](#compnet_model)
  2. [Training](#compnet_train)
  3. [Testing](#compnet_test)
  4. [Predictive uncertainty under dataset shift](#compnet_predictive_uncertainty)
7. [Benchmark: Maxout (Goodfellow et. al, 2013)](#benchmark)
  1. [Model](#benchmark_model)
  2. [Training](#benchmark_train)
  3. [Testing](#benchmark_test)
  4. [Predictive uncertainty under dataset shift](#benchmark_predictive_uncertainty)

---
## Imports
<a id ='imports'></a>

In [None]:
import time

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
import matplotlib as mpl

# Matplotlib configuration

# General settings
mpl.rcParams['axes.grid'] = True
mpl.rcParams['grid.alpha'] = 0.5
mpl.rcParams['grid.linestyle'] = '--'
mpl.rcParams['legend.framealpha'] = 1.0
mpl.rcParams['savefig.bbox'] = 'tight'

# Font sizes
mpl.rcParams['axes.labelsize'] = 15
mpl.rcParams['axes.titlesize'] = 15
mpl.rcParams['figure.titlesize'] = 20
mpl.rcParams['legend.fontsize'] = 15
mpl.rcParams['xtick.labelsize'] = 15 
mpl.rcParams['ytick.labelsize'] = 15

In [None]:
import tensorflow as tf
import tensorflow_datasets as tfds
print('Tensorflow: v{}'.format(tf.__version__))
print('Tensorflow Datasets: v{}'.format(tfds.__version__))

In [None]:
# Set random seed for reproducability
tf.random.set_seed(42)

---
## Configuration
<a id ='config'></a>

In [None]:
# Download and storage locations for Tensorflow Datasets
TFDS_DATA_DIR = '/Volumes/Data/tensorflow_datasets/'
TFDS_DOWNLOAD_DIR = '/Volumes/Data/tensorflow_datasets/downloads/'

In [None]:
# Storage locations for model checkpoints and training logs
CKPT_DIR = '.model_checkpoints/cifar100/'
LOG_DIR = 'logs/cifar100/'

In [None]:
# Storage location for data from experiments
RESULTS_DIR = 'results/'

In [None]:
# Dataset specific configuration

# Storage locations for data from experiments on the composed resp. benchmark network
CIFAR100_RESULTS_DIR_COMPNET = RESULTS_DIR + 'cifar100/compnet/'
CIFAR100_RESULTS_DIR_BENCHMARK = RESULTS_DIR + 'cifar100/benchmark/'

# Files containing the string literals corresponding to the numeric labels from the CIFAR100 dataset
# Label 'i' in the dataset corresponds to the i-th entry of the respective file
CIFAR100_COARSE_LABELS_TO_LITERALS_FILE = TFDS_DATA_DIR + 'cifar100/1.3.1/coarse_label.labels.txt'
CIFAR100_FINE_LABELS_TO_LITERALS_FILE = TFDS_DATA_DIR + 'cifar100/1.3.1/label.labels.txt'

# Keys of the dataset fields containing images and labels
CIFAR100_IMG_KEY = 'image'
CIFAR100_COARSE_LABEL_KEY = 'coarse_label'
CIFAR100_FINE_LABEL_KEY = 'label'

# Synset level of the different label categories (indicating the hierarchical relation between the
# different categories)
CIFAR100_COARSE_LABEL_LEVEL = 0
CIFAR100_FINE_LABEL_LEVEL = 1

In [None]:
# Preprocessing configuration

# Cropping dimensions
CROP_SIZE_H = 28
CROP_SIZE_W = 28

---
## Loading the dataset
<a id='load'></a>

CIFAR100 dataset

In [None]:
[cifar100_train_raw, cifar100_val_raw, cifar100_test_raw], cifar100_info = tfds.load('cifar100',
                                                                                     split=[tfds.Split.TRAIN.subsplit(tfds.percent[:80]),
                                                                                            tfds.Split.TRAIN.subsplit(tfds.percent[-20:]),
                                                                                            tfds.Split.TEST],
                                                                                     data_dir=TFDS_DATA_DIR,
                                                                                     download_and_prepare_kwargs={'download_dir': TFDS_DOWNLOAD_DIR},
                                                                                     with_info=True)
print(cifar100_info)

In [None]:
# Print a sample image to make sure loading worked correctly

fig, ax = plt.subplots(1, 3, figsize=(20, 5))
titles = ['Train', 'Validation', 'Test']

for idx, dataset in enumerate([cifar100_train_raw, cifar100_val_raw, cifar100_test_raw]):
    for record in dataset.take(1):
        ax[idx].imshow(record[CIFAR100_IMG_KEY])
        ax[idx].set_title(titles[idx])
        ax[idx].axis('off')
    
fig.show()

In [None]:
# Print the distribution of the train, validation and test dataset to validate, that the former is
# representative for the latter two (requires the initalization of `CIFAR100_SYNSET_MAP` prior to
# execution (cf. below))

fig, ax = plt.subplots(1, 3, figsize=(20, 5))
titles = ['Train', 'Validation', 'Test']

for idx, dataset in enumerate([cifar100_train_raw, cifar100_val_raw, cifar100_test_raw]):
    # Get the maximum label (called 'label depth' in Tensorflow)
    label_depth = CIFAR100_NUM_FINE_LABELS

    # Initialize the label distribution
    dist = [0] * (label_depth + 1)
    
    # Get the label distribution for the current dataset
    for record in dataset:
        dist[record['label'].numpy()] += 1
        
    # Normalize the distribution
    dist = list(map(lambda entry: entry / sum(dist) * 100, dist)) 
                
    # Plot the distribution
    ax[idx].bar(range(1, len(dist) + 1), dist, width=1)
    ax[idx].set_title(titles[idx])
    ax[idx].set_xlim([0.5, 100.5])
    ax[idx].set_ylim([0, 0.015])
    ax[idx].set_xlabel('Label')
    ax[idx].set_ylabel('Share in %')
    
fig.show()

---
## Mapping synset relationships
<a id='synset_mapping'></a>

In this section, we create a mapping between hyper- and hyponyms (i.e. coarse and fine labels) in analogy to the wordnet corpus underlying ImageNet to be able to model the relations between the different hierarchy levels of labels and as a basis for the structure of the composed network.

In [None]:
# Annotation: We'll eventually outsource this class into a dedicated module, thus we're including
# necessary imports, etc. here instead of putting them at the top of the notebook together with
# the rest to keep all ressources in one place.

from copy import deepcopy

class synset_map(object):
    '''Representational model for hierarchical syntactical structures
    
    This class implements a representational model for (injective) hierarchical syntactical structures.
    It provides the necessary functionalities to create a mapping between multilayered hyper- and
    hyponym compositions as well as to trace the inherent relations as well as to measure the semantic
    distance between different synsets.
    
    Args:
        synset_map (optional): Representational model for a hierarchical syntactical structure (e.g.
            manually constructed; takes highest priority if provided together with `dataset` and
            `keys`)
        dataset (optional): Dataset-like structure that can be accessed via `keys`and that contains
            the hyper- and hyponyms whose relation is to be mapped (assuming an unambiguous, injective
            structure of synsets)
        keys (optional): List of keys that can be used to access the fields of `dataset` that contain
            the synset specifiers
    
    Attributes:
        synset_map (dict): Nested structure of dicts that serves to store the hierarchical relationships
            between the different synsets
    '''
    
    def __init__(self, synset_map=None, dataset=None, keys=None):
        if synset_map is not None:
            self.synset_map = synset_map
        elif (dataset is not None) and (keys is not None):
            self.synset_map_from_dataset(dataset, keys)
        else:
            self.synset_map = {}
    
    
    @property
    def synset_map(self):
        return deepcopy(self._synset_map)
    
    
    @synset_map.setter
    def synset_map(self, synset_map):
        elements = [synset_map]
        for element in elements:
            if not isinstance(element, dict):
                raise TypeError(
                    'All entries of `synset_map` have to be of type `dict`.\n'
                )

            if element:
                for value in element.values():
                    elements.append(value)
        
        self._synset_map = synset_map
        
        self.construct_hyponym_map()
        self.construct_hypernym_map()
        
    
    def construct_hyponym_map(self):
        '''Constructs a dictionary containing the hyponyms for every synset in `synset_map`
        
        Constructs a dictionary containing the hyponyms for every synset in `synset_map`. This
        function is called exactly once every time `synset_map` is set, thus reducing the complexity
        of subsequent lookup operations to O(1). Has to be called manually if changes to an existing
        synset map are made.
        
        Raises:
            UserWarning: If `synset_map` has not been initialized at call time
        '''
                
        if not self._synset_map:
            raise UserWarning(
                'Please initialize `synset_map` before calling this function.\n'
            )
                    
        self._hyponyms = {}
        
        synsets = [self._synset_map]
        for synset in synsets:
            for hypernym in synset.keys():                
                self._hyponyms[hypernym] = list(synset[hypernym].keys())
                synsets.append(synset[hypernym])
            
        
    def construct_hypernym_map(self):
        '''Constructs a dictionary containing the complete hypernym path for every synset in `synset_map`
        
        Constructs a dictionary containing the complete hypernym path for every synset in `synset_map`.
        This function is called exactly once every time `synset_map`is changed, thus reducing the
        complexity of subsequent lookup operations to O(1). Has to be called manually if changes to
        an existing synset map are made.
        
        Raises:
            UserWarning: If `synset_map` has not been initialized at call time
        '''
        
        if not self._synset_map:
            raise UserWarning(
                'Please initialize `synset_map` before calling this function.\n'
            )
        
        self._hypernym_paths = {}
        
        synsets = [self._synset_map]
        for synset in synsets:
            for hypernym, hyponyms in zip(synset.keys(), synset.values()):
                hypernym_path = [hypernym]

                while True:
                    # Check if the current `hypernym` is part of the root layer (meaning it has no
                    # further hypernyms)
                    if hypernym in self._synset_map.keys():
                        break

                    elements = [self._synset_map]
                    for element in elements:
                        for key in element.keys():
                            if hypernym in element[key].keys():
                                hypernym = key
                                break
                            else:
                                elements.append(element[key])

                    hypernym_path.append(hypernym)

                self._hypernym_paths[hypernym_path[0]] = hypernym_path[::-1]
                
                synsets.append(hyponyms)
    
    
    def synset_map_from_dataset(self, dataset, keys):
        '''Creates a mapping between hyper- and hyponyms from a given dataset
        
        Args:
            dataset: Dataset-like structure that can be accessed via `keys`and that contains the
                hyper- and hyponyms whose relation is to be mapped (assuming an unambiguous, injective
                structure of synsets)
            keys: List of keys that can be used to access the fields of `dataset` that contain the
                synset specifiers
            
        Raises:
            TypeError: If one of the input arguments is of the wrong type
            ValueError: If invalid values are specified for one or more input arguments
        '''
        
        if not isinstance(dataset, list):
            raise TypeError(
                '`dataset` is expected to be of type `list`. Is of type {}.\n'.format(type(dataset))
            )
        if not dataset:
            raise ValueError(
                '`dataset` may not be empty.\n'
            )
        if not isinstance(keys, list):
            raise TypeError(
                '`keys` is expected to be of type `list`. Is of type {}.\n'.format(type(keys))
            )
        if not keys:
            raise ValueError(
                '`keys` may not be empty.\n'
            )
            
        # Determine the hierarchical relationship between the given keys (i.e. which fields in the
        # dataset contain the highest-level synsets, which ones contain the second highest and so on)
        
        ordered_keys = keys
        len_ordered_keys = len(ordered_keys)
        if len_ordered_keys > 1:
            # Assuming an injective structure of the syntactic relationships, determine the hierarchical
            # relationship between two values of `keys` by simply iterate over the dataset until one
            # of the fields differs for the same value of the field referenced by the other key,
            # indicating that the former is a hyponym of the latter. The keys can then be sorted
            # using e.g. Bubble Sort as is done here.
            for i in range(len_ordered_keys):
                for j in range(0, len_ordered_keys - i - 1):
                    key_1 = ordered_keys[j]
                    key_2 = ordered_keys[j + 1]
                    
                    val_key_1 = dataset[0][key_1]
                    val_key_2 = dataset[0][key_2]
                        
                    for record in dataset:
                        if (val_key_1 != record[key_1]) and (val_key_2 == record[key_2]):
                            ordered_keys[j], ordered_keys[j + 1] = ordered_keys[j + 1], ordered_keys[j]
                            break
        
        # Construct the mapping between hierarchically related synsets from the dataset
        
        synset_map = {}
        for record in dataset:
            synset = synset_map
            
            for key in ordered_keys:
                hyponym = record[key]
                
                if not hyponym in synset.keys():
                    synset[hyponym] = {}
                    
                synset = synset[hyponym]
        
        self.synset_map = synset_map
        
    
    def hyponyms(self, synset):
        '''Returns the hyponyms for a given synset
        
        Args:
            synset: Key / label of the synset whose hyponyms are to be returned
        
        Returns:
            hyponyms: List containing the hyponyms of the given synset (empty if `synset` is not in
                `synset_map` or if `synset` is a leaf node)
            
        Raises:
            UserWarning: If `synset_map` has not been initialized at call time
        '''
        
        if not self._hyponyms:
            raise UserWarning(
                'Please initialize `synset_map` or call `construct_hyponym_map` before calling this function.\n'
            )
            
        hyponyms = []
                
        if synset in self._hyponyms.keys():
            hyponyms = self._hyponyms[synset]

        return hyponyms
    
    
    def hypernym(self, synset):
        '''Returns the hypernym for a given synset
        
        Args:
            synset: Key / label of the synset whose hypernym is to be returned
        
        Returns:
            hypernym: Hypernym of the given synset (`None` if `synset` is not in `synset_map` or if
                `synset` is part of the root layer)
            
        Raises:
            UserWarning: If `synset_map` has not been initialized at call time
        '''
        
        if not self._hypernym_paths:
            raise UserWarning(
                'Please initialize `synset_map` or call `construct_hypernym_map` before calling this function.\n'
            )
        
        hypernym = None
           
        if synset in self._hypernym_paths.keys():
            # Check if `synset` is part of the root layer
            if len(self._hypernym_paths[synset]) > 1:
                hypernym = self._hypernym_paths[synset][-2]
        
        return hypernym
    
    
    def hypernym_path(self, synset):
        '''Returns the hypernym path from a given synset to the root layer
        
        Args:
            synset: Key / label of the synset whose hypernym path is to be returned
        
        Returns:
            hypernym_path: List containing the hypernym path in descending order from the root layer
                down to `synset` (empty if `synset` is not in `synset_map`)
            
        Raises:
            UserWarning: If `synset_map` has not been initialized at call time
        '''
        
        if not self._hypernym_paths:
            raise UserWarning(
                'Please initialize `synset_map` or call `construct_hypernym_map` before calling this function.\n'
            )
            
        hypernym_path = []
        
        if synset in self._hypernym_paths.keys():
            hypernym_path = self._hypernym_paths[synset]
                        
        return hypernym_path
    
    
    def is_a(self, hyponym, hypernym):
        '''Checks whether a synset is a hyponym of another synset
        
        Args:
            hyponym: Key / label of the hierarchically subordinate synset in question
            hypernym: Key / label of the hierarchically superordinate synset in question
        
        Returns:
            is_hyponym: Indicator whether `hyponym` is a hyponym of `hypernym`
            
        Raises:
            UserWarning: If `synset_map` has not been initialized at call time
        '''
        
        if not self._synset_map:
            raise UserWarning(
                'Please initialize `synset_map` before calling this function.\n'
            )
        
        return (hypernym in self.hypernym_path(hyponym))
    
    
    def semantic_distance(self, synset_1, synset_2):
        '''Calculates the semantic distance between two synsets in accordance to (Fergus et al., 2010)
        
        Args:
            hyponym: Key / label of the first synset
            hypernym: Key / label of the second synset
            
        Returns:
            semantic_dist (float): Semantic distance between the two given synsets (0.0 if one of
                the synsets is not in `synset_map`)
            
        Raises:
            UserWarning: If `synset_map` has not been initialized at call time
        '''
        
        if not self._synset_map:
            raise UserWarning(
                'Please initialize `synset_map` before calling this function.\n'
            )
            
        semantic_dist = 0.0
            
        hypernym_path_1 = self.hypernym_path(synset_1)
        hypernym_path_2 = self.hypernym_path(synset_2)
        
        if hypernym_path_1 and hypernym_path_2:
            # Semantic distance is defined by (Fergus et al., 2010) as follows:
            # S(i, j) = intersect(path(i), path(j)) / max(length(path(i)), length(path(j)))
            semantic_dist = len([hypernym for hypernym in hypernym_path_1 if hypernym in hypernym_path_2]) / max(len(hypernym_path_1), len(hypernym_path_2))
        
        return semantic_dist
    
    
    def synset_level(self, synset):
        '''Returns the level of the given synset in the syntactical structure
        
        Args:
            synset: Key / label of the synset in question
            
        Returns:
            level: Level of the given synset in the hierarchical structure (zero indicating root
                level, one the first level below root level, etc.; `None` if `synset` is not in
                `synset_map`)
            
        Raises:
            UserWarning: If `synset_map` has not been initialized at call time
        '''
        
        if not self._synset_map:
            raise UserWarning(
                'Please initialize `synset_map` before calling this function.\n'
            )
        
        hypernym_path = self.hypernym_path(synset)
        
        if not hypernym_path:
            return None
        
        return len(hypernym_path) - 1
    
    
    def get_all_synsets_of_level(self, level):
        '''Returns all synsets of the specified level in the hierarchical syntactic structure
        
        Args:
            level (int): Positive integer specifying of level of `synset_map` from which the synsets
                are to be returned
        
        Returns:
            synset: List of all synsets on the specified level in `synset_map` (empty if `level` is
                greater than the maximum depth of `synset_map`)
                
        Raises:
            TypeError: If one of the input arguments is of the wrong type
            ValueError: If invalid values are specified for one or more input arguments
            UserWarning: If `synset_map` has not been initialized at call time
        '''
        
        if not isinstance(level, int):
            raise TypeError(
                '`level` is expected to be of type `int`. Is of type {}.\n'.format(type(level))
            )
        if level < 0:
            raise ValueError(
                '`level` has to be a positive value.\n'
            )
        if not self._hypernym_paths:
            raise UserWarning(
                'Please initialize `synset_map` or call `construct_hypernym_map` before calling this function.\n'
            )
            
        synset = []
        
        for hypernym_path in self._hypernym_paths.values():
            if len(hypernym_path) - 1 >= level:
                if hypernym_path[level] not in synset:
                    synset.append(hypernym_path[level])
                
        return synset

In [None]:
# Mappings between (encoded) numeric and (decoded) string literal labels
CIFAR100_COARSE_LABELS_TO_LITERALS = [literal.numpy().decode('utf8') for literal in tf.data.TextLineDataset(CIFAR100_COARSE_LABELS_TO_LITERALS_FILE)]
CIFAR100_FINE_LABELS_TO_LITERALS = [literal.numpy().decode('utf8') for literal in tf.data.TextLineDataset(CIFAR100_FINE_LABELS_TO_LITERALS_FILE)]

In [None]:
def cifar100_decode(label, level=CIFAR100_FINE_LABEL_LEVEL):
    '''Returns the literal associated with the given combination of encoded label and hierarchy level
    
    Args:
        label (int): Encoded label
        level (int, optional): Level of the given label in the hierarchical structure (`0` indicating
            coarse, `1` (default) fine labels)
    
    Returns:
        decoded_label (str): String literal belonging to the given label / level combination
    '''
   
    # Cut off invalid values for `level` (has to be either `0` or `1`)
    level = max(0, min(CIFAR100_FINE_LABEL_LEVEL, level))
    
    if level == CIFAR100_COARSE_LABEL_LEVEL:
        return CIFAR100_COARSE_LABELS_TO_LITERALS[label]
   
    return CIFAR100_FINE_LABELS_TO_LITERALS[label]

In [None]:
def cifar100_encode(label):
    '''Returns the combination of encoded label and hierarchy level associated with the given literal
    
    Args:
        label (str): Decoded label
    
    Returns:
        encoded_label (int): Numeric label belonging to the given string literal
        level (int): Level of the `encoded_label` in the hierarchical structure (`0` indicating coarse,
            `1` fine labels)
    '''
    
    if label in CIFAR100_COARSE_LABELS_TO_LITERALS:
        return CIFAR100_COARSE_LABELS_TO_LITERALS.index(label), CIFAR100_COARSE_LABEL_LEVEL
        
    return CIFAR100_FINE_LABELS_TO_LITERALS.index(label), CIFAR100_FINE_LABEL_LEVEL

In [None]:
# Annotation: This function serves to aggregate and encapsulate all native (non-tensorized) Python
# functionalities that require eager execution for direct data access during the preprocessing of
# the dataset to minimize the overhead resulting from converting TensorFlow data structures to Python
# objects and back at runtime.

def cifar100_resolve_hypernym(label, level, encoded=False):
    '''Returns the hypernym of the given (fine) label on the specified hierarchy level

    Given a label on the finest supported categorical resolution (i.e. fine labels for the CIFAR100
    dataset) returns the related hypernym on the specified hierarchy level

    Args:
        label: Label of the synset on the finest supported categorical resolution whose hypernym is to
            be returned in numerical or string literal format
        level (int): Level of the given label in the hierarchical structure (`0` indicating
            coarse, `1` (default) fine labels)
        encoded (bool, optional): Indicator whether to return the hypernym in encoded or decoded
            format

    Returns:
        hypernym: Hypernym of the given synset on the hierarchy level specified by `level` in numerical
            or string literal format depending on `encoded`
    '''

    # `tf.Tensor` compatibility
    if isinstance(label, tf.Tensor):
        label = label.numpy()

    # Decode the label if it was given in encoded format
    if not isinstance(label, str):
        label = cifar100_decode(label, CIFAR100_FINE_LABEL_LEVEL)

    hypernym = CIFAR100_SYNSET_MAP.hypernym_path(label)[level]
    
    if encoded:
        hypernym, _ = cifar100_encode(hypernym)

    return hypernym

# TensorFlow wrapper to be able to apply this function to placeholder object, thus being able to
# employ it as part of the preprocessing pipeline
def tf_cifar100_resolve_hypernym_decoded(label, level):
    return tf.py_function(cifar100_resolve_hypernym, inp=(label, level, False), Tout=tf.dtypes.string)
def tf_cifar100_resolve_hypernym_encoded(label, level):
    return tf.py_function(cifar100_resolve_hypernym, inp=(label, level, True), Tout=tf.dtypes.int64)

In [None]:
# Create a mapping between hyper- and hyponyms (i.e. coarse and fine labels) of the CIFAR100 dataset
CIFAR100_SYNSET_MAP = synset_map(
    dataset=[{CIFAR100_COARSE_LABEL_KEY: cifar100_decode(
                  record[CIFAR100_COARSE_LABEL_KEY].numpy(), CIFAR100_COARSE_LABEL_LEVEL),
              CIFAR100_FINE_LABEL_KEY: cifar100_decode(
                  record[CIFAR100_FINE_LABEL_KEY].numpy(), CIFAR100_FINE_LABEL_LEVEL)}
             for record in cifar100_test_raw],
    keys=[CIFAR100_COARSE_LABEL_KEY, CIFAR100_FINE_LABEL_KEY])

# Macros for the number of coarse resp. fine labels
CIFAR100_NUM_COARSE_LABELS = len(CIFAR100_SYNSET_MAP.get_all_synsets_of_level(CIFAR100_COARSE_LABEL_LEVEL))
CIFAR100_NUM_FINE_LABELS = len(CIFAR100_SYNSET_MAP.get_all_synsets_of_level(CIFAR100_FINE_LABEL_LEVEL))

---
## Processing and augmentation
<a id='processing_augmentation'></a>

In this section, different functions are provided for preprocessing the dataset prior to the training and inference adhering to the standard 10-crop procedure with contrast normalization and ZCA whitening as described in (Goodfellow et al., 2013), (Krizhevsky, 2009) and (Krizhevsky et al., 2012).

Crop out a patch of an image

In [None]:
def crop_image(image, offset_height, offset_width, target_height, target_width):
    '''Simple, type-preserving wrapper around `tf.image.crop_to_bounding_box`
    
    Args:
        image: 3-D image `Tensor`
        offset_height (int): Vertical coordinate of the top-left corner of the crop
        offset_width (int): Horizontal coordinate of the top-left corner of the crop
        target_height (int): Height of the crop
        target_width (int): Width of the crop
        
    Returns:
        cropped_image: 3-D tensor containing the cropped image
    '''
    
    return tf.cast(
        tf.image.crop_to_bounding_box(
            image,
            offset_height, offset_width,
            target_height, target_width),
        image.dtype)

In [None]:
def crop_central(image, target_height, target_width):
    '''Crops the central patch of an image
    
    Args:
        image: 3-D image `Tensor`
        target_height (int): Height of the crop
        target_width (int): Width of the crop
        
    Returns:
        cropped_image: 3-D tensor containing the central crop of the image
    '''
    
    shape = tf.shape(input=image)
    height, width = shape[0], shape[1]

    offset_height = (height - target_height) // 2
    offset_width = (width - target_width) // 2
    
    return crop_image(
        image,
        offset_height, offset_width,
        target_height, target_width)

In [None]:
def crop_corner_upper_left(image, target_height, target_width):
    '''Crops the upper left corner patch of an image
    
    Args:
        image: 3-D image `Tensor`
        target_height (int): Height of the crop
        target_width (int): Width of the crop
        
    Returns:
        cropped_image: 3-D tensor containing the crop at the upper left corner position of the image
    '''
    
    shape = tf.shape(input=image)
    height, width = shape[0], shape[1]

    offset_height = 0
    offset_width = 0

    return crop_image(
        image,
        offset_height, offset_width,
        target_height, target_width)

In [None]:
def crop_corner_upper_right(image, target_height, target_width):
    '''Crops the upper right corner patch of an image
    
    Args:
        image: 3-D image `Tensor`
        target_height (int): Height of the crop
        target_width (int): Width of the crop
        
    Returns:
        cropped_image: 3-D tensor containing the crop at the upper right corner position of the image
    '''
    
    shape = tf.shape(input=image)
    height, width = shape[0], shape[1]

    offset_height = 0
    offset_width = (width - target_width)

    return crop_image(
        image,
        offset_height, offset_width,
        target_height, target_width)

In [None]:
def crop_corner_lower_right(image, target_height, target_width):
    '''Crops the lower right corner patch of an image
    
    Args:
        image: 3-D image `Tensor`
        target_height (int): Height of the crop
        target_width (int): Width of the crop
        
    Returns:
        cropped_image: 3-D tensor containing the crop at the lower right corner position of the image
    '''
    
    shape = tf.shape(input=image)
    height, width = shape[0], shape[1]

    offset_height = (height - target_height)
    offset_width = (width - target_width)

    return crop_image(
        image,
        offset_height, offset_width,
        target_height, target_width)

In [None]:
def crop_corner_lower_left(image, target_height, target_width):
    '''Crops the lower left corner patch of an image
    
    Args:
        image: 3-D image `Tensor`
        target_height (int): Height of the crop
        target_width (int): Width of the crop
        
    Returns:
        cropped_image: 3-D tensor containing the crop at the lower left corner position of the image
    '''
    
    shape = tf.shape(input=image)
    height, width = shape[0], shape[1]

    offset_height = (height - target_height)
    offset_width = 0

    return crop_image(
        image,
        offset_height, offset_width,
        target_height, target_width)

In [None]:
def crop_random(image, target_height, target_width):
    '''Crops a random patch of an image
    
    Args:
        image: 3-D image `Tensor`
        target_height (int): Height of the crop
        target_width (int): Width of the crop
        
    Returns:
        cropped_image: 3-D tensor containing a crop at a random position of the image
    '''
    
    shape = tf.shape(input=image)
    height, width = shape[0], shape[1]

    offset_height = tf.random.uniform((1,), maxval=(height - target_height), dtype=tf.dtypes.int32)[0]
    offset_width = tf.random.uniform((1,), maxval=(width - target_width), dtype=tf.dtypes.int32)[0]
    
    return crop_image(
        image,
        offset_height, offset_width,
        target_height, target_width)

Flip a given image

In [None]:
def flip_horizontally(image):
    '''Simple wrapper for `tf.image.flip_up_down`
    
    Args:
        image: 3-D image `Tensor`
        
    Returns:
        flipped_image: 3-D tensor containing the horizontally flipped image
    '''
    
    return tf.image.flip_up_down(image)

In [None]:
def flip_vertically(image):
    '''Simple wrapper for `tf.image.flip_left_right`
    
    Args:
        image: 3-D image `Tensor`
        
    Returns:
        flipped_image: 3-D tensor containing the vertically flipped image
    '''
    
    return tf.image.flip_left_right(image)

Contrast correction

In [None]:
def global_contrast_correction(feature_mat, scale=1, bias=0, eps=1e-3, rowvar=True):
    '''Global contrast correction for a given feature matrix
    
    Performs global contrast correction for a given feature matrix (i.e. bias correction and
    normalization resp. rescaling of the standard deviation)
    
    Args:
        feature_mat (float): 2-D tensor containing the features to be contrast corrected
        scale (int, optional): Rescaling factor for the standard deviation after normalizing; the
            resulting image will thus have a standard deviation of `scale`
        bias (int, optional): Regularization term to bias the standard deviation prior to normalizing
        eps (float, optional): Cutoff constant to prevent division by zero; standard deviations below
            this value are automatically set to `cutoff`
        rowvar (bool, optional): Numpy-style indicator, whether the features are in the rows or in the
            coloumns of the matrix (with the respective other dimension representing the observations);
            True (default) means the features are in the rows, False implies the features are in the
            coloumns
            
    Returns:
        contrast_corrected_feature_mat: 2-D tensor containing the contrast corrected feature matrix
            with feature mean zero and standard deviation `scale`
    '''
    
    if not rowvar:
        feature_mat = tf.transpose(feature_mat)
        
    # Offset / bias correction
    feature_mat = feature_mat - tf.expand_dims(tf.reduce_mean(feature_mat, axis=1), axis=-1)
    
    # Standard deviation normalization and rescaling
    feature_mat = scale * feature_mat / tf.math.maximum(eps, tf.math.sqrt(bias + tf.expand_dims(tf.reduce_mean(feature_mat ** 2, axis=1), axis=-1)))
    
    if not rowvar:
        feature_mat = tf.transpose(feature_mat)
    
    return feature_mat

Decorrelation of pixel values

In [None]:
def zca_whitening(feature_mat, eps=1e-3, rowvar=True):
    '''Performs ZCA whitening for a given feature matrix
    
    Performs ZCA whitening (aka "Mahalanbobis transformation") to de-correlate the data dimensions
    of a given feature matrix following (Krizhevsky, 2009)
    
    Args:
        feature_mat (float): 2-D tensor containing the features to be de-correlated
        eps (float, optional): Cutoff constant to prevent division by zero; standard deviations below
            this value are automatically set to `cutoff`
        rowvar (bool, optional): Numpy-style indicator, whether the features are in the rows or in the
            coloumns of the matrix (with the respective other dimension representing the observations);
            True (default) means the features are in the rows, False implies the features are in the
            coloumns
            
    Returns:
        zca_whitened_feature_mat: 2-D tensor containing the ZCA whitened feature matrix with
            de-correlated data dimensions
    '''
    
    if not rowvar:
        feature_mat = tf.transpose(feature_mat)
    
    # Calculate the covariance matrix (for zero-mean centered data: cov(X) = X * X.T / (n - 1)
    cov = tf.linalg.matmul(feature_mat, feature_mat, transpose_b=True) / tf.cast(tf.shape(feature_mat)[1] - 1, tf.float32)
            
    # Singular value decomposition
    s, u, _ = tf.linalg.svd(cov)
    
    # Calculate the ZCA whitening matrix (w = P * D ** (-1/2) * P with P an orthogonal and D a diagonal matrix constuting cov)
    w = tf.linalg.matmul(u, tf.linalg.matmul(tf.linalg.diag(tf.math.maximum(eps, s) ** (-1/2)), u, transpose_b=True))
        
    # ZCA whitening of the feature matrix
    feature_mat = tf.linalg.matmul(w, feature_mat)
                                                                           
    if not rowvar:
        feature_mat = tf.transpose(feature_mat)                               
    
    return feature_mat

General preprocessing applicable to all images (dataset level)

In [None]:
def general_preprocessing_dataset_level(dataset):
    '''Preprocessing steps applicable to all images
    
    Preprocessing steps applicable to all images; includes global contrast correction and ZCA whitening
    
    Args:
        dataset: Dataset containing image records in the format (`image`(float32), `label`)
        
    Returns:
        processed_dataset: Preprocessed dataset of (`image`, `label`) tuples
    '''
    
    # Get the number of entries in the dataset
    dataset_length = 0
    for _ in dataset:
        dataset_length += 1
    
    # Transform the dataset to a 2-D tensor with each row containing one image in flattened form and
    # a 1-D tensor containing the respective labels
    (images, labels) = tf.data.experimental.get_single_element(dataset.batch(dataset_length))
    image_dims = tf.shape(images)
    images = tf.reshape(images, (image_dims[0], image_dims[1] * image_dims[2] * image_dims[3]))
    
    # Global contrast correction
    images = global_contrast_correction(images, rowvar=False)
    
    # ZCA whitening
    images = zca_whitening(images, rowvar=False)
    
    # Reconstruction of the original dataset structure
    images = tf.reshape(images, (image_dims[0], image_dims[1], image_dims[2], image_dims[3]))
    
    return tf.data.Dataset.from_tensor_slices((images, labels))    

General preprocessing applicable to all images (record level)

In [None]:
def general_preprocessing_record_level(record, flip=False):
    '''Preprocessing steps applicable to all images
    
    Preprocessing steps applicable to all images; includes horizontal flipping
    
    Args:
        record: (`image`, `label`) tuple
        flip (optional): Indicator whether to flip the image
        
    Returns:
        processed_image: 3-D tensor containing the processed image
        label: Label belonging to `processed_image`
    '''
    
    image = record[0]
    label = record[1]
            
    if flip:
        processed_image = flip_horizontally(image)
    else:
        processed_image = image
    
    return (processed_image, label)

Preprocessing dependent on whether the images are used for training or evaluation (record level)

In [None]:
def train_preprocessing_record_level(record):
    '''Preprocessing steps applicable only to training images
    
    Preprocessing steps applicable only to training images; includes cropping of random image patches
    
    Args:
        record: (`image`, `label`) tuple
        
    Returns:
        processed_image: 3-D tensor containing the processed image
        label: Label belonging to `processed_image`
    '''
            
    image = record[0]
    label = record[1]
    
    processed_image = crop_random(image, CROP_SIZE_H, CROP_SIZE_W)
    
    return (processed_image, label)

In [None]:
def test_preprocessing_record_level(record, crop_mode):
    '''Preprocessing steps applicable only to test images
    
    Preprocessing steps applicable only to training images; includes cropping of center and corner
    image patches
    
    Args:
        record: (`image`, `label`) tuple
        crop_mode (int): Indicates whether to crop at the center or the corners of the image ([0:3]
            corners (clockwise starting at the upper left corner) and [4] image center)
            
    Returns:
        processed_image: 3-D tensor containing the processed image
        label: Label belonging to `processed_image`
    '''
    
    image = record[0]
    label = record[1]
    
    if crop_mode == 0:
        processed_image = crop_corner_upper_left(image, CROP_SIZE_H, CROP_SIZE_W)
    elif crop_mode == 1:
        processed_image = crop_corner_upper_right(image, CROP_SIZE_H, CROP_SIZE_W)
    elif crop_mode == 2:
        processed_image = crop_corner_lower_right(image, CROP_SIZE_H, CROP_SIZE_W)
    elif crop_mode == 3:
        processed_image = crop_corner_lower_left(image, CROP_SIZE_H, CROP_SIZE_W)
    else:
        processed_image = crop_central(image, CROP_SIZE_H, CROP_SIZE_W)
    
    return (processed_image, label)

Given a raw dataset, construct an iterator over the preprocessed and augmented records

In [None]:
def process_and_augment(dataset,
                        batch_size,
                        synset_level=CIFAR100_FINE_LABEL_LEVEL,
                        hypernym=None,
                        is_train=False,
                        num_rnd_crops=5,
                        shuffle_buffer_size=1000,
                        num_epochs=1):
    '''Given raw dataset, construct an iterator over the preprocessed and augmented records
    
    Args:
        dataset: Dataset containing raw records
        batch_size: Number of samples per batch
        synset_level (optional): Specifier of the synset level (i.e. fineness of label resolution)
            to use (defaults to the finest possible resolution for the CIFAR100 dataset)
        hypernym (optional): Identifier of the parental (hypernym) category whose subordinate (hyponym)
            labels are to be included in the dataset (only evaluated below the root level, i.e. if
            `synset_level` is greater than `0`; value `None` (default) indicates that all categories
            are to be included; all records with other labels are discarded)
        is_train (optional): Indicator whether the input is for training
        num_rnd_crops (optional): Number of random crops perform per image (each crop is subsequently
            flipped vertically as well, resulting in a total augmentation of 2 * `num_rnd_crops` per
            image; only evaluated if `is_train`is set to True)
        shuffle_buffer_size (optional): Size of the buffer to use when shuffling records (only used
            if `is_train` is set to True)
        num_epochs (optional): Number of epochs to repeat the dataset (only used when `is_train` is
            set to True)
            
    Returns:
        dataset: Dataset of (`image`, `label`) pairs ready for iteration
    '''
    
    # Cut off invalid values for `synset_level` (has to be either `0` or `1` as the CIFAR100 dataset
    # only has two levels of label granularity)
    synset_level = max(0, min(CIFAR100_FINE_LABEL_LEVEL, synset_level))
    
    # Get the maximum label (called 'label depth' in Tensorflow) for one-hot encoding (cf. below)
    label_depth = len(CIFAR100_SYNSET_MAP.get_all_synsets_of_level(synset_level))

    # Convert each record to the form (image, label)
    dataset = dataset.map(
        lambda record: (record[CIFAR100_IMG_KEY],
                        tf_cifar100_resolve_hypernym_encoded(record[CIFAR100_FINE_LABEL_KEY], synset_level)),
        num_parallel_calls=tf.data.experimental.AUTOTUNE)
    
    if (synset_level != 0) and (hypernym is not None):
        # Check, whether `hypernym` is compatible with `synset_level`
        if (CIFAR100_SYNSET_MAP.synset_level(hypernym) == synset_level - 1):
            hyponyms = CIFAR100_SYNSET_MAP.hyponyms(hypernym)

            if hyponyms:
                # Apply filter w/ regards to the parental (coarse) category (i.e. filter out all images
                # that don't belong to a given parent category)
                dataset = dataset.filter(
                    lambda _, label: tf.math.reduce_any(
                        [tf.math.equal(label, cifar100_encode(hyponym)[0]) for hyponym in hyponyms]))
    
    # One-hot encode the remaining entries and convert the images to float32 to make subsequent
    # calculations go smoothly
    dataset = dataset.map(
        lambda image, label: (tf.cast(image, tf.dtypes.float32), tf.one_hot(label, label_depth)),
        num_parallel_calls=tf.data.experimental.AUTOTUNE)
    
    # General (dataset level) preprocessing as described in (Goodfellow et al., 2013) and
    # (Krizhevsky, 2009)
    dataset = general_preprocessing_dataset_level(dataset)
        
    # Cache the dataset to prevent the subsequent record level preprocessing steps to re-run the
    # dataset level preprocessing for each record (only applicable if we're not filtering; in that
    # case we're only processing a small fraction of the dataset at once so we don't necessarily
    # need the caching anyway)
    if (synset_level == 0) or (hypernym is None):
        dataset = dataset.cache()

    # Parse the raw records into images and labels and augment the dataset as described in
    # (Krizhevsky et al., 2012)
    
    # Repeat each image twice (once per flip)
    dataset = dataset.enumerate().interleave(
        lambda _, record: tf.data.Dataset.from_tensors(record).repeat(2),
        cycle_length=1,
        num_parallel_calls=tf.data.experimental.AUTOTUNE)
    
    # General (record level) preprocessing applicable to all images
    dataset = dataset.enumerate().map(
        lambda idx, record: general_preprocessing_record_level(record, flip=tf.dtypes.cast(idx % 2, tf.dtypes.bool)),
        num_parallel_calls=tf.data.experimental.AUTOTUNE)
        
    # Mode dependent augmentation
    if is_train:
        # Repeat each image `num_rnd_crops` times (once per crop)
        dataset = dataset.enumerate().interleave(
            lambda _, record: tf.data.Dataset.from_tensors(record).repeat(num_rnd_crops),
            cycle_length=1,
            num_parallel_calls=tf.data.experimental.AUTOTUNE)
    
        dataset = dataset.enumerate().map(
            lambda idx, record: train_preprocessing_record_level(record),
            num_parallel_calls=tf.data.experimental.AUTOTUNE)
    else:
        # Repeat each image five times (once per crop)
        dataset = dataset.enumerate().interleave(
            lambda _, record: tf.data.Dataset.from_tensors(record).repeat(5),
            cycle_length=1,
            num_parallel_calls=tf.data.experimental.AUTOTUNE)
        
        dataset = dataset.enumerate().map(
            lambda idx, record: test_preprocessing_record_level(record, crop_mode=idx%5),
            num_parallel_calls=tf.data.experimental.AUTOTUNE)
    
    if is_train:
        # Shuffle records before repeating to respect epoch boundaries
        dataset = dataset.shuffle(buffer_size=shuffle_buffer_size)
        # Repeats the dataset for the number of epochs to train
        dataset = dataset.repeat(num_epochs)
    
    # Batching
    dataset = dataset.batch(batch_size)

    # Operations between the final prefetch and the get_next call to the iterator will happen
    # synchronously during run time. Manual prefetching at this point backgrounds all of the above
    # processing work and helps keep it out of the critical training path.
    dataset = dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
    
    return dataset

In [None]:
# Benchmark the performance of the preprocessing pipeline

# Number of samples to use for the benchmark (higher numbers result in a lower bias from onetime
# operations but increase the computation time)
NUM_SAMPLES = 1000

cifar100_train = process_and_augment(cifar100_train_raw,
                                     batch_size=1,
                                     synset_level=CIFAR100_FINE_LABEL_LEVEL,
                                     hypernym='aquatic_mammals',
                                     is_train=True,
                                     num_rnd_crops=5,
                                     shuffle_buffer_size=1000,
                                     num_epochs=1)

start_time_raw = time.perf_counter()
for record in cifar100_train_raw.take(NUM_SAMPLES):
    continue
end_time_raw = time.perf_counter()

mean_time_raw = (end_time_raw - start_time_raw) / NUM_SAMPLES
print('Mean time per sample (raw): {} s'.format(mean_time_raw))

start_time_processed = time.perf_counter()
for record in cifar100_train.take(NUM_SAMPLES):
    continue
end_time_processed = time.perf_counter()

mean_time_processed = (end_time_processed - start_time_processed) / NUM_SAMPLES
print('Mean time per sample (preprocessed): {} s'.format(mean_time_processed))

mean_time_diff = mean_time_processed - mean_time_raw
print('Time difference per sample: {} s'.format(mean_time_diff))

---
## Composed Network (CompNet)
<a id='compnet'></a>

In this section, we examine the performance of a composed model on the CIFAR100 dataset. 

The structure of the model is based on the inherent structure of the CIFAR100 dataset, i.e. two interconnected "layers" of classifiers where the predictions of the first layer are used to route the inputs to the corresponding sub-modules of the second, specialized layer. For simplicity, we keep the basic architecture and hyperparameters for each sub-module the same (i.e. we don't perform inidividual hyperparameter tuning) and limit the amount of data each module is trained upon to the minimum number of examples one of the (coarse) categories is comprised of. Furthermore, for comparability with results reported in literature as well as our own benchmark (cf. below), we employ MaxOut units (Goodfellow et al., 2013) at the convolutional as well as at the fully connected layers.

It should be noted at this point, that, even though there are several promising approaches to be found in literature regarding possible improvements of the resilience of hierarchically composed models (cf. e.g. (Fergus et al., 2010), (Deng et al., 2014) or (Roy et al., 2019)), we deliberately avoided incorporating these ideas into our work as the goal of this project is to examine the question whether modularisation of networks in general is a feasible approach and we want to keep our results as unbiased as possible in this regard (i.e. not "unfairly" improving our approach with regard to the benchmark). Nevertheless, it should also be noted that in doing so, we leave lots of room for improvement.

### Model
<a id='compnet_model'></a>

<a id='compnet_model_maxout'></a>
In a convolutional network, MaxOut feature maps can be constructed by concatenating the following elements (Goodfellow et. al, 2013):
    
    Input: (batch_size, image_width, image_heigth, num_channels)
        tf.keras.Layers.conv2D w/ activaton=None
    Intermediate output: (batch_size, image_width, image_heigth, num_filters)
        tf.keras.layers.Reshape w/ dimensions (image_width - filter_width + 1, image_heigth - filter_height + 1, num_filters, 1)
    Intermediate output: (batch_size, image_width, image_heigth, num_filters, 1)
        tf.keras.Layers.MaxPool3D over the affine feature maps
    Intermediate output: (batch_size, image_width, image_heigth, num_filters / pool_size, 1)
        tf.keras.layers.Reshape w/ dimensions (image_width - filter_width + 1, image_heigth - filter_height + 1, num_filters / pool_size)
    Final output: (batch_size, image_width, image_heigth, num_filters / pool_size)

On the topic of native API versus custom layers: Usually, one would argue, that - considering the "hack" that is used in the form of reshaping to enable cross-channel pooling - it might be a good idea to implement a custom MaxOut layer instead to increase the readability of the code and remove unnecessary operations from the model (i.e. increase the "execution smoothness" of the model. Even though, as goal of this project is the scientific evaluation of the effect of our proposed method, we against this (usually good) practice to keep our "side-channel error vector" as small as possible (in this specific case: original errors induced by faulty programming) to reduce the risk of accidental result distortion. Additionally, we believe that by limiting ourselves to the utilization of standard functions provided by the native API, we increase the reproducability of our experiments even across framework boundaries.

#### Coarse model

In [None]:
# Note: We imply a combination of Dropout and MaxNorm for weight regularization as is reported
# in (Srivastava et al., 2014) to be most effective. Furthermore, we also apply Dropout to the
# convolutional layers to facilitate the discovery of informative features as described in
# (Park et al., 2016).

# Input specification
inputs = tf.keras.Input(shape=(28, 28, 3))

# First convolutional MaxOut layer
x = tf.keras.layers.Conv2D(filters=64,
                           kernel_size=(3, 3),
                           strides=1,
                           padding='same',
                           activation=None,
                           use_bias=True,
                           kernel_constraint='max_norm',
                           bias_constraint='max_norm')(inputs)
x = tf.keras.layers.Reshape((28, 28, 64, 1))(x)
x = tf.keras.layers.MaxPool3D(pool_size=(1, 1, 4),
                              strides=(1, 1, 4),
                              padding='same')(x)
x = tf.keras.layers.Reshape((28, 28, 16))(x)

# First MaxPool layer
x = tf.keras.layers.MaxPool2D(pool_size=(2, 2),
                              strides=(2, 2),
                              padding='same')(x)

# First Dropout layer
x = tf.keras.layers.Dropout(rate=0.1)(x)

# Second convolutional MaxOut layer
x = tf.keras.layers.Conv2D(filters=128,
                           kernel_size=(3, 3),
                           strides=1,
                           padding='same',
                           activation=None,
                           use_bias=True,
                           kernel_constraint='max_norm',
                           bias_constraint='max_norm')(x)
x = tf.keras.layers.Reshape((14, 14, 128, 1))(x)
x = tf.keras.layers.MaxPool3D(pool_size=(1, 1, 4),
                              strides=(1, 1, 4),
                              padding='same')(x)
x = tf.keras.layers.Reshape((14, 14, 32))(x)

# Second MaxPool layer
x = tf.keras.layers.MaxPool2D(pool_size=(2, 2),
                              strides=(2, 2),
                              padding='same')(x)

# First Second layer
x = tf.keras.layers.Dropout(rate=0.1)(x)

# Third convolutional MaxOut layer
x = tf.keras.layers.Conv2D(filters=256,
                        kernel_size=(3, 3),
                        strides=1,
                        padding='same',
                        activation=None,
                        use_bias=True,
                        kernel_constraint='max_norm',
                        bias_constraint='max_norm')(x)
x = tf.keras.layers.Reshape((7, 7, 256, 1))(x)
x = tf.keras.layers.MaxPool3D(pool_size=(1, 1, 4),
                            strides=(1, 1, 4),
                            padding='same')(x)
x = tf.keras.layers.Reshape((7, 7, 64))(x)

# Transition between convolutional and fully connected layers
x = tf.keras.layers.Flatten()(x)

# Third Dropout layer
x = tf.keras.layers.Dropout(rate=0.5)(x)

# Fully connected MaxOut layer
x = tf.keras.layers.Dense(units=128,
                          activation=None,
                          use_bias=True,
                          kernel_constraint='max_norm',
                          bias_constraint='max_norm')(x)
x = tf.keras.layers.Reshape((128, 1))(x)
x = x = tf.keras.layers.MaxPool1D(pool_size=(4),
                              strides=(4),
                              padding='same')(x)
x = tf.keras.layers.Flatten()(x)

# Fourth Dropout layer
x = tf.keras.layers.Dropout(rate=0.5)(x)

# Softmax output layer
outputs = tf.keras.layers.Dense(units=CIFAR100_NUM_COARSE_LABELS,
                                activation='softmax',
                                use_bias=True,
                                kernel_constraint='max_norm',
                                bias_constraint='max_norm')(x)

In [None]:
coarse_model = tf.keras.Model(inputs=inputs, outputs=outputs)
coarse_model.summary()

In [None]:
# Use Categorical Cross Entropy as loss function
loss = tf.keras.losses.CategoricalCrossentropy()

# Define metrics to watch during training
metrics = [tf.keras.metrics.CategoricalAccuracy(),
           tf.keras.metrics.CategoricalCrossentropy(),
           tf.keras.metrics.AUC()]

# Use Adam (Kingma et al., 2017) as optimizer during training
# Annotation: We don't set `learning_rate` here as this is automatically handled by the
# LearningRateScheduler (cf. callback section below).
optimizer = tf.keras.optimizers.Adam(beta_1=0.9,
                                     beta_2=0.999,
                                     epsilon=1e-07)

In [None]:
coarse_model.compile(loss=loss, metrics=metrics, optimizer=optimizer)

In [None]:
# Alternatively: Restore model from checkpoint
# coarse_model = tf.keras.models.load_model(CKPT_DIR + 'compnet_coarse')
# coarse_model.summary()

#### Fine models

We've got as many subordinate specialist modules as there are coarse categories (cf. https://www.cs.toronto.edu/~kriz/cifar.html for a comprehensive list). The downstream models are grouped in an array in order of their respective coarse label (i.e. 0..n where n is the number of coarse categories) to facilitate routing between the two layers by enabling the direct utilization of the prediction of the first layer for indexing.

In [None]:
fine_models = []

for _ in range(CIFAR100_NUM_COARSE_LABELS):
    # Note: We imply a combination of Dropout and MaxNorm for weight regularization as is reported
    # in (Srivastava et al., 2014) to be most effective. Furthermore, we also apply Dropout to the
    # convolutional layers to facilitate the discovery of informative features as described in
    # (Park et al., 2016).

    # Input specification
    inputs = tf.keras.Input(shape=(28, 28, 3))

    # First convolutional MaxOut layer
    x = tf.keras.layers.Conv2D(filters=64,
                               kernel_size=(3, 3),
                               strides=1,
                               padding='same',
                               activation=None,
                               use_bias=True,
                               kernel_constraint='max_norm',
                               bias_constraint='max_norm')(inputs)
    x = tf.keras.layers.Reshape((28, 28, 64, 1))(x)
    x = tf.keras.layers.MaxPool3D(pool_size=(1, 1, 4),
                                  strides=(1, 1, 4),
                                  padding='same')(x)
    x = tf.keras.layers.Reshape((28, 28, 16))(x)

    # First MaxPool layer
    x = tf.keras.layers.MaxPool2D(pool_size=(2, 2),
                                  strides=(2, 2),
                                  padding='same')(x)

    # First Dropout layer
    x = tf.keras.layers.Dropout(rate=0.1)(x)

    # Second convolutional MaxOut layer
    x = tf.keras.layers.Conv2D(filters=128,
                               kernel_size=(3, 3),
                               strides=1,
                               padding='same',
                               activation=None,
                               use_bias=True,
                               kernel_constraint='max_norm',
                               bias_constraint='max_norm')(x)
    x = tf.keras.layers.Reshape((14, 14, 128, 1))(x)
    x = tf.keras.layers.MaxPool3D(pool_size=(1, 1, 4),
                                  strides=(1, 1, 4),
                                  padding='same')(x)
    x = tf.keras.layers.Reshape((14, 14, 32))(x)

    # Second MaxPool layer
    x = tf.keras.layers.MaxPool2D(pool_size=(2, 2),
                                  strides=(2, 2),
                                  padding='same')(x)

    # Transition between convolutional and fully connected layers
    x = tf.keras.layers.Flatten()(x)

    # Second Dropout layer
    x = tf.keras.layers.Dropout(rate=0.5)(x)

    # Fully connected MaxOut layer
    x = tf.keras.layers.Dense(units=128,
                              activation=None,
                              use_bias=True,
                              kernel_constraint='max_norm',
                              bias_constraint='max_norm')(x)
    x = tf.keras.layers.Reshape((128, 1))(x)
    x = x = tf.keras.layers.MaxPool1D(pool_size=(4),
                                  strides=(4),
                                  padding='same')(x)
    x = tf.keras.layers.Flatten()(x)

    # Third Dropout layer
    x = tf.keras.layers.Dropout(rate=0.5)(x)

    # Softmax output layer
    outputs = tf.keras.layers.Dense(units=CIFAR100_NUM_FINE_LABELS,
                                    activation='softmax',
                                    use_bias=True,
                                    kernel_constraint='max_norm',
                                    bias_constraint='max_norm')(x)
    
    # Build the current model and append it to the list of sub-modules
    model = tf.keras.Model(inputs=inputs, outputs=outputs)
    fine_models.append(model)

In [None]:
fine_models[0].summary()

In [None]:
for idx in range(CIFAR100_NUM_COARSE_LABELS):
    # Use Categorical Cross Entropy as loss function
    loss = tf.keras.losses.CategoricalCrossentropy()

    # Define metrics to watch during training
    metrics = [tf.keras.metrics.CategoricalAccuracy(),
               tf.keras.metrics.CategoricalCrossentropy(),
               tf.keras.metrics.AUC()]

    # Use Adam (Kingma et al., 2017) as optimizer during training
    # Annotation: We don't set `learning_rate` here as this is automatically handled by the
    # LearningRateScheduler (cf. callback section below).
    optimizer = tf.keras.optimizers.Adam(beta_1=0.9,
                                         beta_2=0.999,
                                         epsilon=1e-07)
    
    # Compile the respective sub-module
    fine_models[idx].compile(loss=loss, metrics=metrics, optimizer=optimizer)

In [None]:
# Alternatively: Restore models from checkpoint
# fine_models = []
# for idx in range(CIFAR100_NUM_COARSE_LABELS):
#     model = tf.keras.models.load_model(CKPT_DIR + 'compnet_fine_' + str(idx))
#     fine_models.append(model)
#
# fine_models[0].summary()

### Training
<a id='compnet_train'></a>

#### Coarse model

In [None]:
cifar100_train_coarse = process_and_augment(cifar100_train_raw, batch_size=32, synset_level=CIFAR100_COARSE_LABEL_LEVEL, is_train=True, num_rnd_crops=5, shuffle_buffer_size=1000, num_epochs=1)
cifar100_val_coarse = process_and_augment(cifar100_val_raw, batch_size=32, synset_level=CIFAR100_COARSE_LABEL_LEVEL, is_train=True, num_rnd_crops=5, shuffle_buffer_size=1000, num_epochs=1)

In [None]:
# EarlyStopping: Stop training early if no significant improvement in the monitored quantity is
#     observed for at least `patience` epochs
# LearningRateScheduler: Dynamically adapt the learning rate depending on the training epoch to
#     facilitate accelerated learning during the first few epochs
# ModelCheckpoint: Save the model after each epoch (if `save_best_only` is set to True, only keep
#     the best model with regard to the monitored quantity)
# TensorBoard: Enable TensorBoard visualization
callbacks = [tf.keras.callbacks.EarlyStopping(monitor='val_loss',
                                              min_delta=0.01,
                                              patience=3),
             tf.keras.callbacks.LearningRateScheduler(lambda epoch: 1e-03 if epoch < 3 else 1e-04),
             tf.keras.callbacks.ModelCheckpoint(filepath=CKPT_DIR + 'compnet_coarse',
                                                monitor='val_loss',
                                                verbose=False,
                                                save_best_only=True),
             tf.keras.callbacks.TensorBoard(log_dir=LOG_DIR,
                                            histogram_freq=1)]

In [None]:
coarse_model.fit(x=cifar100_train_coarse,
                 epochs=100,
                 verbose=True,
                 callbacks=callbacks,
                 validation_data=cifar100_val_coarse,
                 shuffle=True,
                 validation_freq=1)

#### Fine models

In [None]:
cifar100_train_fine = [process_and_augment(cifar100_train_raw, batch_size=32, synset_level=CIFAR100_FINE_LABEL_LEVEL, hypernym=cifar100_decode(category, CIFAR100_COARSE_LABEL_LEVEL), is_train=True, num_rnd_crops=5, shuffle_buffer_size=1000, num_epochs=1) for category in range(CIFAR100_NUM_COARSE_LABELS)] 
cifar100_val_fine = [process_and_augment(cifar100_val_raw, batch_size=32, synset_level=CIFAR100_FINE_LABEL_LEVEL, hypernym=cifar100_decode(category, CIFAR100_COARSE_LABEL_LEVEL), is_train=True, num_rnd_crops=5, shuffle_buffer_size=1000, num_epochs=1) for category in range(CIFAR100_NUM_COARSE_LABELS)] 

In [None]:
# Count the number of data points in each training and validation set for the different sub-modules
# and determine the minimum amount of examples among them (used as a restrictive factor during the
# training procedure; cf. below)

dataset_length_train = []
dataset_length_val = []

for dataset in cifar100_train_fine:
    dataset_length = 0
    for _ in dataset:
        dataset_length += 1

    dataset_length_train.append(dataset_length)

min_dataset_length_train = min(dataset_length_train)
        
for dataset in cifar100_val_fine:
    dataset_length = 0
    for _ in dataset:
        dataset_length += 1

    dataset_length_val.append(dataset_length)

min_dataset_length_val = min(dataset_length_val)

In [None]:
for idx in range(CIFAR100_NUM_COARSE_LABELS):
    # EarlyStopping: Stop training early if no significant improvement in the monitored quantity is
    #     observed for at least `patience` epochs
    # LearningRateScheduler: Dynamically adapt the learning rate depending on the training epoch to
    #     facilitate accelerated learning during the first few epochs
    # ModelCheckpoint: Save the model after each epoch (if `save_best_only` is set to True, only keep
    #     the best model with regard to the monitored quantity)
    # TensorBoard: Enable TensorBoard visualization
    callbacks = [tf.keras.callbacks.EarlyStopping(monitor='val_loss',
                                                  min_delta=0.01,
                                                  patience=3),
                 tf.keras.callbacks.LearningRateScheduler(lambda epoch: 1e-03 if epoch < 3 else 1e-04),
                 tf.keras.callbacks.ModelCheckpoint(filepath=CKPT_DIR + 'compnet_fine_' + str(idx),
                                                    monitor='val_loss',
                                                    verbose=False,
                                                    save_best_only=True),
                 tf.keras.callbacks.TensorBoard(log_dir=LOG_DIR,
                                                histogram_freq=1)]
    
    fine_models[idx].fit(x=cifar100_train_fine[idx].take(min_dataset_length_train),
                         epochs=100,
                         verbose=True,
                         callbacks=callbacks,
                         validation_data=cifar100_val_fine[idx].take(min_dataset_length_val),
                         shuffle=True,
                         validation_freq=1)

### Testing
<a id='compnet_test'></a>

In [None]:
cifar100_test_coarse = process_and_augment(cifar100_test_raw, batch_size=10, synset_level=CIFAR100_COARSE_LABEL_LEVEL, is_train=False)
cifar100_test_fine = [process_and_augment(cifar100_test_raw, batch_size=10, synset_level=CIFAR100_FINE_LABEL_LEVEL, hypernym=cifar100_decode(category, CIFAR100_COARSE_LABEL_LEVEL), is_train=False) for category in range(CIFAR100_NUM_COARSE_LABELS)]
cifar100_test_composed = process_and_augment(cifar100_test_raw, batch_size=10, synset_level=CIFAR100_FINE_LABEL_LEVEL, hypernym=None, is_train=False)

#### Coarse model

In [None]:
# Define metrics to watch during the evaluation of the model on the test data set

# Use the same metrics as for the training
# test_metrics = metrics

# Use different metrics than during the training
test_metrics = [tf.keras.metrics.CategoricalAccuracy(name='CategoricalAccuracy'),
                tf.keras.metrics.CategoricalCrossentropy(name='CategoricalCrossentropy'),
                tf.keras.metrics.AUC(name='AUC')]

In [None]:
# Reset all metrics before starting the evaluation
for metric in test_metrics:
    metric.reset_states()

for (imgs, ground_truths) in cifar100_test_coarse:
    # Generate predictions for each image in the current batch
    batch_scores = coarse_model.predict(imgs)
    
    # Since we constructed our test data set in a way that each image in one batch is an augmented
    # version of the same base image, we can simply average the individual scores to get the final
    # prediction for the respective base image in adherence to (Goodfellow et al., 2013) and
    # (Krizhevsky et al., 2012).
    prediction = tf.math.reduce_mean(batch_scores, axis=0)
    
    # Update the metrics w/ the result for the current base image; as all images in one batch
    # originate from the same base image (cf. above), the ground truth is hence identical as well.
    for metric in test_metrics:
        metric.update_state(ground_truths[0], prediction)

print('Coarse Model')
print()
print('==================================================')
print()
print('Final results:')
for metric in test_metrics:
    print('{}: {}'.format(metric.name, metric.result().numpy()))

#### Fine models

In [None]:
for idx in range(CIFAR100_NUM_COARSE_LABELS):
    # Define metrics to watch during the evaluation of the model on the test data set

    # Use the same metrics as for the training
    # test_metrics = metrics

    # Use different metrics than during the training
    test_metrics = [tf.keras.metrics.CategoricalAccuracy(name='CategoricalAccuracy'),
                    tf.keras.metrics.CategoricalCrossentropy(name='CategoricalCrossentropy'),
                    tf.keras.metrics.AUC(name='AUC')]

    # Reset all metrics before starting the evaluation
    for metric in test_metrics:
        metric.reset_states()

    for (imgs, ground_truths) in cifar100_test_fine[idx]:
        # Generate predictions for each image in the current batch
        batch_scores = fine_models[idx].predict(imgs)

        # Since we constructed our test data set in a way that each image in one batch is an augmented
        # version of the same base image, we can simply average the individual scores to get the final
        # prediction for the respective base image in adherence to (Goodfellow et al., 2013) and
        # (Krizhevsky et al., 2012).
        prediction = tf.math.reduce_mean(batch_scores, axis=0)

        # Update the metrics w/ the result for the current base image; as all images in one batch
        # originate from the same base image (cf. above), the ground truth is hence identical as well.
        for metric in test_metrics:
            metric.update_state(ground_truths[0], prediction)

    print('Fine Model #{}'.format(cifar100_decode(idx, CIFAR100_COARSE_LABEL_LEVEL)))
    print()
    print('==================================================')
    print()
    print('Final results:')
    for metric in test_metrics:
        print('{}: {}'.format(metric.name, metric.result().numpy()))
    print()
    print('====================================================================================================')
    print()

#### Composed model

In [None]:
# Define metrics to watch during the evaluation of the model on the test data set

# Use the same metrics as for the training
# test_metrics = metrics

# Use different metrics than during the training
test_metrics = [tf.keras.metrics.CategoricalAccuracy(name='CategoricalAccuracy'),
                tf.keras.metrics.CategoricalCrossentropy(name='CategoricalCrossentropy'),
                tf.keras.metrics.AUC(name='AUC')]

In [None]:
# Reset all metrics before starting the evaluation
for metric in test_metrics:
    metric.reset_states()
    
# Initialize additional custom metrics to watch during the evaluation

# Overall label distribution (predicted and ground truth)
dist_ground_truth = [0] * CIFAR100_NUM_FINE_LABELS
dist_predicted = [0] * CIFAR100_NUM_FINE_LABELS

# Predicted label distribution for each coarse category
dist_predicted_coarse = [[0] * CIFAR100_NUM_COARSE_LABELS for _ in range(CIFAR100_NUM_COARSE_LABELS)]

# Predicted label distribution for each fine category
dist_predicted_fine = [[0] * CIFAR100_NUM_FINE_LABELS for _ in range(CIFAR100_NUM_FINE_LABELS)]

# Fine label distribution for each coarse category (predicted and ground truth)
dist_ground_truth_coarse_fine = [[0] * CIFAR100_NUM_FINE_LABELS for _ in range(CIFAR100_NUM_COARSE_LABELS)]
dist_predicted_coarse_fine = [[0] * CIFAR100_NUM_FINE_LABELS for _ in range(CIFAR100_NUM_COARSE_LABELS)]

# Semantic distance between the predicted category and the ground truth in accordance to
# (Fergus et al., 2010) 
semantic_distance = 0.0

In [None]:
for (imgs, ground_truths) in cifar100_test_composed:
    # Generate predictions for each image in the current batch from the coarse (parental)
    batch_scores_coarse = coarse_model.predict(imgs)
    
    # Since we constructed our test data set in a way that each image in one batch is an augmented
    # version of the same base image, we can simply average the individual scores to get the final
    # prediction for the respective base image in adherence to (Goodfellow et al., 2013) and
    # (Krizhevsky et al., 2012).
    prediction_coarse = tf.math.reduce_mean(batch_scores_coarse, axis=0)
    
    # Route the current batch to the sub-module predcited by the coarse model and generate the
    # predictions for the respective fine category labels
    batch_scores_fine = fine_models[tf.math.argmax(prediction_coarse)].predict(imgs)
    
    # Average the scores for the individual images in the batch as described above
    prediction_fine = tf.math.reduce_mean(batch_scores_fine, axis=0)
    
    # Update the metrics w/ the result for the current base image; as all images in one batch
    # originate from the same base image (cf. above), the ground truth is hence identical as well.
    for metric in test_metrics:
        metric.update_state(ground_truths[0], prediction_fine)
        
    # Update custom metrics w/ the result for the current base image; cf. above concerning the
    # ground truth for each batch
    
    ground_truth_fine = tf.math.argmax(ground_truths[0]).numpy()
    ground_truth_fine_decoded = cifar100_decode(ground_truth_fine, CIFAR100_FINE_LABEL_LEVEL)
    
    ground_truth_coarse = cifar100_resolve_hypernym(ground_truth_fine, level=CIFAR100_COARSE_LABEL_LEVEL, encoded=True)
    
    prediction_fine = tf.math.argmax(prediction_fine).numpy()
    prediction_fine_decoded = cifar100_decode(prediction_fine, CIFAR100_FINE_LABEL_LEVEL)
    
    prediction_coarse = tf.math.argmax(prediction_coarse).numpy()
    
    dist_ground_truth[ground_truth_fine] += 1
    dist_predicted[prediction_fine] += 1

    dist_predicted_coarse[ground_truth_coarse][prediction_coarse] += 1

    dist_predicted_fine[ground_truth_fine][prediction_fine] += 1

    dist_ground_truth_coarse_fine[ground_truth_coarse][ground_truth_fine] += 1
    dist_predicted_coarse_fine[ground_truth_coarse][prediction_fine] += 1
    
    semantic_distance += CIFAR100_SYNSET_MAP.semantic_distance(
        ground_truth_fine_decoded,
        prediction_fine_decoded
    )

print('Composed Model')
print()
print('==================================================')
print()
print('Final results:')
for metric in test_metrics:
    print('{}: {}'.format(metric.name, metric.result().numpy()))
print('Semantic distance: {}'.format(semantic_distance))

In [None]:
# Store the results

tf.io.write_file(
    CIFAR100_RESULTS_DIR_COMPNET + 'dist_ground_truth.data',
    tf.io.serialize_tensor(dist_ground_truth))

tf.io.write_file(
    CIFAR100_RESULTS_DIR_COMPNET + 'dist_predicted.data',
    tf.io.serialize_tensor(dist_predicted))

tf.io.write_file(
    CIFAR100_RESULTS_DIR_COMPNET + 'dist_predicted_coarse.data',
    tf.io.serialize_tensor(dist_predicted_coarse))

tf.io.write_file(
    CIFAR100_RESULTS_DIR_COMPNET + 'dist_predicted_fine.data',
    tf.io.serialize_tensor(dist_predicted_fine))

tf.io.write_file(
    CIFAR100_RESULTS_DIR_COMPNET + 'dist_ground_truth_coarse_fine.data',
    tf.io.serialize_tensor(dist_ground_truth_coarse_fine))

tf.io.write_file(
    CIFAR100_RESULTS_DIR_COMPNET + 'dist_predicted_coarse_fine.data',
    tf.io.serialize_tensor(dist_predicted_coarse_fine))

tf.io.write_file(
    CIFAR100_RESULTS_DIR_COMPNET + 'semantic_distance.data',
    tf.io.serialize_tensor(semantic_distance))

In [None]:
# Load the results

dist_ground_truth = tf.io.parse_tensor(
    tf.io.read_file(
        CIFAR100_RESULTS_DIR_COMPNET + 'dist_ground_truth.data'),
        tf.dtypes.int64)

dist_predicted = tf.io.parse_tensor(
    tf.io.read_file(
        CIFAR100_RESULTS_DIR_COMPNET + 'dist_predicted.data'),
        tf.dtypes.int64)

dist_predicted_coarse = tf.io.parse_tensor(
    tf.io.read_file(
        CIFAR100_RESULTS_DIR_COMPNET + 'dist_predicted_coarse.data'),
        tf.dtypes.int64)

dist_predicted_fine = tf.io.parse_tensor(
    tf.io.read_file(
        CIFAR100_RESULTS_DIR_COMPNET + 'dist_predicted_fine.data'),
        tf.dtypes.int64)

dist_ground_truth_coarse_fine = tf.io.parse_tensor(
    tf.io.read_file(
        CIFAR100_RESULTS_DIR_COMPNET + 'dist_ground_truth_coarse_fine.data'),
        tf.dtypes.int64)

dist_predicted_coarse_fine = tf.io.parse_tensor(
    tf.io.read_file(
        CIFAR100_RESULTS_DIR_COMPNET + 'dist_predicted_coarse_fine.data'),
        tf.dtypes.int64)

semantic_distance = tf.io.parse_tensor(
    tf.io.read_file(
        CIFAR100_RESULTS_DIR_COMPNET + 'semantic_distance.data'),
        tf.dtypes.float32)

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(21, 5.5))
fig.suptitle('Predicted vs Ground Truth Distribution')

titles = ['Predicted', 'Ground Truth']

for idx, dist in enumerate([dist_predicted, dist_ground_truth]):
    # Normalize the distribution
    dist = list(map(lambda entry: entry / sum(dist) * 100, dist)) 
                
    # Plot the distribution
    ax[idx].bar(range(1, len(dist) + 1), dist, width=1)
    ax[idx].set_title(titles[idx])
    ax[idx].set_xticks(range(0, CIFAR100_NUM_FINE_LABELS, 10))
    ax[idx].set_xlim([0.5, 100.5])
    ax[idx].set_ylim([0, 3.3])
    ax[idx].set_xlabel('Label')
    ax[idx].set_ylabel('Share in %')

fig.savefig('cifar100_compnet_dist_pred_groundTruth.jpg', format='jpg')

In [None]:
fig, ax = plt.subplots(10, 2, figsize=(21, 65.5))
fig.suptitle('Predicted Coarse Distribution per Coarse Ground Truth', y=0.88885)

for label in range(CIFAR100_NUM_COARSE_LABELS):
    row = int(label / 2)
    col = label % 2

    # Normalize the distribution
    dist = list(map(
        lambda entry: entry / sum(dist_predicted_coarse[label]) * 100, dist_predicted_coarse[label])) 

    # Plot the distribution
    ax[row][col].bar(range(1, len(dist) + 1), dist, width=1)
    ax[row][col].set_title(
        '{}'.format(cifar100_decode(label, CIFAR100_COARSE_LABEL_LEVEL))
    )
    ax[row][col].set_xticks(range(0, CIFAR100_NUM_COARSE_LABELS, 2))
    ax[row][col].set_xlim([0.5, 20.5])
    ax[row][col].set_ylim([0, 100])
    ax[row][col].set_xlabel('Label')
    ax[row][col].set_ylabel('Share in %')

fig.savefig('cifar100_compnet_dist_pred_coarse_per_coarse_groundTruth.jpg', format='jpg')

In [None]:
fig, ax = plt.subplots(50, 2, figsize=(21, 330.5))
fig.suptitle('Predicted Fine Distribution per Fine Ground Truth', y=0.88175)

for label in range(CIFAR100_NUM_FINE_LABELS):
    row = int(label / 2)
    col = int(label % 2)
    
    # Normalize the distribution
    dist = list(map(
        lambda entry: entry / sum(dist_predicted_fine[label]) * 100, dist_predicted_fine[label])) 

    # Plot the distribution
    ax[row][col].bar(range(1, len(dist) + 1), dist, width=1)
    ax[row][col].set_title(
        '{}'.format(cifar100_decode(label, CIFAR100_FINE_LABEL_LEVEL))
    )
    ax[row][col].set_xticks(range(0, CIFAR100_NUM_FINE_LABELS, 10))
    ax[row][col].set_xlim([0.5, 100.5])
    ax[row][col].set_ylim([0, 100])
    ax[row][col].set_xlabel('Label')
    ax[row][col].set_ylabel('Share in %')

fig.savefig('cifar100_compnet_dist_pred_fine_per_fine_groundTruth.jpg', format='jpg')

In [None]:
fig, ax = plt.subplots(20, 2, figsize=(21, 130.5))
fig.suptitle('Predicted vs Ground Truth Fine Distribution per Coarse Ground Truth', y=0.88425)

for label in range(CIFAR100_NUM_COARSE_LABELS):
    titles = ['Predicted for {}'.format(cifar100_decode(label, CIFAR100_COARSE_LABEL_LEVEL)),
          'Ground Truth for {}'.format(cifar100_decode(label, CIFAR100_COARSE_LABEL_LEVEL))]

    for idx, dist in enumerate([dist_predicted_coarse_fine, dist_ground_truth_coarse_fine]):
        # Normalize the distribution
        dist = list(map(lambda entry: entry / sum(dist[label]) * 100, dist[label])) 
                    
        # Plot the distribution
        ax[label][idx].bar(range(1, len(dist) + 1), dist, width=1)
        ax[label][idx].set_title(titles[idx])
        ax[label][idx].set_xticks(range(0, CIFAR100_NUM_FINE_LABELS, 10))
        ax[label][idx].set_xlim([0.5, 100.5])
        ax[label][idx].set_ylim([0, 50])
        ax[label][idx].set_xlabel('Label')
        ax[label][idx].set_ylabel('Share in %')

fig.savefig('cifar100_compnet_dist_pred_groundTruth_fine_per_coarse_groundTruth.jpg', format='jpg')

### Predictive uncertainty under dataset shift
<a id='compnet_predictive_uncertainty'></a>

In this section, we examine how the composed model behaves with respect to predictive uncertainty under dataset shift. Methodologically, we follow the basic approach described by (Ovadia et. al, 2019) and, in doing so, employ for the shifted in-distribution data the corrupted CIFAR100-C dataset (Hendrycks et al., 2019) resp. the CIFAR10 dataset as out-of-distribution (OOD) reference.

#### Loading the datasets

CIFAR100-C dataset

In [None]:
# Dataset specific configuration

# Storage location of the CIFAR100-C dataset
CIFAR100_C_DATA_DIR = TFDS_DATA_DIR + 'cifar100/corrupted/'

# List of corruption types to include in the dataset on load
# Annotation: For a comprehensive list of all corruption types, cf. https://github.com/hendrycks/robustness).
CIFAR100_C_CORRUPTIONS = [
    'defocus_blur',
    'gaussian_blur',
    'glass_blur',
    'motion_blur',
    'zoom_blur',
    'gaussian_noise',
    'impulse_noise',
    'shot_noise',
    'speckle_noise',
    'fog',
    'frost',
    'snow',
    'brightness',
    'contrast',
    'saturate',
    'elastic_transform',
    'jpeg_compression',
    'pixelate',
    'spatter'
]

# Indicator whether to include the uncorrupted dataset for reference in addition to the above
# corruptions during the subsequent evaluation
CIFAR100_C_INCLUDE_UNCORRUPTED = True

# Number of severity grades and respectively associated samples per corruption
CIFAR100_C_NUM_SEVERITY_GRADES = (5 * (1 if CIFAR100_C_CORRUPTIONS else 0) + CIFAR100_C_INCLUDE_UNCORRUPTED)
CIFAR100_C_NUM_SAMPLES_PER_SEVERITY = 10000

In [None]:
# Load the CIFAR100-C dataset

# Template of the structure of the individual records to apply when parsing the serialized dataset
record_template = {
    CIFAR100_IMG_KEY: tf.io.FixedLenFeature([], tf.dtypes.string, default_value=''),
    CIFAR100_FINE_LABEL_KEY: tf.io.FixedLenFeature([], tf.dtypes.int64, default_value=-1)
}

# Load the dataset including all corruptions specified by `CIFAR100_C_CORRUPTIONS`
cifar100_c_test_raw = tf.data.TFRecordDataset([CIFAR100_C_DATA_DIR + corruption + '.tfrecord' for corruption in CIFAR100_C_CORRUPTIONS])

# Parse the serialized `tf.train.Example` protos
cifar100_c_test_raw = cifar100_c_test_raw.map(
    lambda record: tf.io.parse_single_example(record, record_template),
    num_parallel_calls=tf.data.experimental.AUTOTUNE)

# JPEG-decode the images
cifar100_c_test_raw = cifar100_c_test_raw.map(
    lambda record: {CIFAR100_IMG_KEY: tf.io.decode_jpeg(record[CIFAR100_IMG_KEY]),
                    CIFAR100_FINE_LABEL_KEY: record[CIFAR100_FINE_LABEL_KEY]},
    num_parallel_calls=tf.data.experimental.AUTOTUNE)

CIFAR10 dataset

In [None]:
# Dataset specific configuration

# Number of samples in the test dataset
CIFAR10_NUM_TEST_SAMPLES = 10000

In [None]:
cifar10_test_raw, cifar10_info = tfds.load('cifar10',
                                           split='test',
                                           data_dir=TFDS_DATA_DIR,
                                           download_and_prepare_kwargs={'download_dir': TFDS_DOWNLOAD_DIR},
                                           with_info=True)
print(cifar10_info)

#### Distributional shift

In [None]:
cifar100_c_test = [None] * CIFAR100_C_NUM_SEVERITY_GRADES

if CIFAR100_C_INCLUDE_UNCORRUPTED:
    # Include the uncorrupted test dataset for reference
    cifar100_c_test[0] = process_and_augment(cifar100_test_raw, batch_size=10, synset_level=CIFAR100_FINE_LABEL_LEVEL, hypernym=None, is_train=False)

# Construct one test dataset per severity grade
for corruption in range(len(CIFAR100_C_CORRUPTIONS)):
    print('Corruption: {}'.format(corruption))

    for severity in range(CIFAR100_C_NUM_SEVERITY_GRADES):
        print('Severity: {}'.format(severity))

        # As the corruption files are read in consecutively and each file contains
        # `CIFAR100_C_NUM_SAMPLES_PER_SEVERITY` * `CIFAR100_C_NUM_SEVERITY_GRADES` records with the
        # first `CIFAR100_C_NUM_SAMPLES_PER_SEVERITY` samples being of the lowest severity grade
        # followed by `CIFAR100_C_NUM_SAMPLES_PER_SEVERITY` samples of the secont severity grade, etc.,
        # we can construct a dataset containing only samples of corruption `corruption` and severity
        # grade `severity` as follows:
        dataset = cifar100_c_test_raw.enumerate().filter(
            lambda idx, _: int(idx / CIFAR100_C_NUM_SAMPLES_PER_SEVERITY) == corruption * (CIFAR100_C_NUM_SEVERITY_GRADES - CIFAR100_C_INCLUDE_UNCORRUPTED) + severity)
        
        # Remove the index residue from the previous filtering operation
        dataset = dataset.map(
            lambda _, record: record,
            num_parallel_calls=tf.data.experimental.AUTOTUNE)
        
        dataset = process_and_augment(dataset, batch_size=10, synset_level=CIFAR100_FINE_LABEL_LEVEL, hypernym=None, is_train=False)
        
        idx = severity + CIFAR100_C_INCLUDE_UNCORRUPTED
        if cifar100_c_test[idx] is None:
            cifar100_c_test[idx] = dataset
        else:
            cifar100_c_test[idx] = cifar100_c_test[idx].concatenate(dataset)

In [None]:
# Initialize metrics to watch during the evaluation in accordance to (Ovadia et. al, 2019)

# Confidence
confidence = [[] for _ in range(CIFAR100_C_NUM_SEVERITY_GRADES)]

# Categorical accuracy
cat_acc = [[] for _ in range(CIFAR100_C_NUM_SEVERITY_GRADES)]

# Negative log-likelihood
neg_log_likelihood = [[] for _ in range(CIFAR100_C_NUM_SEVERITY_GRADES)]

# Brier score
brier_score = [[] for _ in range(CIFAR100_C_NUM_SEVERITY_GRADES)]

# Predictive entropy
pred_entropy = [[] for _ in range(CIFAR100_C_NUM_SEVERITY_GRADES)]

In [None]:
for severity in range(CIFAR100_C_NUM_SEVERITY_GRADES):
    print('Severity: {}'.format(severity))

    for idx, (imgs, ground_truths) in cifar100_c_test[severity].enumerate():
        # Generate predictions for each image in the current batch from the coarse (parental)
        batch_scores_coarse = coarse_model.predict(imgs)
        
        # Since we constructed our test data set in a way that each image in one batch is an augmented
        # version of the same base image, we can simply average the individual scores to get the final
        # prediction for the respective base image in adherence to (Goodfellow et al., 2013) and
        # (Krizhevsky et al., 2012).
        prediction_coarse = tf.math.reduce_mean(batch_scores_coarse, axis=0)
        
        # Route the current batch to the sub-module predcited by the coarse model and generate the
        # predictions for the respective fine category labels
        batch_scores_fine = fine_models[tf.math.argmax(prediction_coarse)].predict(imgs)
        
        # Average the scores for the individual images in the batch as described above
        prediction_fine = tf.math.reduce_mean(batch_scores_fine, axis=0)
        
        # Update the metrics w/ the result for the current base image; as all images in one batch
        # originate from the same base image (cf. above), the ground truth is hence identical as well.
        
        ground_truth = ground_truths[0]
        ground_truth_max_idx = tf.math.argmax(ground_truth).numpy()

        prediction = prediction_fine
        prediction_max_idx = tf.math.argmax(prediction_fine).numpy()

        confidence[severity].append(
            prediction[prediction_max_idx])
        
        cat_acc[severity].append(tf.dtypes.cast(
            tf.math.equal(ground_truth_max_idx, prediction_max_idx), tf.dtypes.float32))

        neg_log_likelihood[severity].append(
            -tf.math.log(prediction[ground_truth_max_idx]))

        brier_score[severity].append(
            tf.math.reduce_sum((prediction - ground_truth) ** 2))

        pred_entropy[severity].append(
            -tf.math.reduce_sum(tf.map_fn(lambda p: p * tf.math.log(p), prediction)))

        if idx % 1000 == 0:
            print('{}'.format(idx))

In [None]:
# Store the results

# Inflate the arrays in case only the uncorrupted test dataset was evaluated
if CIFAR100_C_INCLUDE_UNCORRUPTED and not CIFAR100_C_CORRUPTIONS:
    for _ in range(5):
        confidence.append([])
        cat_acc.append([])
        neg_log_likelihood.append([])
        brier_score.append([])
        pred_entropy.append([])
        
# Append a placeholder for the uncorrupted dataset in case it was not evaluated
if not CIFAR100_C_INCLUDE_UNCORRUPTED and CIFAR100_C_CORRUPTIONS:
    for _ in range(5):
        confidence.insert(0, [])
        cat_acc.insert(0, [])
        neg_log_likelihood.insert(0, [])
        brier_score.insert(0, [])
        pred_entropy.insert(0, [])

tf.io.write_file(
    CIFAR100_RESULTS_DIR_COMPNET + 'confidence_shifted.data',
    tf.io.serialize_tensor(
        [tf.io.serialize_tensor(entry) for entry in confidence]))

tf.io.write_file(
    CIFAR100_RESULTS_DIR_COMPNET + 'cat_acc.data',
    tf.io.serialize_tensor(
        [tf.io.serialize_tensor(entry) for entry in cat_acc]))

tf.io.write_file(
    CIFAR100_RESULTS_DIR_COMPNET + 'neg_log_likelihood.data',
    tf.io.serialize_tensor(
        [tf.io.serialize_tensor(entry) for entry in neg_log_likelihood]))

tf.io.write_file(
    CIFAR100_RESULTS_DIR_COMPNET + 'brier_score.data',
    tf.io.serialize_tensor(
        [tf.io.serialize_tensor(entry) for entry in brier_score]))

tf.io.write_file(
    CIFAR100_RESULTS_DIR_COMPNET + 'pred_entropy_shifted.data',
    tf.io.serialize_tensor(
        [tf.io.serialize_tensor(entry) for entry in pred_entropy]))

In [None]:
# Load the results

confidence = [tf.io.parse_tensor(
    entry, tf.dtypes.float32) for entry in tf.io.parse_tensor(
        tf.io.read_file(
            CIFAR100_RESULTS_DIR_COMPNET + 'confidence_shifted.data'),
            tf.dtypes.string)]

cat_acc = [tf.io.parse_tensor(
    entry, tf.dtypes.float32) for entry in tf.io.parse_tensor(
        tf.io.read_file(
            CIFAR100_RESULTS_DIR_COMPNET + 'cat_acc.data'),
            tf.dtypes.string)]

neg_log_likelihood = [tf.io.parse_tensor(
    entry, tf.dtypes.float32) for entry in tf.io.parse_tensor(
        tf.io.read_file(
            CIFAR100_RESULTS_DIR_COMPNET + 'neg_log_likelihood.data'),
            tf.dtypes.string)]

brier_score = [tf.io.parse_tensor(
    entry, tf.dtypes.float32) for entry in tf.io.parse_tensor(
        tf.io.read_file(
            CIFAR100_RESULTS_DIR_COMPNET + 'brier_score.data'),
            tf.dtypes.string)]

pred_entropy = [tf.io.parse_tensor(
    entry, tf.dtypes.float32) for entry in tf.io.parse_tensor(
        tf.io.read_file(
            CIFAR100_RESULTS_DIR_COMPNET + 'pred_entropy_shifted.data'),
            tf.dtypes.string)]

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(21, 7.5))
fig.suptitle('Categorical Accuracy over varying Corruption Severities', y=0.9225)

bplot = ax.boxplot([val * 100 for val in cat_acc], sym='', showmeans= True, meanline=True, meanprops={'color':'C0'})

ax.set_xlabel('Corruption severity')
ax.set_ylabel('Accuracy in %')
ax.set_xticklabels(range(CIFAR100_C_NUM_SEVERITY_GRADES))

ax.legend([bplot["medians"][0], bplot["means"][0]], ['Median', 'Mean'], loc='upper right')

fig.savefig('uncertainty_compnet_ds_cat_acc.jpg', format='jpg')

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(21, 7.5))
fig.suptitle('Brier Score over varying Corruption Severities', y=0.9225)

bplot = ax.boxplot(brier_score, sym='', showmeans= True, meanline=True, meanprops={'color':'C0'})

ax.set_xlabel('Corruption severity')
ax.set_ylabel('Brier Score')
ax.set_xticklabels(range(CIFAR100_C_NUM_SEVERITY_GRADES))

ax.legend([bplot["medians"][0], bplot["means"][0]], ['Median', 'Mean'], loc='upper right')

fig.savefig('uncertainty_compnet_ds_brier_score.jpg', format='jpg')

In [None]:
fig, ax = plt.subplots(3, 2, figsize=(21, 19))
fig.suptitle('Confidence over varying Severity Grades', y=0.9109)

for idx in range(CIFAR100_C_NUM_SEVERITY_GRADES):
    row = int(idx / 2)
    col = idx % 2
    
    ax[row][col].hist(confidence[idx] * 100, bins=10000, cumulative=-1, histtype='step')

    ax[row][col].set_title('Severity Grade {}'.format(idx))
    ax[row][col].set_xlabel('Confidence ' + r'$\tau$' + ' in %')
    ax[row][col].set_ylabel('Number of examples P(x) > ' +  r'$\tau$')
    ax[row][col].set_xlim([0, 100])

fig.savefig('uncertainty_compnet_ds_confidence.jpg', format='jpg')

In [None]:
fig, ax = plt.subplots(3, 2, figsize=(21, 19))
fig.suptitle('Negative Log-likelihood over varying Severity Grades', y=0.9109)

for idx in range(CIFAR100_C_NUM_SEVERITY_GRADES):
    row = int(idx / 2)
    col = idx % 2
    
    ax[row][col].hist([min(val, 50) for val in neg_log_likelihood[idx]], bins=10000, cumulative=1, histtype='step')

    ax[row][col].set_title('Severity Grade {}'.format(idx))
    ax[row][col].set_xlabel('Log-likelihood ' + r'$l$')
    ax[row][col].set_ylabel('Number of examples L(x) > ' +  r'$l$')
    ax[row][col].set_xlim([0, 50])

fig.savefig('uncertainty_compnet_ds_neg_log_likelihood.jpg', format='jpg')

In [None]:
fig, ax = plt.subplots(3, 2, figsize=(21, 19))
fig.suptitle('Predictive Entropy over varying Severity Grades', y=0.9109)

# Filter out NaN entries from `pred_entropy`
for idx, severity in enumerate(pred_entropy):
    pred_entropy[idx] = tf.where(tf.math.is_nan(severity), tf.zeros_like(severity), severity)

for idx in range(CIFAR100_C_NUM_SEVERITY_GRADES):
    row = int(idx / 2)
    col = idx % 2
    
    ax[row][col].hist(pred_entropy[idx], bins=10000, cumulative=-1, histtype='step')

    ax[row][col].set_title('Severity Grade {}'.format(idx))
    ax[row][col].set_xlabel('Entropy ' + r'$h$' + ' in nats')
    ax[row][col].set_ylabel('Number of examples H(x) > ' +  r'$h$')
    ax[row][col].set_xlim([0, 2])

fig.savefig('uncertainty_compnet_ds_pred_entropy.jpg', format='jpg')

In [None]:
WINDOW_SIZE = 10000

fig, ax = plt.subplots(3, 2, figsize=(21, 19))
fig.suptitle('Confidence vs Accuracy over varying Severity Grades', y=0.9109)

for idx in range(CIFAR100_C_NUM_SEVERITY_GRADES):
    row = int(idx / 2)
    col = idx % 2

    if idx == 0:
        window_size = int(WINDOW_SIZE / 10)
    else:
        window_size = WINDOW_SIZE

    # Sort confidence and categorical accuracy in ascending order
    confidence_sorted = [val * 100 for val in sorted(confidence[idx])]
    cat_acc_sorted = [val for _, val in sorted(zip(confidence[idx], cat_acc[idx]))]

    # Calculate the moving averages of the sorted categorical accuracy
    cat_acc_cumsum = tf.math.cumsum(cat_acc_sorted)
    cat_acc_moving_avgs = (cat_acc_cumsum[window_size:] - cat_acc_cumsum[:-window_size]) / window_size * 100

    ax[row][col].plot(confidence_sorted[window_size:], cat_acc_moving_avgs)
    
    ax[row][col].set_title('Severity Grade {}'.format(idx))
    ax[row][col].set_xlabel('Confidence > ' + r'$\tau$' + ' in %')
    ax[row][col].set_ylabel('Accuracy on examples P(x) > ' + r'$\tau$' + ' in %')
    ax[row][col].set_xlim([0, 100])
    ax[row][col].set_ylim([0, 100])

fig.savefig('uncertainty_compnet_ds_confidence_accuracy.jpg', format='jpg')

#### Out-of-distribution (OOD)

In [None]:
cifar10_test = process_and_augment(cifar10_test_raw, batch_size=10, synset_level=CIFAR100_FINE_LABEL_LEVEL, hypernym=None, is_train=False)

In [None]:
# Initialize metrics to watch during the evaluation in accordance to (Ovadia et. al, 2019)
# Annotation: As there is no ground truth for fully OOD samples, we only report confidence and
# predictive entropy for those.

# Confidence
confidence = [0] * CIFAR10_NUM_TEST_SAMPLES

# Predictive entropy
pred_entropy = [0] * CIFAR10_NUM_TEST_SAMPLES

In [None]:
for idx, (imgs, ground_truths) in cifar10_test.enumerate():
    # Generate predictions for each image in the current batch from the coarse (parental)
    batch_scores_coarse = coarse_model.predict(imgs)
    
    # Since we constructed our test data set in a way that each image in one batch is an augmented
    # version of the same base image, we can simply average the individual scores to get the final
    # prediction for the respective base image in adherence to (Goodfellow et al., 2013) and
    # (Krizhevsky et al., 2012).
    prediction_coarse = tf.math.reduce_mean(batch_scores_coarse, axis=0)
    
    # Route the current batch to the sub-module predcited by the coarse model and generate the
    # predictions for the respective fine category labels
    batch_scores_fine = fine_models[tf.math.argmax(prediction_coarse)].predict(imgs)
    
    # Average the scores for the individual images in the batch as described above
    prediction_fine = tf.math.reduce_mean(batch_scores_fine, axis=0)
    
    # Update the metrics w/ the result for the current base image; as all images in one batch
    # originate from the same base image (cf. above), the ground truth is hence identical as well.
    
    ground_truth = ground_truths[0]
    ground_truth_max_idx = tf.math.argmax(ground_truths).numpy()

    prediction = prediction_fine
    prediction_max_idx = tf.math.argmax(prediction_fine).numpy()

    confidence[idx] = prediction[prediction_max_idx]

    pred_entropy[idx] = -tf.math.reduce_sum(tf.map_fn(lambda p: p * tf.math.log(p), prediction))

In [None]:
# Store the results

tf.io.write_file(
    CIFAR100_RESULTS_DIR_BENCHMARK + 'confidence_ood.data',
    tf.io.serialize_tensor(
        [tf.io.serialize_tensor(entry) for entry in confidence]))

tf.io.write_file(
    CIFAR100_RESULTS_DIR_BENCHMARK + 'pred_entropy_ood.data',
    tf.io.serialize_tensor(
        [tf.io.serialize_tensor(entry) for entry in pred_entropy]))

In [None]:
# Load the results

confidence = [tf.io.parse_tensor(
    entry, tf.dtypes.float32) for entry in tf.io.parse_tensor(
        tf.io.read_file(
            CIFAR100_RESULTS_DIR_BENCHMARK + 'confidence_ood.data'),
            tf.dtypes.string)]

pred_entropy = [tf.io.parse_tensor(
    entry, tf.dtypes.float32) for entry in tf.io.parse_tensor(
        tf.io.read_file(
            CIFAR100_RESULTS_DIR_BENCHMARK + 'pred_entropy_ood.data'),
            tf.dtypes.string)]

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(21, 7.5))
fig.suptitle('Confidence on OOD Data', y=0.9225)

ax.hist([val * 100 for val in confidence], bins=10000, cumulative=-1, histtype='step')

ax.set_xlabel('Confidence ' + r'$\tau$' + ' in %')
ax.set_ylabel('Number of examples P(x) > ' +  r'$\tau$')
ax.set_xlim([0, 100])

fig.savefig('uncertainty_compnet_ood_confidence.jpg', format='jpg')

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(21, 7.5))
fig.suptitle('Predictive Entropy on OOD Data', y=0.9225)

# Filter out NaN entries from `pred_entropy`
pred_entropy = tf.where(tf.math.is_nan(pred_entropy), tf.zeros_like(pred_entropy), pred_entropy)

ax.hist(pred_entropy, bins=10000, cumulative=-1, histtype='step')

ax.set_xlabel('Entropy ' + r'$h$' + ' in nats')
ax.set_ylabel('Number of examples H(x) > ' +  r'$h$')
ax.set_xlim([0, 2])

fig.savefig('uncertainty_compnet_ood_pred_entropy.jpg', format='jpg')

---
## Benchmark: Maxout Network (Goodfellow et. al, 2013)
<a id='benchmark'></a>

As a comparative benchmark, a (monolithic) Maxout Network as described in (Goodfellow et. al, 2013) was trained on the basis of the same preprocessed dataset. Hyperparameters were tuned experimentally as there is - to the best of my knowledge - no information available as to what configuration was used to achieve the results reported in the publication.

Criteria that were taken into account when selecting the benchmark network's architecture:
- a) Simplicity / Leanness (i.e. no inferred knowledge in the architecture for the same [reasons](#compnet) mentioned when arguing about the composed model's architecture)
- b) Existence of reference values in literature
- c) Scalability (so that we could use a downscaled version for the composed model for comparative reasons)
- d) Contemporary performance levels

### Model
<a id='benchmark_model'></a>

See [above](#compnet_model_maxout) with regard to the construction of MaxOut feature maps and the topic native API vs custom layers.

In [None]:
# Note: We imply a combination of Dropout and MaxNorm for weight regularization as is reported
# in (Srivastava et al., 2014) to be most effective. Furthermore, we also apply Dropout to the
# convolutional layers to facilitate the discovery of informative features as described in
# (Park et al., 2016).

# Input specification
inputs = tf.keras.Input(shape=(28, 28, 3))

# First convolutional MaxOut layer
x = tf.keras.layers.Conv2D(filters=128,
                           kernel_size=(3, 3),
                           strides=1,
                           padding='same',
                           activation=None,
                           use_bias=True,
                           kernel_constraint='max_norm',
                           bias_constraint='max_norm')(inputs)
x = tf.keras.layers.Reshape((28, 28, 128, 1))(x)
x = tf.keras.layers.MaxPool3D(pool_size=(1, 1, 4),
                              strides=(1, 1, 4),
                              padding='same')(x)
x = tf.keras.layers.Reshape((28, 28, 32))(x)

# First MaxPool layer
x = tf.keras.layers.MaxPool2D(pool_size=(2, 2),
                              strides=(2, 2),
                              padding='same')(x)

# First Dropout layer
#x = tf.keras.layers.Dropout(rate=0.1)(x)

# Second convolutional MaxOut layer
x = tf.keras.layers.Conv2D(filters=256,
                           kernel_size=(3, 3),
                           strides=1,
                           padding='same',
                           activation=None,
                           use_bias=True,
                           kernel_constraint='max_norm',
                           bias_constraint='max_norm')(x)
x = tf.keras.layers.Reshape((14, 14, 256, 1))(x)
x = tf.keras.layers.MaxPool3D(pool_size=(1, 1, 4),
                              strides=(1, 1, 4),
                              padding='same')(x)
x = tf.keras.layers.Reshape((14, 14, 64))(x)

# Second MaxPool layer
x = tf.keras.layers.MaxPool2D(pool_size=(2, 2),
                              strides=(2, 2),
                              padding='same')(x)

# Second Dropout layer
#x = tf.keras.layers.Dropout(rate=0.1)(x)

# Third convolutional MaxOut layer
x = tf.keras.layers.Conv2D(filters=512,
                           kernel_size=(3, 3),
                           strides=1,
                           padding='same',
                           activation=None,
                           use_bias=True,
                           kernel_constraint='max_norm',
                           bias_constraint='max_norm')(x)
x = tf.keras.layers.Reshape((7, 7, 512, 1))(x)
x = tf.keras.layers.MaxPool3D(pool_size=(1, 1, 4),
                              strides=(1, 1, 4),
                              padding='same')(x)
x = tf.keras.layers.Reshape((7, 7, 128))(x)

# Transition between convolutional and fully connected layers
x = tf.keras.layers.Flatten()(x)

# Third Dropout layer
#x = tf.keras.layers.Dropout(rate=0.5)(x)

# Fully connected MaxOut layer
x = tf.keras.layers.Dense(units=512,
                          activation=None,
                          use_bias=True,
                          kernel_constraint='max_norm',
                          bias_constraint='max_norm')(x)
x = tf.keras.layers.Reshape((512, 1))(x)
x = x = tf.keras.layers.MaxPool1D(pool_size=(4),
                              strides=(4),
                              padding='same')(x)
x = tf.keras.layers.Flatten()(x)

# Fourth Dropout layer
x = tf.keras.layers.Dropout(rate=0.5)(x)

# Softmax output layer
outputs = tf.keras.layers.Dense(units=CIFAR100_NUM_FINE_LABELS,
                                activation='softmax',
                                use_bias=True,
                                kernel_constraint='max_norm',
                                bias_constraint='max_norm')(x)

In [None]:
model = tf.keras.Model(inputs=inputs, outputs=outputs)
model.summary()

In [None]:
# Use Categorical Cross Entropy as loss function
loss = tf.keras.losses.CategoricalCrossentropy()

# Define metrics to watch during training
metrics = [tf.keras.metrics.CategoricalAccuracy(),
           tf.keras.metrics.CategoricalCrossentropy(),
           tf.keras.metrics.AUC()]

# Use Adam (Kingma et al., 2017) as optimizer during training
# Annotation: We don't set `learning_rate` here as this is automatically handled by the
# LearningRateScheduler (cf. callback section below).
optimizer = tf.keras.optimizers.Adam(beta_1=0.9,
                                     beta_2=0.999,
                                     epsilon=1e-07)

In [None]:
model.compile(loss=loss, metrics=metrics, optimizer=optimizer)

In [None]:
# Alternatively: Restore model from checkpoint
# model = tf.keras.models.load_model(CKPT_DIR + 'maxout')
# new_model.summary()

### Training
<a id='benchmark_train'></a>

In [None]:
cifar100_train = process_and_augment(cifar100_train_raw, batch_size=32, synset_level=CIFAR100_FINE_LABEL_LEVEL, hypernym=None, is_train=True, num_rnd_crops=5, shuffle_buffer_size=1000, num_epochs=1)
cifar100_val = process_and_augment(cifar100_val_raw, batch_size=32, synset_level=CIFAR100_FINE_LABEL_LEVEL, hypernym=None, is_train=True, num_rnd_crops=5, shuffle_buffer_size=1000, num_epochs=1)

In [None]:
# EarlyStopping: Stop training early if no significant improvement in the monitored quantity is
#     observed for at least `patience` epochs
# LearningRateScheduler: Dynamically adapt the learning rate depending on the training epoch to
#     facilitate accelerated learning during the first few epochs
# ModelCheckpoint: Save the model after each epoch (if `save_best_only` is set to True, only keep
#     the best model with regard to the monitored quantity)
# TensorBoard: Enable TensorBoard visualization
callbacks = [tf.keras.callbacks.EarlyStopping(monitor='val_loss',
                                              min_delta=0.01,
                                              patience=3),
             tf.keras.callbacks.LearningRateScheduler(lambda epoch: 1e-03 if epoch < 3 else 1e-04),
             tf.keras.callbacks.ModelCheckpoint(filepath=CKPT_DIR + 'maxout',
                                                monitor='val_loss',
                                                verbose=False,
                                                save_best_only=True),
             tf.keras.callbacks.TensorBoard(log_dir=LOG_DIR,
                                            histogram_freq=1)]

In [None]:
model.fit(x=cifar100_train,
          epochs=100,
          verbose=True,
          callbacks=callbacks,
          validation_data=cifar100_val,
          shuffle=True,
          validation_freq=1)

### Testing
<a id='benchmark_test'></a>

We perform model evaluation adhering to the standard 10-crop procedure as described in (Goodfellow et al., 2013) and (Krizhevsky et al., 2012).

In [None]:
cifar100_test = process_and_augment(cifar100_test_raw, batch_size=10, synset_level=CIFAR100_FINE_LABEL_LEVEL, hypernym=None, is_train=False)

In [None]:
# Define metrics to watch during the evaluation of the model on the test data set

# Use the same metrics as for the training
# test_metrics = metrics

# Use different metrics than during the training
test_metrics = [tf.keras.metrics.CategoricalAccuracy(name='CategoricalAccuracy'),
                tf.keras.metrics.CategoricalCrossentropy(name='CategoricalCrossentropy'),
                tf.keras.metrics.AUC(name='AUC')]

In [None]:
# Reset all metrics before starting the evaluation
for metric in test_metrics:
    metric.reset_states()
    
# Initialize additional custom metrics to watch during the evaluation

# Overall label distribution (predicted and ground truth)
dist_ground_truth = [0] * CIFAR100_NUM_FINE_LABELS
dist_predicted = [0] * CIFAR100_NUM_FINE_LABELS

# Predicted label distribution for each coarse category
dist_predicted_coarse = [[0] * CIFAR100_NUM_COARSE_LABELS for _ in range(CIFAR100_NUM_COARSE_LABELS)]

# Predicted label distribution for each fine category
dist_predicted_fine = [[0] * CIFAR100_NUM_FINE_LABELS for _ in range(CIFAR100_NUM_FINE_LABELS)]

# Fine label distribution for each coarse category (predicted and ground truth)
dist_ground_truth_coarse_fine = [[0] * CIFAR100_NUM_FINE_LABELS for _ in range(CIFAR100_NUM_COARSE_LABELS)]
dist_predicted_coarse_fine = [[0] * CIFAR100_NUM_FINE_LABELS for _ in range(CIFAR100_NUM_COARSE_LABELS)]

# Semantic distance between the predicted category and the ground truth in accordance to
# (Fergus et al., 2010) 
semantic_distance = 0.0

In [None]:
for (imgs, ground_truths) in cifar100_test:
    # Generate predictions for each image in the current batch
    batch_scores = model.predict(imgs)
    
    # Since we constructed our test data set in a way that each image in one batch is an augmented
    # version of the same base image, we can simply average the individual scores to get the final
    # prediction for the respective base image in adherence to (Goodfellow et al., 2013) and
    # (Krizhevsky et al., 2012).
    prediction = tf.math.reduce_mean(batch_scores, axis=0)
    
    # Update the metrics w/ the result for the current base image; as all images in one batch
    # originate from the same base image (cf. above), the ground truth is hence identical as well.
    for metric in test_metrics:
        metric.update_state(ground_truths[0], prediction)
        
    # Update custom metrics w/ the result for the current base image; cf. above concerning the
    # ground truth for each batch
    
    ground_truth_fine = tf.math.argmax(ground_truths[0]).numpy()
    ground_truth_fine_decoded = cifar100_decode(ground_truth_fine, CIFAR100_FINE_LABEL_LEVEL)
    
    ground_truth_coarse = cifar100_resolve_hypernym(ground_truth_fine, level=CIFAR100_COARSE_LABEL_LEVEL, encoded=True)
        
    prediction_fine = tf.math.argmax(prediction).numpy()
    prediction_fine_decoded = cifar100_decode(prediction_fine, CIFAR100_FINE_LABEL_LEVEL)
    
    prediction_coarse = cifar100_resolve_hypernym(prediction_fine, level=CIFAR100_COARSE_LABEL_LEVEL, encoded=True)
    
    dist_ground_truth[ground_truth_fine] += 1
    dist_predicted[prediction_fine] += 1

    dist_predicted_coarse[ground_truth_coarse][prediction_coarse] += 1

    dist_predicted_fine[ground_truth_fine][prediction_fine] += 1

    dist_ground_truth_coarse_fine[ground_truth_coarse][ground_truth_fine] += 1
    dist_predicted_coarse_fine[ground_truth_coarse][prediction_fine] += 1
    
    semantic_distance += CIFAR100_SYNSET_MAP.semantic_distance(
        ground_truth_fine_decoded,
        prediction_fine_decoded
    )

print('Benchmark')
print()
print('==================================================')
print()
print('Final results:')
for metric in test_metrics:
    print('{}: {}'.format(metric.name, metric.result().numpy()))

In [None]:
# Store the results

tf.io.write_file(
    CIFAR100_RESULTS_DIR_BENCHMARK + 'dist_ground_truth.data',
    tf.io.serialize_tensor(dist_ground_truth))

tf.io.write_file(
    CIFAR100_RESULTS_DIR_BENCHMARK + 'dist_predicted.data',
    tf.io.serialize_tensor(dist_predicted))

tf.io.write_file(
    CIFAR100_RESULTS_DIR_BENCHMARK + 'dist_predicted_coarse.data',
    tf.io.serialize_tensor(dist_predicted_coarse))

tf.io.write_file(
    CIFAR100_RESULTS_DIR_BENCHMARK + 'dist_predicted_fine.data',
    tf.io.serialize_tensor(dist_predicted_fine))

tf.io.write_file(
    CIFAR100_RESULTS_DIR_BENCHMARK + 'dist_ground_truth_coarse_fine.data',
    tf.io.serialize_tensor(dist_ground_truth_coarse_fine))

tf.io.write_file(
    CIFAR100_RESULTS_DIR_BENCHMARK + 'dist_predicted_coarse_fine.data',
    tf.io.serialize_tensor(dist_predicted_coarse_fine))

tf.io.write_file(
    CIFAR100_RESULTS_DIR_BENCHMARK + 'semantic_distance.data',
    tf.io.serialize_tensor(semantic_distance))

In [None]:
# Load the results

dist_ground_truth = tf.io.parse_tensor(
    tf.io.read_file(
        CIFAR100_RESULTS_DIR_BENCHMARK + 'dist_ground_truth.data'),
        tf.dtypes.int64)

dist_predicted = tf.io.parse_tensor(
    tf.io.read_file(
        CIFAR100_RESULTS_DIR_BENCHMARK + 'dist_predicted.data'),
        tf.dtypes.int64)

dist_predicted_coarse = tf.io.parse_tensor(
    tf.io.read_file(
        CIFAR100_RESULTS_DIR_BENCHMARK + 'dist_predicted_coarse.data'),
        tf.dtypes.int64)

dist_predicted_fine = tf.io.parse_tensor(
    tf.io.read_file(
        CIFAR100_RESULTS_DIR_BENCHMARK + 'dist_predicted_fine.data'),
        tf.dtypes.int64)

dist_ground_truth_coarse_fine = tf.io.parse_tensor(
    tf.io.read_file(
        CIFAR100_RESULTS_DIR_BENCHMARK + 'dist_ground_truth_coarse_fine.data'),
        tf.dtypes.int64)

dist_predicted_coarse_fine = tf.io.parse_tensor(
    tf.io.read_file(
        CIFAR100_RESULTS_DIR_BENCHMARK + 'dist_predicted_coarse_fine.data'),
        tf.dtypes.int64)

semantic_distance = tf.io.parse_tensor(
    tf.io.read_file(
        CIFAR100_RESULTS_DIR_BENCHMARK + 'semantic_distance.data'),
        tf.dtypes.float32)

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(21, 5.5))
fig.suptitle('Predicted vs Ground Truth Distribution')

titles = ['Predicted', 'Ground Truth']

for idx, dist in enumerate([dist_predicted, dist_ground_truth]):
    # Normalize the distribution
    dist = list(map(lambda entry: entry / sum(dist) * 100, dist)) 
                
    # Plot the distribution
    ax[idx].bar(range(1, len(dist) + 1), dist, width=1)
    ax[idx].set_title(titles[idx])
    ax[idx].set_xticks(range(0, CIFAR100_NUM_FINE_LABELS, 10))
    ax[idx].set_xlim([0.5, 100.5])
    ax[idx].set_ylim([0, 3.3])
    ax[idx].set_xlabel('Label')
    ax[idx].set_ylabel('Share in %')

fig.savefig('cifar100_benchmark_dist_pred_groundTruth.jpg', format='jpg')

In [None]:
fig, ax = plt.subplots(10, 2, figsize=(21, 65.5))
fig.suptitle('Predicted Coarse Distribution per Coarse Ground Truth', y=0.88885)

for label in range(CIFAR100_NUM_COARSE_LABELS):
    row = int(label / 2)
    col = label % 2

    # Normalize the distribution
    dist = list(map(
        lambda entry: entry / sum(dist_predicted_coarse[label]) * 100, dist_predicted_coarse[label])) 

    # Plot the distribution
    ax[row][col].bar(range(1, len(dist) + 1), dist, width=1)
    ax[row][col].set_title(
        '{}'.format(cifar100_decode(label, CIFAR100_COARSE_LABEL_LEVEL))
    )
    ax[row][col].set_xticks(range(0, CIFAR100_NUM_COARSE_LABELS, 2))
    ax[row][col].set_xlim([0.5, 20.5])
    ax[row][col].set_ylim([0, 100])
    ax[row][col].set_xlabel('Label')
    ax[row][col].set_ylabel('Share in %')

fig.savefig('cifar100_benchmark_dist_pred_coarse_per_coarse_groundTruth.jpg', format='jpg')

In [None]:
fig, ax = plt.subplots(50, 2, figsize=(21, 330.5))
fig.suptitle('Predicted Fine Distribution per Fine Ground Truth', y=0.88175)

for label in range(CIFAR100_NUM_FINE_LABELS):
    row = int(label / 2)
    col = int(label % 2)
    
    # Normalize the distribution
    dist = list(map(
        lambda entry: entry / sum(dist_predicted_fine[label]) * 100, dist_predicted_fine[label])) 

    # Plot the distribution
    ax[row][col].bar(range(1, len(dist) + 1), dist, width=1)
    ax[row][col].set_title(
        '{}'.format(cifar100_decode(label, CIFAR100_FINE_LABEL_LEVEL))
    )
    ax[row][col].set_xticks(range(0, CIFAR100_NUM_FINE_LABELS, 10))
    ax[row][col].set_xlim([0.5, 100.5])
    ax[row][col].set_ylim([0, 100])
    ax[row][col].set_xlabel('Label')
    ax[row][col].set_ylabel('Share in %')

fig.savefig('cifar100_benchmark_dist_pred_fine_per_fine_groundTruth.jpg', format='jpg')

In [None]:
fig, ax = plt.subplots(20, 2, figsize=(21, 130.5))
fig.suptitle('Predicted vs Ground Truth Fine Distribution per Coarse Ground Truth', y=0.88425)

for label in range(CIFAR100_NUM_COARSE_LABELS):
    titles = ['Predicted for {}'.format(cifar100_decode(label, CIFAR100_COARSE_LABEL_LEVEL)),
          'Ground Truth for {}'.format(cifar100_decode(label, CIFAR100_COARSE_LABEL_LEVEL))]

    for idx, dist in enumerate([dist_predicted_coarse_fine, dist_ground_truth_coarse_fine]):
        # Normalize the distribution
        dist = list(map(lambda entry: entry / sum(dist[label]) * 100, dist[label])) 
                    
        # Plot the distribution
        ax[label][idx].bar(range(1, len(dist) + 1), dist, width=1)
        ax[label][idx].set_title(titles[idx])
        ax[label][idx].set_xticks(range(0, CIFAR100_NUM_FINE_LABELS, 10))
        ax[label][idx].set_xlim([0.5, 100.5])
        ax[label][idx].set_ylim([0, 50])
        ax[label][idx].set_xlabel('Label')
        ax[label][idx].set_ylabel('Share in %')

fig.savefig('cifar100_benchmark_dist_pred_groundTruth_fine_per_coarse_groundTruth.jpg', format='jpg')

### Predictive uncertainty under dataset shift
<a id='benchmark_predictive_uncertainty'></a>

In this section, we examine how the benchmark model behaves with respect to predictive uncertainty under dataset shift. Methodologically, we follow the basic approach described by (Ovadia et. al, 2019) and, in doing so, employ for the shifted in-distribution data the corrupted CIFAR100-C dataset (Hendrycks et al., 2019) resp. the CIFAR10 dataset as out-of-distribution (OOD) reference.

#### Loading the datasets

CIFAR100-C dataset

In [None]:
# Dataset specific configuration

# Storage location of the CIFAR100-C dataset
CIFAR100_C_DATA_DIR = TFDS_DATA_DIR + 'cifar100/corrupted/'

# List of corruption types to include in the dataset on load
# Annotation: For a comprehensive list of all corruption types, cf. https://github.com/hendrycks/robustness).
CIFAR100_C_CORRUPTIONS = [
    'defocus_blur',
    'gaussian_blur',
    'glass_blur',
    'motion_blur',
    'zoom_blur',
    'gaussian_noise',
    'impulse_noise',
    'shot_noise',
    'speckle_noise',
    'fog',
    'frost',
    'snow',
    'brightness',
    'contrast',
    'saturate',
    'elastic_transform',
    'jpeg_compression',
    'pixelate',
    'spatter'
]

# Indicator whether to include the uncorrupted dataset for reference in addition to the above
# corruptions during the subsequent evaluation
CIFAR100_C_INCLUDE_UNCORRUPTED = True

# Number of severity grades and respectively associated samples per corruption
CIFAR100_C_NUM_SEVERITY_GRADES = (5 * (1 if CIFAR100_C_CORRUPTIONS else 0) + CIFAR100_C_INCLUDE_UNCORRUPTED)
CIFAR100_C_NUM_SAMPLES_PER_SEVERITY = 10000

In [None]:
# Load the CIFAR100-C dataset

# Template of the structure of the individual records to apply when parsing the serialized dataset
record_template = {
    CIFAR100_IMG_KEY: tf.io.FixedLenFeature([], tf.dtypes.string, default_value=''),
    CIFAR100_FINE_LABEL_KEY: tf.io.FixedLenFeature([], tf.dtypes.int64, default_value=-1)
}

# Load the dataset including all corruptions specified by `CIFAR100_C_CORRUPTIONS`
cifar100_c_test_raw = tf.data.TFRecordDataset([CIFAR100_C_DATA_DIR + corruption + '.tfrecord' for corruption in CIFAR100_C_CORRUPTIONS])

# Parse the serialized `tf.train.Example` protos
cifar100_c_test_raw = cifar100_c_test_raw.map(
    lambda record: tf.io.parse_single_example(record, record_template),
    num_parallel_calls=tf.data.experimental.AUTOTUNE)

# JPEG-decode the images
cifar100_c_test_raw = cifar100_c_test_raw.map(
    lambda record: {CIFAR100_IMG_KEY: tf.io.decode_jpeg(record[CIFAR100_IMG_KEY]),
                    CIFAR100_FINE_LABEL_KEY: record[CIFAR100_FINE_LABEL_KEY]},
    num_parallel_calls=tf.data.experimental.AUTOTUNE)

CIFAR10 dataset

In [None]:
# Dataset specific configuration

# Number of samples in the test dataset
CIFAR10_NUM_TEST_SAMPLES = 10000

In [None]:
cifar10_test_raw, cifar10_info = tfds.load('cifar10',
                                           split='test',
                                           data_dir=TFDS_DATA_DIR,
                                           download_and_prepare_kwargs={'download_dir': TFDS_DOWNLOAD_DIR},
                                           with_info=True)
print(cifar10_info)

#### Distributional shift

In [None]:
cifar100_c_test = [None] * CIFAR100_C_NUM_SEVERITY_GRADES

if CIFAR100_C_INCLUDE_UNCORRUPTED:
    # Include the uncorrupted test dataset for reference
    cifar100_c_test[0] = process_and_augment(cifar100_test_raw, batch_size=10, synset_level=CIFAR100_FINE_LABEL_LEVEL, hypernym=None, is_train=False)

# Construct one test dataset per severity grade
for corruption in range(len(CIFAR100_C_CORRUPTIONS)):
    print('Corruption: {}'.format(corruption))

    for severity in range(CIFAR100_C_NUM_SEVERITY_GRADES):
        print('Severity: {}'.format(severity))

        # As the corruption files are read in consecutively and each file contains
        # `CIFAR100_C_NUM_SAMPLES_PER_SEVERITY` * `CIFAR100_C_NUM_SEVERITY_GRADES` records with the
        # first `CIFAR100_C_NUM_SAMPLES_PER_SEVERITY` samples being of the lowest severity grade
        # followed by `CIFAR100_C_NUM_SAMPLES_PER_SEVERITY` samples of the secont severity grade, etc.,
        # we can construct a dataset containing only samples of corruption `corruption` and severity
        # grade `severity` as follows:
        dataset = cifar100_c_test_raw.enumerate().filter(
            lambda idx, _: int(idx / CIFAR100_C_NUM_SAMPLES_PER_SEVERITY) == corruption * (CIFAR100_C_NUM_SEVERITY_GRADES - CIFAR100_C_INCLUDE_UNCORRUPTED) + severity)
        
        # Remove the index residue from the previous filtering operation
        dataset = dataset.map(
            lambda _, record: record,
            num_parallel_calls=tf.data.experimental.AUTOTUNE)
        
        dataset = process_and_augment(dataset, batch_size=10, synset_level=CIFAR100_FINE_LABEL_LEVEL, hypernym=None, is_train=False)
        
        idx = severity + CIFAR100_C_INCLUDE_UNCORRUPTED
        if cifar100_c_test[idx] is None:
            cifar100_c_test[idx] = dataset
        else:
            cifar100_c_test[idx] = cifar100_c_test[idx].concatenate(dataset)

In [None]:
# Initialize metrics to watch during the evaluation in accordance to (Ovadia et. al, 2019)

# Confidence
confidence = [[] for _ in range(CIFAR100_C_NUM_SEVERITY_GRADES)]

# Categorical accuracy
cat_acc = [[] for _ in range(CIFAR100_C_NUM_SEVERITY_GRADES)]

# Negative log-likelihood
neg_log_likelihood = [[] for _ in range(CIFAR100_C_NUM_SEVERITY_GRADES)]

# Brier score
brier_score = [[] for _ in range(CIFAR100_C_NUM_SEVERITY_GRADES)]

# Predictive entropy
pred_entropy = [[] for _ in range(CIFAR100_C_NUM_SEVERITY_GRADES)]

In [None]:
for severity in range(CIFAR100_C_NUM_SEVERITY_GRADES):
    print('Severity: {}'.format(severity))

    for idx, (imgs, ground_truths) in cifar100_c_test[severity].enumerate():
        # Generate predictions for each image in the current batch
        batch_scores = model.predict(imgs)
        
        # Since we constructed our test data set in a way that each image in one batch is an augmented
        # version of the same base image, we can simply average the individual scores to get the final
        # prediction for the respective base image in adherence to (Goodfellow et al., 2013) and
        # (Krizhevsky et al., 2012).
        prediction = tf.math.reduce_mean(batch_scores, axis=0)
        
        # Update the metrics w/ the result for the current base image; as all images in one batch
        # originate from the same base image (cf. above), the ground truth is hence identical as well.
        
        ground_truth = ground_truths[0]
        ground_truth_max_idx = tf.math.argmax(ground_truth).numpy()

        prediction_max_idx = tf.math.argmax(prediction).numpy()

        confidence[severity].append(
            prediction[prediction_max_idx])
        
        cat_acc[severity].append(tf.dtypes.cast(
            tf.math.equal(ground_truth_max_idx, prediction_max_idx), tf.dtypes.float32))

        neg_log_likelihood[severity].append(
            -tf.math.log(prediction[ground_truth_max_idx]))

        brier_score[severity].append(
            tf.math.reduce_sum((prediction - ground_truth) ** 2))

        pred_entropy[severity].append(
            -tf.math.reduce_sum(tf.map_fn(lambda p: p * tf.math.log(p), prediction)))

        if idx % 1000 == 0:
            print('{}'.format(idx))

In [None]:
# Store the results

# Inflate the arrays in case only the uncorrupted test dataset was evaluated
if CIFAR100_C_INCLUDE_UNCORRUPTED and not CIFAR100_C_CORRUPTIONS:
    for _ in range(5):
        confidence.append([])
        cat_acc.append([])
        neg_log_likelihood.append([])
        brier_score.append([])
        pred_entropy.append([])
        
# Append a placeholder for the uncorrupted dataset in case it was not evaluated
if not CIFAR100_C_INCLUDE_UNCORRUPTED and CIFAR100_C_CORRUPTIONS:
    for _ in range(5):
        confidence.insert(0, [])
        cat_acc.insert(0, [])
        neg_log_likelihood.insert(0, [])
        brier_score.insert(0, [])
        pred_entropy.insert(0, [])

tf.io.write_file(
    CIFAR100_RESULTS_DIR_BENCHMARK + 'confidence_shifted.data',
    tf.io.serialize_tensor(
        [tf.io.serialize_tensor(entry) for entry in confidence]))

tf.io.write_file(
    CIFAR100_RESULTS_DIR_BENCHMARK + 'cat_acc.data',
    tf.io.serialize_tensor(
        [tf.io.serialize_tensor(entry) for entry in cat_acc]))

tf.io.write_file(
    CIFAR100_RESULTS_DIR_BENCHMARK + 'neg_log_likelihood.data',
    tf.io.serialize_tensor(
        [tf.io.serialize_tensor(entry) for entry in neg_log_likelihood]))

tf.io.write_file(
    CIFAR100_RESULTS_DIR_BENCHMARK + 'brier_score.data',
    tf.io.serialize_tensor(
        [tf.io.serialize_tensor(entry) for entry in brier_score]))

tf.io.write_file(
    CIFAR100_RESULTS_DIR_BENCHMARK + 'pred_entropy_shifted.data',
    tf.io.serialize_tensor(
        [tf.io.serialize_tensor(entry) for entry in pred_entropy]))

In [None]:
# Load the results

confidence = [tf.io.parse_tensor(
    entry, tf.dtypes.float32) for entry in tf.io.parse_tensor(
        tf.io.read_file(
            CIFAR100_RESULTS_DIR_BENCHMARK + 'confidence_shifted.data'),
            tf.dtypes.string)]

cat_acc = [tf.io.parse_tensor(
    entry, tf.dtypes.float32) for entry in tf.io.parse_tensor(
        tf.io.read_file(
            CIFAR100_RESULTS_DIR_BENCHMARK + 'cat_acc.data'),
            tf.dtypes.string)]

neg_log_likelihood = [tf.io.parse_tensor(
    entry, tf.dtypes.float32) for entry in tf.io.parse_tensor(
        tf.io.read_file(
            CIFAR100_RESULTS_DIR_BENCHMARK + 'neg_log_likelihood.data'),
            tf.dtypes.string)]

brier_score = [tf.io.parse_tensor(
    entry, tf.dtypes.float32) for entry in tf.io.parse_tensor(
        tf.io.read_file(
            CIFAR100_RESULTS_DIR_BENCHMARK + 'brier_score.data'),
            tf.dtypes.string)]

pred_entropy = [tf.io.parse_tensor(
    entry, tf.dtypes.float32) for entry in tf.io.parse_tensor(
        tf.io.read_file(
            CIFAR100_RESULTS_DIR_BENCHMARK + 'pred_entropy_shifted.data'),
            tf.dtypes.string)]

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(21, 7.5))
fig.suptitle('Categorical Accuracy over varying Corruption Severities', y=0.9225)

bplot = ax.boxplot([val * 100 for val in cat_acc], sym='', showmeans= True, meanline=True, meanprops={'color':'C0'})

ax.set_xlabel('Corruption severity')
ax.set_ylabel('Accuracy in %')
ax.set_xticklabels(range(CIFAR100_C_NUM_SEVERITY_GRADES))

ax.legend([bplot["medians"][0], bplot["means"][0]], ['Median', 'Mean'], loc='upper right')

fig.savefig('uncertainty_benchmark_ds_cat_acc.jpg', format='jpg')

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(21, 7.5))
fig.suptitle('Brier Score over varying Corruption Severities', y=0.9225)

bplot = ax.boxplot(brier_score, sym='', showmeans= True, meanline=True, meanprops={'color':'C0'})

ax.set_xlabel('Corruption severity')
ax.set_ylabel('Brier Score')
ax.set_xticklabels(range(CIFAR100_C_NUM_SEVERITY_GRADES))

ax.legend([bplot["medians"][0], bplot["means"][0]], ['Median', 'Mean'], loc='upper right')

fig.savefig('uncertainty_benchmark_ds_brier_score.jpg', format='jpg')

In [None]:
fig, ax = plt.subplots(3, 2, figsize=(21, 19))
fig.suptitle('Confidence over varying Severity Grades', y=0.9109)

for idx in range(CIFAR100_C_NUM_SEVERITY_GRADES):
    row = int(idx / 2)
    col = idx % 2
    
    ax[row][col].hist(confidence[idx] * 100, bins=10000, cumulative=-1, histtype='step')

    ax[row][col].set_title('Severity Grade {}'.format(idx))
    ax[row][col].set_xlabel('Confidence ' + r'$\tau$' + ' in %')
    ax[row][col].set_ylabel('Number of examples P(x) > ' +  r'$\tau$')
    ax[row][col].set_xlim([0, 100])

fig.savefig('uncertainty_benchmark_ds_confidence.jpg', format='jpg')

In [None]:
fig, ax = plt.subplots(3, 2, figsize=(21, 19))
fig.suptitle('Negative Log-likelihood over varying Severity Grades', y=0.9109)

for idx in range(CIFAR100_C_NUM_SEVERITY_GRADES):
    row = int(idx / 2)
    col = idx % 2
    
    ax[row][col].hist([min(val, 10) for val in neg_log_likelihood[idx]], bins=10000, cumulative=1, histtype='step')

    ax[row][col].set_title('Severity Grade {}'.format(idx))
    ax[row][col].set_xlabel('Log-likelihood ' + r'$l$')
    ax[row][col].set_ylabel('Number of examples L(x) > ' +  r'$l$')
    ax[row][col].set_xlim([0, 10])

fig.savefig('uncertainty_benchmark_ds_neg_log_likelihood.jpg', format='jpg')

In [None]:
fig, ax = plt.subplots(3, 2, figsize=(21, 19))
fig.suptitle('Predictive Entropy over varying Severity Grades', y=0.9109)

# Filter out NaN entries from `pred_entropy`
for idx, severity in enumerate(pred_entropy):
    pred_entropy[idx] = tf.where(tf.math.is_nan(severity), tf.zeros_like(severity), severity)

for idx in range(CIFAR100_C_NUM_SEVERITY_GRADES):
    row = int(idx / 2)
    col = idx % 2
    
    ax[row][col].hist(pred_entropy[idx], bins=10000, cumulative=-1, histtype='step')

    ax[row][col].set_title('Severity Grade {}'.format(idx))
    ax[row][col].set_xlabel('Entropy ' + r'$h$' + ' in nats')
    ax[row][col].set_ylabel('Number of examples H(x) > ' +  r'$h$')
    ax[row][col].set_xlim([0, 5])

fig.savefig('uncertainty_benchmark_ds_pred_entropy.jpg', format='jpg')

In [None]:
WINDOW_SIZE = 10000

fig, ax = plt.subplots(3, 2, figsize=(21, 19))
fig.suptitle('Confidence vs Accuracy over varying Severity Grades', y=0.9109)

for idx in range(CIFAR100_C_NUM_SEVERITY_GRADES):
    row = int(idx / 2)
    col = idx % 2

    if idx == 0:
        window_size = int(WINDOW_SIZE / 10)
    else:
        window_size = WINDOW_SIZE

    # Sort confidence and categorical accuracy in ascending order
    confidence_sorted = [val * 100 for val in sorted(confidence[idx])]
    cat_acc_sorted = [val for _, val in sorted(zip(confidence[idx], cat_acc[idx]))]

    # Calculate the moving averages of the sorted categorical accuracy
    cat_acc_cumsum = tf.math.cumsum(cat_acc_sorted)
    cat_acc_moving_avgs = (cat_acc_cumsum[window_size:] - cat_acc_cumsum[:-window_size]) / window_size * 100

    ax[row][col].plot(confidence_sorted[window_size:], cat_acc_moving_avgs)
    
    ax[row][col].set_title('Severity Grade {}'.format(idx))
    ax[row][col].set_xlabel('Confidence > ' + r'$\tau$' + ' in %')
    ax[row][col].set_ylabel('Accuracy on examples P(x) > ' + r'$\tau$' + ' in %')
    ax[row][col].set_xlim([0, 100])
    ax[row][col].set_ylim([0, 100])

fig.savefig('uncertainty_benchmark_ds_confidence_accuracy.jpg', format='jpg')

#### Out-of-distribution (OOD)

In [None]:
cifar10_test = process_and_augment(cifar10_test_raw, batch_size=10, synset_level=CIFAR100_FINE_LABEL_LEVEL, hypernym=None, is_train=False)

In [None]:
# Initialize metrics to watch during the evaluation in accordance to (Ovadia et. al, 2019)
# Annotation: As there is no ground truth for fully OOD samples, we only report confidence and
# predictive entropy for those.

# Confidence
confidence = [0] * CIFAR10_NUM_TEST_SAMPLES

# Predictive entropy
pred_entropy = [0] * CIFAR10_NUM_TEST_SAMPLES

In [None]:
for idx, (imgs, ground_truths) in cifar10_test.enumerate():
    # Generate predictions for each image in the current batch
    batch_scores = model.predict(imgs)
    
    # Since we constructed our test data set in a way that each image in one batch is an augmented
    # version of the same base image, we can simply average the individual scores to get the final
    # prediction for the respective base image in adherence to (Goodfellow et al., 2013) and
    # (Krizhevsky et al., 2012).
    prediction = tf.math.reduce_mean(batch_scores, axis=0)
    
    # Update the metrics w/ the result for the current base image; as all images in one batch
    # originate from the same base image (cf. above), the ground truth is hence identical as well.
    
    ground_truth = ground_truths[0]
    ground_truth_max_idx = tf.math.argmax(ground_truths).numpy()

    prediction_max_idx = tf.math.argmax(prediction).numpy()

    confidence[idx] = prediction[prediction_max_idx]

    pred_entropy[idx] = -tf.math.reduce_sum(tf.map_fn(lambda p: p * tf.math.log(p), prediction))

In [None]:
# Store the results

tf.io.write_file(
    CIFAR100_RESULTS_DIR_BENCHMARK + 'confidence_ood.data',
    tf.io.serialize_tensor(
        [tf.io.serialize_tensor(entry) for entry in confidence]))

tf.io.write_file(
    CIFAR100_RESULTS_DIR_BENCHMARK + 'pred_entropy_ood.data',
    tf.io.serialize_tensor(
        [tf.io.serialize_tensor(entry) for entry in pred_entropy]))

In [None]:
# Load the results

confidence = [tf.io.parse_tensor(
    entry, tf.dtypes.float32) for entry in tf.io.parse_tensor(
        tf.io.read_file(
            CIFAR100_RESULTS_DIR_BENCHMARK + 'confidence_ood.data'),
            tf.dtypes.string)]

pred_entropy = [tf.io.parse_tensor(
    entry, tf.dtypes.float32) for entry in tf.io.parse_tensor(
        tf.io.read_file(
            CIFAR100_RESULTS_DIR_BENCHMARK + 'pred_entropy_ood.data'),
            tf.dtypes.string)]

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(21, 7.5))
fig.suptitle('Confidence on OOD Data', y=0.9225)

ax.hist([val * 100 for val in confidence], bins=10000, cumulative=-1, histtype='step')

ax.set_xlabel('Confidence ' + r'$\tau$' + ' in %')
ax.set_ylabel('Number of examples P(x) > ' +  r'$\tau$')
ax.set_xlim([0, 100])

fig.savefig('uncertainty_benchmark_ood_confidence.jpg', format='jpg')

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(21, 7.5))
fig.suptitle('Predictive Entropy on OOD Data', y=0.9225)

# Filter out NaN entries from `pred_entropy`
pred_entropy = tf.where(tf.math.is_nan(pred_entropy), tf.zeros_like(pred_entropy), pred_entropy)

ax.hist(pred_entropy, bins=10000, cumulative=-1, histtype='step')

ax.set_xlabel('Entropy ' + r'$h$' + ' in nats')
ax.set_ylabel('Number of examples H(x) > ' +  r'$h$')
ax.set_xlim([0, 5])

fig.savefig('uncertainty_benchmark_ood_pred_entropy.jpg', format='jpg')