In this practice, we will implement ***Wide Residual Network*** and learn essential concepts of convolutional neural networks.

In [1]:
import tensorflow as tf
from tensorflow.keras.datasets.cifar10 import load_data

# 1. Load CIFAR-10 dataset and check the number of data samples

In [2]:
train_set, test_set = load_data()

In [3]:
x_train, y_train = train_set[0], train_set[1]
x_test, y_test = test_set[0], test_set[1]

In [4]:
x_train.shape

(50000, 32, 32, 3)

In [5]:
y_train.shape

(50000, 1)

In [6]:
x_test.shape

(10000, 32, 32, 3)

In [7]:
y_test.shape

(10000, 1)

Officially provided CIFAR-10 dataset consists of 50000/10000 images for training and test. The shape of input images is [32,32,3] for [Height, Width, Channels]. 

# 2. Build WideResNet model

* Original paper: https://arxiv.org/abs/1605.07146
* Blog (Korean): https://norman3.github.io/papers/docs/wide_resnet.html
* TensorFlow implementation in MixMatch: https://github.com/google-research/mixmatch

WRNs aims to achieve good performance on classification tasks by increasing both width and depth of residual networks. The main residual blocks in WRNs are shown as below.

![](WRN-figure1.png)

Note that batch normalization and non-linear activcation precede before each convolution (BN-Act-Conv). 

* (a) basic block : does not change the shape of tensors. 
* (b) bottleneck : 1x1 convolutions reduce and expand the dimensionality of tensors, respectively. In this practice we won't use this bottleneck block.
* (c) basic wide : wide 3x3 convolution operation that increases the number of channels and reduces the height and width of input tensors. --> What are appropriate filters, kernel_size and strides for this block? 
* (d) dropout : since widening the tensors require large number of parameters, the authors incorporated dropout regularizations between the consecutive convolutions. In this practice we won't use this bottleneck block.  

Let's investigate the change of tensor shapes with respect to apply the residual blocks. 

In [8]:
bn_args = dict(training=True, momentum=0.999) # Note that "training=False" for inference time.

def conv_args(k, f):
    return dict(padding='same',
                kernel_initializer=tf.random_normal_initializer(stddev=tf.rsqrt(0.5*k*k*f)))

def residual(x0, filters, stride=1):
    x = tf.layers.batch_normalization(x0, **bn_args)
    x = tf.nn.elu(x)
    x = tf.layers.conv2d(x, filters, 3, stride, **conv_args(3, filters))
    x = tf.layers.batch_normalization(x0, **bn_args)
    x = tf.nn.elu(x)
    x = tf.layers.conv2d(x, filters, 3, **conv_args(3, filters))
    return x + x0

x = tf.placeholder(tf.float32, [None, 32, 32, 3]) # [32,32,3] for CIFAR-10 input tensors whose shape is [H,W,C]
x = tf.layers.conv2d(x, 16, 3, **conv_args(3, 16)) # For the initial change of the input tensor shape.
x

W0805 13:59:30.568976 140620181837568 deprecation.py:323] From <ipython-input-8-d72946d4c16b>:17: conv2d (from tensorflow.python.layers.convolutional) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.keras.layers.Conv2D` instead.


<tf.Tensor 'conv2d/BiasAdd:0' shape=(?, 32, 32, 16) dtype=float32>

## Basic block - when the dimensionality of shapes of input and output tensors are same.

Let's see the change of tensors after applying the basic block.

In [9]:
x1 = residual(x, 16, 1)
x1

W0805 13:59:30.800174 140620181837568 deprecation.py:323] From <ipython-input-8-d72946d4c16b>:8: batch_normalization (from tensorflow.python.layers.normalization) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.BatchNormalization instead.  In particular, `tf.control_dependencies(tf.GraphKeys.UPDATE_OPS)` should not be used (consult the `tf.keras.layers.batch_normalization` documentation).


<tf.Tensor 'add:0' shape=(?, 32, 32, 16) dtype=float32>

What will be happend if we use different number of filters (channels)?

In [10]:
x2 = residual(x, 32, 1)
x2

ValueError: Dimensions must be equal, but are 32 and 16 for 'add_1' (op: 'Add') with input shapes: [?,32,32,32], [?,32,32,16].

Since the dimensionalit of input (x0) and output (x) tensors are different, the residual block is not able to perform the given residual computation. Thus, we need to revise the implementation of the above residual block.

In [11]:
def residual(x0, filters, stride=1):
    x = tf.layers.batch_normalization(x0, **bn_args)
    x = tf.nn.elu(x)
    x = tf.layers.conv2d(x, filters, 3, stride, **conv_args(3, filters))
    x = tf.layers.batch_normalization(x, **bn_args)
    x = tf.nn.elu(x)
    x = tf.layers.conv2d(x, filters, 3, **conv_args(3, filters))
    
    if x0.get_shape()[3] != filters or x0.get_shape()[1] != x.get_shape()[1]:
        x0 = tf.layers.conv2d(x0, filters, 1, stride, **conv_args(3, filters)) # 1x1 convolution
    return x + x0

x2 = residual(x, 32, 1)
x2

<tf.Tensor 'add_2:0' shape=(?, 32, 32, 32) dtype=float32>

In [12]:
x3 = residual(x, 16, 2)
x3

<tf.Tensor 'add_3:0' shape=(?, 16, 16, 16) dtype=float32>

In [13]:
x4 = residual(x, 32, 2)
x4

<tf.Tensor 'add_4:0' shape=(?, 16, 16, 32) dtype=float32>

Yes! This implementation of the residual block allows reducing the both height/width and channels of feature maps, which is the key idea of WRNs.

![](WRN-figure2.png)

With the above residual block, we now can implement the classifier built on WRNs, whose architecture is shown in the above figure. "k" denotes the widening factor, which increase the number of channels k-times. (Therefore, k=1 does not increase widen the network.) 

Let's implement the WRNs. 

In [14]:
x # the output of the conv1 group.

<tf.Tensor 'conv2d/BiasAdd:0' shape=(?, 32, 32, 16) dtype=float32>

In [17]:
def classifier(x0, k_widen, repeat):
    channels = [16*k_widen, 32*k_widen, 64*k_widen]
    y = x0
    for scale in range(len(channels)):
        y = residual(y, channels[scale], stride=2)
        print (y)
        for i in range(repeat-1):
            y = residual(y, channels[scale])
            print (y)
    y = tf.reduce_mean(y, [1,2]) # global average pooling
    print (y)
    logits = tf.layers.dense(y, 10, kernel_initializer=tf.glorot_normal_initializer())
    return logits

In [18]:
logits = classifier(x, 2, 4)
logits

Tensor("add_17:0", shape=(?, 16, 16, 32), dtype=float32)
Tensor("add_18:0", shape=(?, 16, 16, 32), dtype=float32)
Tensor("add_19:0", shape=(?, 16, 16, 32), dtype=float32)
Tensor("add_20:0", shape=(?, 16, 16, 32), dtype=float32)
Tensor("add_21:0", shape=(?, 8, 8, 64), dtype=float32)
Tensor("add_22:0", shape=(?, 8, 8, 64), dtype=float32)
Tensor("add_23:0", shape=(?, 8, 8, 64), dtype=float32)
Tensor("add_24:0", shape=(?, 8, 8, 64), dtype=float32)
Tensor("add_25:0", shape=(?, 4, 4, 128), dtype=float32)
Tensor("add_26:0", shape=(?, 4, 4, 128), dtype=float32)
Tensor("add_27:0", shape=(?, 4, 4, 128), dtype=float32)
Tensor("add_28:0", shape=(?, 4, 4, 128), dtype=float32)
Tensor("Mean_1:0", shape=(?, 128), dtype=float32)


<tf.Tensor 'dense_1/BiasAdd:0' shape=(?, 10) dtype=float32>

We succesfully implement the WRNs. 
We provide the "model.py" and "train.py" in this repository. 
Please fill the blank and complete implementation. 
Also check the train/test accuracy on CIFAR-10 dataset!