## U-Net+residual blocks: A modified U-Net inspired by the Kaggle Carvana Image Masking Challenge solutions


Date created: Oct 10, 2017   
Last modified: Dec 20, 2017  
Tags: U-Net, semantic segmentation, CNNs, dilated convolutions, residual block  
About: modified U-Nets

The U-Net <a href="#ref1">[1]</a> was the most widely implemented deep learning network in the recently concluded [Carvana Image Masking Challenge](https://www.kaggle.com/c/carvana-image-masking-challenge) on Kaggle and the competition was an opportunity to experiment with different 
[U-Net](#UNet) architectures. The [third place solution](https://www.kaggle.com/c/carvana-image-masking-challenge/discussion/40199) ([github code](https://github.com/lyakaap/Kaggle-Carvana-3rd-place-solution)) was a modified  U-Net that used dilated convolution blocks in the bottleneck region (the region at the junction of the contracting and expanding paths). This simple change, which was responsible for the high scoring (0.997193) results, used fewer parameters (8.5 million vs something in the range of 34 million for a lower scoring U-Net model 7 or 8 blocks deep) as well. [Dilated convolutional filters](#dilated) increase the receptive field but require far fewer parameters than a deeper network with standard convolution filters. Since the ResNet <a href="#ref1">[2]</a> is a deep network that uses fewer parameters (than older conv-net architechtures), a worthwhile experiment would be to replace the dilated convolutions with ResNet [residual blocks](#residual-blocks).

In this exercise we will compare the *U-Net+residual block* network with the *U-Net+dilated convolutions*. A template for these two modified U-Net architechtures is shown below:

<center><img src="files/Unet.png" width=650px align="center"></center>

We will only train the models for a few epochs to see how the initial numbers compare since it would take roughly 3 days to train each model (50 mins/epoch x 100 epochs on the Crestle/AWS GPU platform).
The models are implemented in Keras with the TensorFlow backend. The use of the following code is acknowledged:
  1. [Image processing](https://github.com/petrosgk/Kaggle-Carvana-Image-Masking-Challenge) code from Kaggle user Peter Giannakopoulos
  2. [U-Net+dilated conv and loss function](https://github.com/lyakaap/Kaggle-Carvana-3rd-Place-Solution) code from Kaggle user @lyakaap
  3. [Resnet50 residual block](https://github.com/fchollet/keras/blob/master/keras/applications/resnet50.py) code from the Keras project



The notebook is organized into the following sections:  
  1. [Brief description of U-Net, Dilated convolutions and ResNet residual blocks](#Basics) for context.
  2. [Data preparation](#Data)  
  3. Initial results from U-Net+dilated conv (using lyakaap's model)  
  4. Initial results from U-Net+residual blocks


<a id='Basics'></a>
## Basics of U-Net, Dilated Convolutions and Residual Blocks

<a id='UNet'></a>
#### U-Net
The U-Net is an example of a FCN Encoder-Decoder architechture. The Encoder is a standard convolutional neural network which learns features for classification. During this process, downsampling operations reduce the spatial resolution, so an upsampling stage is needed to restore the spatial resolution for semantic segmentation. This is achieved by the Decoder network. The SegNet is an example of an Encoder-Decoder where two VGG16 networks, one a downsampling path and the other an upsampling path, mirror each other. In the U-Net, the downsampling and upsampling path are connected via a convolutional block. This set of convolutional filters lets the network go deeper and preserves the dimensions of the contracted segmentation maps at the same time. Some variants of this block could be the use of dilated convolutions or a bottleneck layer.

<a id='dilated'></a>
#### Dilated Convolutional Filters
Dilated (or atrous) convolutional filters are used in semantic segmentation architechtures to increase the receptive field but to do so with far fewer parameters than a deeper network with standard convolution filters would require. To see why, note that a 3x3 filter with a dilation rate of 2 has the same FOV as a 5x5 filter. The 3x3 filter uses 9 parameters whereas the 5x5 filter uses 25. (It should be noted that while the max pooling operation also increases the FOV it is also associated with aggressive downsampling and so is not used in a U-Net once the features are learned.) 

<a id='residual-blocks'></a>
#### Residual Blocks
ResNets employ residual blocks to create deep networks. The basic residual block consists of a sequence of convolutional layers chained by a skip connection which ensures that the input signal is backpropagated.  The ResNet50 residual block that we use is made up of three sequential convolutions:
  1. 1x1 convolutions (this is the NiN concept: features that are correlated are efficiently combined reducing redundancy and this reduces the number of parameters effectively creating a *bottleneck*)
  2. 3x3 convolutions (to get a rich set of new feature combinations; this is computationally expensive but due to the preceding bottleneck step the computational burden is significantly reduced) 
  3. 1x1 operation (to expand the number of filters usually restoring the output to the original block input)
  
This is shown in the figure below:
<center><img src="files/bottleneck.png" width=200px align="center"></center>

## Import libraries

In [4]:
import cv2
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

from keras.callbacks import EarlyStopping, ReduceLROnPlateau, ModelCheckpoint, TensorBoard
from keras.models import Model
from keras.layers import Input, concatenate, Conv2D, MaxPooling2D, Activation, UpSampling2D, BatchNormalization
from keras.optimizers import RMSprop
from keras.losses import binary_crossentropy
import keras.backend as K

from sklearn.model_selection import train_test_split
import h5py
import random

import losses
import new_models

import augmentation

DATAPATH = "../data"

Using TensorFlow backend.
  return f(*args, **kwds)


## Models

The *U-Net+dilated blocks* (dilated_unet) model has 8.6 million parameters; the bottleneck has 6 dilated convolutions. The model architechture is shown [here](https://github.com/lyakaap/Kaggle-Carvana-3rd-Place-Solution/blob/master/network.png).

The *U-Net+residual blocks* (resblock_unet) uses one convolution + 5 residual blocks. This model uses 3.6 million parameters which is less than half of unet_MDCB. The model summary is shown below.

In [5]:
model = new_models.get_resblock_unet(input_shape=(512,512, 3))
model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, 512, 512, 3)  0                                            
__________________________________________________________________________________________________
enc_conv_1a (Conv2D)            (None, 512, 512, 44) 1232        input_1[0][0]                    
__________________________________________________________________________________________________
enc_conv_1b (Conv2D)            (None, 512, 512, 44) 17468       enc_conv_1a[0][0]                
__________________________________________________________________________________________________
max_pooling2d_1 (MaxPooling2D)  (None, 256, 256, 44) 0           enc_conv_1b[0][0]                
__________________________________________________________________________________________________
enc_conv_2

## Data, Preprocessing, Augmentation

#### Data
The training dataset consists of 5088 images. Each car is presented in 16 fixed photo angles. The ground truth training mask images were converted from *.gif* to *.png* format so as to be compatible with the *OpenCV* library. 

The Carvana dataset can be found [here](https://www.kaggle.com/c/carvana-image-masking-challenge/data). The training images and ground truth masks are in the *train* and *train_masks* folders respectively. The test data was not used.

In [13]:
input_width = 512
input_height = 512
max_epochs = 10
orig_width = 1918
orig_height= 1280
threshold  = 0.5

In [5]:
df_train = pd.read_csv(DATAPATH+'/train_masks.csv')

In [6]:
df_train = df_train.iloc[2:,:]
#df_train.head()

In [8]:
ids_train = df_train['img'].map(lambda s: s.split('.')[0])

In [9]:
ids_train_split, ids_valid_split = train_test_split(ids_train, test_size=0.2, random_state=42)

#### Rescaling (input and mask data)

In [10]:
all_imgs  = {}
all_masks = {}

for id in ids_train:
    img  = cv2.imread(DATAPATH+'/train/{}.jpg'.format(id))
    img  = cv2.resize(img, (input_width, input_height))
    mask = cv2.imread(DATAPATH+'/train_masks_png/{}_mask.png'.format(id), cv2.IMREAD_GRAYSCALE)
    mask = cv2.resize(mask, (input_width, input_height))
    all_imgs[id]  = img
    all_masks[id] = mask

#### Augmentation

The following transformations using the *OpenCV* library were made:
* Hue, Saturation, Value using randomHueSaturationValue
* Shift, Scale, Rotate using randomShiftScaleRotate
* Horizontal flips using randomHorizontalFlip

## Training

In [12]:
def train_generator(train_batch_size):
    while True:
        this_ids_train_split = random.sample(list(ids_train_split), len(ids_train_split))
        for start in range(0, len(ids_train_split), train_batch_size):
            x_batch = []
            y_batch = []
            end = min(start + train_batch_size, len(ids_train_split))
            ids_train_batch = this_ids_train_split[start:end]
            for id in ids_train_batch:
                img  = all_imgs[id]
                mask = all_masks[id]
                img = augmentation.randomHueSaturationValue(img,
                                               hue_shift_limit=(-50, 50),
                                               sat_shift_limit=(-5, 5),
                                               val_shift_limit=(-15, 15))
                img, mask = augmentation.randomShiftScaleRotate(img, mask,
                                                   shift_limit=(-0.0625, 0.0625),
                                                   scale_limit=(-0.1, 0.1),
                                                   rotate_limit=(-0, 0))
                img, mask = augmentation.randomHorizontalFlip(img, mask)
                mask = np.expand_dims(mask, axis=2)
                x_batch.append(img)
                y_batch.append(mask)
            x_batch = np.array(x_batch, np.float32) / 255
            y_batch = np.array(y_batch, np.float32) / 255
            yield x_batch, y_batch


In [13]:
def valid_generator(val_batch_size):
    while True:
        for start in range(0, len(ids_valid_split), val_batch_size):
            x_batch = []
            y_batch = []
            end = min(start + val_batch_size, len(ids_valid_split))
            ids_valid_batch = ids_valid_split[start:end]
            for id in ids_valid_batch.values:
                img  = all_imgs[id]
                mask = all_masks[id]
                mask = np.expand_dims(mask, axis=2)
                x_batch.append(img)
                y_batch.append(mask)
            x_batch = np.array(x_batch, np.float32) / 255
            y_batch = np.array(y_batch, np.float32) / 255
            yield x_batch, y_batch


In [14]:
train_batch_size = 6
val_batch_size   = 16

In [15]:
callbacks = [ReduceLROnPlateau(monitor='val_dice_coef',
                               factor=0.2,
                               patience=3,
                               verbose=1,
                               epsilon=1e-4,
                               mode='max'),
             ModelCheckpoint(monitor='val_dice_coef',
                             filepath='../weights/best_weights_resblock_1.hdf5',
                             save_best_only=True,
                             save_weights_only=True,
                             mode='max')]

history = model.fit_generator(generator=train_generator(train_batch_size),
                    steps_per_epoch=np.ceil(float(len(ids_train_split)) / float(train_batch_size)),
                    epochs=max_epochs,
                    verbose=2,
                    callbacks=callbacks,
                    validation_data=valid_generator(val_batch_size),
                    validation_steps=np.ceil(float(len(ids_valid_split)) / float(val_batch_size)))


Epoch 1/10
 - 2391s - loss: 0.1446 - dice_coef: 0.9391 - val_loss: 0.7907 - val_dice_coef: 0.7309
Epoch 2/10
 - 2235s - loss: 0.0451 - dice_coef: 0.9815 - val_loss: 0.0524 - val_dice_coef: 0.9791
Epoch 3/10
 - 2235s - loss: 0.0358 - dice_coef: 0.9857 - val_loss: 0.0646 - val_dice_coef: 0.9750
Epoch 4/10
 - 2231s - loss: 0.0282 - dice_coef: 0.9885 - val_loss: 0.0279 - val_dice_coef: 0.9886
Epoch 5/10
 - 2233s - loss: 0.0246 - dice_coef: 0.9899 - val_loss: 1.9311 - val_dice_coef: 0.6366
Epoch 6/10
 - 2235s - loss: 0.0223 - dice_coef: 0.9909 - val_loss: 0.5545 - val_dice_coef: 0.8320
Epoch 7/10
 - 2235s - loss: 0.0194 - dice_coef: 0.9920 - val_loss: 0.0254 - val_dice_coef: 0.9900
Epoch 8/10
 - 2221s - loss: 0.0186 - dice_coef: 0.9924 - val_loss: 0.0180 - val_dice_coef: 0.9925
Epoch 9/10
 - 2213s - loss: 0.0168 - dice_coef: 0.9930 - val_loss: 0.0166 - val_dice_coef: 0.9932
Epoch 10/10
 - 2214s - loss: 0.0160 - dice_coef: 0.9933 - val_loss: 0.1499 - val_dice_coef: 0.9600


## Discussion

The initial numbers (from 5 epochs) for both the resblock U-Net and the dilated U-Net are comparable and both are very high in terms of the scores. There is some indication that the dilated U-Net is more stable.  

The U-Nets were trained from scratch using an NVIDIA Tesla K80 GPU computing platform. It took around 40-50 mins per epoch.
The dilated U-Net with 8.6 million parameters took longer than the resblock U-Net with has 3.5 million parameters. The run times are listed below.

**Dilated U-Net** 
* 3300 secs for the first 5 epochs on 2 runs
               
**Resblock U-Net** 
* 2700 secs for the first 5 epochs on 2 runs
* 2250 secs for first 10 epochs (possibly because code was rewritten or the compuitng resources on the Crestle platform changed)


## References and Further Reading

<a name="ref1"></a>[1] [Olaf Ronneberger, Philipp Fischer, Thomas Brox. "U-Net: Convolutional Networks for Biomedical Image Segmentation." arXiv:1505.04597v1 [cs.CV]](https://arxiv.org/pdf/1505.04597.pdf)

<a name="ref2"></a>[2] [He, Kaiming, et al. "Deep Residual Learning for Image Recognition." arXiv:1512.03385v1 [cs.CV]](https://arxiv.org/pdf/1512.03385.pdf)

[3] [Pröve, Paul-Louis. "An Introduction to different Types of Convolutions in Deep Learning." Towards Data Science/Medium, Jul 22 2017](https://towardsdatascience.com/types-of-convolutions-in-deep-learning-717013397f4d)

[4] [Culurciello, Eugenio. "Neural Network Architechtures" Towards Data Science/Medium, Mar 23 2017](https://towardsdatascience.com/neural-network-architectures-156e5bad51ba)

<div style="background-color: #FAAC58; margin-left: 0px; margin-right: 20px; padding-bottom: 8px; padding-left: 8px; padding-right: 8px; padding-top: 8px;">


Author:  Meena Mani  <br>
email:   meenas.mailbag@gmail.com   <br> 
Twitter: @meena_uvaca    <br>
</div>