# U-Net: Convolutional Networks for Biomedical Image Segmentation 
### Olaf Ronneberger, Philipp Fischer, and Thomas Brox

### Some general notes
* The original paper comes back from May 2015. That was the first release, so it's quite "old".
* At the time it was simply the best.
* It is still state-of-the-art at the time from some biomedical applications.

## 1.1 Introduction
* Discussing previous state-of-the-art apporaches.
* And how CNN outperform previous methods in visual recognition task.
* Highlighting the need for per-pixel-classification for some visual tasks (especially for biomedical images).
* Small number of training examples for biomedical task becuase of very specific field of expertise requiered.

In the later part of the introduction of this research paper, the authors discussed theory behind U-net architecture:

* More elegant fully convolutional network proposal.
* Base idea is to use downsampling path for **features extraction** followed by upsampling path for **precise localization** of these features in higher resolution layers.
* By concatenating downsampling feature maps with corresponding upsampling layers we help successive conv layers to assemble a more precise output.

(don't worry this is just the theory behind, when you see the implementation or the architecture it will make much more sense.)

One thing to mention here before we jump into the actual architecture is the *overlap tile strategy*:
* Input size > output size.
* Output segmentation map only contains the pixels for which the full context is available in the output image.
* Missing context on the edge/borders of input image is extrapolated by mirroring
![](images/overlap-tile-strategy.png)

For the test that this network was designed for, there were always very little training data available, so they had to come up with very excessive data augmentations strategy, another challenge that they were facing was working on more accurate separation of touching objects, just because when you have a lot of cells, many of them were touching each other, and it's easy for segmentation network to basically merge those cells, so they had to come up with a solution to penalize the network for doing so and to focus more on drawing those separation borders between cells. My intuition is that this idea comes after making a good error analysis, so maybe for our purpose we need to omit that part of the loss function, its looks quite specific-domain.

## 1.2 Network architecture

It's very well defined on the paper and it was very easy to reconstruct the network (as @ppisarski did)

Picture tells more than thousand words so we should jump straight to the image representing overall architecture
![](https://raw.githubusercontent.com/cienciaydatos/ai-challenge-trees/master/unet/images/unet.png)
On the down sampling path there's bassically feature extraction, as a common CNN, but instead of flattening the network in some part, they tried to preserve the spatial properties along the entire network, and then we have these kind of skip connections which seem to maintain some of the spatial information.

### Downsampling path
* 4 conv blocks (2 conv layers each) followed by max pooling layers 2x2 with stride 2 for downsampling
* 5th conv block without max pooling (connection to upsampling path)
* First conv block with 64 fiters on each con layer
* Number of filters doubled with each consecurive conv block
* Reduce resolution, increse depth
* No padding (valid padding)
* 3x3 filters with ReLU activation

In [1]:
from keras.models import Model
from keras.layers import BatchNormalization, Conv2D, Conv2DTranspose, Cropping2D
from keras.layers import MaxPooling2D, Dropout, UpSampling2D, Input, concatenate

def conv2d_block(inputs, filters=16):
    c = inputs
    for _ in range(2):
        c = Conv2D(filters, (3,3), activation='relu', padding='valid') (c)

    return c

Using TensorFlow backend.


In [2]:
x = Input((572, 572, 1))

# Downsampling path
down_layers = [] 
filters = 64
for _ in range(4):
    x = conv2d_block(x, filters)
    down_layers.append(x)
    x = MaxPooling2D((2, 2), strides=2) (x)
    filters *= 2 # Number of filters doubled with each layer

x = conv2d_block(x, filters) # 5th conv block without max pooling
print(x)

Instructions for updating:
Colocations handled automatically by placer.
Tensor("conv2d_10/Relu:0", shape=(?, 28, 28, 1024), dtype=float32)


### Upsampling path
* Symetric to the downsampling path (thus U-shape => U-net)
* Number of filters for each consecurive conv block equals half of the filters from previous conv block
* Increse resolution, reduce depth (number of layers)
* Concatenating feature maps from corresponding downsampling layers for more precise localization
* Final layer is a 1x1 conv used to map each 64 component feature vector to the desired number of classes.
* There are several upsampling operators, in particular they uses "up-convolution".

In [3]:
import math

def crop_shape(down, up):
    ch = int(down[1] - up[1])
    cw = int(down[2] - up[2])
    ch1, ch2 = ch // 2, int(math.ceil(ch / 2))
    cw1, cw2 = cw // 2, int(math.ceil(cw / 2))
    
    return (ch1, ch2), (cw1, cw2)

In [4]:
for conv in reversed(down_layers): 
    filters //= 2
    x = Conv2DTranspose(filters, (2, 2), strides=(2, 2),
                        padding='same') (x)

    ch, cw = crop_shape(conv._keras_shape, x._keras_shape)
    conv = Cropping2D((ch, cw)) (conv)
    
    x = concatenate([x, conv])
    x = conv2d_block(x, filters)

output = Conv2D(2, (1, 1), activation='softmax') (x)
print(output)

Tensor("conv2d_19/truediv:0", shape=(?, 388, 388, 2), dtype=float32)


### Some implementation notes
* It was mentioned that they're using some dropout in the down sampling path but wasn't specified exactly where.
* Presumably you can use the same padding and avoid all that crop stuff.

## 1.3 Training
* They implemented everything in caffe and is available [here](https://lmb.informatik.uni-freiburg.de/people/ronneber/u-net/u-net-release-2015-10-02.tar.gz)
* For optimizer they uses SGD
* To minimize the overhead and make maximum use of the GPU memory, they favor large input tiles (large input size) over large batch size, they ended up using batch size of a sigle image. This is an historical note, single batch size is often a bad idea.
* High momentum (0.99)
* Loss function is a pixel wise softmax over the final feature map combined with the standard cross-entropy, they also precompute a weight map for each ground truth segmentation to kind of compensate the frequency of pixels from certain class.
* We don't often see this in NN.
* Good initialization of the weights is extremely important.
* They mention that generating smooth deformations using random displacement factors was the most influential data augmentation.

## Future task
* Define loss fuction.
* Standardize input pipeline.
* Transfer learning on feature extractor.
* Define model metrics.
* Investigate the upsampling operator, arXiv:1603.07285 is an awesome starting point.
* Real implications of valid vs same padding.
* Is there a way to add metadata to the model?
* Post-processing?