Modeling

The model that we created is a modified version of the popular image segmentation model called the U-Net. With U-net as the baseline model for our segmentation task, we wanted to study about the latest techniques being used for image segmentation and apply those to the U-Net.

Model description

Just like the U-net, we have skip connections from the encoder side to the decoder side. The part that we added is an Atrous Spatial Pyramid Pooling(ASPP) block at the end of the encoder side. The new block is taken from Google's DeepLabv3. We perform multiple atrous convolutions on the same map and concatenate all those outputs. This gives us an feature map with a hierarchy of features which would better help segmentating the image while performing transpose convolutions on the decoder side.

Encoder 

On the encoder side we have 4 convolutional blocks and a ASPP block. Each convolution block contains 2 convolutional layers with a dropout layer between them and a max pooling block at the end. With every convolutional block we increase the number of filters in the convolutional layers

Decoder

On the decoder side, we have 2 upscaling blocks. Each of these upsacling blocks have a transpose convolution layer and 2 convolutional layers with a dropout layer between them. At the end, after upscaling the image back to its input size, we have a convolutional layer with a 1x1 filter size and a sigmoid activation function.

Hyper-parameter information

For this application, we used a loss function that is a combination of the binary cross entropy loss and the dice loss, derived from the dice coefficient.
The optimizer we used is ADAM. We trained this model for 100 epochs with a batch size of 32.

Experimentation

We used Google Colab to run our experimentations. The dataset we used is called the PFCN dataset, referencing the paper “Xiaoyong Shen, Xin Tao, Hongyun Gao, Chao Zhou, Jiaya Jia. Deep Automatic Portrait Matting. European Conference on Computer Vision (ECCV),2016”. The images in this dataset have large structure variation for background and foreground images. Finally, 2,000 images 
with high quality mattes are collected. In these 2000 images, 1700 images are used for training and 300 images are used for testing. And out of the 1700 images, we set 0.15 as the validation set.

The main framework for training and modeling is TensorFlow 2.0. We started with a smaller and similar implementation of the U-Net architecture and researched the current techniques being used of image segmentation. We want to build a image segmentation model for blurring the background in a portrait that is small and simple like the U-Net.

Literature Review

U-Net
Convolutional Networks for Biomedical Image Segmentation.
https://lmb.informatik.uni-freiburg.de/people/ronneber/u-net/

Google DeepLab v3
https://towardsdatascience.com/review-deeplabv3-atrous-convolution-semantic-segmentation-6d818bfd1d74




In [24]:
from google.colab import drive
drive.mount('/content/drive',force_remount=True)


Mounted at /content/drive


In [25]:
import tensorflow as tf
from tensorflow.keras import models, layers
import numpy as np
import cv2
from tensorflow.keras import backend as K
from tensorflow.keras.losses import binary_crossentropy

def dice_coef(y_true, y_pred, smooth=1):
    y_true_f = K.flatten(y_true)
    y_pred_f = K.flatten(y_pred)
    intersection = K.sum(y_true_f * y_pred_f)
    return (2. * intersection + smooth) / (K.sum(y_true_f) + K.sum(y_pred_f) + smooth)
  
def bce_dice_loss(y_true, y_predict):
    return binary_crossentropy(y_true, y_predict) + (1-dice_coef(y_true, y_predict))

def  dice_loss(y_true, y_predict):
    return (1-dice_coef(y_true, y_predict))

model = models.load_model('/content/drive/My Drive/CVProject/unetv2', custom_objects={'dice_coef': dice_coef, 'bce_dice_loss' : bce_dice_loss})

frame = cv2.imread('/content/drive/My Drive/CVProject/test13.jpg')        

height, width, channels = frame.shape
frame = cv2.resize(frame, (600,800))
frame1 = frame[np.newaxis, :, :, :]

mask = model(frame1)
mask = np.array(mask)
mask = mask[0][:,:,0]

full_mask = np.zeros((mask.shape[0], mask.shape[1], 3))
full_mask[:,:,0] = mask
full_mask[:,:,1] = mask
full_mask[:,:,2] = mask

frame = frame.astype(float)
full_mask = full_mask.astype(float)

subject = cv2.multiply(full_mask, frame)
bg = cv2.multiply(1-full_mask, frame)
bg = cv2.GaussianBlur(bg, (7,7), 2)

final = cv2.add(bg, subject)
final = final.astype('uint8')
final = cv2.resize(final, (width, height))
cv2.imwrite('test13_res.jpg', final)



True

Results

Metrics used 

We used the dice co-efficient that calculates the overlap between the predicted output mask and the actual output mask. On our test set, we got a score for 93.56 after training for 100 epochs.

Visual results

input:
https://drive.google.com/file/d/136G5xOcE8hM6V4YZP__AMfLY2Q6fxbHM/view?usp=sharing


output:
https://drive.google.com/file/d/1eEFkYzQO4gg20TObutE00qX-2goabQIa/view?usp=sharing

We believe is a good progress but isn't the latest step. We plan on exploring and trying different loss functions and making changes to the model architecture.
