<h1>Image Processing and Handling WS 2018/19</h1>

Exercise instructor: Marko Jovanović, mjovanovic@mi.rwth-aachen.de

<strong style="color: red">Notice: </strong>Attendance to <strong>all</strong> exercises sessions <strong>is mandatory</strong>. However, submitted exercise solutions aren't graded nor they present a prerequiste for the exam, but you will receive feedback on your submitted solutions.

The exercise sessions are held from 12.30-14.00 on the following dates at COMA 1:

22.10.2018 - OpenCV and Python intro (this session)<br />
05.11.2018 - Image Enhancement<br />
19.11.2018 - Fourier Transform<br />
03.12.2018 - Low-level Image Segmentation<br />
<strong>17.12.2018 - High-level Image Segmentation</strong><br />
14.01.2019 - Visualization<br />
21.01.2019 - Automation<br />
28.01.2019 - Solving a Problem<br />

The topics are an orientation and subject to change, in accordance with the lectures.

<h2>Exercise 4: High-level Image Segmentation</h2>

Due date: <strong>24.12.2018</strong>

<h3>Semantic Segmentation</h3>

Semantic Segmentation has the aim to assign object classes to each image pixel. Unlike simple object detection, where we try to find and locate objects, possibly with their bounding boxes, we aim to delineate objects with dense, pixel-wise prodeictions from our models.

In the lecture, you have been taught several segmentation methods, including several pixel-based methods such as Otsu; edge-based methods such as active countours and watershed method; region growing; model-based such as the Hough-transform. In the last exercise, you had to deal with programming a thresholding method, clustering K-means, Mean-shift, Canny-edge detector, Sobel edge-detector and a graph-cut segmentation method.

Also, during the lecture, you have been introduced to artificial neural networks (ANN).

During this exercise, you will learn about another segmentation method based on deep-learning. Deep learning is part of a broader family of machine learning algorithms which are  based on learning data representations (as opposed to task-specific algorithms). A Deep neural network (DNN) is a fancy term for an artificial neural network (ANN) with multiple layers between the input and output layers. You will learn how to apply semantic segmentation to extract a dense, pixel-wise map of each of these classes in both images and video streams using such a DNN.

The semantic segmentation architecture we’re using for this exercise is ENet, which is based on Paszke et al.’s 2016 publication, <a href="https://arxiv.org/abs/1606.02147" target="_blank">ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation.</a>

Here is a diagram of the network's architecture:

<img src="architecture.jpg" />

One of the primary benefits of ENet is that it’s fast — up to 18x faster and requiring 79x fewer parameters with similar or better accuracy than larger models. The model size itself is only 3.2MB! This makes it perfectly suitable for real-time applications and object tracking in video sequences.

Paszke et al. trained that dataset on <a href="https://www.cityscapes-dataset.com/dataset-overview/" target="_blank">The Cityscapes Dataset</a>, a semantic, instance-wise, dense pixel annotation of 20-30 classes (depending on which model you’re using). The model we will be using distinguishes the following classes:

<ul>
    <li>Unlabeled (i.e., background)</li>
    <li>Road</li>
    <li>Sidewalk</li>
    <li>Building</li>
    <li>Wall</li>
    <li>Fence</li>
    <li>Pole</li>
    <li>TrafficLight</li>
    <li>TrafficSign</li>
    <li>Vegetation</li>
    <li>Terrain</li>
    <li>Sky</li>
    <li>Person</li>
    <li>Rider</li>
    <li>Car</li>
    <li>Truck</li>
    <li>Bus</li>
    <li>Train</li>
    <li>Motorcycle</li>
    <li>Bicycle</li>
</ul>

Note that this exercise is purely demonstrational. You don't need to write code this time, rather just learn how to use the provided architecture in action in order to perform the semantical segmentation task.

Please note that this exercise's example code uses a new library we haven't introduced before: https://github.com/jrosebr1/imutils 

Get acquainted with imutils well, as it provides a few convenience functions which you may also use during the exam.

In [1]:
import numpy as np
import imutils
import time
import cv2

# path to deep learning segmentation model
model = 'opencv-semantic-segmentation/enet-cityscapes/enet-model.net' 

# path to .txt file containing class labels
classes = 'opencv-semantic-segmentation/enet-cityscapes/enet-classes.txt'

# path to input image
image = 'opencv-semantic-segmentation/images/example_01.png'

# path to .txt file containing colors for labels
colors = 'opencv-semantic-segmentation/enet-cityscapes/enet-colors.txt'

# desired width (in pixels) of input image
width = 500 

# load the class label names
CLASSES = open(classes).read().strip().split("\n")
 
# if a colors file was supplied, load it from disk
if colors:
	COLORS = open(colors).read().strip().split("\n")
	COLORS = [np.array(c.split(",")).astype("int") for c in COLORS]
	COLORS = np.array(COLORS, dtype="uint8")
    # otherwise, we need to randomly generate RGB colors for each class label
else:
	# initialize a list of colors to represent each class label in
	# the mask (starting with 'black' for the background/unlabeled
	# regions)
	np.random.seed(42)
	COLORS = np.random.randint(0, 255, size=(len(CLASSES) - 1, 3), dtype="uint8")
	COLORS = np.vstack([[0, 0, 0], COLORS]).astype("uint8")

In [2]:
# initialize the legend visualization
legend = np.zeros(((len(CLASSES) * 25) + 25, 300, 3), dtype="uint8")
 
# loop over the class names + colors
for (i, (className, color)) in enumerate(zip(CLASSES, COLORS)):
	# draw the class name + color on the legend
	color = [int(c) for c in color]
	cv2.putText(legend, className, (5, (i * 25) + 17),
		cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 255), 2)
	cv2.rectangle(legend, (100, (i * 25)), (300, (i * 25) + 25),
		tuple(color), -1)

In the following, we are using the DNN module from OpenCV.
See https://github.com/opencv/opencv/wiki/Deep-Learning-in-OpenCV, and https://www.pyimagesearch.com/2017/08/21/deep-learning-with-opencv/ for an overview of the module's capabilities.

Please, take a moment and have a look at what cv2.dnn.blobFromImage does:

https://www.pyimagesearch.com/2017/11/06/deep-learning-opencvs-blobfromimage-works/

In [3]:
# load our serialized model from disk
print("[INFO] loading model...")
net = cv2.dnn.readNet(model)
 
# load the input image, resize it, and construct a blob from it,
# but keeping mind mind that the original input image dimensions
# ENet was trained on 1024x512
image = cv2.imread(image)
image = imutils.resize(image, width=width)
blob = cv2.dnn.blobFromImage(image, 1 / 255.0, (1024, 512), 0, swapRB=True, crop=False)
 
# perform a forward pass using the segmentation model
net.setInput(blob)
start = time.time()
output = net.forward()
end = time.time()
 
# show the amount of time inference took
print("[INFO] inference took {:.4f} seconds".format(end - start))

[INFO] loading model...
[INFO] inference took 0.2590 seconds


In [4]:
# infer the total number of classes along with the spatial dimensions
# of the mask image via the shape of the output array
(numClasses, height, width) = output.shape[1:4]
 
# our output class ID map will be num_classes x height x width in
# size, so we take the argmax to find the class label with the
# largest probability for each and every (x, y)-coordinate in the
# image
classMap = np.argmax(output[0], axis=0)
 
# given the class ID map, we can map each of the class IDs to its
# corresponding color
mask = COLORS[classMap]

In [5]:
# resize the mask and class map such that its dimensions match the
# original size of the input image (we're not using the class map
# here for anything else but this is how you would resize it just in
# case you wanted to extract specific pixels/classes)
mask = cv2.resize(mask, (image.shape[1], image.shape[0]), interpolation=cv2.INTER_NEAREST)
classMap = cv2.resize(classMap, (image.shape[1], image.shape[0]), interpolation=cv2.INTER_NEAREST)
 
# perform a weighted combination of the input image with the mask to
# form an output visualization
output = ((0.4 * image) + (0.6 * mask)).astype("uint8")

from matplotlib import pyplot as plt
import matplotlib.pylab as pylab

pylab.rcParams['figure.figsize'] = (15.0, 15.0)

plt.subplot(121), plt.imshow(image)
plt.title('Input image'), plt.xticks([]), plt.yticks([])
plt.subplot(122), plt.imshow(legend)
plt.title('Legend'), plt.xticks([]), plt.yticks([])
plt.show()

plt.subplot(111), plt.imshow(output)
plt.title('Output image'), plt.xticks([]), plt.yticks([])

<matplotlib.figure.Figure at 0x7f4b4c812710>

(<matplotlib.text.Text at 0x7f4acc162860>,
 ([], <a list of 0 Text xticklabel objects>),
 ([], <a list of 0 Text yticklabel objects>))

As you can see, performing semantic image segmentation with a pre-trained model is a fairly easy task.

Read this tutorial if you're interested on training your own ENet model: https://github.com/TimoSaemann/ENet/tree/master/Tutorial