# Segmentation

## What is segmentation?

In the case of images, as we will consider in this notebook, segmentation is the task of classifying every pixel as a member of a certain class.

![](images/segmentation.jpeg)

What you see above is called a segmentation map.

Segmentation can also be performed on other types of data such as video or text.

## Please note
Segmentation is not a solved problem. And we don't expect to achieve state of the art results in this notebook.
Instead in this notebook we'll just set up a CNN that trains, even if just a little, to show you how to hook things up from input to output for learning and inference.

## Different types of segmentation

**Instance segmentation** is the task of distinguishing different instances of the same class type, as well as between other classes.

**Semantic segmentation** is the task of distinguishing between only different classes.

![](images/semantic_vs_instance.png)

## What does a segmentation dataset look like?

Let's read in a segmentation dataset called the PASCAL VOC segmentation dataset and look an example.

You can download the dataset from [here](https://github.com/life-efficient/VOC).

In [2]:
from VOCDataset import VOC
import matplotlib.pyplot as plt

pascal = VOC(root='/Users/ice/ai-core/VOC2007') # set the root to where the VOC dataset is stored on your machine

img, maps = pascal[3]
img.show()
print(type(img))
print(img)
print(img.size)
print()
print(type(maps))
print(maps)
print(maps.shape)

<class 'PIL.JpegImagePlugin.JpegImageFile'>
<PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=375x500 at 0x7F80E95A3340>
(375, 500)

<class 'torch.Tensor'>
tensor([[[0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         ...,
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.]]])
torch.Size([1, 500, 375])


We can see that the input is the usual image that we're used to. 
But what about the label?

It's a tensor, the same size as the input image, which contains integer values. 
Those integer values are the index of the class which the pixel in the corresponding location belongs.


## What should our output prediction look like?

This is a multiclass classification problem right?
So we know that we are going to use the cross entropy loss function.
That loss function takes in the prediction and the label. 
As usual, the label is of size $batch\_size$, with an integer label in each position.

However, our classification models never output the predicted class explicitly. 
Instead they output a probability distribution over all classes (a vector of probabilities for each example). 
But in the case of segmentation, we are making this vector predition at every pixel location!
So our prediction should be $batch\_size \times height \times width \times num\_classes$ ($B \times H \times W K$)
This should give you an idea of how much more computationally expensive this task is compared to regular classification.

![](images/segmentation-in-out.jpg)

We know that convolutions reduce the size of the data they process because, along each spatial axis, for each region the filter covers, it produces a single number, and we hit the side/bottom of the image before we can operate on as many positions as we would need to to produce an output of the same size as the input.

How can we produce an output of the same size as the input whilst using convolutional layers? This is the question of: what should the architecture of our hidden layers be?

There are many ways to do this. One thing that we learnt previously was that we could add a specific amount of padding to an input to a conv layer and the output would stay the same size.

We won't dive into how the specifics of state of the art segmentation models work. They are changing often and can get very complicated. But if you're interested, you should look into:
- FCN
- Mask R-CNN