# Introduction to Semantic Segmentation     
## Steve Elston     
## Introduciton   
**Semantic segmentation** is one of three types of image segmentation tasks. The goal of semantic segmentation is to label each of the pixels in an image by category of the 'thing' associated with each pixel. In other words, the labeling associates the semantics of the things in the image with pixel labels.     

You will use a convolution neural network with an encoder-decoder atchitecture known as [**DeepLab3+**](https://arxiv.org/abs/1802.02611v3) to perform multi-class semantic segmentation. For this exercise you will use the [**Crowd Instance-level Human Parsing CHIP Dataset**](https://github.com/nikhilroxtomar/Multiclass-Segmentation-on-Crowd-Instance-level-Human-Parsing-CHIP-Dataset-using-UNET) data set to segment images with multiple people in to up to 20 semmantic categories. The categories include face, hair, arms and hands, legs, feet, etc. The dataset contains pairs of an $512 \times 512 \times 3$ RGB image and an $512 \times 512 \times 1$ segmentation mask.

## Setup    
You will work with the Keras example Jupyter notebook [**Multiclass semantic segmentation using DeepLabV3+**](https://keras.io/examples/vision/deeplabv3_plus/). This notebook uses a DeepLab3+ backbone network to perform multi-class semantic segmentation on the images in the dataset.   

It is recommended that you run this notebook in [**Google Colab Pro+**](https://colab.research.google.com/signup) piad subscription. Colab Pro+ allows background execution so that model training is not lost when session ends. For fastest execution the **A100 GPU** is recommended, along with High-RAM. 

Before you execute the code in the notebook make the following changes. In the interest of limiting development time and complexity of this example we are taking a few short cuts, violating best practices in the process.   
- Only one model evaluation metric is used, accuracy. Ideally, we should be using others like mAP.
- No early stopping is used. The model simply runs for a fixed number of epochs, even if the result is not optimal. 

### 1. Initial visualization of dataset

The dataset contains pairs of a $512 \times 512 \times 3$ RGB image and a $512 \times 512 \times 1$ ground truth segmentation mask. This segmentation mask contains the lables for training the multi-class segementation model. To visualize a few of these pairs, create a code cell following the cell that creates the TensorFlow data set and paste in the following code.     

### 2. Training

As provided, the model training code in the original Jupyter notebook could use some improvement. Replace the original code with the code shown below. The updated code addresses two issues.     
1. **Model overfitting:** The original code resulted in erratic training and validation loss and accuracy curves, indicating that the model was overfit. The updated code addresses this problem in two ways:
     - The original learning rate appears to be too high. To address this problem a exponentially decaying learning rate schudule with a lower initial rate is used.      
     - The original optimizer specification did not include any weight decay. The updated code includes the `weight_decay` hyperparameter.    
2. **Better charts for learning curves:** The updated code places the training and validation curves on the same plot axis so that the viewer gets a better sense of the trajectory of the model learning. 

### 3. Visualizing errors in the learned masks     
One may well wonder about the error between the ground truth segmenation masks and the predicted segmentation masks. The code below does the following:   
1. Displays the ground truth mask, the predicted mask and the absolute difference between the ground truth and predicted as images.
2. Compute and print an average error between the ground truth mask and the predicted mask. This average error is scaled to a range of $[0,1]$, with 0 being no error and 1 being maximum disagreement.

Following the code cell in the Inference on Train Images section, create a new code cell paste in the code shown below.   

## Exercises   
Now that you have made the required updates to the Jupyter notebook, execute the code. Expect the model training to take about 1 hour. 
Once execution has completed examine the results and answer the questions in the following exercises.   

**Exercise 13-1:** Examine the images and ground truth masks displayed by the code added to the Creating the Tensor FLow section of the notebook and answer these questions:    
1. The semmantic segmentation ground truth mask labels are coded as colors in these plots. Compare the mask labels to the body parts of the people shown in the associated image. Do these ground truth labels have semantic meaning with realtionship to the structure of these humans? Provide an example. 
2. Are the semantic segmentation categories consistent from image to image and why is this important in training a model?   

**Answers:** 
1.        
2.    

> **NoteL** Before proceeding to the next exercise you might want to review the first two sections of [**Multi-Scale Context Aggregation by Dialated Convolutions**](https://arxiv.org/pdf/1511.07122v3) by Yu and  Koltun, 2016.  

**Exercise 13-2:**     
1. Examine the ground truth masks. In one or a few brief sentances describe the different scales that you see, stating a few examples.
2. Keeping the different scales in the images in mind, explain how the encoder or backbone in the first cell specifing the DeepLab3+ encoder deals with scales. What is the range of scales the ecoder incorporates?
3. DeepLab3+ has a fully convolutional architecture. What are the dimensions of the final (output) convolution layer and how do these dimensions correspond to the required output. 

**Answers:**     
1.       
2.      
3.   

**Exercise 13-3:** Examine the fully convolutional architecture of the DeepLab3+ model. Notice the output spatial dimenstions and channel numbers of the DeepLab3+ model output layer. In one or a few short sentances, explian why these are the appropriate spatial dimensions and channel numbers.

**Answer:**      

**Exercise 13-4:**  In the Training code block the loss funciton is defined that operatoes on the output of the convolutional model. Answer the following questions.    
1. Examine the mask examples and consider the large fraction of total pixels that are background. Further consider the fraction of the total pixels that make up the different semmantic categories. What do these observations tell you about the class imbalnce for this problem? Is this imbalance situation likely to be common of other segmentation problems and why?     \
2. Given your observations about inherent class imbalance, why is the [`keras.losses.SparseCategoricalCrossentropy`](https://keras.io/api/losses/probabilistic_losses/#sparsecategoricalcrossentropy-class) function a good choice for a loss function? 

**Answer:**  
1.    
2.    

**Exercise 13-5:** In the code cell of the Inference using Colormap Overlay section examine the infer funciton. Keeping in mind the output shape of the fully convolutional (no MLP, or softmax) DeepLab3+ network, in one or a few sentances explain how this function returns the category of each pixel. 

**Answer:**      

**Exercise 13-6:** Examine the images displayed by the code you added to the Inference on Trained Images section of the notebook. In each row left to right are the ground truth segmentation mask, the predicted mask, and the difference between the ground truth and the prediction. The brighter the color in the righthand image the greater the error. Notice also the average relative error on a $[0,1]$ scale is printed. What can you notice about the size of the high error regions compared to the size of the overall image and what does this tell you about possible problems with class imbalance.  

**Answer:**    

#### Copyright, 2025, Stephen F Elston. All rights reserved. 