# Image segmentation for self-driving car

Welcome to Image segmentation for self-driving car project.
In this project I will explore about semantic image segmentation for self-driving car.
In this jupyter notebook I am going to go through the complete process of the project along with code snippets.

## Table of Content
- [1 - Problem statement](#1)
    - [1.1 - The goals of the project](#1-1)
    - [1.2 - The challenges of the project](#1-2)
- [2 - Introduction to the Problem](#2)
    - [2.1 - What is image segmentation?](#2-1)
    - [2.2 - Types of image segmentation](#2-2)
        - [2.2.1 - Image semantic segmentation](#2-2-1)
        - [2.2.2 - Image Instance segmentation](#2-2-2)
        - [2.2.3 - Image Panoptic segmentation](#2-2-3)
    - [2.3 - Image segmentation Methods](#2-3)
        - [2.3.1 - Traditional Methods](#2-3-1)
        - [2.3.2 - Deep learning Methods](#2-3-2)
- [3 - CityScapes dataset](#3)
    - [3.1 - Features](#3-1)
    - [3.2 - Classes](#3-2)
- [4 - Packages](#4)
- [5 - Preprocessing](#5)
    - [5.1 - Load and split our data to train/dev/test datasets](#5-1)  
    - [5.2 - Process Labels](#5-2)
    - [5.3 - Process path](#5-3)
    - [5.4 - Explore train dataset](#5-4)
    - [5.5 - Preprocessing the dataset](#5-5) 
        - [5.5.1 - Why we need to resize our dataset?](#5-5-1)
        - [5.5.2 - Why we need to normalize our dataset?](#5-5-2)
        - [5.5.3 - Code Preprocessing function](#5-5-3)
    - [5.6 - Data augmentation](#5-6)
        - [5.6.1 - What is Data augmentation and why we use this?](#5-6-1)
        - [5.6.2 - Why some data augmentation techniques is not good for self-driving car?](#5-6-2)
        - [5.6.3 - Which data augmentation techniques is still good for self-driving car?](#5-6-3)
        - [5.6.4 - Code Data augmentation](#5-6-4)
    - [5.7 - Divide our train dataset to mini batches and shuffle our train dataset](#5-7)
    - [5.8 - Use prefetch](#5-8)
- [6 - Unet explanation](#6)
    - [6.1 - What is Unet?](#6-1)
    - [6.2 - Unet model detials](#6-2)
        - [6.2.1 - Unet model genral detials and Architecture](#6-2-1)
        - [6.2.2 - Unet Encoder](#6-2-2)
        - [6.2.3 - Unet Decoder](#6-2-3)
        - [6.2.4 - Unet Connecting paths](#6-2-4)
        - [6.2.5 - Unet bottlenack](#6-2-5)
        - [6.2.6 - Initialization of the weights](#6-2-6)
        - [6.2.7 - Regulzation](#6-2-7)    
    - [6.3 - Pros and Cons of Unet](#6-3)
        - [6.3.1 - Pros of Unet](#6-3-1)
        - [6.3.2 - Cons of Unet](#6-3-2)
- [7 - Cost functions and evaluation metrics](#7)
    - [7.1 - Pixel accuracy](#7-1)
    - [7.2 - What is the problem of pixel accuracy?](#7-2)
    - [7.3 - Sparse Categorical Cross entropy](#7-3)
    - [7.4 - Why still Sparse Categorical Cross entropy is not good enough?](#7-4)
    - [7.5 - The solutions for highly unbalanced segmentations and giving importance to certain pixels](#7-5)
        - [7.5.1 - Weighted Sparse Categorical Cross entropy for each class](#7-5-1)
        - [7.5.2 - Weighted Sparse Categorical Cross entropy for each pixel](#7-5-2)
        - [7.5.3 - Dice Coefficient and soft Dice Coefficient](#7-5-3)
            - [7.5.3.1 - What is Precision and Recall?](#7-5-3-1)
            - [7.5.3.2 - What is F1 score?](#7-5-3-2)
            - [7.5.3.3 - Dice Coefficient explanation and formula explanation](#7-5-3-3)
            - [7.5.3.4 - Soft Dice Coefficient](#7-5-3-4)
            - [7.5.3.5 - Why Soft Dice Coefficient is good?](#7-5-3-5)
- [8 - The models that we will build](#8)
- [9 - Let's build Unet encoders and decoders](#9)
    - [9.1 - Let's build Unet encoder](#9-1)
    - [9.2 - Let's build Unet decoder](#9-2)
- [10 - Let's build the genral unet model](#10)
- [11 - Let's implement the cost functions and evaluation metrics that we need for the models](#11)
- [12 - Let's implement plot function for plot history of the model](#12)
- [13 - Let's implement functions for shows model predictions](#13)
- [14 - Let's implement visualization callbacks functions](#14)
- [15 - Let's implement schedulers](#15)
    - [15.1 - Let's implement Dropout scheduler](#15-1)
    - [15.2 - Let's implement Learning Rate scheduler](#15-2) 
- [16 - Regular Unet model](#14)
    - [16.1 - First model version](#16-1)
        - [16.1.1 - Create the model](#16-1-1)
        - [16.1.2 - Train the model and evaluate him on train and dev datasets](#16-1-2)
        - [16.1.3 - Model's predictions on the train and dev datasets](#16-1-3)
    - [16.2 - Second model version](#16-2)
        - [16.2.1 - Create the model](#16-2-1)
        - [16.2.2 - Train the model and evaluate him on train and dev datasets](#16-2-2)
    - [16.3 - Third model version](#16-3)
        - [16.3.1 - Create the model](#16-3-1)
        - [16.3.2 - Train the model and evaluate him on train and dev datasets](#16-3-2)
        - [16.3.3 - Model's predictions on the train and dev datasets](#16-3-3)
    - [16.4 - Fourth model version](#16-4)
        - [16.4.1 - Create the model](#16-4-1)
        - [16.4.2 - Train the model and evaluate him on train and dev datasets](#16-4-2)
    - [16.5 - Fiveth model version](#16-5)
        - [16.5.1 - Create the model](#16-5-1)
        - [16.5.2 - Train the model and evaluate him on train and dev datasets](#16-5-2)
    - [16.6 - Sixth model version](#16-6)
        - [16.6.1 - Create the model](#16-6-1)
        - [16.6.2 - Train the model and evaluate him on train and dev datasets](#16-6-2)
    - [16.7 - Seventh model version](#16-7)
        - [16.7.1 - Create the model](#16-7-1)
        - [16.7.2 - Train the model and evaluate him on train and dev datasets](#16-7-2)
    - [16.8 - Eighth model version](#16-8)
        - [16.8.1 - Create the model](#16-8-1)
        - [16.8.2 - Train the model and evaluate him on train and dev datasets](#16-8-2)
        - [16.8.3 - Model's predictions on the train and dev datasets](#16-8-3)
    - [16.9 - Ninth model version](#16-9)
        - [16.9.1 - Create the model](#16-9-1)
        - [16.9.2 - Train the model and evaluate him on train and dev datasets](#16-9-2)
    - [16.10 - Tenth model version](#16-10)
        - [16.10.1 - Create the model](#16-10-1)
        - [16.10.2 - Train the model and evaluate him on train and dev datasets](#16-10-2)
        - [16.10.3 - Model's predictions on the train and dev datasets](#16-10-3)
    - [16.11 - Eleventh model version](#16-11)
        - [16.11.1 - Create the model](#16-11-1)
        - [16.11.2 - Train the model and evaluate him on train and dev datasets](#16-11-2)
        - [16.11.3 - Model's predictions on the train and dev datasets](#16-11-3)
    - [16.12 - Twelfth model version](#16-12)
        - [16.12.1 - Create the model](#16-12-1)
        - [16.12.2 - Train the model and evaluate him on train and dev datasets](#16-12-2)
        - [16.12.3 - Model's predictions on the train and dev datasets](#16-12-3)
    - [16.13 - Thirteen model version](#16-13)
        - [16.13.1 - Create the model](#16-13-1)
        - [16.13.2 - Train the model and evaluate him on train and dev datasets](#16-13-2)
        - [16.13.3 - Model's predictions on the train and dev datasets](#16-13-3)
- [17 - Final model](#17)
    - [17.1 - Choose the model that we will use](#17-1)
    - [17.2 - Final model's predictions on the test dataset](#17-2)
    - [17.3 - Final model evaluation on the test dataset](#17-3)
- [18 - Summary](#18)
- [19 - Goals for the future](#19)

<a name='1'></a>
## 1 - Problem statement

Our problem is to explore Unet architectures for do semantic segmentation for self-driving car on Cityscapes dataset
and try to get the lowest Computational Cost, lowest storage and the best accuracy.

<a name='1-1'></a>
### 1.1 - The goals of the project
The goals of the project are:

* To explore semantic segmentation for self-driving car.

* To explore how Unet architecture work for self-driving car.

* To get the the lowest Computational Cost, lowest storage and the best accuracy that I can on Cityscapes Dataset when I use Unet architectures.

* To explore different models that will help to achieve the thrid goal
and compare between them.

* To experience build projects of this magnitude(my first project of this magnitude).

* To experience kaggle and their computing power for run the models that we will explore.

<a name='1-2'></a>
### 1.2 - The challenges of the project
The challenges of the project are:

* Our data set is contains large images. This is Makes it difficult for us to do semantic segmentation problem for self-driving car and increases the Computational Cost and storage. This leads us to use complex architectures for get good accuracy and therefore we encounter situation that we need to striking balance the trade off between accuracy of the model and Computational Cost and storage.

* Self-driving car tasks are real-time tasks and therefore we must to solve the problem with the lowest Computational Cost, lowest storage and use efficient segmentation algorithms

* Self-driving car tasks are difficlut, because there are many situations we can encounter while driving, for example: different weather conditions, daytime vs nighttime, different road conditions, blurry noisable and unintelligible images and more.

* Our data set contain 34 classes, so we have a lot of classes.

* Semantic segmentation for self-driving car is difficult. We need to achieve high accuracy in pixel-level segmentation for ensure safe driving. Even small error can lead significant consequences.

* Because we have limited dataset size we have more chance for overfitting.

* Because we have limited dataset size we have more chance for bad generalization and bad adaptation to scenarios and environments that the model don't saw.

And many more challenges that we will encounter during the project.

This is my first project of this magnitude so this is challenging for me and I will have to solve and overcome many challenges.

<a name='2'></a>
## 2 - Introduction to the Problem
In this section I will give some introduction to the Problem.
I will explain briefly about image segmentation.

<a name='2-1'></a>
### 2.1 - What is image segmentation?
Image segmentation is a famous task in computer vision.

Image segmentation is the process of dividing an image into multiple meaningful regions or objects based on their inherent characteristics, such as color, texture, shape, or brightness. 

In this way we can get more information about the image and we can easier to analyze the image.
For example in self-driving car tasks we can get better scene understanding and get desicions accrding this, like stop if we close to car or pedestrian.

<a name='2-2'></a>
### 2.2 - Types of image segmentation
There are 3 types of image segmentation: semantic segmentation, instance segmentation and panoptic segmentation.

In this project I will work on semantic segmentation but now I will cover briefly all the 3 types of image segmentation.

<a name='2-2-1'></a>
#### 2.2.1 - Image semantic segmentation
Image semantic segmentation is a famous task in computer vision of labelling each pixel of the image into a predefined set of classes.
That mean that we ask the following question:

"What objects are in this image and where exactly in the image are those objects located? 
Give me precise mask for each object in the image by labeling each pixel in the image with its corresponding class."

For example:

<div style="text-align:center">
<img src="Images/semantic segmentation.webp" style="width:500px;height:250;">
</div>
<caption><center> <u><b>Figure 1</u></b>: Example of a semantic segmented image <a href="https://medium.com/analytics-vidhya/introduction-to-semantic-image-segmentation-856cda5e5de8">(Source)</a> <br> </center></caption>

<a name='2-2-2'></a>
#### 2.2.2 - Image instance segmentation
Image instance segmentation involves detecting and segmenting each object in an image.

Differently from semantic segmentation that labelling each pixel of the image into a predefined set of classes, this type of segmentation segmenting the objectâ€™s boundaries.
That mean that we ask the following question:

"What objects are in this image and what the boundaries of those objects?"

For example:

<div style="text-align:center">
<img src="Images/Instance segmentation input.png" style="width:500px;height:300px;">
<img src="Images/Instance segmentation output.png" style="width:500px;height:300px;">
</div>
<caption><center> <u><b>Figure 2</u></b>: Example of a instance segmented image<br> </center></caption>


<a name='2-2-3'></a>
#### 2.2.3 - Image panoptic segmentation
Panoptic segmentation generalizes both semantic and instance segmentation.

That mean that we ask the following question:

"What different objects are in this image and where exactly in the image are those objects located?
Give me precise mask for each object in the image by labeling each pixel in the image with its corresponding class and distinguish between objects that are different from the same class."

This type of image segmentation get the most information from the image, because he knows also tell where there are different object. For example in self-driving car taks this is important because we want to know what and where the different objects but also want to distinguish between different cars or pedestrians.

For example:

<div style="text-align:center">
<img src="Images/Panoptic segmentation input.png" style="width:500px;height:300;">
<img src="Images/Panoptic segmentation output.png" style="width:500px;height:300;">
</div>
<caption><center> <u><b>Figure 3</u></b>: Example of a panoptic segmented image<br><a href="https://medium.com/hasty-ai/panoptic-segmentation-explained-ca10597fb357">(Source)</a> </center></caption>


<a name='2-3'></a>
### 2.3 - Image segmentation Methods
In this project I will explore Unet architectures for image semantic segmentation.

Now I will cover briefly Image segmentation methods.

<a name='2-3-1'></a>
#### 2.3.1 - Traditional Methods
Traditional methods for image segmentation are usually computationally efficient and relatively simple to implement(for example relatively to deep learning methods). 

Some examples of traditional methods for image segmentation are:
* Thresholding like: Global thresholding and Adaptive thresholding
* Edge-based Segmentation like: Canny edge detection and Sobel edge detection
* Clustering like: K-means clustering

Traditional image segmentation methods are more suitable for simpler image segmentation and have limited accuracy on complex scenes. 

Therefore when deep learning field start raise, pepole started use deep learning methods for image segmentation tasks, because they are have high accuracy(relatively to traditional methods) on complex scenes and suitable not only for simpler image segmentation, but also for complex image segmentation.

<a name='2-3-2'></a>
#### 2.3.2 - Deep learning Methods
Deap learning methods for image segmentation are less computationally efficient than Traditional methods for image segmentation and relatively complex to implement. 

As I said in section 2.3.1 deap learning methods for image segmentation are have high accuracy(relatively to traditional methods) on complex scenes and suitable not only for simpler image segmentation, but also for complex image segmentation.

Some examples of deap learning methods for image segmentation are:
* U-Net, in this project I will explore Unet architectures for image semantic segmentation.
* SegNet
* DeepLab
* Panoptic FPN

<a name='3'></a>
## 3 - CityScapes dataset
In this project I will use the CityScapes dataset. 

In this section I will give overiew of CityScapes dataset.

<a name='3-1'></a>
### 3.1 - Features
The features of CityScapes dataset that I use according the <a href="https://www.cityscapes-dataset.com/dataset-overview/#features">CityScapes site</a>, are(Some of the things on the site are not updated, so I changed and wrote here what is correct):

* Complexity
    * 34 classes
    * See Class Definitions in section 3.2

* Diversity
    * 50 cities
    * Several months (spring, summer, fall)
    * Daytime
    * Good/medium weather conditions
    * Manually selected frames
        * Large number of dynamic objects
        * Varying scene layout
        * Varying background

* Volume
    * 3475 annotated images with fine annotations

* Images dimensions
    * 2048x1024 pixels 

* Division into train/dev/test sets
    * Train set contains 2975 images, that mean the size of the train set is 59.5% of the dataset.
    * Dev(or validation set) set contains 500 images, that mean the size of the dev set is 10% of the dataset. 
    * Test set contains 1,525 images, that mean the size of the dev set is 30.5% of the dataset.

    In summary, the dataset division into train/dev/test sets is 59.5%/10%/30.5%.
    The problem that the test dataset not has labels, and therefore we need to divide the dataset into train/dev/test sets manually.
    
    We will divide the dataset into train/dev/test sets in this way:

    * Train set contains 2085 images, that mean the size of the train set is 60% of the dataset. 
    * Dev set contains 521 images, that mean the size of the train set is approx 15% of the dataset. 
    * Test set contains 869 images, that mean the size of the train set is approx 25% of the dataset. 

    In total this is a logical division into train/dev/test, according our dataset size.
    
<div style="text-align:center">
<img src="Images/cityscapesCities.jpg" style="width:500px;height:400px;">
</div>
<caption><center> <u><b>Figure 4</u></b>: Contained cities in CityScapes dataset<br><a href="https://www.cityscapes-dataset.com/dataset-overview/#features">(Source)</a> </center></caption>

<a name='3-2'></a>
### 3.2 - Classes
The Classes of CityScapes dataset according the CityScapes site</a>, are:
<div style="text-align:center">
<img src="Images/cityscapesClasses.png" style="width:700px;height:400px;">
</div>
<caption><center> <u><b>Figure 5</u></b>: Contained classes in CityScapes dataset<br><a href="https://www.cityscapes-dataset.com/dataset-overview/#features">(Source)</a> </center></caption>

The classes of CityScapes dataset for semantic segmentation are described in the picture without sign and with * sign.

In summary for semantic segmentation task we have 8 groups and 19 classes.  

<a name='4'></a>
## 4 - Packages
After all the introduction to the project, let's start work on project and import all the libraries that we need.

In [None]:
import tensorflow as tf
import numpy as np
import os
import cv2 as cv
import matplotlib.pyplot as plt
from typing import Tuple
from tqdm import tqdm
import seaborn as sns
import copy

from tensorflow.keras.layers import Input
from tensorflow.keras.layers import Conv2D
from tensorflow.keras.layers import MaxPooling2D
from tensorflow.keras.layers import Dropout 
from tensorflow.keras.layers import Conv2DTranspose
from tensorflow.keras.layers import concatenate
from tensorflow.keras.callbacks import LearningRateScheduler


<a name='5'></a>
## 5 - Preprocessing
In this section I will preprocess the data set. This is an important and essential part before I start building a model to solve my problem.

<a name='5-1'></a>
### 5.1 - Load and split our data to train/dev/test datasets

Firstly I will get the paths of all the images and all the masks in the train/dev/test datasets.

In [None]:
train_images_top_directory = './data/leftImg8bit_trainvaltest/leftImg8bit/train/'
train_ground_truth_images_top_directory = './data/gtFine_trainvaltest/gtFine/train/'
dev_images_top_directory = './data/leftImg8bit_trainvaltest/leftImg8bit/val/'
dev_ground_truth_images_top_directory = './data/gtFine_trainvaltest/gtFine/val/'

TOTAL_FILES = 3475
TOTAL_TRAIN_FILES = 2085
TOTAL_DEV_FILES = 521
TOTAL_TEST_FILES = 869

def read_files_from_directory(directory, extensions):
    files = []
    for root, dirs, dir_files in os.walk(directory):
        for file in dir_files:
            if any(file.endswith(ext) for ext in extensions):
                files.append(os.path.join(root, file))
                pbar.update(1)
    return sorted(files)  # Sort the collected files

images_extensions = ('.jpg', '.png')
ground_truth_extensions = ('labelIds.jpg', 'labelIds.png')

with tqdm(total=TOTAL_FILES, desc="Processing images files") as pbar:
    images_path_first_part = read_files_from_directory(train_images_top_directory, images_extensions)
    images_path_second_part = read_files_from_directory(dev_images_top_directory, images_extensions)
    images_path = images_path_first_part + images_path_second_part

with tqdm(total=TOTAL_FILES, desc="Processing ground truth images files") as pbar:
    ground_truth_images_path_first_part = read_files_from_directory(train_ground_truth_images_top_directory, ground_truth_extensions)
    ground_truth__images_path_second_part = read_files_from_directory(dev_ground_truth_images_top_directory, ground_truth_extensions)
    ground_truth_images_path = ground_truth_images_path_first_part + ground_truth__images_path_second_part

train_images_path = images_path[:TOTAL_TRAIN_FILES]
train_ground_truth_images_path = ground_truth_images_path[:TOTAL_TRAIN_FILES]
dev_images_path = images_path[TOTAL_TRAIN_FILES:TOTAL_TRAIN_FILES+TOTAL_DEV_FILES]
dev_ground_truth_path = ground_truth_images_path[TOTAL_TRAIN_FILES:TOTAL_TRAIN_FILES+TOTAL_DEV_FILES]
test_images_path = images_path[TOTAL_TRAIN_FILES+TOTAL_DEV_FILES:TOTAL_TRAIN_FILES+TOTAL_DEV_FILES+TOTAL_TEST_FILES]
test_ground_truth_path = ground_truth_images_path[TOTAL_TRAIN_FILES+TOTAL_DEV_FILES:TOTAL_TRAIN_FILES+TOTAL_DEV_FILES+TOTAL_TEST_FILES]

Now We will use tensorflow in order to do load our datasets efficiently.

Firstly we will use tf.data.Dataset.from_tensor_slices method for create datasets from our lists of the images paths.

In [None]:
train_images_path = tf.constant(train_images_path)
train_ground_truth_images_path = tf.constant(train_ground_truth_images_path)

dev_images_path = tf.constant(dev_images_path)
dev_ground_truth_path = tf.constant(dev_ground_truth_path)

test_images_path = tf.constant(test_images_path)
test_ground_truth_path = tf.constant(test_ground_truth_path)

train_image_dataset_before_path_processing = tf.data.Dataset.from_tensor_slices((train_images_path, train_ground_truth_images_path))
dev_image_dataset_before_path_processing = tf.data.Dataset.from_tensor_slices((dev_images_path, dev_ground_truth_path))
test_image_dataset_before_path_processing = tf.data.Dataset.from_tensor_slices((test_images_path, test_ground_truth_path))

<a name='5-2'></a>
### 5.2 - Process Labels
In this section I will process the labels in the dataset.

For each image in the datasets we have corresponding mask image that represents segmantation map of our image. For each object we have unique color, that represents object's label.
How we will know what the pixel's label, what the label represents, to which catagory of labels he belongs, which labels we will need to take in considaration when we evalute our models?

The solution for these questions is to create list of Labels, when each label will contain her features and with this solution we will can for example to convert color to the corresponding label id in order to encode our mask to segmatation map that contatins the labels ids.

Let's start create label class:

In [None]:
class Label:
    def __init__(self, name: str, id: int, catagory: str, catagory_id: int, color: Tuple[int, int, int]):
        self.__name = name
        self.__id = id
        self.__catagory = catagory
        self.__catagory_id = catagory_id
        self.__color = color

    def get_name(self) -> str:
        return self.__name
    
    def get_id(self) -> int:
        return self.__id
    
    def get_catagory(self) -> str:
        return self.__catagory
    
    def get_catagory_id(self) -> int:
        return self.__catagory_id
    
    def get_color(self) -> Tuple[int, int, int]:
        return self.__color

Now I will initialize the list of the Labels in our dataset, inspired by this <a href="https://github.com/mcordts/cityscapesScripts/blob/master/cityscapesscripts/helpers/labels.py">page</a> and according section
3.2:

In [None]:
labels = [
    #       name                     id      category       catagory Id       color
    Label(  'unlabeled'            ,  0 ,    'void'            , 0       , (  0,  0,  0) ),
    Label(  'ego vehicle'          ,  1 ,    'void'            , 0       , (  0,  0,  0) ),
    Label(  'rectification border' ,  2 ,    'void'            , 0       , (  0,  0,  0) ),
    Label(  'out of roi'           ,  3 ,    'void'            , 0       , (  0,  0,  0) ),
    Label(  'static'               ,  4 ,    'void'            , 0       , (  0,  0,  0) ),
    Label(  'dynamic'              ,  5 ,    'void'            , 0       , (111, 74,  0) ),
    Label(  'ground'               ,  6 ,    'void'            , 0       , ( 81,  0, 81) ),
    Label(  'road'                 ,  7 ,    'flat'            , 1       , (128, 64,128) ),
    Label(  'sidewalk'             ,  8 ,    'flat'            , 1       , (244, 35,232) ),
    Label(  'parking'              ,  9 ,    'flat'            , 1       , (250,170,160) ),
    Label(  'rail track'           , 10 ,    'flat'            , 1       , (230,150,140) ),
    Label(  'building'             , 11 ,    'construction'    , 2       , ( 70, 70, 70) ),
    Label(  'wall'                 , 12 ,    'construction'    , 2       , (102,102,156) ),
    Label(  'fence'                , 13 ,    'construction'    , 2       , (190,153,153) ),
    Label(  'guard rail'           , 14 ,    'construction'    , 2       , (180,165,180) ),
    Label(  'bridge'               , 15 ,    'construction'    , 2       , (150,100,100) ),
    Label(  'tunnel'               , 16 ,    'construction'    , 2       , (150,120, 90) ),
    Label(  'pole'                 , 17 ,    'object'          , 3       , (153,153,153) ),
    Label(  'polegroup'            , 18 ,    'object'          , 3       , (153,153,153) ),
    Label(  'traffic light'        , 19 ,    'object'          , 3       , (250,170, 30) ),
    Label(  'traffic sign'         , 20 ,    'object'          , 3       , (220,220,  0) ),
    Label(  'vegetation'           , 21 ,    'nature'          , 4       , (107,142, 35) ),
    Label(  'terrain'              , 22 ,    'nature'          , 4       , (152,251,152) ),
    Label(  'sky'                  , 23 ,    'sky'             , 5       , ( 70,130,180) ),
    Label(  'person'               , 24 ,    'human'           , 6       , (220, 20, 60) ),
    Label(  'rider'                , 25 ,    'human'           , 6       , (255,  0,  0) ),
    Label(  'car'                  , 26 ,    'vehicle'         , 7       , (  0,  0,142) ),
    Label(  'truck'                , 27 ,    'vehicle'         , 7       , (  0,  0, 70) ),
    Label(  'bus'                  , 28 ,    'vehicle'         , 7       , (  0, 60,100) ),
    Label(  'caravan'              , 29 ,    'vehicle'         , 7       , (  0,  0, 90) ),
    Label(  'trailer'              , 30 ,    'vehicle'         , 7       , (  0,  0,110) ),
    Label(  'train'                , 31 ,    'vehicle'         , 7       , (  0, 80,100) ),
    Label(  'motorcycle'           , 32 ,    'vehicle'         , 7       , (  0,  0,230) ),
    Label(  'bicycle'              , 33 ,    'vehicle'         , 7       , (119, 11, 32) ),
    Label(  'license plate'        , -1 ,    'vehicle'         , 7       , (  0,  0,142) ),
]


One important note is that I can include only some classes in the evaluation matric and some classes not include, but in this project I want to include all of them, and challange myself to build a model that give good performance to all the classes.
In industery applications this is will be more smart to include sometimes not all the classes, but only the classes that are the most important to us. In addition I include the columns catagory and catagory id in case if we will want to evalutate our model on spesific catagory.

<a name='5-3'></a>
### 5.3 - Process path
In this section I will process the format and the type of the datasets.
I will decode each img and convert her to floast32 type.

This is essential process because in order our model will work good and will not happen bugs, we need that all the images in our datasets will be in the same format and with the same type.

In addition, I will resize each mask image, because now our mask's shape is (2048, 1024, 3), but for each pixel all his three chanels have the same value, because this is labal ids mask and not rgb mask for visualizetion.
Thus we will resize each mask image to shape (2048, 1024), that mean now each value of pixel will be the label id of the pixel, according section 5.2 .
 
We will convert all the images in our datasets to png format and to float32 type.

In [None]:
def process_path(image_path, ground_truth_path):
    img = tf.io.read_file(image_path)  
    img = tf.image.decode_png(img, channels=3)
    img = tf.image.convert_image_dtype(img, tf.float32) 

    ground_truth_img = tf.io.read_file(ground_truth_path)  
    ground_truth_img = tf.image.decode_png(ground_truth_img, channels=1)

    return img, ground_truth_img

In [None]:
train_image_dataset_before_pre_processing = train_image_dataset_before_path_processing.map(process_path)
dev_image_dataset_before_pre_processing = dev_image_dataset_before_path_processing.map(process_path)
test_image_dataset_before_pre_processing = test_image_dataset_before_path_processing.map(process_path)

<a name='5-4'></a>
### 5.4 - Explore dataset
In this section I will explore our dataset, in order I will get better familiar with the dataset.

Firstly, let's see the sizes of our datasets:

In [None]:
train_image_dataset_m = len(train_image_dataset_before_path_processing)
dev_image_dataset_m = len(dev_image_dataset_before_path_processing)
test_image_dataset_m = len(test_image_dataset_before_path_processing)
NUM_CLASSES = 34
print("Train dataset size is: " + str(train_image_dataset_m))
print("Dev dataset size is: " + str(dev_image_dataset_m))
print("Test dataset size is: " + str(test_image_dataset_m))
print("Number of classes is: " + str(NUM_CLASSES))

Before we start plot some examples of images and their corresponding masks from the train dataset,
we need to create a function that can convert a mask to a visualization mask.

What this is mean?
So, our masks are label ids mask, that mean each pixel in the mask have only one value, that represents his label id, according section 5.2.
Now in order visulaize our mask we need to convert each label id to his corresponding color, according section 5.2.

This function is importent, because visualization mask give us better understanding of the segmantation,
and for pepole that does not knows what is label ids(like pepole that not understanding DL) they are can better understanding the segmantation with visualization mask.

In summary, visualization mask is more beautiful :)

In [None]:
def convert_mask_to_visualization_mask(ground_truth_img, labels):
    if ground_truth_img.ndim == 3:
        height, width, _ = ground_truth_img.shape
        new_ground_truth_img = tf.math.reduce_max(ground_truth_img, axis=-1, keepdims=False)
    else:
        height, width = ground_truth_img.shape
        new_ground_truth_img = ground_truth_img
    visualize_ground_truth_img = np.zeros((height, width, 3), dtype=int)
    for label in labels:
        visualize_ground_truth_img[new_ground_truth_img[:,:] == label.get_id(), :] = label.get_color()
    return visualize_ground_truth_img

Now let's plot some exampels of images and their corresponding visualization masks from the train dataset:

In [None]:
def plot_example(img, ground_truth_img):
    fig, arr = plt.subplots(1, 2, figsize=(14, 10))
    arr[0].imshow(img)
    arr[0].set_title('Image')
    arr[1].imshow(ground_truth_img)
    arr[1].set_title('Segmentation')


AMOUNT_OF_EXAMPELS_TO_PLOT = 3 
for img, ground_truth_img in train_image_dataset_before_pre_processing:
    if AMOUNT_OF_EXAMPELS_TO_PLOT:
        visualize_ground_truth_img = convert_mask_to_visualization_mask(ground_truth_img, labels)
        plot_example(img, visualize_ground_truth_img)
        AMOUNT_OF_EXAMPELS_TO_PLOT -= 1
    else:
        break

<a name='5-5'></a>
### 5.5 - Preprocessing the dataset
In this section I will cover the part of Preprocessing the dataset that include:
* Resize our dataset
* Normalize our dataset

<a name='5-5-1'></a>
#### 5.5.1 - Why we need to resize our dataset?
Our dataset contains images with dimensions of 2048x1024 pixels, that mean we are handle with large images.
Most of the models architectures in DL are not suitable for images of this size.

However we can change these models architectures in order they will be suitable for images of this size,
but the result of this operation is to create very big and deep architectures what lead to higher computational cost and higher memory storage.
Also, usually big and deep architectures tends to be with higher chance for overfitting, because they are more complex.
There are other problems with large and deep architectures such as vanishing and exploding gradients (because they are deep).

Now let's recall our problem. Our problem is to do semantic segmentation for self-driving car, so we need to build model for Real-Time problem.
Thus, we need to create model with the lowest Computational Cost an the lowest storage as we can.
And of course, we not want to create model with higher chance for overfitting, because we need that our model has a succseful adaptive for new images and has high accuarcy on the dev dataset and test dataset.

Thus, we can not use large images because this leed to problem, that can leed our model to slower performence in Real-Time and bad adaptive for new images, what can lead to for accidents(let's recal our problem is for self-driving car, so we need to create fastest and safe(make as few mistakes as possible) model).

I have not large memory storage and large Computational Resoreces, and therefore I will resize the images to $ 128 \times 128 $ pixels.

<a name='5-5-2'></a>
#### 5.5.2 - Why we need to normalize our dataset?
Our dataset is not normalized in the beginning and we need to normalize him,
that mean to do opeartion that will lead that all our pixels in the images on the dataset will be in the same scale and between 0 and 1 inclusive.

But why we need to do this? 
1. **Sigmoid and tanh activations:** 

    Sigmoid and tanh activations are sometimes used in DL(usually sigmoid used in the output layer) and they are look like this:
    <div style="text-align:center">
    <img src="Images/Sigmoid.webp" style="width:500px;height:250px;">
    <img src="Images/tanh.webp" style="width:500px;height:250px;">
    </div>
    <caption><center> <u><b>Figure 6</u></b>: Sigmoid and tanh functions <a href="https://medium.com/@toprak.mhmt/activation-functions-for-deep-learning-13d8b9b20e">(Source)</a> <br> </center></caption>

    We can see that when z is become larger and larger number or become smaller and smaller number the slope of the Sigmoid and tanh functions gets closer and closer to zero, that mean:
    
    When f(z) = sigmoid(z) and g(z) = tanh(z) then:
     $$\lim_{{z \to +-\infty}} \frac{df(z)}{dz} = 0$$ 
     $$\lim_{{z \to +-\infty}} \frac{dg(z)}{dz} = 0$$

    So if our pixels in the images on the dataset will be between 0 and 255 inclusive so there is a bigger chance that z will be larger(in different layers) and thus the slope of the Sigmoid and tanh functions gets around zero, which will lead that the gradients will be around zero and thus the network will be more slowly learn the optimal parameters for the network.

    We want that the network will learn the optimal parameters for the network, as fast as possible, and thus we need that all our pixels in the images on the dataset will be between 0 and 1 inclusive(in this way the slope of the sigmoid and tanh functions probably will not gets around zero, at least in the beginning of the learning). 

2. **Lower cost function and gradients**

    If our pixels in the images on the dataset will not be between 0 and 1 inclusive, but will be between 0 and 255 inclusive, so if we using Relu(what very commonly used in DL) also the activations values will be huge and thus after the forward pass we will end up with hugh loss value and hugh gradients values.

    Hugh loss value will lead the network be more slowly learn the optimal parameters for the network.
    
    Hugh gradients values can lead that the cost function not decrease after each iteration, because in the update parameters part we will change each parameter with relativly big value. 
    We can try prevent this with set very small learning rate, but this will cause to the network be more slowly learn the optimal parameters for the network, and this is not good.

    The solution of this problem is to do that all our pixels in the images on the dataset will be between 0 and 1 inclusive.

    <div style="text-align:center">
    <img src="Images/Relu.webp" style="width:500px;height:250px;">
    </div>
    <caption><center> <u><b>Figure 7</u></b>: Relu function <a href="https://medium.com/@toprak.mhmt/activation-functions-for-deep-learning-13d8b9b20e">(Source)</a> <br> </center></caption>

3. **Avoid weird mathematical artifacts with floating-point number precision**

    If our pixels in the images on the dataset will not be between 0 and 1 inclusive, but will be between 0 and 255 inclusive, so if we using Relu(what very commonly used in DL) also the activations values will be huge, and this can lead us to handle mathematical operations in our network on really large or really small numbers.

    When we handle mathematical operations on really large or really small numbers we can loss information and lose accuracy, because computers lose accuracy when performing math operations on really large or really small numbers.

    The solution of this problem is to do that all our pixels in the images on the dataset will be between 0 and 1 inclusive(now we will work on numbers that are not really large and not really small numbers).

4. **Fastest learning**

    If our pixels in the images on the dataset will not be in the same scale so the parameters that associated with each pixel will be in different scale.

    Assume that we have 2 pixels and they are have 2 parameters that associated with them. They(the parameters) are will be in different scale and the cost function will be like in this image:

    <div style="text-align:center">
    <img src="Images/cost_func_with_diff_scale.png" style="width:500px;height:250;">
    </div>
    <caption><center> <u><b>Figure 8</u></b>: Cost function with parameters that have different scale <a href="https://www.coursera.org/specializations/deep-learning">From andraw ng course</a> <br> </center></caption>


    Let's recall that we want to minimize the cost function, but with this cost function the minimize process will be more slower, because each parmeter has different scale, and thus if we want our cost function to converge we need set our learning rate to small number.

    Setting our learning rate to small number will cause to the network be more slowly learn the optimal parameters for the network, and this is not good. 

    Thus, we need that our pixels in the images on the dataset will be in the same scale, and in this situation our cost function will probably looks like this(with two parameters):

    <div style="text-align:center">
    <img src="Images/cost_func_with_same_scale.png" style="width:500px;height:250;">
    </div>
    <caption><center> <u><b>Figure 9</u></b>: Cost function with parameters that have same scale <a href="https://www.coursera.org/specializations/deep-learning">From andraw ng course</a> <br> </center></caption>

    With this cost function the minimize process will be more faster.

<a name='5-5-3'></a>
#### 5.5.3 - Code Preprocessing function

There are two possible methods for normalizing the input.  

**1.** We will divide each pixel by 255, and the result of this is that all our pixels in the images on the dataset will be in the between 0 and 1 inclusive.

**2.** 
Compute the mean across the entire dataset:

$Mean = \frac{1}{m} \sum_{i=1}^{m} X_i$ , when $X_i$ is the i img in the dataset. 

Compute the variance across the entire dataset:

$Var = \frac{1}{m} \sum_{i=1}^{m} (X_i - Mean)^2$

And compute:

$X_i = \frac{X_i - Mean}{\sqrt{Var}}$, when $X_i$ is the i img in the dataset. 

In this project we will experiment the two methods. One important thing that it is worth noting is that the second method for normalizing the input can cause to negative values of pixels, but we still can use her, because we enter this to our model, that can handle with this.

Firstly we will resize our images in the datasets.

We will resize our images to $ 128 \times 128 $ pixels.

In [None]:
INPUT_SHAPE = (128, 128, 3)
TARGET_SIZE_TO_RESIZE = (128, 128)

def resize_dataset(img, ground_truth_img):
    resized_img = tf.image.resize(img, TARGET_SIZE_TO_RESIZE, method=tf.image.ResizeMethod.LANCZOS3)
    resized_ground_truth_img = tf.image.resize(ground_truth_img, TARGET_SIZE_TO_RESIZE, method='nearest')
    return resized_img, resized_ground_truth_img


train_image_dataset_before_normalize = train_image_dataset_before_pre_processing.map(resize_dataset)
dev_image_dataset_before_normalize = dev_image_dataset_before_pre_processing.map(resize_dataset)
test_image_dataset_before_normalize = test_image_dataset_before_pre_processing.map(resize_dataset)

Let's compute the mean and the variance of the datasets:

In [None]:
def compute_example_variance(example, dataset_size):
    img = example[0]
    variance = tf.math.reduce_variance(img)
    return variance / dataset_size


def compute_example_mean(example, dataset_size):
    img = example[0]
    return tf.cast(img, tf.float32) / dataset_size

train_dataset_mean = train_image_dataset_before_normalize.reduce(
    tf.zeros(INPUT_SHAPE, dtype=tf.float32), lambda x, example: x + compute_example_mean(example, train_image_dataset_m))
dev_dataset_mean = dev_image_dataset_before_normalize.reduce(
    tf.zeros(INPUT_SHAPE, dtype=tf.float32), lambda x, example: x + compute_example_mean(example, dev_image_dataset_m))
test_dataset_mean = test_image_dataset_before_normalize.reduce(
    tf.zeros(INPUT_SHAPE, dtype=tf.float32), lambda x, example: x + compute_example_mean(example, test_image_dataset_m))

train_dataset_variance = train_image_dataset_before_normalize.reduce(
    0.0, lambda x, example: x + compute_example_variance(example, train_image_dataset_m))
dev_dataset_variance = dev_image_dataset_before_normalize.reduce(
    0.0, lambda x, example: x + compute_example_variance(example, dev_image_dataset_m))
test_dataset_variance = test_image_dataset_before_normalize.reduce(
    0.0, lambda x, example: x + compute_example_variance(example, test_image_dataset_m))

In [None]:
def normalize_dataset_first_version(img, ground_truth_img):
    normalized_img = img / 255
    return normalized_img, ground_truth_img


def normalize_dataset_second_version(img, ground_truth_img, dataset_mean, dataset_variance):
    normalized_img = img - dataset_mean
    normalized_img /= tf.sqrt(dataset_variance)
    return normalized_img, ground_truth_img


train_dataset_before_normalize_first_version = copy.deepcopy(train_image_dataset_before_normalize)
dev_dataset_before_normalize_first_version = copy.deepcopy(dev_image_dataset_before_normalize)
test_dataset_before_normalize_first_version = copy.deepcopy(test_image_dataset_before_normalize)

train_dataset = train_dataset_before_normalize_first_version.map(normalize_dataset_first_version)
dev_dataset = dev_dataset_before_normalize_first_version.map(normalize_dataset_first_version)
test_dataset = test_dataset_before_normalize_first_version.map(normalize_dataset_first_version)

train_dataset_second_version = train_image_dataset_before_normalize.map(
    lambda img, ground_truth_img: normalize_dataset_second_version(img, ground_truth_img, train_dataset_mean, train_dataset_variance))
dev_dataset_second_version = dev_image_dataset_before_normalize.map(
    lambda img, ground_truth_img: normalize_dataset_second_version(img, ground_truth_img, dev_dataset_mean, dev_dataset_variance))
test_dataset_second_version = test_image_dataset_before_normalize.map(
    lambda img, ground_truth_img: normalize_dataset_second_version(img, ground_truth_img, test_dataset_mean, test_dataset_variance))

One important thing that it is worth noting is that copy.deepcopy not alwayes work in all the computers, so we can use other alternative method that will copy our dataset that if we will do changes on this copy this not affect on the original dataset.

<a name='5-6'></a>
### 5.6 - Data augmentation
In this section I will cover the part of data augmentation.

<a name='5-6-1'></a>
#### 5.6.1 - What is Data augmentation and why we use this?
Overfitting is a common problem for deep neural networks.
Neural networks are often very big and deep relative to our dataset,
what means that neural networks often have more parameters than we really need for our size of dataset.

As result of this, neural networks often over fitting our dataset, that is they can memorize some unimportent features of our dataset, instead of learn some genral and useful information about our dataset and genralize this information to other examples that she does not saw.
Thus, when we give to our neural network new, real-world data that he never saw, he fail to yield useful results.

There are some techniques to address overfitting like dropout, regularization, early stopping, get bigger dataset and data augmentation.

In this project we will discuss about data augmentation.

Data augmentation is a technic to address overfitting, and what she offer is to "augment" our training dataset.

What this mean? "Augment" our training dataset mean to take our training dataset and on each iteration of our model do for each example an operation like: randomly flipping horizontally, shifting their hues, cropping random sections and more.
In this way we increasing the amount of information we have, and on each iteration our model will randomly change the current example with one or more types of data augmentation.

The result of this is our model need to be more genral because now he have more data information.
For example if I train a network to recognize cat and I have example with cat facing right,
if I will do on her flipping horizontally, so the model will learn that cat is cat, regardless of orientation.
In this way our model must to learn genral and useful information about our dataset, what will genralize this information to other examples that he does not saw.

Of course, that to add more new data is better way to cause our model to generalize itself, but in the most of cases add more new data is expensive, so data augmentation is good solution.

In addition, model genralzation is very important in real-world problems, because many datasets contain images from many sources, taken from different cameras in various conditions. Thus networks need to generalize over many factors to perform well. 
Some factors for example are lighting, scale, camera conditions and more.
With data augmentation we can cause our model to genralize over all of these factors.
For example if I train a network to recognize cat and I have example with cat in image with a lot of light with data augmentation we can cause our model to understand that lighting conditions not detrmine whether it is a cat in the picture or not.Thus cat in image that more darker, is also cat.

<div style="text-align:center">
    <img src="Images/data_augmentation_exampels.png" style="width:500px;height:250;">
</div>
<caption><center> <u><b>Figure 10</u></b>: Data augmentation exampels <a href="https://medium.com/@tagxdata/data-augmentation-for-computer-vision-9c9ed474291e">(Source)</a> <br> </center></caption>

<a name='5-6-2'></a>
#### 5.6.2 - Why some data augmentation techniques is not good for self-driving car?
In self-driving car we have consistent data, because cars generally have consistent pose with respect to other vehicles and road objects. For example car alwayes will be in the right side of the road and the camera that took the pictures in the data set that we have, always in the same position, orientation and zoom.
Thus all our data always will be from the same system that have consistent camera and features.

This is because in self-driving car we collect our data with the same sensor system as will be used in production, and therefore we alwayes will have the some properties that will be the same in every image.

Thus we not need that our neural network genralize these properties.
For example we don't need that our neural network genralize to fliped, croped and rotationed images,
beacuse always our camera will be in the same position, orientation and zoom.
Therefore our neural network will never get images that fliped, croped or rotationed.

**And now we come to out conclusion:**

If we still do data augmentation like flip, crop or rotation this is will hurt the performance of our neural network because we take some of the resources of our neural network and assign them to genralize our neural network for fliped, croped and rotationed images.
But the problem is that our neural network will never get images that fliped, croped or rotationed,
and therefore we wasted a lot of resources for something that will never happend.
The result of this is that our neural network will have more worse performance(because she not use all her resources), and when we will do not use data augmentation like flip, crop or rotation, the neural network will have more good performance.

This is beutiful problem because, always in the internet we are told us ,that data augmentation only can improve and help to our neural network get better preformance on real-world data and production, but this is not right in all situations.

In genral, in DL field there are many tips and rules of thumb about things, like that overfitting is bad, how to chosse some hyperparaters(some rules of thumb for default recommended values), but one important thing
that we always should to remember is to these tips and rules of thumb does not right to all situations, and do not rely on them.

In summary I think this section is very interesting, and I was very surprised to discover that sometimes overfitting is very good thing, and try to address overfitting only leads to worse performance.

<a name='5-6-3'></a>
#### 5.6.3 - Which data augmentation techniques is still good for self-driving car?
There are some data augmentation techniques that are still good for self-driving car.
For example hue jitter augmentation technique is good for self-driving car because she has not affect camera properties(that are consistent). This is help to our nerual network to genralize the color of objects in the image, like car. This result of this, is that our nerual network understand that a red car and a blue car should both be detected the same, and the color is not important.

There is more data augmentation techniques that are still good for self-driving car, like random contrast, random brightness, random saturation and more, for reasons similar to those we described earlier.

<a name='5-6-4'></a>
#### 5.6.4 - Code data augmentation

Let's code our data augmenters:

In [30]:
class DataAugmentationLayer(tf.keras.layers.Layer):
    def __init__(self, augmentation_func, data_augmentation_prob, **kwargs):
        super(DataAugmentationLayer, self).__init__(**kwargs)
        self.__augmentation_func = augmentation_func
        self.__data_augmentation_prob = data_augmentation_prob

    @tf.autograph.experimental.do_not_convert
    def call(self, input_data, training):
        apply_augmentation = np.random.uniform() < self.__data_augmentation_prob
        if apply_augmentation and training:
            augmented_data = self.__augmentation_func(input_data)
        else:
            augmented_data = input_data 
        return augmented_data

def basic_data_augmentation(input_data):
    augmented_data = tf.image.random_hue(input_data, max_delta=0.1)
    return augmented_data

def advance_data_augmentation(input_data):
    augmented_data = tf.image.random_hue(input_data, max_delta=0.1)
    augmented_data = tf.image.random_brightness(augmented_data)
    augmented_data = tf.image.random_contrast(augmented_data)
    return augmented_data

def bad_data_augmentation(input_data):
    augmented_data = tf.image.random_flip_left_right(input_data)
    augmented_data = tf.image.random_flip_up_down(augmented_data)
    return augmented_data

<a name='5-7'></a>
### 5.7 - Divide our train dataset to mini batches and shuffle our train dataset

<span style="border-bottom: 2px solid blue">Divide our train dataset to mini batches:</span>

Beacuse our training dataset is big(2085 images), we need to divide him to mini batches.
If we will not do this, so for each iteration in gradient deceant we will pass over 2085 images, that mean update our parmeters after every time we go through 2085 images, that is do forward propogation and backward propogation of 2085 images. This is storage costs a lot, and this slow down our learning process because we update our parmeters after every time we go through 2085 images, that is large amount of images.
Thus for example after 100 iterations we will pass over $ 2085 \times 100 = 208,500$ images, but do only 100 updates to our parameters. Probably 100 updates to our parameters is not enogth to get a good result, but we still wasted expensive resources in terms of calculation, which is not good for us.

In conclusion, we are wasted expensive resources in terms of calculation for 100 iterations, but not get good result.

The solution of this is divide our training dataset to mini batches, that mean we divide our training dataset to mini groups of data, and pass over each group individually and update the parameters, and like this until we finish going through all the mini groups.

In this way we update the parameters more times after each pass over all the training dataset,
and this is more effinantly and speed our neural networks. In addition pass over mini group is more effienct in terms of storage than pass over the entire dataset. 

I have not large memory storage and large Computational Resources, and therefore I will choose mini batch size of 32, and during the project I will change this if I have to.

The pros of mini batch size that is 32 are that our learning procees more faster, because we do a lot of updates to our parameters after we pass all over our training dataset, and that we use less memory(this is good when you are have limited hardware or larger models).

The cons of mini batch size that is 32 are that in the learning process may be noisier gradients, and this can lead us to slower convergence and require do more training iterations.

<span style="border-bottom: 2px solid blue">Shuffle our train dataset:</span>

After we divide our train dataset to mini batches, it's important to shuffle our train dataset in order each mini batch contain random data, that mean we want that each mini batch contain genral and diverse data.

This is important thing because if spesific mini batch will contain only one type of data (for example only data from spesific city), our learning process will be not effecitve, and will not reflect all information.
This will lead to that when we update the model parameters for the spesific mini batch, this update is worng, and does not contribute to our algorithm to minimize the cost function, because it only represents a very specific type of data.

Thus, we shuffle our train dataset. 

Before we start writing the code I want to comment on one more small thing.

<span style="border-bottom: 2px solid blue">Buffer size:</span>

When we want to shuffle our train dataset we need to detrmine the buffer size.
Buffer size refers to the number of images that are loaded into memory and shuffled at a time, that mean if our buffer size is 300 we load randomly 300 images from our train dataset and shuffled them. After this we take randomly amount of images from the 300 images, according our mini batch size.

We need to use buffer size because if we have large dataset and low available memory so we can not loaded into memory our entire dataset and shuffled him at the same time.
Thus, we use buffer size.

As we increase the buffer size and will be closer and closer to our training dataset size we will shuffle more images and thus the chosse of the images from shuffled images, will be more randomly and genral(because for example if I chosse 32 images from 300 shuffled images it's less randomly and genral, than if I chosse 32 images from 900 shuffled images).

In conclusion we need to have buffer size that will balance between available memory that we have and the need for randomness.

I will start with buffer size of 2085, and during the project I will change this if I have to.

Now let's code this section:

We will set initial sizes for BUFFER_SIZE and BATCH_SIZE and during the project I will experiment more values if I have to.

In [None]:
BUFFER_SIZE = 2085
BATCH_SIZE = 32

train_dataset.batch(BATCH_SIZE)
train_dataset = train_dataset.cache().shuffle(BUFFER_SIZE).batch(BATCH_SIZE)
dev_dataset.batch(BATCH_SIZE)
dev_dataset = dev_dataset.cache().shuffle(BUFFER_SIZE).batch(BATCH_SIZE)
test_dataset.batch(BATCH_SIZE)
test_dataset = test_dataset.cache().shuffle(BUFFER_SIZE).batch(BATCH_SIZE)

train_dataset_second_version.batch(BATCH_SIZE)
train_dataset_second_version = train_dataset_second_version.cache().shuffle(BUFFER_SIZE).batch(BATCH_SIZE)
dev_dataset_second_version.batch(BATCH_SIZE)
dev_dataset_second_version = dev_dataset_second_version.cache().shuffle(BUFFER_SIZE).batch(BATCH_SIZE)
test_dataset_second_version.batch(BATCH_SIZE)
test_dataset_second_version = test_dataset_second_version.cache().shuffle(BUFFER_SIZE).batch(BATCH_SIZE)

<a name='5-8'></a>
### 5.8 - Use prefetch

In this section we will use prefetch method that prevents a memory bottleneck that can occur when reading from disk. It save aside some amount of images(from our train dataset), and keeps this data ready for when it's needed.

We can set the number of images that we save aside, or we can use `tf.data.experimental.AUTOTUNE` to choose the number of images that we save aside automatically.

In [None]:
AUTOTUNE = tf.data.experimental.AUTOTUNE

train_dataset = train_dataset.prefetch(buffer_size=AUTOTUNE)
dev_dataset = dev_dataset.prefetch(buffer_size=AUTOTUNE)
test_dataset = test_dataset.prefetch(buffer_size=AUTOTUNE)

train_dataset_second_version = train_dataset_second_version.prefetch(buffer_size=AUTOTUNE)
dev_dataset_second_version = dev_dataset_second_version.prefetch(buffer_size=AUTOTUNE)
test_dataset_second_version = test_dataset_second_version.prefetch(buffer_size=AUTOTUNE)

<a name='6'></a>
## 6 - Unet explnation

In this section I will discuss about the Unet architecture.

In my explanation I am going to use the paper <a href="https://arxiv.org/pdf/1505.04597.pdf">U-Net: Convolutional Networks for Biomedical Image Segmentation</a> of the Unet authors.

<a name='6-1'></a>
### 6.1 - What is Unet?
Unet is architecture that published in 2015 by Olaf Ronneberger, Philipp Fischer, and Thomas Brox From University of Freiburg, Germany.

Unet is very common arcitcthure and in the beginnig the purpose of the Unet authors was to find solution for mediacl segmentation application, and so Unet invented.

The surprising thing that since 2015, in other fields of computer vision also started use Unet too, and discovered that Unet is powerful architecture.

**But how the Unet authors thought about to do model such?**

So, before 2015 we had good models for calssifcation images tasks.
These models says as what there is an the image, but does not answer about important answer that is where are the objects(that we intersting them) in the image, i.e., we want that a class label to be assigned to each pixel in the image.

Thus the localization problem raise. Before 2015, were already models for localization images tasks, but they are still was not very effective and did not given good results of semantic segmentation on smaller datasets.

So the purpose of the Unet authors was to found model arcitcthure, that will be very effective 
and can give very good results of on segmentation on smaller datasets spesfily in biomedical tasks(remember that the Unet authors worked on mediacl problems), that have small datasets.

Often, thousands of training images are usually beyond reach in biomedical tasks, because very difficult and expensive to achive this dataset.
This is also true to a certain extent about other fields in computer vision, because in genral it is more difficult and more expensive to achive dataset for computer vision tasks compare to other fields in DL.

Hence, this is very important to us to find model arcitcthures, that will can give very good results of different tasks on smaller datasets.

**One more intersting question is what the unet aplications?**

Some aplications of Unet are:
 * Image segmentation
 * Super resulotion, that is get lower resultion image and output higher resultion image
 * Diffusion models were transforming gausian noise to newly genrated images.

<div style="text-align:center">
    <img src="Images/semantic segmentation.webp" style="width:370px;height:300px;">
    <img src="Images/super_resulotion.png" style="width:370px;height:300px;">
    <img src="Images/diffusion_models.jpg" style="width:370px;height:300px;">
</div>
<caption><center> <u><b>Figure 11</u></b>: <a href="https://medium.com/analytics-vidhya/introduction-to-semantic-image-segmentation-856cda5e5de8">Image segmentation</a>, <a href="https://www.v7labs.com/blog/image-super-resolution-guide">Super resulotion</a> and <a href="https://www.youtube.com/watch?app=desktop&v=fbLgFrlTnGU">Diffusion models</a> examples from left to right <br> </center></caption>

<a name='6-2'></a>
### 6.2 - Unet model details

<a name='6-2-1'></a>
#### 6.2.1 - Unet Architecture

The Unet architecture that described in the Unet papaer is:

<div style="text-align:center">
    <img src="Images/Unet_architecture.png" style="width:700px;height:400;">
</div>
<caption><center> <u><b>Figure 12</u></b>: Unet Architecture that was described in the Unet paper.

Each blue box corresponds to a multi-channel feature map.
The number of channels is denoted on top of the box.
The x-y-size is provided at the lower left edge of the box.
White boxes represent copied feature maps. The arrows denote the different operations.
<a href="https://arxiv.org/pdf/1505.04597.pdf">(Source)</a><br> </center></caption>

In the originl paper, the Unet only uses the valid part of each convolution, i.e., the segmentation map only contains the pixels, for which the full context is available in the input image.
In this project we will use padding, in order we will output the same image size as the input image size and we will refer to the information at the edges of the input image.

The result of padding operation described in this image:

<div style="text-align:center">
    <img src="Images/padding_example_Unet.png" style="width:500px;height:400;">
</div>
<caption><center> <u><b>Figure 13</u></b>: Result of padding operation on image
<a href="https://arxiv.org/pdf/1505.04597.pdf">(Source)</a><br> </center></caption>


We can see in the example that prediction of the segmentation in the yellow area, requires image data within the blue area as input. One possible solution for completing missing input data is to complete the missing input data by mirroring.

In addition in this project in each blue arrow we will do also batch-normalization(after conv $3\times3$ and ReLU).
Another change that we will do in this project is to use Unet model on images with resolution of $286\times572$.  

In addition, Unet have encoder and decoder. The encoder is the left part of the Unet and he responsible for extract importent information and usful features from the image. 
The decoder is the right part of the Unet and he responsible for take this features back by starting returning to the size of the input image, and try to do perfect segmentation for each pixel. 

At the final layer a 1x1 convolution is used to map each 64-component feature vector to the desired number of classes.

Another thing that worth noting is that Unet architecture is FCN, i.e. it only contains conv layers and does not contain any FC layer, and this is a strong advantage, that we will talk about him soon.

In total the network has 23 conv layers, 4 connecting paths and 4 max-pooling layers.

Another thing that worth noting is that we use ReLU after each conv layer but we can also use ELU.
In this project I will use ReLU.

<div style="text-align:center">
    <img src="Images/relu_vs_elu.png" style="width:500px;height:400;">
</div>
<caption><center> <u><b>Figure 14</u></b>: Fun image of ReLU versus ELU
<a href="https://pallawi-ds.medium.com/understand-semantic-segmentation-with-the-fully-convolutional-network-u-net-step-by-step-9d287b12c852">(Source)</a><br> </center></caption>

<a name='6-2-2'></a>
#### 6.2.2 - Unet Encoder

The encoder is a network that takes the input and outputs a feature map of the input.
We can consider the encoder as just FCN that tries to understand what information the image has.

There is 4 levels in the ecnoder, where that each level repated by two $ 3\times3 $ conv(with padding) + Relu + Batch normalzation layers, and one $ 2\times2 $ max pooling layer to downsample the resolution of the image. After each level we double the amount of the chanels.

This mean that after each level in the encoder the resolution of the image decrease but the amount of the chanels doubles.

This is give us levels with multi chanels, and this is very good because multi chanels gives us more fetures and more diverse information about the image. As bigger our resultion we have more detail informatoin about the fetures.

The encoder purpose is to get a lot of fetures, and in the end to output feture map with many fetures, that each feture has relative small resolution. In this way the encoder can learn very much information about the image and learn complex realtionships in the image data. 

So until now the encoder is regular FCN and behaves like many FCN for understeend very well what is the image represents. In the last stage, instead to add dense layer and fully connected layers like reulgar CNN for classifcation tasks, the Unet use the bootlenack and after this the decoder.

This is because for our purpose know what the image represents is not enough. We want to know where each object located and what the pixels of each object in the image.


<div style="text-align:center">
    <img src="Images/Unet_encoder.png" style="width:500;height:500px;">
</div>
<caption><center> <u><b>Figure 15</u></b>: Unet encoder from Unet paper
<a href="https://arxiv.org/pdf/1505.04597.pdf">(Source)</a><br> </center></caption>

<a name='6-2-3'></a>
#### 6.2.3 - Unet Decoder

The decoder is a network that takes as input the feature map and give us information where are the objects in the image, in that she upsample our bottleneck resolution to the original resolution.

There is 4 levels in the decoder, where the first 3 levels repated by two $ 3\times3 $ conv(with padding) + Relu + Batch normalzation layers, and one $ 2\times2 $ Transpose conv/Deconvultion layer to upsample the resolution of the image. After each level we reduce by 2 times the amount of the chanels.
In the final level we do two $ 3\times3 $ conv(with padding) + Relu + Batch normalzation layers, and one $ 1\times1 $ conv(with padding), in order to map each 64-component feature vector to the desired number of classes. 

In each level we get feature map from the connection path and we combine this with the output of the last Deconvultion layer. We will discuss about this in the next section.

Deconvultion layer is the reverse of max pooling layer. She upsample the resolution of the image.
We use Deconvultion layer because we want to get closer to the resolution of the input image.

In the decoder in each level our resolution increase but the amount of chanels decrese, because we want to use the information that we found and combine this to come close the orginal resulation, and to know where each object located and what the pixels of each object in the image.

<div style="text-align:center">
    <img src="Images/Unet_decoder.png" style="width:500;height:500px;">
</div>
<caption><center> <u><b>Figure 16</u></b>: Unet decoder from Unet paper
<a href="https://arxiv.org/pdf/1505.04597.pdf">(Source)</a><br> </center></caption>

<a name='6-2-4'></a>
#### 6.2.4 - Unet Connecting paths

In order to localize, features maps from the encoder are combined with the upsampled layers outputs from the decoder.

In each level in the decoder we take the fetures of the symetrical part of the encoder, and concante them onto their opoosing level in the decoder.

**A good question is why are we even thinking of doing this?**
The answer is simple: in the encoder part we learn so much information about the image so why not to use this information in the decoder part and help to our decoder with extra information. In this way the conv layers can operate on both the decoder features and the encoder features.
Connecting paths should provide the necessary detail in order to know the acuurate pixels that belonging to each object. We can recover more fine-grain detail with the addition of these skip connections,
becuase in the encoder we learn useful information about our image.

This is good becuase the encoder fetures can tell us for example more detail information of the pixels and the decoder can tell us what the area of the object. When we combining this we get more powerful information.

<div style="text-align:center">
    <img src="Images/Unet_connecting_paths.png" style="width:500;height:500px;">
</div>
<caption><center> <u><b>Figure 17</u></b>: Unet connecting paths
<a href="https://pallawi-ds.medium.com/understand-semantic-segmentation-with-the-fully-convolutional-network-u-net-step-by-step-9d287b12c852">(Source)</a><br> </center></caption>

<a name='6-2-5'></a>
#### 6.2.5 - Unet Bottlenack 
The lowest level in the Unet called bottlenack.

This is the level that connect between the encoder and the decoder. 

The bootlenack repated by two $ 3\times3 $ conv(with padding) + Relu + Batch normalzation layers, and one $ 2\times2 $ Transpose conv/Deconvultion layer to upsample the resolution of the image.

<a name='6-2-6'></a>
#### 6.2.6 Intialization of the weights

In deep networks with many convs layers, a good intialization of the weights is very important.
Otherwise, for example if we set some of the weights to very big values and the set the other weights to very small values, we will get that some parts of the network will give excessive activations, while other parts will never contribute or will give very small activations.

Another bad intialization is zero-intialziation that set all the weights to zero.
This cause that all the parts in our network will compute exactly the same function, i.e, our network is not powrful, and there is really no difference between it compared to a simple learning algorithm like logistic regression for example.

This is can cause bad accuracy and that our network will give bad semantic segmentation of the input image.

Hence, we will use he-normal initialization.

<a name='6-2-7'></a>
#### 6.2.7 - Regulzation

In Unet we need to care about regulzation in order our model will can adapt to new exampels that he did not saw and genralize.

One regulzation technique that we covered is data augmentation, and we saw how to do data augmentation for self-driving car tasks.

In addition, we will use Dropout that is famous technique for prevent overfitting.

<a name='6-3'></a>
### 6.3 - Pros and Cons of Unet

In this section we will discuss about the Pros and the Cons of Unet.

<a name='6-3-1'></a>
#### 6.3.1 - Pros of Unet

Here are several pros of U Net:

* According the Unet paper, Unet can use the available annotated samples more efficiently and even if we have small dataset we can get very good results.

* According the Unet paper, Unet is fast(on small resultion images). Segmentation of a 512x512 image takes less than a second on a recent GPU(in 2015 year).

* Unet is an end-to-end fully convolutional network (FCN), because of which it can accept the training and testing images of any size.

* Unet has unique structure that lead that unet very effective for tasks with high resulotion inputs and outputs.

* Unet is a genral architecture, i.e, we can use many different encoders and decoders. For example we can use VGG-16 in the encoder and decoder or use other architecture. This is useful because we combine Unet architecture with other good architectures and can use Transfer Learning.


<a name='6-3-2'></a>
#### 6.3.2 - Cons of Unet 

Here are several cons of U Net:

* Unet can has optimization challenges especially when we have small train dataset. This is because Unet contains a lot of parameters and tuning them well is not simple task.

* Unet can has not good performance on Imbalanced Data. We will discuss about this and will discuss how to try to address this later in the project.

* Unet can overfit the train dataset especially when we have small train dataset. This is because Unet contains a lot of parameters and when we have small dataset this can cause to overfitting the train dataset. We can solve this with regulzation tecniques.

* Unet can be computationally and memory expensive, when we need to deal with high resolution images. This can be challenging and impossible to use when we use limited GPU memory.

<a name='7'></a>
## 7 - Loss functions and evaluation metrics

In this section I will disucss about loss functions and evaluation metrics for semantic segmentation.

<a name='7-1'></a>
### 7.1 - Pixel accuracy

Pixel accuracy is a very simple evaluation metric for semantic segmentation.

Pixel accuracy measures the overall accuracy of the segmentation.

For each image we take the output of our network for the image, and in each pixel take the id of the class that have the highest probability that pixel will be this class.
For example if I in spesific pixel, I have 4 classes and in the output of our network these are the probabilties for the pixel: 0.1, 0.2, 0.4, 0.3 , so the pixel will have the value 2 because the class with the id 2, has the highest probability that pixel will be this class.

After this we have 2d array that each pixel in the array contain the id of the class that we predicted. Now we will use this formula for the evaluation of our prediction for the image:

$Pixel\hspace{0.1cm}Accuracy = \frac{Number\hspace{0.1cm}of\hspace{0.1cm}correctly\hspace{0.1cm}classified\hspace{0.1cm}pixels}{Total\hspace{0.1cm}number\hspace{0.1cm}of\hspace{0.1cm}Pixels}$ 

We compute number of correctly classified pixels in that we compare our 2d array that we get, to the ground-truth of the image.

The pixel accuracy is between 0 and 1 and as closer the pixel accuracy to 1, so we probably better predicted our pixels classes. I say probably, because high pixel accuracy does not necessarily mean that our model is good. There some problems with pixel accuracy that we will discuss about them in the next section.

If we want to compute the loss for spesific example(that is image), we will do:

$Loss = 1 - Pixel\hspace{0.1cm}Accuracy$

<a name='7-2'></a>
### 7.2 - What is the problem of pixel accuracy?

Pixel accuracy has two main problems.
The first problem is that we calcultate a rough estimate of our accuracy, and we do not consider the probabilities of each pixel to be the class we have chosen. For example if we chose that for sepsific pixel the prediceted id class is 3, that is important if the pixel had very high probabilty to be this class or low probabilty to be this class. 

When we consider the probabilities of each pixel to be the class we have chosen, the learning process is better, and we say to our model to  to be more sure about his decisions, and not only choose the right id of the class for each pixel, but also be as sure as possible if it is a correct decision(that is get right decisions and get the highest probabilities for them).

Is like the differnce between softmax and hardmax.

The second problem is class imbalance, that says the pixel accuracy can be very good, but can be some classes, that the model accuracy on them is very bad.

This is often cause where there is class imbalance, i.e, there is classes that appear more in our dataset and there is calsses that appear less in our dataset. This problem is often cause in computer vision apllications.

Because pixel accuracy is give overall accuracy on our modal on image, she does not good for situations like class imbalance.

For example if we have 2 classes and our image resolution is $10\times10$, and in the ground-truth the first class appear in 95 pixels and the second class appear in 5 pixels.
If our model predicted all the pixels in the image that belongs to the first class as first class, and predicted all the pixels that belongs to the second class as first class, the pixel accuracy that we will have for our image is:

$Pixel\hspace{0.1cm}Accuracy = \frac{Number\hspace{0.1cm}of\hspace{0.1cm}correctly\hspace{0.1cm}classified\hspace{0.1cm}pixels}{Total\hspace{0.1cm}number\hspace{0.1cm}of\hspace{0.1cm}Pixels} = \frac{95}{100} = 0.95 $ 

So we have very high accuracy but in actual our model is not good,  because our model predicted all the pixels in the image that belongs to the second class as first class.

In addition, sometimes we need to give more importance to certain pixels in the images, because if our model will mistake in them this will be more worse, than he will mistake in other pixels. Pixel accuracy can not give more importance to certain pixels than others pixels, and this is a problem.

We will cover the problem of giving more importance to certain pixels in the image in 7.5.2 section.

Thus, we need to find other evaluation matrix that solves these problems.

<a name='7-3'></a>
### 7.3 - Sparse Categorical Cross entropy

The Sparse Categorical Cross entropy is very common loss function for semantic segmentation.

This is actually pixel-wise soft-max over the final output(i.e final feature map) of our network combined with the sparse cross entropy loss function.

Let's defined the soft-max function:

$ p_c(x) = \frac {exp(\hat{y}_c(x))}{\sum_{c'=1}^{C} exp(\hat{y}_{c'}(x))} $

Where C is the amount of the classes, $ \hat{y}_c(x) $ is the value that the pixel have in the final feature map of our network in the c feature chanel, and x is the coordinates of the pixel in our image.
Of course, $ p_c(x) $ for spesific class c and the coordinates of the pixel x, represents the probability of x to be the c class. 

In addition, $ p_c(x) \approx 1 $ when $ \hat{y}_c(x) $ is the maximum value from all the values in the feature chanels for x(in the the final feature map), and $ p_c(x) \approx 0 $ for all the other classes.

This is good propertie, because thanks to this propertie the cross entropy loss function forces our model to give the higest value that he can to $ p_c(x) $ when c is the correct class, and the lowest value that he can to $ p_c(x) $ to all the other classes.
In this way our model will learn the right class for each pixel, and will more sure with his decisions(because he will give more probability to the correct class and give approximate zero probabilty to the other classes).

The sparse cross entropy loss function for specific image, when P contains all the pixels coordinates in the image is:

$ L = - \sum\limits_{\substack{x \in P}} \log(p_{g(x)}(x)) $

Where g(x) is a function that gives the true class of x. 

<a name='7-4'></a>
### 7.4 - Why still Sparse Categorical Cross entropy is not good enough?

Firstly, the first problem that was in pixel accuracy solved, because the Sparse Categorical Cross entropy take in consider the probabilities of each pixel to be the class we have chosen.

But the second problem that was in pixel accuracy not solved, i.e, the Sparse Categorical Cross entropy also not good for highly unbalanced segmentations and not good when we want to give more importance to certain pixels in the image, from reasons similar to what we described in section 7.2.

In the next section we will cover other evaluation metrics and losses, that solves the problems of giving more importance to certain pixels in the image, and highly unbalanced segmentation.

<a name='7-5'></a>
### 7.5 - The solutions for highly unbalanced segmentations and giving importance to certain pixels

In this section I will cover the solutions for highly unbalanced segmentations and giving importance to certain pixels.

In this section I am helped with the paper <a href="https://arxiv.org/pdf/1707.03237v3.pdf">Generalised Dice overlap as a deep learning loss function for highly unbalanced segmentations </a> of Carole H. Sudre, Wenqi Li, Tom Vercauteren, Sebastien Ourselin, and M. Jorge Cardoso.

<a name='7-5-1'></a>
#### 7.5.1 - Weighted Sparse Categorical Cross entropy for each class

Let's recall in the first part of our second problem that was in pixel accuracy and Sparse Categorical Cross entropy:

They are not good for highly unbalanced segmentations.

We discussed in section 7.2 about the highly unbalanced segmentation problem and why this important to solve this, and now we will solve this.

Let's recall our formula for Sparse Categorical Cross entropy.
The sparse cross entropy loss function for specific image, when P contains all the pixels coordinates in the image is:

$ L = - \sum\limits_{\substack{x \in P}} \log(p_{g(x)}(x)) $

Now we want to give more importance to some classes that have very small data relative to other classes or some classes that more important for us, and we will do this with the function $w : \{1,..., C\} \to \mathbb{R}$ where C is the amount of the classes. Now, the new loss function, i.e, the Weighted Sparse Categorical Cross entropy(for each class) for specific image, when P contains all the pixels coordinates in the image is:

$ L = - \sum\limits_{\substack{x \in P}} w(g(x))\log(p_{g(x)}(x)) $

In this loss function we will give different weights for different classes.

Now, if our model will do mistake that related to class that have very small data relative to other classes or this class more important for us, so the loss function will be bigger than before(without w, i.e, in the original loss function), and in this way we will force our model to learn the minority classes or important classes.

In this way, we solved the problem of highly unbalanced segmentations.

But, how we create the w function, and know that it is a good function?

So, we will pass over all the ground-truth images in our train dataset, and count the amount of the pixels that belong to each class. After this we compute w(c) in this way:

$ w(c) = \frac{Total\hspace{0.1cm} amount\hspace{0.1cm} of\hspace{0.1cm} pixels\hspace{0.1cm} in\hspace{0.1cm} our\hspace{0.1cm} train\hspace{0.1cm} dataset}{Amount\hspace{0.1cm} of\hspace{0.1cm} pixels\hspace{0.1cm} in\hspace{0.1cm} our\hspace{0.1cm} train\hspace{0.1cm} dataset\hspace{0.1cm} that\hspace{0.1cm} belongs\hspace{0.1cm} to\hspace{0.1cm} class\hspace{0.1cm} c} $

For example if we have train dataset that contains 10 images with resultion $ 10 \times 10 $, have 2 classes A and B, and the amount of pixels in our train dataset that belongs to class A are 200, we will get:

$ w(A) = \frac{1000}{200} = 5 $, that give more impotance to pixels that belongs to the class A.

In this way we give more importance to classes that have very small data relative to other classes.

<a name='7-5-2'></a>
#### 7.5.2 - Weighted Sparse Categorical Cross entropy for each pixel

Let's recall in the second part of our second problem that was in pixel accuracy and Sparse Categorical Cross entropy:

They are not good when we want to give more importance to certain pixels in the image.

Firstly before we will find solution to this problem, we need to understand why and when we need to give more importance to certain pixels in the image.
In order to understand this, I will give example:

In Medicine applications often, there are very small details and little borders that separate different cells.  
Hence, we need to build a model that will be very good with predicting successfully very small details and borders, because if not, we will do big mistakes, that can lead to a wrong decision and harm to the person we are treating.

For example, the <a href="https://arxiv.org/pdf/1505.04597.pdf"> Unet paper </a> show this example:

<div style="text-align:center">
    <img src="Images/Weighted_sparse_categorical_cross_entropy_for_each_pixel.png" style="width:200;height:200px;">
</div>
<caption><center> <u><b>Figure 18</u></b>: HeLa cells on glass recorded with DIC (differential interference contrast) microscopy. (a) raw image. (b) overlay with ground truth segmentation. Different colors
indicate different instances of the HeLa cells. (c) generated segmentation mask (white:
foreground, black: background). (d) map with a pixel-wise loss weight to force the
network to learn the border pixels.
<a href="https://arxiv.org/pdf/1505.04597.pdf">(Source)</a><br> </center></caption>

We can see that very important to prdict correctly the borders that separate different cells, and hence we give the pixels of the borders more importance than other pixels.
In this way we force our model to learn and predict succesfully borders and little diteals,
and we see that we get good prediction for our image.

So, let's summarize our problem:

We want to give more importance to certain pixels in the image that more important to us.

After we recall our problem, let's recall our formula for Sparse Categorical Cross entropy.
The sparse cross entropy loss function for specific image, when P contains all the pixels coordinates in the image is:

$ L = - \sum\limits_{\substack{x \in P}} \log(p_{g(x)}(x)) $

Now we want to give more importance to some pixels in the image, and we will do this with the function $w : P \to \mathbb{R}$ where P contains all the pixels coordinates in the image. Now, the new loss function, i.e, the Weighted Sparse Categorical Cross entropy(for each pixel) for specific image, when P contains all the pixels coordinates in the image is:

$ L = - \sum\limits_{\substack{x \in P}} w(x)\log(p_{g(x)}(x)) $

In this loss function we will give different weights for different pixels.

Now, if our model will do mistake that related to pixel that very important to us(like the border that separate different cells in the medicial application that we discussed before), so the loss function will be bigger than before(without w, i.e, in the original loss function), and in this way we will force our model to learn and predict succesfully borders and little diteals.

The Unet paper describe how to create the w function. I don't cover this in this project because we will dont use this loss function in the project.

<a name='7-5-3'></a>
#### 7.5.3 - Dice Coefficient and soft Dice Coefficient

In this section I will discuss about another solution for highly unbalanced segmentations, that is Dice Coefficient.

<a name='7-5-3-1'></a>
##### 7.5.3.1 - What is Precision, Recall and F1 score?

Firstly, before we will talk about Dice Coefficient we need to talk about Precision, Recall and F1 score.

For sake of explanation, I will use example in this section, and will explain what is Recall, Precision and F1 score with this example.

Let's assume that we work on the following medical application:

We need to build a model that will predict if a patient is diagnosed with cancer or not, according some features. The relevant element is patient that with cancer.

When we build this model we need a way to estimate how good our model predictions.

So, the most trivial way to estimate how good our model predictions is accuarcy:

$ Accuarcy = \frac{Number\hspace{0.1cm}of\hspace{0.1cm}correctly\hspace{0.1cm}classified\hspace{0.1cm}patients}{Total\hspace{0.1cm}number\hspace{0.1cm}of\hspace{0.1cm}patients} $

The problem of this way is that we can get very good accuarcy on our model, but in actually our model is not good. For example, if we have 1000 patients that 998 of them are actually not with cancer(let's call this group A) and 2 of them are actually with cancer(let's call this group B), and our model predict correctly on group A, but on one patient from the two patients in group B our model predict that he is not have cancer, we will get that our model accuracy is:

$ Accuarcy = \frac{Number\hspace{0.1cm}of\hspace{0.1cm}correctly\hspace{0.1cm}classified\hspace{0.1cm}patients}{Total\hspace{0.1cm}number\hspace{0.1cm}of\hspace{0.1cm}patients} = \frac{999}{1000} = 99.9$

Wow! This is amazing result! Our model is the best!

Not so... If we will look on the amount of the patients that have cancer, our model predict them 50% correctly. This is very bad, because you tell to patient with cancer that he not have cancer, and risking his life.

Hence, accuracy is not a good way to estimate how good our model predictions, and we need to find another way to estimate how good our model predictions.

That's why we come to talk about Precision, Recall and F1 score.

**So what is Precision?**

Precision is how good the model avoids to predict patient as diagnosed with cancer where is not has cancer, that mean how good the model avoids to predict FP.
Another way to think about accuracy is the probabilty that if I randomly selected patient that predicted with cancer is TP.

The formula of Precision is:

$ Precision = \frac{TP}{TP + FP} $ 


**So what is Recall?**

Recall is how good the model avoids to predict patient that has cancer as not diagnosed with cancer, that mean how good the model avoids to predict FN.
Another way to think about recall is the probabilty that if I randomly selected patient that labeld with cancer(the patient has actually cancer), I will predict his as diagnosed with cancer, that mean TP.

The formula of Recall is:

$ Recall = \frac{TP}{TP + FN} $

So, if we want that our model will work good, we need that our model will have good score both in Recall and in Precision.
Hence, we want to find score that will care of Recall and Precision.

F1 score is good solution for this.

**So what is F1 score?**

F1 score take into account both Recall and Precision and try to balance both of them.
It's the hermonic mean of Recall and Precision, and thus he tend to the lowest value between Recall and Precision. This is good, becuase if we have very good precision but have bad recall, this is not good and the F1 score give us score that tend to recall, i.e, bad score.

The formula of F1 score is:

$ F1 = \frac{2}{\frac{1}{Precision} + \frac{1}{Recall}} = \frac{2 \times Precision \times Recall}{Precision + Recall}$

Hence, we have good way to estimate how good our model predictions. For example, for the example that we provided before with the 1000 patients, the F1 score is:

$ F1 = \frac{2 \times Precision \times Recall}{Precision + Recall} = \frac{2 \times 1 \times 0.5}{1 + 0.5} = \frac{1}{1.5} = \frac{2}{3} $

And this is very good estimate for our example, compare to the accuracy method that we had use(that give us 99.9).

One last thing to note is that in some cases we will want to give more importance to Recall or Precision. For example, in the example that we described Recall is more important than Precision, because this is more dangerous to mistakly predict paitents with cancer as diagnose with no cancer, than to mistakly predict paitents that have not cancer as diagnose with cancer.

If we want to give more importance to Recall or Precision, we will change the F1 score.

<a name='7-5-3-2'></a>
#### 7.5.3.2 - Dice Coefficient explanation and formula explanation

So, we discussed in the previous section about the importance of Precision, Recall and F1 score. Now we will talk about Dice Coefficient that based on Precision, Recall and F1 score.

The Dice Coefficient is evalution matric for semantic segmentation image that is actually F1 score, but suited for semantic segmentation image problem. It may not sound so clear at the moment but now we will understand it better.

Firstly, let's open the formula of F1 score so that it contains only TP, FN and FP expressions.

Let's recall the F1 score formula:

$ F1 = \frac{2}{\frac{1}{Precision} + \frac{1}{Recall}} \xrightarrow{\text{Precision and Recall formulas}} \frac{2}{\frac{TP + FP}{TP} + \frac{TP + FN}{TP}} = \frac{2}{\frac{2 \times TP + FP + FN}{TP}} = \frac{2 \times TP}{2 \times TP + FP + FN}$

Now let's assume that we have problem that we need to do semantic segmentation only on one relvant element, that mean that we have two classes that is relvant class and not relvant class. For example to do semantic semgentation on images and classify each pixel as cat or not cat.

So we have Y that represents the ground truth for our image and $ \hat{Y} $ that represents our model prediction for our image.The Y and $ \hat{Y} $ are 2d arrays that in each pixel in the array has value 0 or 1(for example indicates cat or not cat). We will want that the intersection between Y and $ \hat{Y} $ will be bigger as possible and in actually the intersection will be as close as possible to the union between Y and $ \hat{Y} $. If we will achive this, so this mean that for the relvant class(in our example is cat) the model predictions are very well.

But now the question arises as to what union and intersection between Y and $ \hat{Y} $ actually is in terms of TP, FN and FP expressions.
So intersection between Y and $ \hat{Y} $ is actually TP, and union is actually TP + FN + FP, because the union include the pixels that we predict as positive(in our example cat) correctly(this is actually the intersection between Y and $ \hat{Y} $), the pixels that we think that are negative, but we were wrong about them(pixels in Y but not in $ \hat{Y} $), and pixels that we think that are poitive, but we were wrong about them (pixels in $ \hat{Y} $ but not in Y). 

Hence, the formulas of union and intersection between Y and $ \hat{Y} $ in terms of TP, FN and FP expressions are:

$ Intersection = TP $

$ Union = TP + FN + FP $ 

and therefore:

$ F1 = \frac{2 \times TP}{2 \times TP + FP + FN} = \frac{2 \times Intersection}{Intersection + Union} $ 

And like we said before Dice Coefficient is evalution matric for semantic segmentation image that is actually F1 score. Hence:

$ Dice \hspace{0.1cm} Coefficient = \frac{2 \times Intersection}{Intersection + Union} $ 

Okey, we get formula for Dice Coefficient, but we still have problem.
In semantic segmentation $ \hat{Y} $ is not 2d array that each pixel in the array has value 0 or 1(for example indicates cat or not cat), but 2d array that each pixel in the array has probability that the pixel will be 1 value(for our example the probability that the pixel will be cat), so we need a way to compute the Intersection and the Union, when according what we were said before:

$ Intersection = TP $

$ Union = TP + FN + FP $ 

We can compute this for Y and $ \hat{Y} $ in this way:

$ Dice \hspace{0.1cm} Coefficient = \frac{2 \times Intersection}{Intersection + Union} = \frac{2 \times \sum\limits_{\substack{y,\hat{y} \in P}} y \times \hat{y}}{\sum\limits_{\substack{y,\hat{y} \in P}} (y + \hat{y})}$ 

When P contains all the pixels coordinates in the image, and in each sum y and \hat{y} on the same pixel coordinate. 
Pay attention that when we sum $ y + \hat{y} $, we sum the intersecion between y and \hat{y} twice and therefore we don't need to add the intersection.

Finally, how we are use Dice Coefficient when we have semantic segmentation problem on more than one relevant class?

The answer is simple: We will compute Dice Coefficient on each class and take the average. 
The formula for this is:

$ Dice \hspace{0.1cm} Coefficient \hspace{0.1cm} for \hspace{0.1cm} K \hspace{0.1cm} classes = \frac{\sum_{k=1}^{K} Dice \hspace{0.1cm} Coefficient \hspace{0.1cm} for \hspace{0.1cm} class \hspace{0.1cm} k}{K}  $

<a name='7-5-3-3'></a>
#### 7.5.3.3 - Soft Dice Coefficient

In the previous section we discussed about the Dice Coefficient evaluation matrix, but now the question arises as to how we can use Dice Coefficient in the loss function?

So, the answer is simple:

The Dice Coefficient evalution matric is between 0 and 1 inclusive, and as closer Dice Coefficient evalution matric to 1 this means our model is better, so if we want to know the loss function we can use this formula that called Soft Dice Coefficient:

$ Soft \hspace{0.1cm} Dice \hspace{0.1cm} Coefficient = 1 - Dice \hspace{0.1cm} Coefficient$

And we can see that as closer Dice Coefficient evalution matric to 1 this means our model is better, and therefore the Soft Dice Coefficient is smaller and closer to zero, and vice versa. Of course that the Soft Dice Coefficient is between 0 and 1 inclusive.

In gernal the Soft Dice Coefficient for K classes is:

$ Soft \hspace{0.1cm} Dice \hspace{0.1cm} Coefficient \hspace{0.1cm} for \hspace{0.1cm} K \hspace{0.1cm} classes = 1 - \frac{\sum_{k=1}^{K} Dice \hspace{0.1cm} Coefficient \hspace{0.1cm} for \hspace{0.1cm} class \hspace{0.1cm} k}{K}  $


<a name='7-5-3-4'></a>
#### 7.5.3.4 - Why Soft Dice Coefficient is good?

The Soft Dice Coefficient is good loss function for several main reasons:

* She depends on the Dice Coefficient evaluation matrix that is actually F1 score, and we discussed before why F1 score is a good way to estimate our model

* She very useful for imbalanced datasets, that this is very common problem. The loss function force the model be good in all the classes, because if the model is not good in some classes, it results in a higher Soft Dice Coefficient value.

<a name='8'></a>
## 8 - The models that we will build
In this section I will write the models that I will build.

**1.** Regular Unet model architecture with dropout layer after each max pooling layer in the encoder and after each deconvultion layer in the decoder, basic_data_augmentation function, first version of normalization and Sparse Categorical Cross entropy loss function.

**2.** Regular Unet model architecture with dropout layer after each max pooling layer in the encoder and after each deconvultion layer in the decoder, basic_data_augmentation function, first version of normalization and Soft Dice Coefficient loss function.

**3.** Regular Unet model architecture with dropout layer after each max pooling layer in the encoder and after each deconvultion layer in the decoder, basic_data_augmentation function, first version of normalization and Weighted Sparse Categorical Cross entroy loss function.

**4.** Regular Unet model architecture without dropout layers, with basic_data_augmentation function, first version of normalization and with Sparse Categorical Cross entropy loss function.

**5.** Regular Unet model architecture with dropout layer after each max pooling layer in the encoder and after each deconvultion layer in the decoder, without data augmentation, first version of normalization and with Sparse Categorical Cross entropy loss function.

**6.** Regular Unet model architecture without dropout layers, without data augmentation, first version of normalization and with Sparse Categorical Cross entropy loss function.

**7.** Regular Unet model architecture with dropout layer after each max pooling layer in the encoder and after each deconvultion layer in the decoder, bad_data_augmentation function, first version of normalization and Sparse Categorical Cross entropy loss function.

**8.** Regular Unet model architecture with dropout layer after each max pooling layer in the encoder and after each deconvultion layer in the decoder, basic_data_augmentation function, first version of normalization, Weighted Sparse Categorical Cross entroy loss function and Learning Rate scheduler.

**9.** Regular Unet model architecture with dropout layer after each max pooling layer in the encoder and after each deconvultion layer in the decoder, basic_data_augmentation function, first version of normalization, Weighted Sparse Categorical Cross entroy loss function and Dropout scheduler.

**10.** Regular Unet model architecture with dropout layer after each max pooling layer in the encoder and after each deconvultion layer in the decoder, basic_data_augmentation function, first version of normalization, Weighted Sparse Categorical Cross entroy loss function and Dropout and Learning Rate scheduler.

**11.** Regular Unet model architecture with dropout layer after each max pooling layer in the encoder and after each deconvultion layer in the decoder, advance_data_augmentation function, first version of normalization, Weighted Sparse Categorical Cross entroy loss function and Learning Rate scheduler.

**12.** Regular Unet model architecture with dropout layer after each max pooling layer in the encoder and after each deconvultion layer in the decoder, basic_data_augmentation function,second version of normalization, Weighted Sparse Categorical Cross entroy loss function and Learning Rate scheduler.

**13.** Regular Unet model architecture with dropout layer after each max pooling layer in the encoder and after each deconvultion layer in the decoder, advance_data_augmentation function, second version of normalization, Weighted Sparse Categorical Cross entroy loss function and Learning Rate scheduler.

<a name='9'></a>
## 9 - Let's build Unet encoders and decoders
In this section we will build Unet encoders and decoders.

They will be used to build the models in the next section.

<a name='9-1'></a>
### 9.1 - Let's build Unet encoder
In this section we will build Unet encoder.

Let's recall the structure of Unet encoder:

There is 4 levels in the ecnoder, where that each level repated by two $ 3\times3 $ conv(with padding) + Relu + Batch normalzation layers, and one $ 2\times2 $ max pooling layer to downsample the resolution of the image.

For simplicity, we will implement the bottleneck part in the encoder part.

Before we will build the encoder of Unet, we need to build a function that will represent each level in the encoder.
This function will implement the two $ 3\times3 $ conv(with padding) + Relu + Batch normalzation layers, and one $ 2\times2 $ max pooling layer to downsample the resolution of the image(This is optional. In the boottleneck part we will not use max pooling layer).

In [16]:
def downsampling_block(input, n_filters, dropout_prob = 0, max_pooling=True):
    """
    Down sampling block that contains two 3*3 conv(with padding) + Relu + Batch normalzation layers,
    and one optional 2*2 max pooling layer. 

    Arguments:
        input - Input tensor
        n_filters - Number of the filters in the block
        dropout_prob - Dropout probability. If equal to 0, so we not use dropout.
        max_pooling - Bool value that indicates if we need to do max pooling in this block

    Returns:
        outputs_dic - Dictionary that contains the next layer and skip connection outputs
    """ 

    conv = Conv2D(n_filters, 3, activation='relu', padding='same', kernel_initializer='he_normal')(input)
    conv = tf.keras.layers.BatchNormalization()(conv)
    conv = Conv2D(n_filters, 3, activation='relu', padding='same', kernel_initializer='he_normal')(conv)
    conv = tf.keras.layers.BatchNormalization()(conv)
    
    if max_pooling:
        next_layer = MaxPooling2D(2, strides=2)(conv)
    else:
        next_layer = conv
        
    if dropout_prob > 0:
        next_layer = Dropout(dropout_prob)(next_layer)
        
    skip_connection = conv

    outputs_dic = {"next_layer":next_layer, "skip_connection":skip_connection}

    return outputs_dic

In [17]:
def unet_encoder(input, start_n_filters = 64, dropout_prob = 0.3):
    """
    Unet encoder

    Arguments:
        input - Input tensor
        start_n_filters - Number of the filters in the first level of the encoder
        dropout_prob - Dropout probability. If equal to 0, so we not use dropout.

    Returns:
        Outputs that we will need in the decoder.
        Outputs(feature maps) of the encoder that we need for the skip connections in this order:
        first skip connection, ..., fourth skip connection  
        And the output(feature map) of the last level in the encoder for the first level in the decoder.
    """ 

    dblock1 = downsampling_block(input, start_n_filters, dropout_prob)
    dblock2 = downsampling_block(dblock1["next_layer"], start_n_filters*2, dropout_prob)
    dblock3 = downsampling_block(dblock2["next_layer"], start_n_filters*4, dropout_prob)
    dblock4 = downsampling_block(dblock3["next_layer"], start_n_filters*8, dropout_prob)
    dblock5 = downsampling_block(dblock4["next_layer"], start_n_filters*16, max_pooling=False)

    return dblock1["skip_connection"], dblock2["skip_connection"], \
    dblock3["skip_connection"], dblock4["skip_connection"], dblock5["next_layer"]

<a name='9-2'></a>
### 9.2 - Let's build Unet decoder
In this section we will build Unet decoder.

Let's recall the structure of Unet decoder:

There is 4 levels in the decoder, where the first 3 levels repated by two $ 3\times3 $ conv(with padding) + Relu + Batch normalzation layers, and one $ 2\times2 $ Transpose conv/Deconvultion layer to upsample the resolution of the image

In the final level we do two $ 3\times3 $ conv(with padding) + Relu + Batch normalzation layers, and one $ 1\times1 $ conv(with padding).

Before we will build the decoder of Unet, we need to build a function that will represent each level in the decoder.

This function will implement the one $ 2\times2 $ Transpose conv/Deconvultion layer and two $ 3\times 3 $ conv(with padding) + Relu + Batch normalzation layers.

The $ 1\times 1 $ conv(with padding) layer not include in this function and we will implement this layer in the decoder function.

In [18]:
def upsampling_block(input, skip_connection_input, n_filters, dropout_prob = 0):
    """
    Up sampling block that contains one 2*2 Transpose conv/Deconvultion layer, and 
    two 3*3 conv(with padding) + Relu + Batch normalzation layers. 

    Arguments:
        input - Input tensor from the last block
        skip_connection_input - Skip connection input tensor
        n_filters - Number of the filters in the block
        dropout_prob - Dropout probability. If equal to 0, so we not use dropout.

    Returns:
        output - Output tensor  
    """ 
    
    upsample_input = Conv2DTranspose(n_filters, kernel_size=3, strides=2,
                                      padding='same')(input)
    
    if dropout_prob > 0:
        upsample_input = Dropout(dropout_prob)(upsample_input)
    merge = concatenate([upsample_input, skip_connection_input], axis=-1)
    output = Conv2D(n_filters, 3, activation='relu', padding='same',
                     kernel_initializer='he_normal')(merge)
    output = tf.keras.layers.BatchNormalization()(output)
    output = Conv2D(n_filters, 3, activation='relu', padding='same',
                     kernel_initializer='he_normal')(output)
    output = tf.keras.layers.BatchNormalization()(output)

    return output

In [19]:
def unet_decoder(encoder_output, n_classes, start_n_filters = 512, dropout_prob = 0.3):
    """
    Unet decoder

    Arguments:
        encoder_output - The output that we get from the encoder function.
        n_classes - Number of classes in the dataset
        start_n_filters - Number of the filters in the first level of the decoder
        dropout_prob - Dropout probability. If equal to 0, so we not use dropout.
        
    Returns:
        output - The output of the decoder
    """ 

    skip_connection_1, skip_connection_2, skip_connection_3, skip_connection_4, \
    input_for_decoder_first_level = encoder_output

    ublock1 = upsampling_block(input_for_decoder_first_level, skip_connection_4, start_n_filters, dropout_prob)
    ublock2 = upsampling_block(ublock1, skip_connection_3, start_n_filters/2, dropout_prob)
    ublock3 = upsampling_block(ublock2, skip_connection_2, start_n_filters/4, dropout_prob)
    ublock4 = upsampling_block(ublock3, skip_connection_1, start_n_filters/8, dropout_prob)

    output = Conv2D(filters=n_classes, kernel_size=1, padding='same', activation = "softmax")(ublock4)

    return output


<a name='10'></a>
## 10 - Let's build genral unet model

In [21]:
def genral_unet_model(input_size, data_augmentation_func, encoder_func, decoder_func, data_augmentation_prob = 0.5, dropout_prob = 0.3, start_n_filters = 64, n_classes = NUM_CLASSES):
    """
    Unet model

    Arguments:
        input_size - Input shape
        data_augmentation_func - Data augmentation function that we will use
        data_augmentation_prob - The probabilty for data augmentation. 
        We will do this in order that not always will be data augmentation,
        and our model also for some iterations will learn our original images.
        If equal to 0, so we not use data augmentation

        encoder_func - Encoder function that we will use
        decoder_func - Decoder function that we will use
        dropout_prob - Dropout probability. If equal to 0, so we not use dropout.
        start_n_filters - Number of the filters in the first level of the encoder
        n_classes - Number of classes in the dataset
        
    Returns:
        model - tf.keras.Model
    """

    input = Input(input_size)
    data_augmentation_layer = DataAugmentationLayer(data_augmentation_func, data_augmentation_prob)
    augmented_input = data_augmentation_layer(input)

    encoder_output = encoder_func(augmented_input, start_n_filters, dropout_prob)
    decoder_output = decoder_func(encoder_output, n_classes, start_n_filters*8) 

    model = tf.keras.Model(input, decoder_output)

    return model

<a name='11'></a>
## 11 - Let's code the cost functions and evaluation metrics that we need for the models

In section 7 we discussed about cost functions and evaluation metrics for Unet.

In this project we will use Sparse Categorical Cross entropy, Weighted Sparse Categorical Cross entropy for each class and Soft Dice Coefficient as loss functions, and our evalution matric will include:

1. Dice Coefficient
2. Pixel accuracy

We include Pixel accuracy in our evaluation matrix, in order to see how Pixel accuracy don't indicates if our model is good, even if Pixel accuracy has a high value.

We can include only some classes in the evaluation matric and some classes not include, but in this project I want to include all of them, and challange myself to build a model that give good performance to all the classes.
In industery applications this is will be more smart to include somtimes not all the classes, but only the classes that are the most important to us.

Now, we will implement Weighted Sparse Categorical Cross entropy for each class and Soft Dice Coefficient loss functions, and Dice Coefficient evaluation matrix.

**Let's implement Dice Coefficient:**

In [22]:
def dice_coefficient_for_one_class(y_true, y_pred, class_id):
    y_true = tf.math.reduce_max(y_true, axis=-1, keepdims=False)
    y_true = tf.cast(tf.equal(y_true, class_id), tf.float32)
    y_pred = y_pred[..., class_id]
    
    intersection = tf.reduce_sum(y_true * y_pred)
    union = tf.reduce_sum(y_true + y_pred)
    
    return (2.0 * intersection) / (union)

**And now let's implement Dice Coefficient and Soft Dice Coefficient for K classes:**

In [23]:
def dice_coefficient(y_true, y_pred):
    dice = 0.0
    for class_id in range(NUM_CLASSES):
        dice += dice_coefficient_for_one_class(y_true, y_pred, class_id)
    return dice / NUM_CLASSES

def soft_dice_coefficient(y_true, y_pred):
    dice = 0.0
    for class_id in range(NUM_CLASSES):
        dice += dice_coefficient_for_one_class(y_true, y_pred, class_id)
    return 1.0 - dice / NUM_CLASSES

**And now let's implement Weighted Sparse Categorical Cross entropy for each class:**

In [None]:
def compute_amount_of_pixels_that_belong_to_class(example, class_id):
    ground_truth = example[1]
    ground_truth = tf.math.reduce_max(ground_truth, axis=-1, keepdims=False)
    count = tf.reduce_sum(tf.cast(tf.equal(ground_truth, class_id), tf.float32))
    return count


def calculate_class_frequency(train_dataset):
    total_amount_of_pixels_in_train_dataset = TOTAL_TRAIN_FILES * TARGET_SIZE_TO_RESIZE[0] * TARGET_SIZE_TO_RESIZE[1]
    class_frequency = tf.Variable(tf.zeros(NUM_CLASSES, dtype=tf.float32))
    
    for class_id in range(NUM_CLASSES):
        amount_of_pixels_that_belong_to_class = train_dataset.reduce(
            0.0, lambda x, example: x + compute_amount_of_pixels_that_belong_to_class(example, class_id))
        
        class_frequency[class_id].assign(amount_of_pixels_that_belong_to_class)
        
    return class_frequency 

class_frequency = calculate_class_frequency(train_dataset)

Now let's assign more importance to the classes that are minority, i.e, that don't show up much in the train dataset.
To the 2 classes that have the most little frequency we will give the higest importance and so we will continue and bring less and less importance as we go to classes that have higer frequency. The classes that have good frequency in the train dataset we will give regular importance, i.e, 1.

In [None]:
class_weights = tf.Variable(tf.zeros(NUM_CLASSES, dtype=tf.float32))

_, minority_indices = tf.math.top_k(-class_frequency, k=15)

minority_indices = minority_indices.numpy()

for class_id in range(NUM_CLASSES):
    class_weights[class_id].assign(1.0)
    
for index, value in enumerate(minority_indices):
    if index == 0:
        class_weights[value].assign(6.0)
    elif index < 3:
        class_weights[value].assign(5.0)
    elif index < 5:
        class_weights[value].assign(4.0)
    elif index < 7:
        class_weights[value].assign(3.0)
    else:
        class_weights[value].assign(2.0)

In [None]:
def weighted_sparse_categorical_crossentropy(y_true, y_pred):
    y_pred_with_class_weights = tf.math.pow(y_pred, class_weights)
    weighted_cross_entropy = tf.keras.losses.sparse_categorical_crossentropy(y_true, y_pred_with_class_weights, from_logits=False)
    return weighted_cross_entropy

<a name='12'></a>
## 12 - Let's implement plot function for plot history of the model

In this section we will implement the plot function for plotting history of the model, because after we will finish train our model we need to plot the history of the model.

We will plot the history of the model for the train, dev and test datasets in the loss function and our evaluation matrix.

In [24]:
def plot_history(model_history):
  fig, axes = plt.subplots(nrows = 1, ncols = 3, figsize=(20, 7))

  # Training
  sns.lineplot(range(1, len(model_history.history["loss"]) + 1), model_history.history["loss"], ax = axes[0], label="Training Loss")
  sns.lineplot(range(1, len(model_history.history["loss"]) + 1), model_history.history["accuracy"], ax = axes[1], label="Training Accuracy")
  sns.lineplot(range(1, len(model_history.history["loss"]) + 1), model_history.history["dice_coefficient"], ax = axes[2], label="Training Dice Coefficient")

  # Validation
  sns.lineplot(range(1, len(model_history.history["loss"]) + 1), model_history.history["val_loss"], ax = axes[0], label="Validation Loss")
  sns.lineplot(range(1, len(model_history.history["loss"]) + 1), model_history.history["val_accuracy"], ax = axes[1], label="Validation Accuracy")
  sns.lineplot(range(1, len(model_history.history["loss"]) + 1), model_history.history["val_dice_coefficient"], ax = axes[2], label="Validation Dice Coefficient")
  
  axes[0].set_title("Loss", fontdict = {'fontsize': 15})
  axes[0].set_xlabel("Epoch")
  axes[0].set_ylabel("Loss")

  axes[1].set_title("Accuracy", fontdict = {'fontsize': 15})
  axes[1].set_xlabel("Epoch")
  axes[1].set_ylabel("Accuracy")

  axes[2].set_title("Dice Coefficient", fontdict = {'fontsize': 15})
  axes[2].set_xlabel("Epoch")
  axes[2].set_ylabel("Dice Coefficient")
  plt.tight_layout()
  plt.show()

<a name='13'></a>
## 13 - Let's implement functions for shows model predictions

In this section we will implement the functions for shows model predictions after we trained the model.

In [25]:
FIRST_EXAMPLE_IN_BATCH = 0

def create_mask(pred):
    mask = tf.argmax(pred, axis=-1)
    mask = mask[..., tf.newaxis]
    return mask


def plot_prediction(ground_troth, pred):
    fig, arr = plt.subplots(1, 2)
    arr[0].imshow(ground_troth)
    arr[0].set_title('Ground truth')
    arr[1].imshow(pred)
    arr[1].set_title('Prediction')
    plt.tight_layout()
    plt.show()


def show_predictions(model, dataset, num=1):
    """
    Displays the first image of each of the num batches
    """
    for img, ground_truth in dataset.take(num):
        batch_pred = model.predict(img)
        mask = create_mask(batch_pred[FIRST_EXAMPLE_IN_BATCH])
        mask_vis = convert_mask_to_visualization_mask(mask, labels)
        groud_truth_vis = convert_mask_to_visualization_mask(ground_truth[FIRST_EXAMPLE_IN_BATCH], labels)
        plot_prediction(groud_truth_vis, mask_vis)

<a name='14'></a>
## 14 - Let's implement visualization callbacks functions

In this section we will implement visualization callbacks functions, that let us see the model prediction on spesific example in real time when the model training.

In [26]:
INPUT_HEIGHT, INPUT_WIDTH, INPUT_CHANNELS = INPUT_SHAPE

class VizCallback(tf.keras.callbacks.Callback):
    
    def __init__(self, model, img, ground_truth, **kwargs):
        super().__init__(**kwargs)
        self.model = model
        self.img = tf.reshape(img, [1, INPUT_HEIGHT, INPUT_WIDTH, INPUT_CHANNELS])
        self.ground_truth = ground_truth
    
    def on_epoch_end(self, epoch, logs=None):
        pred = self.model.predict(self.img)
        mask = create_mask(pred[FIRST_EXAMPLE_IN_BATCH])
        mask_vis = convert_mask_to_visualization_mask(mask, labels)
        groud_truth_vis = convert_mask_to_visualization_mask(self.ground_truth, labels)
        plot_prediction(groud_truth_vis, mask_vis)

<a name='15'></a>
## 15 - Let's implement schedulers

In this section we will implement schedulers for our model.

<a name='15-1'></a>
### 15.1 - Let's implement Dropout scheduler

In this section we will implement Dropout scheduler for our model.

In [27]:
class DropoutScheduler(tf.keras.callbacks.Callback):
    def __init__(self, drop_schedule):
        """
        drop_schedule is a dictionary that have keys of epochs and values of new dropout probabilities for them
        """
        super(DropoutScheduler, self).__init__()
        self.drop_schedule = drop_schedule
        
    def on_epoch_begin(self, epoch, logs=None):
        if epoch in self.drop_schedule:
            new_prob = self.drop_schedule[epoch]
            for layer in self.model.layers:
                if isinstance(layer, tf.keras.layers.Dropout):
                    layer.rate = new_prob

<a name='15-2'></a>
### 15.2 - Let's implement Learning Rate scheduler

In this section we will implement Learning Rate scheduler for our model.

In [28]:
def lr_scheduler(epoch, initial_lr, new_lr, num_of_epochs_for_initial_lr):
    if epoch < num_of_epochs_for_initial_lr:
        return initial_lr
    else:
        return new_lr

<a name='16'></a>
## 16 - Regular Unet model

In this section we will create models that are versions of regular unet model(Unet model that we described in section 6), and we will research there performance, when we do some changes like change the data augmentation, hyperparameters, regulzation and loss functions and more.

One important note is that I not run on my computer the model training process, beacause I have not good enough hardware. Instead, I run my notebook on kaggle, and I will provide in this notebook screenshots that will desribe my results.

<a name='16-1'></a>
### 16.1 - First model version

In this section we will create model that has the regular Unet model architecture with dropout layer after each max pooling layer in the encoder and after each deconvultion layer in the decoder, basic_data_augmentation function and Sparse Categorical Cross entropy loss function.

<a name='16-1-1'></a>
#### 16.1.1 - Create the model

In this section we will create the model.

In [31]:
first_version_model  = genral_unet_model(INPUT_SHAPE, basic_data_augmentation, unet_encoder, unet_decoder)

Let's see the summary of the model

In [32]:
first_version_model.summary()

Model: "model_1"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_2 (InputLayer)        [(None, 128, 128, 3)]        0         []                            
                                                                                                  
 data_augmentation_layer_1   (None, 128, 128, 3)          0         ['input_2[0][0]']             
 (DataAugmentationLayer)                                                                          
                                                                                                  
 conv2d_19 (Conv2D)          (None, 128, 128, 64)         1792      ['data_augmentation_layer_1[0]
                                                                    [0]']                         
                                                                                            

Let's compile the model with the Sparse Categorical Cross entropy loss function.

In [None]:
first_version_model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False),
              metrics=['accuracy', dice_coefficient])

Let's create visualization callback:

In [None]:
for img, ground_truth in train_dataset.take(1):
    first_version_viz_callback = VizCallback(first_version_model, img[FIRST_EXAMPLE_IN_BATCH], ground_truth[FIRST_EXAMPLE_IN_BATCH])

<a name='16-1-2'></a>
#### 16.1.2 - Train the model and evaluate him on train and dev datasets

In this section we will train the model.

Firstly, we will train our model for 40 epochs, and we will see the results.

In [None]:
FIRST_VERSION_EPOCHS = 40
first_version_history = first_version_model.fit(train_dataset, epochs=FIRST_VERSION_EPOCHS, validation_data=dev_dataset, callbacks=[first_version_viz_callback])

Let's plot the history of this training process:

In [None]:
plot_history(first_version_history)

The history of this training process is:

<div style="text-align:center">
    <img src="Images\first_version_first_run_history_result.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 19</u></b>: History result of the first version model in the first training run

And we can see the results of the model in the second epoch and in the last epoch:

<div style="text-align:center">
    <img src="Images\first_version_first_run_second_epoch_result.png" style="width:200;height:200;">
    <img src="Images\first_version_first_run_last_epoch_result.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 20</u></b>: Results of the first version model in the first training run in the second epoch and in the last epoch

**Conclusions:**

* We can see in the history result, that the loss, Accuracy and Dice Cooefficient on the train dataset converge in a pretty good way, but on the dev dataset they are less converging but more moving in a zigzag pattern.

* We get good results for first version of our model(i.e, the first idea that we try).
We can see that the model accuracy is preety high, but the Dice Cooefficient score not high, and the reason for this is that our model less good in predict classes that have small amount of data(we talked about this in section 7).

* We can see that our model little overfit the loss and Accuracy, but in the Dice Coefficient the model do much more overfiting. A logical reason for this that maybee the train dataset and the dev dataset has differernt classes that are minority and maybe there are classes in the dev dataset that in the train dataset these classes not have many examples, so the model not trained on them like the other classes.

In conclusion, we need that our model will be more genral and less overfitt the train dataset.
In addition, we need that the loss, Accuracy and Dice Cooefficient on the dev dataset will more converge and will not move in a zigzag pattern.

Now let's try train the model more in order to see if we can get better results.

In [None]:
FIRST_VERSION_SECOND_RUN_EPOCHS = 40
first_version_second_run_history = first_version_model.fit(train_dataset, epochs=FIRST_VERSION_SECOND_RUN_EPOCHS, validation_data=dev_dataset, callbacks=[first_version_viz_callback])

Let's plot the history of this training process:

In [None]:
plot_history(first_version_second_run_history)

The history of this training process is:

<div style="text-align:center">
    <img src="Images\first_version_second_run_history_result.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 21</u></b>: History result of second training run in the first version model

And we can compare the results of the model in the last epoch of the first training run and in the last epoch of the second training:

<div style="text-align:center">
    <img src="Images\first_version_first_run_last_epoch_result.png" style="width:200;height:200;">
    <img src="Images\first_version_second_run_last_epoch_result.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 22</u></b>: Results of the first version model in the first training run in the last epoch and in the second training run in the last epoch

**Conclusions:**

* We can see in the history result, that the loss, Accuracy and Dice Cooefficient on the train dataset converge in a pretty good way, but on the dev dataset our model become less stable, and moving in a zigzag pattern.

* We get better results in the last epoch in the second running compare to the last epoch in the first running. Now our model can predict correctly more little detials in the image.
We can see that our model much improved in Accuracy and Dice Cooefficient.

* We can see that our model more overfit the train dataset espacially in the Dice Cooefficient.

In overall, the model performance improved on the train dataset, and the performance on the the dev dataset didn't get any worse, but our model overfit the train dataset.

Now let's try train the model little more in order to see if we can get better results in the dev dataset and maybee less overfitting.

In [None]:
FIRST_VERSION_THIRD_RUN_EPOCHS = 25
first_version_third_run_history = first_version_model.fit(train_dataset, epochs=FIRST_VERSION_THIRD_RUN_EPOCHS, validation_data=dev_dataset, callbacks=[first_version_viz_callback])

Let's plot the history of this training process:

In [None]:
plot_history(first_version_third_run_history)

The history of this training process is:

<div style="text-align:center">
    <img src="Images\first_version_third_run_history_result.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 23</u></b>: History result of second training run in the first version model

And we can compare the results of the model in the last epoch of the second training run and in the last epoch of the third training:

<div style="text-align:center">
    <img src="Images\first_version_second_run_last_epoch_result.png" style="width:200;height:200;">
    <img src="Images\first_version_third_run_last_epoch_result.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 24</u></b>: Results of the first version model in the second training run in the last epoch and in the third training run in the last epoch

**Conclusions:**

* We can see in the history result, that the loss, Accuracy and Dice Cooefficient on the train dataset converge in a pretty good way(in the Dice Cooefficient less converges), but on the dev dataset our model become much less stable, and moving in a zigzag pattern(espacially in the 5-10 epochs).

* We get better results in the last epoch in the third running compare to the last epoch in the second running in the train dataset, but our model performance getting worse on the dev dataset compare to the second running.

* We can see that our model much more overfit the train dataset espacially in the Dice Cooefficient.

In overall, the model performance improved on the train dataset, and the model performance on the the dev dataset gets worse, and our model overfit the train dataset.

Hence, we will use the first version model in the second running.

<a name='16-1-3'></a>
#### 16.1.3 - Model's predictions on the train and dev datasets

In this section we will show examples of model's predictions on the train and dev datasets.

Let's plot the model's predictions on train dataset:

In [None]:
show_predictions(first_version_model, train_dataset, 10)

Six examples of model's predictions on the train dataset are:

<div style="text-align:center">
    <img src="Images\first_version_predictions_train_part1.png" style="width:200;height:200;">
    <img src="Images\first_version_predictions_train_part2.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 25</u></b>: Model's predictions on the train dataset

Let's plot the model's predictions on dev dataset:

In [None]:
show_predictions(first_version_model, dev_dataset, 10)

Six examples of model's predictions on the dev dataset are:

<div style="text-align:center">
    <img src="Images\first_version_predictions_dev_part1.png" style="width:200;height:200;">
    <img src="Images\first_version_predictions_dev_part2.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 26</u></b>: Model's predictions on the dev dataset

In conclusion, we get very good results for first version of our model(i.e, the first idea that we try).
We need to improve the model performance in the dev dataset and his performance on minority classes.

In addition, we need that our model will be more genral and less overfitt the train dataset, and the loss, Accuracy and Dice Cooefficient on the dev dataset will more converge and will not move in a zigzag pattern.

<a name='16-2'></a>
### 16.2 - Second model version

In this section we will create model that has the regular Unet model architecture with dropout layer after each max pooling layer in the encoder and after each deconvultion layer in the decoder, basic_data_augmentation function and Soft Dice Coefficient loss function.

<a name='16-2-1'></a>
#### 16.2.1 - Create the model

In this section we will create the model.

In [33]:
second_version_model  = genral_unet_model(INPUT_SHAPE, basic_data_augmentation, unet_encoder, unet_decoder)

Let's see the summary of the model

In [34]:
second_version_model.summary()

Model: "model_2"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_3 (InputLayer)        [(None, 128, 128, 3)]        0         []                            
                                                                                                  
 data_augmentation_layer_2   (None, 128, 128, 3)          0         ['input_3[0][0]']             
 (DataAugmentationLayer)                                                                          
                                                                                                  
 conv2d_38 (Conv2D)          (None, 128, 128, 64)         1792      ['data_augmentation_layer_2[0]
                                                                    [0]']                         
                                                                                            

Let's compile the model with the Soft Dice Coefficient loss function.

In [None]:
second_version_model.compile(optimizer='adam',
              loss=soft_dice_coefficient,
              metrics=['accuracy', dice_coefficient])

Let's create visualization callback:

In [None]:
for img, ground_truth in train_dataset.take(1):
    second_version_viz_callback = VizCallback(second_version_model, img[FIRST_EXAMPLE_IN_BATCH], ground_truth[FIRST_EXAMPLE_IN_BATCH])

<a name='16-2-2'></a>
#### 16.2.2 - Train the model and evaluate him on train and dev datasets

In this section we will train the model.

Firstly, we will train our model for 40 epochs, and we will see the results.

In [None]:
SECOND_VERSION_EPOCHS = 40
second_version_history = second_version_model.fit(train_dataset, epochs=SECOND_VERSION_EPOCHS, validation_data=dev_dataset, callbacks=[second_version_viz_callback])

Let's plot the history of this training process:

In [None]:
plot_history(second_version_history)

The history of this training process is:

<div style="text-align:center">
    <img src="Images\second_version_history_result.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 27</u></b>: History result of the second version model

And we can see the results of the model in the second epoch and in the last epoch:

<div style="text-align:center">
    <img src="Images\second_version_second_epoch_result.png" style="width:200;height:200;">
    <img src="Images\second_version_last_epoch_result.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 28</u></b>: Results of the second version model in the second epoch and in the 25 epoch

**Conclusions:**

* We can see in the history result, that the loss, Accuracy and Dice Coefficient on the train dataset converge in a pretty good way, but on the dev dataset they are not converging but more moving in a zigzag pattern, i.e, the model is really unstable on the dev dataset, and this not good.

* We get not good results compared to the first version model in the first running both in train dataset and dev dataset. We get better result in the Dice Cooefficient on the train dataset compared to the first version model in the first running, but this is because we use Soft Dice Cooefficient loss function.

* We can see that our model overfit the loss, Accuracy and Dice Coefficient in the train dataset. 

In conclusion, we can see that Soft Dice Coefficient loss function not good loss function for our problem.
She gave us better result in the Dice Cooefficient on the train dataset compared to the first version model in the first running, but much worsere results on the other things. In addition we can see that the learning process with this loss function is much more slower(I assume that one of the causes of this is because the scale of the loss function), and not stable in the dev dataset(A logical reason for this that maybee the train dataset and the dev dataset has differernt classes that are minority and while the Soft Dice Cooefficient loss function forces the model learn the minority of classes in train dataset, he not learned the minority of classes in the dev dataset).

For this model version we will not show the model's predictions, because this version not good compared to the first version model.

<a name='16-3'></a>
### 16.3 - Third model version

In this section we will create model that has the regular Unet model architecture with dropout layer after each max pooling layer in the encoder and after each deconvultion layer in the decoder, basic_data_augmentation function and Weighted Sparse Categorical Cross entropy for each class loss function.

<a name='16-3-1'></a>
#### 16.3.1 - Create the model

In this section we will create the model.

In [35]:
third_version_model  = genral_unet_model(INPUT_SHAPE, basic_data_augmentation, unet_encoder, unet_decoder)

Let's see the summary of the model

In [36]:
third_version_model.summary()

Model: "model_3"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_4 (InputLayer)        [(None, 128, 128, 3)]        0         []                            
                                                                                                  
 data_augmentation_layer_3   (None, 128, 128, 3)          0         ['input_4[0][0]']             
 (DataAugmentationLayer)                                                                          
                                                                                                  
 conv2d_57 (Conv2D)          (None, 128, 128, 64)         1792      ['data_augmentation_layer_3[0]
                                                                    [0]']                         
                                                                                            

Let's compile the model with the Weighted Sparse Categorical Cross entropy for each class loss function.

In [None]:
third_version_model.compile(optimizer='adam',
              loss=weighted_sparse_categorical_crossentropy,
              metrics=['accuracy', dice_coefficient])

Let's create visualization callback:

In [None]:
for img, ground_truth in train_dataset.take(1):
    third_version_viz_callback = VizCallback(third_version_model, img[FIRST_EXAMPLE_IN_BATCH], ground_truth[FIRST_EXAMPLE_IN_BATCH])

<a name='16-3-2'></a>
#### 16.3.2 - Train the model and evaluate him on train and dev datasets

In this section we will train the model.

Firstly, we will train our model for 40 epochs, and we will see the results.

In [None]:
THIRD_VERSION_EPOCHS = 40
third_version_history = third_version_model.fit(train_dataset, epochs=THIRD_VERSION_EPOCHS, validation_data=dev_dataset, callbacks=[third_version_viz_callback])

Let's plot the history of this training process:

In [None]:
plot_history(third_version_history)

The history of this training process is:

<div style="text-align:center">
    <img src="Images\third_version_first_run_history_result.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 29</u></b>: History result of the third version model in the first training run

And we can see the results of the model in the second epoch and in the last epoch:

<div style="text-align:center">
    <img src="Images\third_version_first_run_second_epoch_result.png" style="width:200;height:200;">
    <img src="Images\third_version_first_run_last_epoch_result.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 30</u></b>: Results of the third version model in the first training run in the second epoch and in the last epoch

**Conclusions:**

* We can see in the history result, that the loss, Accuracy and Dice Cooefficient on the train dataset converge in a pretty good way, but on the dev dataset they are less converging but more moving in a zigzag pattern(espcially in 15 epoch). Compare to the first version model, this model much less converging in the loss, Accuracy and Dice Cooefficient on the dev dataset.

* We can see that the model accuracy is preety high, but the Dice Cooefficient score not high, and the reson for this is that our model less good in predict classes that have small amount of data(we talked about this in section 7).

* We can see that our model little overfit the loss and Accuracy, but in the Dice Coefficient the model do more overfit. Compare to the first version model, this model less overfit the train dataset.

In conclusion, we need that the loss, Accuracy and Dice Cooefficient on the dev dataset will more converge and will not move in a zigzag pattern.
In addition we want to get a higer performance on the train and dev datasets, while not overfitt the train dataset.

Now let's try train the model more in order to see if we can get better results.

In [None]:
THIRD_VERSION_SECOND_RUN_EPOCHS = 40
third_version_second_run_history = third_version_model.fit(train_dataset, epochs=THIRD_VERSION_SECOND_RUN_EPOCHS, validation_data=dev_dataset, callbacks=[third_version_viz_callback])

Let's plot the history of this training process:

In [None]:
plot_history(third_version_second_run_history)

The history of this training process is:

<div style="text-align:center">
    <img src="Images\third_version_second_run_history_result.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 31</u></b>: History result of the third version model in the second training run

And we can compare the results of the model in the last epoch of the first training run and in the last epoch of the second training:

<div style="text-align:center">
    <img src="Images\third_version_first_run_last_epoch_result.png" style="width:200;height:200;">
    <img src="Images\third_version_second_run_last_epoch_result.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 32</u></b>: Results of the third version model in the first training run in the last epoch and in the second training run in the last epoch

**Conclusions:**

* We can see in the history result, that the loss, Accuracy and Dice Cooefficient on the train dataset converge worse compare to the first training run, and on the dev dataset our model become less stable, and moving in a zigzag pattern.

* We get better results in the last epoch in the second running compare to the last epoch in the first running. Now our model can predict correctly more little detials in the image.
We can see that our model much improved in Accuracy and Dice Cooefficient.

* We can see that our model more overfit the train dataset.

In overall, the model performance improved on the train dataset, and the performance on the the dev dataset didn't get any worse in Accuracy and Dice Cooefficient, but our model overfit the train dataset.

Now let's try train the model little more in order to see if we can get better results in the dev dataset and maybee less overfitting.

In [None]:
THIRD_VERSION_THIRD_RUN_EPOCHS = 25
third_version_third_run_history = third_version_model.fit(train_dataset, epochs=THIRD_VERSION_THIRD_RUN_EPOCHS, validation_data=dev_dataset, callbacks=[third_version_viz_callback])

Let's plot the history of this training process:

In [None]:
plot_history(third_version_third_run_history)

The history of this training process is:

<div style="text-align:center">
    <img src="Images\third_version_third_run_history_result.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 33</u></b>: History result of the third version model in the third training run

And we can compare the results of the model in the last epoch of the first training run and in the last epoch of the second training:

<div style="text-align:center">
    <img src="Images\third_version_second_run_last_epoch_result.png" style="width:200;height:200;">
    <img src="Images\third_version_third_run_last_epoch_result.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 34</u></b>: Results of the third version model in the second training run in the last epoch and in the third training run in the last epoch

**Conclusions:**

* We can see in the history result, that the loss, Accuracy and Dice Cooefficient on the train dataset converge in a pretty good way(in the Dice Cooefficient less converges), but on the dev dataset our model become much less stable, and moving in a zigzag pattern(espacially in the fifth epoch).

* We get better results in the last epoch in the third running compare to the last epoch in the second running in the train dataset, but our model performance getting more worse on the dev dataset compare to the second running.

* We can see that our model much more overfit the train dataset.

In overall, the model performance improved on the train dataset, but the model performance on the the dev dataset gets worse, and our model overfit the train dataset.

Hence, we will use the third version model in the second running.

<a name='16-3-3'></a>
#### 16.3.3 - Model's predictions on the train and dev datasets

In this section we will show examples of model's predictions on the train and dev datasets.

Let's plot the model's predictions on train dataset:

In [None]:
show_predictions(third_version_model, train_dataset, 10)

Six examples of model's predictions on the train dataset are:

<div style="text-align:center">
    <img src="Images\third_version_predictions_train_part1.png" style="width:200;height:200;">
    <img src="Images\third_version_predictions_train_part2.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 35</u></b>: Model's predictions on the train dataset

Let's plot the model's predictions on dev dataset:

In [None]:
show_predictions(third_version_model, dev_dataset, 10)

Six examples of model's predictions on the dev dataset are:

<div style="text-align:center">
    <img src="Images\third_version_predictions_dev_part1.png" style="width:200;height:200;">
    <img src="Images\third_version_predictions_dev_part2.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 36</u></b>: Model's predictions on the dev dataset

In conclusion, we get good results. We saw that the model predictions on minority classes improve, because we use Weighted Sparse Categorical Cross entropy.
We need to try to improve the model performance espically in the dev dataset, i.e, we need that our model will be more genral and less overfitt the train dataset.

In addition we need that the loss, Accuracy and Dice Cooefficient on the dev dataset will more converge and will not move in a zigzag pattern.

<a name='16-4'></a>
### 16.4 - Fourth model version

In this section we will create model that has the regular Unet model architecture without dropout layers,with basic_data_augmentation function and Sparse Categorical Cross entropy loss function.

<a name='16-4-1'></a>
#### 16.4.1 - Create the model

In this section we will create the model.

In [37]:
fourth_version_model  = genral_unet_model(INPUT_SHAPE, basic_data_augmentation, unet_encoder, unet_decoder, dropout_prob = 0)

Let's see the summary of the model

In [38]:
fourth_version_model.summary()

Model: "model_4"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_5 (InputLayer)        [(None, 128, 128, 3)]        0         []                            
                                                                                                  
 data_augmentation_layer_4   (None, 128, 128, 3)          0         ['input_5[0][0]']             
 (DataAugmentationLayer)                                                                          
                                                                                                  
 conv2d_76 (Conv2D)          (None, 128, 128, 64)         1792      ['data_augmentation_layer_4[0]
                                                                    [0]']                         
                                                                                            

Let's compile the model with the Sparse Categorical Cross entropy loss function.

In [None]:
fourth_version_model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False),
              metrics=['accuracy', dice_coefficient])

Let's create visualization callback:

In [None]:
for img, ground_truth in train_dataset.take(1):
    fourth_version_viz_callback = VizCallback(fourth_version_model, img[FIRST_EXAMPLE_IN_BATCH], ground_truth[FIRST_EXAMPLE_IN_BATCH])

<a name='16-4-2'></a>
#### 16.4.2 - Train the model and evaluate him on train and dev datasets

In this section we will train the model.

Firstly, we will train our model for 40 epochs, and we will see the results.

In [None]:
FOURTH_VERSION_EPOCHS = 40
fourth_version_history = fourth_version_model.fit(train_dataset, epochs=FOURTH_VERSION_EPOCHS, validation_data=dev_dataset, callbacks=[fourth_version_viz_callback])

Let's plot the history of this training process:

In [None]:
plot_history(fourth_version_history)

The history of this training process is:

<div style="text-align:center">
    <img src="Images\fourth_version_first_run_history_result.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 37</u></b>: History result of the fourth version model in the first training run

**Conclusions:**

* We can see in the history result, that the loss, Accuracy and Dice Cooefficient on the train dataset converge in a pretty good way, but on the dev dataset they are less converging and little moving in a zigzag pattern. Compare to the previous models, this model much more converging in the loss, Accuracy and Dice Cooefficient on the dev dataset.

* We can see that the model accuracy is preety high, but the Dice Cooefficient score not high, and the reason for this is that our model less good in predict classes that have small amount of data(we talked about this in section 7). We want get a higher results because this still not good results compare to previous models.

* We can see that our model little overfit the loss and Accuracy, but in the Dice Coefficient the model do more overfit. Because this model has not dropout layers, we will see that as long as we train more our model ,our model will overfit more the train dataset. 

In conclusion, we want to get a higer performance on the train and dev datasets, while not overfitt the train dataset.

Now let's try train the model more in order to see if we can get better results.

In [None]:
FOURTH_VERSION_SECOND_RUN_EPOCHS = 40
fourth_version_second_run_history = fourth_version_model.fit(train_dataset, epochs=FOURTH_VERSION_SECOND_RUN_EPOCHS, validation_data=dev_dataset, callbacks=[fourth_version_viz_callback])

Let's plot the history of this training process:

In [None]:
plot_history(fourth_version_second_run_history)

The history of this training process is:

<div style="text-align:center">
    <img src="Images\fourth_version_second_run_history_result.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 38</u></b>: History result of the fourth version model in the second training run

**Conclusions:**

* We can see in the history result, that the loss, Accuracy and Dice Coefficient on the train dataset converge in a pretty good way(in the Dice Cooefficient less converges), but on the dev dataset they are not converging but more moving in a zigzag pattern, i.e, the model is really unstable on the dev dataset, and this not good.

* We can see that our model overfit the loss, Accuracy and Dice Coefficient in the train dataset.  The model performance in loss, Accuracy and Dice Coefficient on the dev dataset become worser. This is because we not use dropout layers, and hence our model overfit the the train dataset.

In conclusion, we can see that dropout layers are necessary for get good performance on the dev dataset, and in order our model can generalize to new examples.

For this model version we will not show the model's predictions, because this version not good compared to the previous models.

<a name='16-5'></a>
### 16.5 - Fifth model version

In this section we will create model that has the regular Unet model architecture with dropout layer after each max pooling layer in the encoder and after each deconvultion layer in the decoder, without data augmentation and with Sparse Categorical Cross entropy loss function.

<a name='16-5-1'></a>
#### 16.5.1 - Create the model

In this section we will create the model.

In [39]:
fifth_version_model  = genral_unet_model(INPUT_SHAPE, basic_data_augmentation, unet_encoder, unet_decoder, data_augmentation_prob = 0)

Let's see the summary of the model

In [40]:
fifth_version_model.summary()

Model: "model_5"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_6 (InputLayer)        [(None, 128, 128, 3)]        0         []                            
                                                                                                  
 data_augmentation_layer_5   (None, 128, 128, 3)          0         ['input_6[0][0]']             
 (DataAugmentationLayer)                                                                          
                                                                                                  
 conv2d_95 (Conv2D)          (None, 128, 128, 64)         1792      ['data_augmentation_layer_5[0]
                                                                    [0]']                         
                                                                                            

Let's compile the model with the Sparse Categorical Cross entropy loss function.

In [None]:
fifth_version_model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False),
              metrics=['accuracy', dice_coefficient])

Let's create visualization callback:

In [None]:
for img, ground_truth in train_dataset.take(1):
    fifth_version_viz_callback = VizCallback(fifth_version_model, img[FIRST_EXAMPLE_IN_BATCH], ground_truth[FIRST_EXAMPLE_IN_BATCH])

<a name='16-5-2'></a>
#### 16.5.2 - Train the model and evaluate him on train and dev datasets

In this section we will train the model.

Firstly, we will train our model for 40 epochs, and we will see the results.

In [None]:
FIFTH_VERSION_EPOCHS = 40
fifth_version_history = fifth_version_model.fit(train_dataset, epochs=FIFTH_VERSION_EPOCHS, validation_data=dev_dataset, callbacks=[fifth_version_viz_callback])

Let's plot the history of this training process:

In [None]:
plot_history(fifth_version_history)

The history of this training process is:

<div style="text-align:center">
    <img src="Images\fifth_version_first_run_history_result.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 39</u></b>: History result of the fifth version model in the first training run

**Conclusions:**

* We can see in the history result, that the loss, Accuracy and Dice Coefficient on the train dataset converge in a pretty good way(in the Dice Cooefficient less converges good in the end), but on the dev dataset they are not converging but more moving in a zigzag pattern, i.e, the model is really unstable on the dev dataset, and this not good.

* We can see that our model overfit the loss, Accuracy and Dice Coefficient in the train dataset. This is because we not use data augmentation, and hence our model overfit the the train dataset.

In conclusion, we can see that data augmentation is necessary for get good performance on the dev dataset, and in order our model can generalize to new examples.

For this model version we will not show the model's predictions, because this version not good compared to the previous models.

<a name='16-6'></a>
### 16.6 - Sixth model version

In this section we will create model that has the regular Unet model architecture without dropout layers, without data augmentation and with Sparse Categorical Cross entropy loss function.

<a name='16-6-1'></a>
#### 16.6.1 - Create the model

In this section we will create the model.

In [41]:
sixth_version_model  = genral_unet_model(INPUT_SHAPE, basic_data_augmentation, unet_encoder, unet_decoder, data_augmentation_prob = 0, dropout_prob = 0)

Let's see the summary of the model

In [42]:
sixth_version_model.summary()

Model: "model_6"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_7 (InputLayer)        [(None, 128, 128, 3)]        0         []                            
                                                                                                  
 data_augmentation_layer_6   (None, 128, 128, 3)          0         ['input_7[0][0]']             
 (DataAugmentationLayer)                                                                          
                                                                                                  
 conv2d_114 (Conv2D)         (None, 128, 128, 64)         1792      ['data_augmentation_layer_6[0]
                                                                    [0]']                         
                                                                                            

Let's compile the model with the Sparse Categorical Cross entropy loss function.

In [None]:
sixth_version_model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False),
              metrics=['accuracy', dice_coefficient])

Let's create visualization callback:

In [None]:
for img, ground_truth in train_dataset.take(1):
    sixth_version_viz_callback = VizCallback(fifth_version_model, img[FIRST_EXAMPLE_IN_BATCH], ground_truth[FIRST_EXAMPLE_IN_BATCH])

<a name='16-6-2'></a>
#### 16.6.2 - Train the model and evaluate him on train and dev datasets

In this section we will train the model.

Firstly, we will train our model for 40 epochs, and we will see the results.

In [None]:
SIXTH_VERSION_EPOCHS = 40
sixth_version_history = sixth_version_model.fit(train_dataset, epochs=SIXTH_VERSION_EPOCHS, validation_data=dev_dataset, callbacks=[sixth_version_viz_callback])

Let's plot the history of this training process:

In [None]:
plot_history(sixth_version_history)

The history of this training process is:

<div style="text-align:center">
    <img src="Images\sixth_version_first_run_history_result.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 40</u></b>: History result of the sixth version model in the first training run

**Conclusions:**

* We can see in the history result, that the loss, Accuracy and Dice Cooefficient on the train dataset converge in a pretty good way, but on the dev dataset they are less converging and moving in a zigzag pattern(especially in 20 epoch and 25-30 epochs). 

* We can see that the model accuracy is preety high, but the Dice Cooefficient score not high, and the reason for this is that our model less good in predict classes that have small amount of data(we talked about this in section 7). We want get a higher results because this still not good results compare to previous models.

* We can see that our model little overfit the loss and Accuracy, but in the Dice Coefficient the model do more overfit. Because this model has not dropout layers and has not data augmentation, we will see that as long as we train more our model ,our model will overfit more the train dataset. 

In conclusion, we want to get a higer performance on the train and dev datasets, while not overfitt the train dataset.

Now let's try train the model more in order to see if we can get better results.

In [None]:
SIXTH_VERSION_SECOND_RUN_EPOCHS = 40
sixth_version_second_run_history = sixth_version_model.fit(train_dataset, epochs=SIXTH_VERSION_SECOND_RUN_EPOCHS, validation_data=dev_dataset, callbacks=[sixth_version_viz_callback])

Let's plot the history of this training process:

In [None]:
plot_history(sixth_version_second_run_history)

The history of this training process is:

<div style="text-align:center">
    <img src="Images\sixth_version_second_run_history_result.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 41</u></b>: History result of the sixth version model in the second training run

**Conclusions:**

* We can see in the history result, that the loss, Accuracy and Dice Coefficient on the train dataset converge in a good way(less good than first training run), but on the dev dataset they are not converging but more moving in a zigzag pattern, i.e, the model is really unstable on the dev dataset, and this not good.

* We can see that our model overfit the loss, Accuracy and Dice Coefficient in the train dataset.  The model performance in loss, Accuracy and Dice Coefficient on the dev dataset become worser. This is because we not use dropout layers and data augmentation, and hence our model overfit the train dataset.

In conclusion, we can see that dropout layers and data augmentation are necessary for get good performance on the dev dataset, and in order our model can generalize to new examples.

For this model version we will not show the model's predictions, because this version not good compared to the previous models.

<a name='16-7'></a>
### 16.7 - Seventh model version

In this section we will create model that has the regular Unet model architecture with dropout layer after each max pooling layer in the encoder and after each deconvultion layer in the decoder, bad_data_augmentation function and Sparse Categorical Cross entropy loss function.

<a name='16-7-1'></a>
#### 16.7.1 - Create the model

In this section we will create the model.

In [43]:
seventh_version_model  = genral_unet_model(INPUT_SHAPE, bad_data_augmentation, unet_encoder, unet_decoder)

Let's see the summary of the model

In [44]:
seventh_version_model.summary()

Model: "model_7"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_8 (InputLayer)        [(None, 128, 128, 3)]        0         []                            
                                                                                                  
 data_augmentation_layer_7   (None, 128, 128, 3)          0         ['input_8[0][0]']             
 (DataAugmentationLayer)                                                                          
                                                                                                  
 conv2d_133 (Conv2D)         (None, 128, 128, 64)         1792      ['data_augmentation_layer_7[0]
                                                                    [0]']                         
                                                                                            

Let's compile the model with the Sparse Categorical Cross entropy loss function.

In [None]:
seventh_version_model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False),
              metrics=['accuracy', dice_coefficient])

Let's create visualization callback:

In [None]:
for img, ground_truth in train_dataset.take(1):
    seventh_version_viz_callback = VizCallback(seventh_version_model, img[FIRST_EXAMPLE_IN_BATCH], ground_truth[FIRST_EXAMPLE_IN_BATCH])

<a name='16-7-2'></a>
#### 16.7.2 - Train the model and evaluate him on train and dev datasets

In this section we will train the model.

Firstly, we will train our model for 40 epochs, and we will see the results.

In [None]:
SEVENTH_VERSION_EPOCHS = 40
seventh_version_history = seventh_version_model.fit(train_dataset, epochs=SEVENTH_VERSION_EPOCHS, validation_data=dev_dataset, callbacks=[seventh_version_viz_callback])

Let's plot the history of this training process:

In [None]:
plot_history(seventh_version_history)

The history of this training process is:

<div style="text-align:center">
    <img src="Images\seventh_version_first_run_history_result.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 42</u></b>: History result of the seventh version model in the first training run

**Conclusions:**

* We can see in the history result, that the loss, Accuracy and Dice Coefficient on the train dataset converge in a pretty good way, but on the dev dataset they are not converging but more moving in a zigzag pattern, i.e, the model is really unstable on the dev dataset, and this not good.

* We can see that our model overfit the Accuracy and Dice Coefficient in the train dataset. This is because we use bad data augmentation.

* We get worse results in the loss, Accuracy and Dice Coefficient on the train and dev datasets, compare to previous models. This is because we use bad data augmentation.

In conclusion, we can see that good data augmentation is necessary for get good performance on the train and dev datasets.

For this model version we will not show the model's predictions, because this version not good compared to the previous models.

<a name='16-8'></a>
### 16.8 - Eighth model version

In this section we will create model that has the regular Unet model architecture with dropout layer after each max pooling layer in the encoder and after each deconvultion layer in the decoder, basic_data_augmentation function, Weighted Sparse Categorical Cross entropy loss function and Learning Rate scheduler.

<a name='16-8-1'></a>
#### 16.8.1 - Create the model

In this section we will create the model.

In [45]:
eighth_version_model  = genral_unet_model(INPUT_SHAPE, basic_data_augmentation, unet_encoder, unet_decoder)

Let's see the summary of the model

In [46]:
eighth_version_model.summary()

Model: "model_8"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_9 (InputLayer)        [(None, 128, 128, 3)]        0         []                            
                                                                                                  
 data_augmentation_layer_8   (None, 128, 128, 3)          0         ['input_9[0][0]']             
 (DataAugmentationLayer)                                                                          
                                                                                                  
 conv2d_152 (Conv2D)         (None, 128, 128, 64)         1792      ['data_augmentation_layer_8[0]
                                                                    [0]']                         
                                                                                            

Let's compile the model with the Weighted Sparse Categorical Cross entropy loss function.

In [None]:
eighth_version_model.compile(optimizer='adam',
              loss=weighted_sparse_categorical_crossentropy,
              metrics=['accuracy', dice_coefficient])

Let's create visualization callback:

In [None]:
for img, ground_truth in train_dataset.take(1):
    eighth_version_viz_callback = VizCallback(eighth_version_model, img[FIRST_EXAMPLE_IN_BATCH], ground_truth[FIRST_EXAMPLE_IN_BATCH])

Let's create Learning Rate scheduler:

In [None]:
INITIAL_LR_EIGHTH_VERSION = 0.001
NEW_LR_EIGHTH_VERSION = 0.0001
NUM_OF_EPOCHS_FOR_INITIAL_LR_EIGHTH_VERSION = 20
    
eighth_version_lr_callback = LearningRateScheduler(lambda epoch: lr_scheduler(epoch, INITIAL_LR_EIGHTH_VERSION, NEW_LR_EIGHTH_VERSION, NUM_OF_EPOCHS_FOR_INITIAL_LR_EIGHTH_VERSION))

<a name='16-8-2'></a>
#### 16.8.2 - Train the model and evaluate him on train and dev datasets

In this section we will train the model.

Firstly, we will train our model for 40 epochs, and we will see the results.

In [None]:
EIGHTH_VERSION_EPOCHS = 40
eighth_version_history = eighth_version_model.fit(train_dataset, epochs=EIGHTH_VERSION_EPOCHS, validation_data=dev_dataset, callbacks=[eighth_version_viz_callback, eighth_version_lr_callback])

Let's plot the history of this training process:

In [None]:
plot_history(eighth_version_history)

The history of this training process is:

<div style="text-align:center">
    <img src="Images\eighth_version_first_run_history_result.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 43</u></b>: History result of the eighth version model in the first training run

And we can see the results of the model in the second epoch and in the last epoch:

<div style="text-align:center">
    <img src="Images\eighth_version_first_run_second_epoch_result.png" style="width:200;height:200;">
    <img src="Images\eighth_version_first_run_last_epoch_result.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 44</u></b>: Results of the eighth version model in the first training run in the second epoch and in the last epoch

**Conclusions:**

* We can see in the history result, that the loss, Accuracy and Dice Cooefficient on the train dataset and on the dev dataset converge in a pretty good way. This is because we use Learning Rate scheduler and from epoch 20 we take little steps in the training process.

* We can see that the model accuracy is preety high, but the Dice Cooefficient score not high, and the reason for this is that our model less good in predict classes that have small amount of data(we talked about this in section 7).

* We can see that our model little overfit the loss, Accuracy and Dice Coefficient on train dataset. He less overfit the train dataset compare to previous models.

In conclusion, we need to try get better results in the dev dataset while not overfit the train dataset.

Let's create Learning Rate scheduler:

In [None]:
NUM_OF_EPOCHS_FOR_INITIAL_LR_EIGHTH_VERSION_SECOND_RUN = 0
    
eighth_version_second_run_lr_callback = LearningRateScheduler(lambda epoch: lr_scheduler(epoch, INITIAL_LR_EIGHTH_VERSION, NEW_LR_EIGHTH_VERSION, NUM_OF_EPOCHS_FOR_INITIAL_LR_EIGHTH_VERSION_SECOND_RUN))

Now let's try train the model more in order to see if we can get better results.

In [None]:
EIGHTH_VERSION_SECOND_RUN_EPOCHS = 40
eighth_version_second_run_history = eighth_version_model.fit(train_dataset, epochs=EIGHTH_VERSION_SECOND_RUN_EPOCHS, validation_data=dev_dataset, callbacks=[eighth_version_viz_callback, eighth_version_second_run_lr_callback])

Let's plot the history of this training process:

In [None]:
plot_history(eighth_version_second_run_history)

The history of this training process is:

<div style="text-align:center">
    <img src="Images\eighth_version_second_run_history_result.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 45</u></b>: History result of the eighth version model in the second training run

And we can compare the results of the model in the last epoch of the first training run and in the last epoch of the second training:

<div style="text-align:center">
    <img src="Images\eighth_version_first_run_last_epoch_result.png" style="width:200;height:200;">
    <img src="Images\eighth_version_second_run_last_epoch_result.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 46</u></b>: Results of the eighth version model in the first training run in the last epoch and in the second training run in the last epoch

**Conclusions:**

* We can see in the history result, that the loss, Accuracy and Dice Cooefficient on the train dataset converge in a pretty good way, but on the dev dataset our model become less stable, and moving in a zigzag pattern.

* We get better results in the last epoch in the second running compare to the last epoch in the first running. Now our model can predict correctly more little detials in the image.
We can see that our model much improved in Accuracy and Dice Cooefficient.

* We can see that our model more overfit the train dataset.

In overall, the model performance improved on the train dataset, and the performance on the the dev dataset didn't get any worse, but our model overfit the train dataset.

Now let's try train the model little more in order to see if we can get better results in the dev dataset and maybee less overfitting.

Let's create Learning Rate scheduler:

In [None]:
NUM_OF_EPOCHS_FOR_INITIAL_LR_EIGHTH_VERSION_THIRD_RUN = 0
    
eighth_version_third_run_lr_callback = LearningRateScheduler(lambda epoch: lr_scheduler(epoch, INITIAL_LR_EIGHTH_VERSION, NEW_LR_EIGHTH_VERSION, NUM_OF_EPOCHS_FOR_INITIAL_LR_EIGHTH_VERSION_THIRD_RUN))

Let's train the model:

In [None]:
EIGHTH_VERSION_THIRD_RUN_EPOCHS = 20
eighth_version_third_run_history = eighth_version_model.fit(train_dataset, epochs=EIGHTH_VERSION_THIRD_RUN_EPOCHS, validation_data=dev_dataset, callbacks=[eighth_version_viz_callback, eighth_version_third_run_lr_callback])

Let's plot the history of this training process:

In [None]:
plot_history(eighth_version_third_run_history)

The history of this training process is:

<div style="text-align:center">
    <img src="Images\eighth_version_third_run_history_result.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 47</u></b>: History result of the eighth version model in the third training run

And we can compare the results of the model in the last epoch of the second training run and in the last epoch of the third training:

<div style="text-align:center">
    <img src="Images\eighth_version_second_run_last_epoch_result.png" style="width:200;height:200;">
    <img src="Images\eighth_version_third_run_last_epoch_result.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 48</u></b>: Results of the third version model in the second training run in the last epoch and in the third training run in the last epoch

**Conclusions:**

* We can see in the history result, that the loss, Accuracy and Dice Cooefficient on the train dataset converge in a pretty good way, but on the dev dataset our model not stable, and moving in a zigzag pattern.

* We get better results in the last epoch in the third running compare to the last epoch in the second running in the train dataset, but our model performance getting worse on the dev dataset compare to the second running.

* We can see that our model much more overfit the train dataset.

In overall, the model performance improved on the train dataset, and the model performance on the the dev dataset gets worse, and our model overfit the train dataset.

Hence, we will use the eighth version model in the second running.

<a name='16-8-3'></a>
#### 16.8.3 - Model's predictions on the train and dev datasets

In this section we will show examples of model's predictions on the train and dev datasets.

Let's plot the model's predictions on train dataset:

In [None]:
show_predictions(eighth_version_model, train_dataset, 10)

Six examples of model's predictions on the train dataset are:

<div style="text-align:center">
    <img src="Images\eighth_version_predictions_train_part1.png" style="width:200;height:200;">
    <img src="Images\eighth_version_predictions_train_part2.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 49</u></b>: Model's predictions on the train dataset

Let's plot the model's predictions on dev dataset:

In [None]:
show_predictions(eighth_version_model, dev_dataset, 10)

Six examples of model's predictions on the dev dataset are:

<div style="text-align:center">
    <img src="Images\eighth_version_predictions_dev_part1.png" style="width:200;height:200;">
    <img src="Images\eighth_version_predictions_dev_part2.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 50</u></b>: Model's predictions on the dev dataset

In conclusion, we get good results and the loss, Accuracy and Dice Cooefficient on the dev dataset were more converge and were not move in a zigzag pattern, but we still need to try to improve the model performance espically in the dev dataset, i.e, we need that our model will be more genral and less overfitt the train dataset.

<a name='16-9'></a>
### 16.9 - Nineth model version

In this section we will create model that has the regular Unet model architecture with dropout layer after each max pooling layer in the encoder and after each deconvultion layer in the decoder, basic_data_augmentation function, Weighted Sparse Categorical Cross entropy loss function and Dropout scheduler.

<a name='16-9-1'></a>
#### 16.9.1 - Create the model

In this section we will create the model.

In [47]:
nineth_version_model  = genral_unet_model(INPUT_SHAPE, basic_data_augmentation, unet_encoder, unet_decoder)

Let's see the summary of the model

In [48]:
nineth_version_model.summary()

Model: "model_9"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_10 (InputLayer)       [(None, 128, 128, 3)]        0         []                            
                                                                                                  
 data_augmentation_layer_9   (None, 128, 128, 3)          0         ['input_10[0][0]']            
 (DataAugmentationLayer)                                                                          
                                                                                                  
 conv2d_171 (Conv2D)         (None, 128, 128, 64)         1792      ['data_augmentation_layer_9[0]
                                                                    [0]']                         
                                                                                            

Let's compile the model with the Weighted Sparse Categorical Cross entropy loss function.

In [None]:
nineth_version_model.compile(optimizer='adam',
              loss=weighted_sparse_categorical_crossentropy,
              metrics=['accuracy', dice_coefficient])

Let's create visualization callback:

In [None]:
for img, ground_truth in train_dataset.take(1):
    nineth_version_viz_callback = VizCallback(nineth_version_model, img[FIRST_EXAMPLE_IN_BATCH], ground_truth[FIRST_EXAMPLE_IN_BATCH])

Let's create Dropout scheduler:

In [None]:
NINETH_VERSION_DROP_SCHEDULE = {13:0.4, 23:0.5}  
nineth_version_dropout_scheduler = DropoutScheduler(drop_schedule=NINETH_VERSION_DROP_SCHEDULE)

<a name='16-9-2'></a>
#### 16.9.2 - Train the model and evaluate him on train and dev datasets

In this section we will train the model.

Firstly, we will train our model for 40 epochs, and we will see the results.

In [None]:
NINETH_VERSION_EPOCHS = 40
nineth_version_history = nineth_version_model.fit(train_dataset, epochs=NINETH_VERSION_EPOCHS, validation_data=dev_dataset, callbacks=[nineth_version_viz_callback, nineth_version_dropout_scheduler])

Let's plot the history of this training process:

In [None]:
plot_history(nineth_version_history)

The history of this training process is:

<div style="text-align:center">
    <img src="Images\nineth_version_first_run_history_result.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 51</u></b>: History result of the nineth version model in the first training run

And we can see the results of the third version model in the first training run in the last epoch(up) and of the nineth version model in the first training run in the last epoch(down):

<div style="text-align:center">
    <img src="Images\third_version_first_run_last_epoch_result.png" style="width:200;height:200;">
    <img src="Images\nineth_version_first_run_last_epoch_result.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 52</u></b>: Results of the third version model in the first training run in the last epoch(up) and of the nineth version model in the first training run in the last epoch(down) 

**Conclusions:**

* We can see in the history result, that the loss, Accuracy and Dice Cooefficient on the train dataset converge in a pretty good way, but on the dev dataset they are less converging but more moving in a zigzag pattern(ecpeically in epochs 0-5 and 25-30).

* We can see that our model overfit the loss and Accuracy and Dice Coefficient on train dataset.

Although we apply Dropout scheduler, we still have the problem of overfitting and bad converge in loss and Accuracy and Dice Coefficient on dev dataset. Hence, we don't sucess improve the third version model, and therefore we will not continue with this model.

Hence, for this model version we will not show the model's predictions.

<a name='16-10'></a>
### 16.10 - Tenth model version

In this section we will create model that has the regular Unet model architecture with dropout layer after each max pooling layer in the encoder and after each deconvultion layer in the decoder, basic_data_augmentation function, Weighted Sparse Categorical Cross entropy loss function and Learning Rate and Dropout scheduler.

<a name='16-10-1'></a>
#### 16.10.1 - Create the model

In this section we will create the model.

In [49]:
tenth_version_model  = genral_unet_model(INPUT_SHAPE, basic_data_augmentation, unet_encoder, unet_decoder)

Let's see the summary of the model

In [50]:
tenth_version_model.summary()

Model: "model_10"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_11 (InputLayer)       [(None, 128, 128, 3)]        0         []                            
                                                                                                  
 data_augmentation_layer_10  (None, 128, 128, 3)          0         ['input_11[0][0]']            
  (DataAugmentationLayer)                                                                         
                                                                                                  
 conv2d_190 (Conv2D)         (None, 128, 128, 64)         1792      ['data_augmentation_layer_10[0
                                                                    ][0]']                        
                                                                                           

Let's compile the model with the Weighted Sparse Categorical Cross entropy loss function.

In [None]:
tenth_version_model.compile(optimizer='adam',
              loss=weighted_sparse_categorical_crossentropy,
              metrics=['accuracy', dice_coefficient])

Let's create visualization callback:

In [None]:
for img, ground_truth in train_dataset.take(1):
    tenth_version_viz_callback = VizCallback(tenth_version_model, img[FIRST_EXAMPLE_IN_BATCH], ground_truth[FIRST_EXAMPLE_IN_BATCH])

Let's create Dropout scheduler:

In [None]:
TENTH_VERSION_DROP_SCHEDULE = {13: 0.35, 22: 0.4}
tenth_version_dropout_scheduler = DropoutScheduler(drop_schedule=TENTH_VERSION_DROP_SCHEDULE)

Let's create Learning Rate scheduler:

In [None]:
INITIAL_LR_TENTH_VERSION = 0.001
NEW_LR_TENTH_VERSION = 0.0001
NUM_OF_EPOCHS_FOR_INITIAL_LR_TENTH_VERSION = 20
    
tenth_version_lr_callback = LearningRateScheduler(lambda epoch: lr_scheduler(epoch, INITIAL_LR_TENTH_VERSION, NEW_LR_TENTH_VERSION, NUM_OF_EPOCHS_FOR_INITIAL_LR_TENTH_VERSION))

<a name='16-10-2'></a>
#### 16.10.2 - Train the model and evaluate him on train and dev datasets

In this section we will train the model.

Firstly, we will train our model for 40 epochs, and we will see the results.

In [None]:
TENTH_VERSION_EPOCHS = 40
tenth_version_history = tenth_version_model.fit(train_dataset, epochs=TENTH_VERSION_EPOCHS, validation_data=dev_dataset, callbacks=[tenth_version_viz_callback, tenth_version_lr_callback, tenth_version_dropout_scheduler])

Let's plot the history of this training process:

In [None]:
plot_history(tenth_version_history)

The history of this training process is:

<div style="text-align:center">
    <img src="Images\tenth_version_first_run_history_result.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 53</u></b>: History result of the tenth version model in the first training run

And we can see the results of the model in the second epoch and in the last epoch:

<div style="text-align:center">
    <img src="Images\tenth_version_first_run_second_epoch_result.png" style="width:200;height:200;">
    <img src="Images\tenth_version_first_run_last_epoch_result.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 54</u></b>: Results of the tenth version model in the first training run in the second epoch and in the last epoch

**Conclusions:**

* We can see in the history result, that the loss, Accuracy and Dice Cooefficient on the train dataset and on the dev dataset converge in a pretty good way. This is because we use Learning Rate scheduler and from epoch 20 we take little steps in the training process.

* We can see that the model accuracy is preety high, but the Dice Cooefficient score not high, and the reason for this is that our model less good in predict classes that have small amount of data(we talked about this in section 7). Compared to previous models we need to try to get higer preformance.

* We can see that our model little overfit the loss, Accuracy and Dice Coefficient on train dataset. This is because we use Dropout scheduler.

In conclusion, we need to try get better results in the dev dataset while not overfit the train dataset.

Now let's try train the model more in order to see if we can get better results.

Let's create Dropout scheduler:

In [None]:
TENTH_VERSION_SECOND_RUN_DROP_SCHEDULE = {1: 0.4} 
tenth_version_second_run_dropout_scheduler = DropoutScheduler(drop_schedule=TENTH_VERSION_SECOND_RUN_DROP_SCHEDULE)

Let's create Learning Rate scheduler:

In [None]:
NUM_OF_EPOCHS_FOR_INITIAL_LR_TENTH_VERSION_SECOND_RUN = 20
    
tenth_version_second_run_lr_callback = LearningRateScheduler(lambda epoch: lr_scheduler(epoch, INITIAL_LR_TENTH_VERSION, NEW_LR_TENTH_VERSION, NUM_OF_EPOCHS_FOR_INITIAL_LR_TENTH_VERSION_SECOND_RUN))

Let's train the model:

In [None]:
TENTH_VERSION_SECOND_RUN_EPOCHS = 40
tenth_version_second_run_history = tenth_version_model.fit(train_dataset, epochs=TENTH_VERSION_SECOND_RUN_EPOCHS, validation_data=dev_dataset, callbacks=[tenth_version_viz_callback, tenth_version_second_run_lr_callback, tenth_version_second_run_dropout_scheduler])

Let's plot the history of this training process:

In [None]:
plot_history(tenth_version_second_run_history)

The history of this training process is:

<div style="text-align:center">
    <img src="Images\tenth_version_second_run_history_result.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 55</u></b>: History result of the tenth version model in the second training run

And we can compare the results of the model in the last epoch of the first training run and in the last epoch of the second training:

<div style="text-align:center">
    <img src="Images\tenth_version_first_run_last_epoch_result.png" style="width:200;height:200;">
    <img src="Images\tenth_version_second_run_last_epoch_result.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 56</u></b>: Results of the tenth version model in the first training run in the last epoch and in the second training run in the last epoch

**Conclusions:**

* We can see in the history result, that the loss, Accuracy and Dice Cooefficient on the train dataset converge in a pretty good way(but less converge compare to the first training), but on the dev dataset our model become much less stable, and moving in a zigzag pattern.

* We get better results in the last epoch in the second running compare to the last epoch in the first running in the train dataset, but our model performance not improved much on the dev dataset compare to the first running.

* We can see that our model much more overfit the train dataset.

In overall, the model performance improved on the train dataset, and the model performance on the the dev dataset not improved, and our model overfit the train dataset(much more compered to the first training run). In genral the model performance is not good compared pervious models, and thus adding Learning Rate and Dropout scheduler not helped us, and we will not continue to train this model.

Hence, we will use the tenth version model in the first running.

<a name='16-10-3'></a>
#### 16.10.3 - Model's predictions on the train and dev datasets

In this section we will show examples of model's predictions on the train and dev datasets.

Let's plot the model's predictions on train dataset:

In [None]:
show_predictions(tenth_version_model, train_dataset, 6)

Six examples of model's predictions on the train dataset are:

<div style="text-align:center">
    <img src="Images\tenth_version_predictions_train_part1.png" style="width:200;height:200;">
    <img src="Images\tenth_version_predictions_train_part2.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 57</u></b>: Model's predictions on the train dataset

Let's plot the model's predictions on dev dataset:

In [None]:
show_predictions(tenth_version_model, dev_dataset, 6)

Six examples of model's predictions on the dev dataset are:

<div style="text-align:center">
    <img src="Images\tenth_version_predictions_dev_part1.png" style="width:200;height:200;">
    <img src="Images\tenth_version_predictions_dev_part2.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 58</u></b>: Model's predictions on the dev dataset

In conclusion, we got nice results and our model less overfit the train dataset, but the model performance is not good compared pervious models.

<a name='16-11'></a>
### 16.11 - Eleventh model version

In this section we will create model that has the regular Unet model architecture with dropout layer after each max pooling layer in the encoder and after each deconvultion layer in the decoder, advance_data_augmentation function and Weighted Sparse Categorical Cross entropy loss function.

<a name='16-11-1'></a>
#### 16.11.1 - Create the model

In this section we will create the model.

In [51]:
eleventh_version_model  = genral_unet_model(INPUT_SHAPE, advance_data_augmentation, unet_encoder, unet_decoder)

Let's see the summary of the model

In [52]:
eleventh_version_model.summary()

Model: "model_11"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_12 (InputLayer)       [(None, 128, 128, 3)]        0         []                            
                                                                                                  
 data_augmentation_layer_11  (None, 128, 128, 3)          0         ['input_12[0][0]']            
  (DataAugmentationLayer)                                                                         
                                                                                                  
 conv2d_209 (Conv2D)         (None, 128, 128, 64)         1792      ['data_augmentation_layer_11[0
                                                                    ][0]']                        
                                                                                           

Let's compile the model with the Weighted Sparse Categorical Cross entropy loss function.

In [None]:
eleventh_version_model.compile(optimizer='adam',
              loss=weighted_sparse_categorical_crossentropy,
              metrics=['accuracy', dice_coefficient])

Let's create visualization callback:

In [None]:
for img, ground_truth in train_dataset.take(1):
    eleventh_version_viz_callback = VizCallback(eleventh_version_model, img[FIRST_EXAMPLE_IN_BATCH], ground_truth[FIRST_EXAMPLE_IN_BATCH])

Let's create Learning Rate scheduler:

In [None]:
INITIAL_LR_ELEVENTH_VERSION = 0.001
NEW_LR_ELEVENTH_VERSION = 0.0001
NUM_OF_EPOCHS_FOR_INITIAL_LR_ELEVENTH_VERSION = 20
    
eleventh_version_lr_callback = LearningRateScheduler(lambda epoch: lr_scheduler(epoch, INITIAL_LR_ELEVENTH_VERSION, NEW_LR_ELEVENTH_VERSION, NUM_OF_EPOCHS_FOR_INITIAL_LR_ELEVENTH_VERSION))

<a name='16-11-2'></a>
#### 16.11.2 - Train the model and evaluate him on train and dev datasets

In this section we will train the model.

Firstly, we will train our model for 40 epochs, and we will see the results.

In [None]:
ELEVENTH_VERSION_EPOCHS = 40
eleventh_version_history = eleventh_version_model.fit(train_dataset, epochs=ELEVENTH_VERSION_EPOCHS, validation_data=dev_dataset, callbacks=[eleventh_version_viz_callback, eleventh_version_lr_callback])

Let's plot the history of this training process:

In [None]:
plot_history(eleventh_version_history)

The history of this training process is:

<div style="text-align:center">
    <img src="Images\eleventh_version_first_run_history_result.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 59</u></b>: History result of the eleventh version model in the first training run

And we can see the results of the model in the second epoch and in the last epoch:

<div style="text-align:center">
    <img src="Images\eleventh_version_first_run_second_epoch_result.png" style="width:200;height:200;">
    <img src="Images\eleventh_version_first_run_last_epoch_result.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 60</u></b>: Results of the eleventh version model in the first training run in the second epoch and in the last epoch

**Conclusions:**

* We can see in the history result, that the loss, Accuracy and Dice Cooefficient on the train dataset and on the dev dataset converge in a pretty good way. This is because we use Learning Rate scheduler and from epoch 20 we take little steps in the training process.

* We can see that the model accuracy is preety high, but the Dice Cooefficient score not high, and the reason for this is that our model less good in predict classes that have small amount of data(we talked about this in section 7). We need to try to get higer preformance in train and dev datasets.

* We can see that our model less overfit the loss, Accuracy and Dice Coefficient on train dataset compare to previous models. This is because we use new function of data augmentation that she better than the initial data augmentation function.

In conclusion, we need to try get better results in the train and dev datasets while not overfit the train dataset.

Now let's try train the model more in order to see if we can get better results.

Let's create Learning Rate scheduler:

In [None]:
NUM_OF_EPOCHS_FOR_INITIAL_LR_ELEVENTH_VERSION_SECOND_RUN = 0

eleventh_version_second_run_lr_callback = LearningRateScheduler(lambda epoch: lr_scheduler(epoch, INITIAL_LR_ELEVENTH_VERSION, NEW_LR_ELEVENTH_VERSION, NUM_OF_EPOCHS_FOR_INITIAL_LR_ELEVENTH_VERSION_SECOND_RUN))

Let's train the model:

In [None]:
ELEVENTH_VERSION_SECOND_RUN_EPOCHS = 40
eleventh_version_second_run_history = eleventh_version_model.fit(train_dataset, epochs=ELEVENTH_VERSION_SECOND_RUN_EPOCHS, validation_data=dev_dataset, callbacks=[eleventh_version_viz_callback, eleventh_version_second_run_lr_callback])

Let's plot the history of this training process:

In [None]:
plot_history(eleventh_version_second_run_history)

The history of this training process is:

<div style="text-align:center">
    <img src="Images\eleventh_version_second_run_history_result.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 61</u></b>: History result of the eleventh version model in the second training run

And we can compare the results of the model in the last epoch of the first training run and in the last epoch of the second training:

<div style="text-align:center">
    <img src="Images\eleventh_version_first_run_last_epoch_result.png" style="width:200;height:200;">
    <img src="Images\eleventh_version_second_run_last_epoch_result.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 62</u></b>: Results of the eleventh version model in the first training run in the last epoch and in the second training run in the last epoch

**Conclusions:**

* We can see in the history result, that the loss, Accuracy and Dice Cooefficient on the train dataset converge in a pretty good way(but less converge compare to the first training), but on the dev dataset our model become much less stable, and moving in a zigzag pattern.

* We get better results in the last epoch in the second running compare to the last epoch in the first running in the train dataset, but our model performance not improved much on the dev dataset compare to the first running, and even get worse.

* We can see that our model much more overfit the train dataset.

In overall, the model performance improved on the train dataset, and the model performance on the the dev dataset not improved(and even get worse), and our model overfit the train dataset(much more compered to the first training run). 

Hence, we will use the eleventh version model in the first running.

<a name='16-11-3'></a>
#### 16.11.3 - Model's predictions on the train and dev datasets

In this section we will show examples of model's predictions on the train and dev datasets.

Let's plot the model's predictions on train dataset:

In [None]:
show_predictions(eleventh_version_model, train_dataset, 6)

Six examples of model's predictions on the train dataset are:

<div style="text-align:center">
    <img src="Images\eleventh_version_predictions_train_part1.png" style="width:200;height:200;">
    <img src="Images\eleventh_version_predictions_train_part2.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 63</u></b>: Model's predictions on the train dataset

Let's plot the model's predictions on dev dataset:

In [None]:
show_predictions(eleventh_version_model, dev_dataset, 6)

Six examples of model's predictions on the dev dataset are:

<div style="text-align:center">
    <img src="Images\eleventh_version_predictions_dev_part1.png" style="width:200;height:200;">
    <img src="Images\eleventh_version_predictions_dev_part2.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 64</u></b>: Model's predictions on the dev dataset

In conclusion, we got nice results, our model less overfit the loss, Accuracy and Dice Coefficient on train dataset compare to previous models, and the loss, Accuracy and Dice Cooefficient on the dev dataset were more converge and were not move in a zigzag pattern, but the model performance is not good enough compared pervious models.

<a name='16-12'></a>
### 16.12 - Twelfth model version

In this section we will create model that has the regular Unet model architecture with dropout layer after each max pooling layer in the encoder and after each deconvultion layer in the decoder, basic_data_augmentation function, second version of normalization, Weighted Sparse Categorical Cross entroy loss function and Learning Rate scheduler.

<a name='16-12-1'></a>
#### 16.12.1 - Create the model

In this section we will create the model.

In [53]:
twelfth_version_model  = genral_unet_model(INPUT_SHAPE, basic_data_augmentation, unet_encoder, unet_decoder)

Let's see the summary of the model

In [54]:
twelfth_version_model.summary()

Model: "model_12"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_13 (InputLayer)       [(None, 128, 128, 3)]        0         []                            
                                                                                                  
 data_augmentation_layer_12  (None, 128, 128, 3)          0         ['input_13[0][0]']            
  (DataAugmentationLayer)                                                                         
                                                                                                  
 conv2d_228 (Conv2D)         (None, 128, 128, 64)         1792      ['data_augmentation_layer_12[0
                                                                    ][0]']                        
                                                                                           

Let's compile the model with the Weighted Sparse Categorical Cross entropy loss function.

In [None]:
twelfth_version_model.compile(optimizer='adam',
              loss=weighted_sparse_categorical_crossentropy,
              metrics=['accuracy', dice_coefficient])

Let's create visualization callback:

In [None]:
for img, ground_truth in train_dataset.take(1):
    twelfth_version_viz_callback = VizCallback(twelfth_version_model, img[FIRST_EXAMPLE_IN_BATCH], ground_truth[FIRST_EXAMPLE_IN_BATCH])

Let's create Learning Rate scheduler:

In [None]:
INITIAL_LR_TWELFTH_VERSION = 0.001
NEW_LR_TWELFTH_VERSION = 0.0001
NUM_OF_EPOCHS_FOR_INITIAL_LR_TWELFTH_VERSION = 20
    
twelfth_version_lr_callback = LearningRateScheduler(lambda epoch: lr_scheduler(epoch, INITIAL_LR_TWELFTH_VERSION, NEW_LR_TWELFTH_VERSION, NUM_OF_EPOCHS_FOR_INITIAL_LR_TWELFTH_VERSION))

<a name='16-12-2'></a>
#### 16.12.2 - Train the model and evaluate him on train and dev datasets

In this section we will train the model.

Firstly, we will train our model for 40 epochs, and we will see the results.

In [None]:
TWELFTH_VERSION_EPOCHS = 40
twelfth_version_history = twelfth_version_model.fit(train_dataset, epochs=TWELFTH_VERSION_EPOCHS, validation_data=dev_dataset, callbacks=[twelfth_version_viz_callback, twelfth_version_lr_callback])

Let's plot the history of this training process:

In [None]:
plot_history(twelfth_version_history)

The history of this training process is:

<div style="text-align:center">
    <img src="Images\twelfth_version_first_run_history_result.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 65</u></b>: History result of the twelfth version model in the first training run

And we can see the results of the model in the second epoch and in the last epoch:

<div style="text-align:center">
    <img src="Images\twelfth_version_first_run_second_epoch_result.png" style="width:200;height:200;">
    <img src="Images\twelfth_version_first_run_last_epoch_result.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 66</u></b>: Results of the twelfth version model in the first training run in the second epoch and in the last epoch

**Conclusions:**

* We can see in the history result, that the loss, Accuracy and Dice Cooefficient on the train dataset and on the dev dataset converge in a pretty good way. This is because we use Learning Rate scheduler and from epoch 20 we take little steps in the training process.

* We can see that the model accuracy is preety high, but the Dice Cooefficient score not high, and the reason for this is that our model less good in predict classes that have small amount of data(we talked about this in section 7). We need to try to get higer preformance in train and dev datasets.

* We can see that our model less overfit the loss, Accuracy and Dice Coefficient on train dataset compare to previous models. Maybee this is because we use new normalization method, or because we use Learning Rate scheduler and from epoch 20 we take little steps in the training process.

In conclusion, we need to try get better results in the train and dev datasets while not overfit the train dataset.

Now let's try train the model more in order to see if we can get better results.

Let's create Learning Rate scheduler:

In [None]:
NUM_OF_EPOCHS_FOR_INITIAL_LR_TWELFTH_VERSION_SECOND_RUN = 0

twelfth_version_second_run_lr_callback = LearningRateScheduler(lambda epoch: lr_scheduler(epoch, INITIAL_LR_TWELFTH_VERSION, NEW_LR_TWELFTH_VERSION, NUM_OF_EPOCHS_FOR_INITIAL_LR_TWELFTH_VERSION_SECOND_RUN))

Let's train the model:

In [None]:
TWELFTH_VERSION_SECOND_RUN_EPOCHS = 40
twelfth_version_second_run_history = twelfth_version_model.fit(train_dataset_second_version, epochs=TWELFTH_VERSION_SECOND_RUN_EPOCHS, validation_data=dev_dataset_second_version, callbacks=[twelfth_version_viz_callback, twelfth_version_second_run_lr_callback])

Let's plot the history of this training process:

In [None]:
plot_history(twelfth_version_second_run_history)

The history of this training process is:

<div style="text-align:center">
    <img src="Images\twelfth_version_second_run_history_result.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 67</u></b>: History result of the twelfth version model in the second training run

And we can compare the results of the model in the last epoch of the first training run and in the last epoch of the second training:

<div style="text-align:center">
    <img src="Images\twelfth_version_first_run_last_epoch_result.png" style="width:200;height:200;">
    <img src="Images\twelfth_version_second_run_last_epoch_result.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 68</u></b>: Results of the twelfth version model in the first training run in the last epoch and in the second training run in the last epoch

**Conclusions:**

* We can see in the history result, that the loss, Accuracy and Dice Cooefficient on the train dataset converge in a pretty good way(but less converge compare to the first training), but on the dev dataset our model become much less stable, and moving in a zigzag pattern.

* We get better results in the last epoch in the second running compare to the last epoch in the first running in the train dataset, but our model performance not improved much on the dev dataset compare to the first running, and even get worse in the loss.

* We can see that our model much more overfit the train dataset.

In overall, the model performance improved on the train dataset, and the model performance on the the dev dataset not improved much(and even get worse in the loss), and our model more overfit the train dataset. 
We can see that this model has very similar performance to the eighth version model performance, and thus the fact that we changed the normaliztation method did not help. Hence, we will not continue to train the model.

<a name='16-12-3'></a>
#### 16.12.3 - Model's predictions on the train and dev datasets

In this section we will show examples of model's predictions on the train and dev datasets.

Let's plot the model's predictions on train dataset:

In [None]:
show_predictions(twelfth_version_model, train_dataset_second_version, 6)

Six examples of model's predictions on the train dataset are:

<div style="text-align:center">
    <img src="Images\twelfth_version_predictions_train_part1.png" style="width:200;height:200;">
    <img src="Images\twelfth_version_predictions_train_part2.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 69</u></b>: Model's predictions on the train dataset

Let's plot the model's predictions on dev dataset:

In [None]:
show_predictions(twelfth_version_model, dev_dataset_second_version, 6)

Six examples of model's predictions on the dev dataset are:

<div style="text-align:center">
    <img src="Images\twelfth_version_predictions_dev_part1.png" style="width:200;height:200;">
    <img src="Images\twelfth_version_predictions_dev_part2.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 70</u></b>: Model's predictions on the dev dataset

In conclusion, we got similar results to the eighth version model performance, and thus the fact that we changed the normaliztation method did not help us.

<a name='16-13'></a>
### 16.13 - Thirteen model version

In this section we will create model that has the regular Unet model architecture with dropout layer after each max pooling layer in the encoder and after each deconvultion layer in the decoder, advance_data_augmentation function, second version of normalization, Weighted Sparse Categorical Cross entroy loss function and Learning Rate scheduler.

<a name='16-13-1'></a>
#### 16.13.1 - Create the model

In this section we will create the model.

In [55]:
thirteen_version_model  = genral_unet_model(INPUT_SHAPE, advance_data_augmentation, unet_encoder, unet_decoder)

Let's see the summary of the model

In [56]:
thirteen_version_model.summary()

Model: "model_13"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_14 (InputLayer)       [(None, 128, 128, 3)]        0         []                            
                                                                                                  
 data_augmentation_layer_13  (None, 128, 128, 3)          0         ['input_14[0][0]']            
  (DataAugmentationLayer)                                                                         
                                                                                                  
 conv2d_247 (Conv2D)         (None, 128, 128, 64)         1792      ['data_augmentation_layer_13[0
                                                                    ][0]']                        
                                                                                           

Let's compile the model with the Weighted Sparse Categorical Cross entropy loss function.

In [None]:
thirteen_version_model.compile(optimizer='adam',
              loss=weighted_sparse_categorical_crossentropy,
              metrics=['accuracy', dice_coefficient])

Let's create visualization callback:

In [None]:
for img, ground_truth in train_dataset.take(1):
    thirteen_version_viz_callback = VizCallback(thirteen_version_model, img[FIRST_EXAMPLE_IN_BATCH], ground_truth[FIRST_EXAMPLE_IN_BATCH])

Let's create Learning Rate scheduler:

In [None]:
INITIAL_LR_THIRTEEN_VERSION = 0.001
NEW_LR_THIRTEEN_VERSION = 0.0001
NUM_OF_EPOCHS_FOR_INITIAL_LR_THIRTEEN_VERSION = 20
    
thirteen_version_lr_callback = LearningRateScheduler(lambda epoch: lr_scheduler(epoch, INITIAL_LR_THIRTEEN_VERSION, NEW_LR_THIRTEEN_VERSION, NUM_OF_EPOCHS_FOR_INITIAL_LR_THIRTEEN_VERSION))

<a name='16-13-2'></a>
#### 16.13.2 - Train the model and evaluate him on train and dev datasets

In this section we will train the model.

Firstly, we will train our model for 40 epochs, and we will see the results.

In [None]:
THIRTEEN_VERSION_EPOCHS = 40
thirteen_version_history = thirteen_version_model.fit(train_dataset_second_version, epochs=THIRTEEN_VERSION_EPOCHS, validation_data=dev_dataset_second_version, callbacks=[thirteen_version_viz_callback, thirteen_version_lr_callback])

Let's plot the history of this training process:

In [None]:
plot_history(thirteen_version_history)

The history of this training process is:

<div style="text-align:center">
    <img src="Images\thirteen_version_first_run_history_result.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 71</u></b>: History result of the thirteen version model in the first training run

And we can see the results of the model in the second epoch and in the last epoch:

<div style="text-align:center">
    <img src="Images\thirteen_version_first_run_second_epoch_result.png" style="width:200;height:200;">
    <img src="Images\thirteen_version_first_run_last_epoch_result.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 72</u></b>: Results of the thirteen version model in the first training run in the second epoch and in the last epoch

**Conclusions:**

* We can see in the history result, that the loss, Accuracy and Dice Cooefficient on the train dataset and on the dev dataset converge in a pretty good way. This is because we use Learning Rate scheduler and from epoch 20 we take little steps in the training process.

* We can see that the model accuracy is preety high, but the Dice Cooefficient score not high, and the reason for this is that our model less good in predict classes that have small amount of data(we talked about this in section 7). We need to try to get higer preformance in train and dev datasets.

* We can see that our model less overfit the loss, Accuracy and Dice Coefficient on train dataset compare to previous models. This is because we use new function of data augmentation that she better than the initial data augmentation function. In additoin maybee this is because we use new normalization method, or because we use Learning Rate scheduler and from epoch 20 we take little steps in the training process.

In conclusion, we need to try get better results in the train and dev datasets while not overfit the train dataset.

Now let's try train the model more in order to see if we can get better results.

Let's create Learning Rate scheduler:

In [None]:
NUM_OF_EPOCHS_FOR_INITIAL_LR_THIRTEEN_VERSION_SECOND_RUN = 0

thirteen_version_second_run_lr_callback = LearningRateScheduler(lambda epoch: lr_scheduler(epoch, INITIAL_LR_THIRTEEN_VERSION, NEW_LR_THIRTEEN_VERSION, NUM_OF_EPOCHS_FOR_INITIAL_LR_THIRTEEN_VERSION_SECOND_RUN))

Let's train the model:

In [None]:
THIRTEEN_VERSION_SECOND_RUN_EPOCHS = 40
thirteen_version_second_run_history = thirteen_version_model.fit(train_dataset_second_version, epochs=THIRTEEN_VERSION_SECOND_RUN_EPOCHS, validation_data=dev_dataset_second_version, callbacks=[thirteen_version_viz_callback, thirteen_version_second_run_lr_callback])

Let's plot the history of this training process:

In [None]:
plot_history(thirteen_version_second_run_history)

The history of this training process is:

<div style="text-align:center">
    <img src="Images\thirteen_version_second_run_history_result.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 73</u></b>: History result of the thirteen version model in the second training run

And we can compare the results of the model in the last epoch of the first training run and in the last epoch of the second training:

<div style="text-align:center">
    <img src="Images\thirteen_version_first_run_last_epoch_result.png" style="width:200;height:200;">
    <img src="Images\thirteen_version_second_run_last_epoch_result.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 74</u></b>: Results of the thirteen version model in the first training run in the last epoch and in the second training run in the last epoch

**Conclusions:**

* We can see in the history result, that the loss, Accuracy and Dice Cooefficient on the train dataset converge in a pretty good way(but less converge compare to the first training), but on the dev dataset our model become much less stable, and moving in a zigzag pattern.

* We get better results in the last epoch in the second running compare to the last epoch in the first running in the train dataset, but our model performance not improved much on the dev dataset compare to the first running, and even get worse in the loss.

* We can see that our model much more overfit the train dataset.

In overall, the model performance improved on the train dataset, and the model performance on the the dev dataset not improved much(and even get worse in the loss), and our model more overfit the train dataset. 
We can see that this model has very similar performance to the eleventh version model performance, and thus the fact that we changed the normaliztation method did not help. Hence, we will not continue to train the model.

<a name='16-13-3'></a>
#### 16.13.3 - Model's predictions on the train and dev datasets

In this section we will show examples of model's predictions on the train and dev datasets.

Let's plot the model's predictions on train dataset:

In [None]:
show_predictions(thirteen_version_model, train_dataset_second_version, 6)

Six examples of model's predictions on the train dataset are:

<div style="text-align:center">
    <img src="Images\thirteen_version_predictions_train_part1.png" style="width:200;height:200;">
    <img src="Images\thirteen_version_predictions_train_part2.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 75</u></b>: Model's predictions on the train dataset

Let's plot the model's predictions on dev dataset:

In [None]:
show_predictions(thirteen_version_model, dev_dataset_second_version, 6)

Six examples of model's predictions on the dev dataset are:

<div style="text-align:center">
    <img src="Images\thirteen_version_predictions_dev_part1.png" style="width:200;height:200;">
    <img src="Images\thirteen_version_predictions_dev_part2.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 76</u></b>: Model's predictions on the dev dataset

In conclusion, we got similar results to the eleventh version model performance, and thus the fact that we changed the normaliztation method did not help us.

<a name='17'></a>
## 17 - Final model

In this section we will choose the final model that we will use, show his predictions on the test dataset and evaluate him on the test dataset.

<a name='17-1'></a>
### 17.1 - Choose the model that we will use

We need to Choose the final model from versions 1, 3, 8, 10, 11, 12 and 13(other versions not good as we explained before).

Firstly, let's recall the final results of each version. Let's plot the final history result of each version and result of the last epoch(in the training run that we choosed in each model):


<div style="text-align:center">
    <img src="Images\first_version_second_run_history_result.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 77</u></b>: History result of the first version model
<div style="text-align:center">
    <img src="Images\first_version_second_run_last_epoch_result.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 78</u></b>: Results of the first version model in the last epoch


<div style="text-align:center">
    <img src="Images\third_version_second_run_history_result.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 79</u></b>: History result of the third version model
<div style="text-align:center">
    <img src="Images\third_version_second_run_last_epoch_result.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 80</u></b>: Results of the third version model in the last epoch


<div style="text-align:center">
    <img src="Images\eighth_version_second_run_history_result.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 81</u></b>: History result of the eighth version model
<div style="text-align:center">
    <img src="Images\eighth_version_second_run_last_epoch_result.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 82</u></b>: Results of the eighth version model in the last epoch


<div style="text-align:center">
    <img src="Images\tenth_version_first_run_history_result.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 83</u></b>: History result of the tenth version model
<div style="text-align:center">
    <img src="Images\tenth_version_first_run_last_epoch_result.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 84</u></b>: Results of the tenth version model in the last epoch


<div style="text-align:center">
    <img src="Images\eleventh_version_first_run_history_result.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 85</u></b>: History result of the eleventh version model
<div style="text-align:center">
    <img src="Images\eleventh_version_first_run_last_epoch_result.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 86</u></b>: Results of the eleventh version model in the last epoch


<div style="text-align:center">
    <img src="Images\twelfth_version_second_run_history_result.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 87</u></b>: History result of the twelfth version model
<div style="text-align:center">
    <img src="Images\twelfth_version_second_run_last_epoch_result.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 88</u></b>: Results of the twelfth version model in the last epoch


<div style="text-align:center">
    <img src="Images\thirteen_version_second_run_history_result.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 89</u></b>: History result of the thirteen version model
<div style="text-align:center">
    <img src="Images\thirteen_version_second_run_last_epoch_result.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 90</u></b>: Results of the thirteen version model in the last epoch

We can see that in the first version model we get the best results(acoording the evaluation matric that we defined before) on the dev dataset(also on the train dataset). 

Thus we will choose the first version model as our final model.

<a name='17-2'></a>
### 17.2 - Final model's predictions on the test dataset

Let's plot the final model's predictions on the test dataset:

In [None]:
show_predictions(first_version_model, test_dataset, 6)

Six examples of the final model's predictions on the test dataset are:

<div style="text-align:center">
    <img src="Images\first_version_predictions_test_part1.png" style="width:200;height:200;">
    <img src="Images\first_version_predictions_test_part2.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 91</u></b>: Final model's predictions on the test dataset

<a name='17-3'></a>
### 17.3 - Final model evaluation on the test dataset

Let's show the final model's evaluation on the test dataset:

In [None]:
evaluation_result_test = first_version_model.evaluate(test_dataset)

And the output of this is:

<div style="text-align:center">
    <img src="Images\first_version_test_evaluation.png" style="width:200;height:200;">
</div>
<caption><center> <u><b>Figure 92</u></b>: Final model's evaluation on the test dataset

<a name='18'></a>
## 18 - Summary 

So we have reached the end of the project, and I must to say that I really enjoyed from the project. I feel that I improved much while from this prject, and I managed to solve and overcome many challenges. This is my first project of this magnitude, and thus I learned so much from this project, include important thing that is read papers. I am very proud of me, and I am see that I managed to achieve many goals that I set for myself, and of course I still have a lot to improve on.

The subject of the project in my opinion is very interesting, and what I learned in this project also useful for other applications like medical applications and more.

Also, I managed to sucess to get very good results for first project, and I sure that with more resarch and try more architectures that are not Unet can improve the performence. One thing that worth to note is that we had small dataset and I belive that if we had larger dataset with more data from each class(espically from minority classes) we could get better performance.

In conclusion, I really enjoyed from this project, and I sure that I learned so much from this project.

<a name='19'></a>
## 19 - Goals for the future

I can imporve this project. My goals for the future for this project are:

* Try more encoders and decoders for the Unet architecture. For example we can use VGG-16 or ResNet in the encoder and decoder.

* Try more architectures that are not Unet for semantic segmentation

* Try to improve the performance and overcome the problem that our dataset is small and we have the problem of imbalanced classes.

And that's it, we finished this project ( for now:) ).

Thank you very much for reading this project. I am very appreciate this!

<div style="text-align:center">
    <img src="Images\thank_you.gif" style="width:800px;height:500px;">
</div>