<a href="https://colab.research.google.com/github/MedicalImageAnalysisTutorials/DeepLearning4All/blob/main/IA_DNN_ImageObjectDetection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction


**The Object Detection Problem:** involves classification and localization. It takes an image as input and produces one or more bounding boxes with the class label attached to each bounding box. \\

**The Object Recognition Problem:** is to identity the objects in images or videos. \\

**Draft version**
In this notebook, I will try to provide a practical tutorial for deep learning using simple examples. I will try to use simple implementation and avoid using built-in functions to give clear idea about the concept. You need basic programming knowledge. 
I made the code flexible so one can try different approaches, datsaets, optimisers, loss functions based on if else statements. 

**The general code structure:**

    1. Read and pre-process the data
    2. Create the model, optimiser, loss function, and metrics
    3. Start the training loop
    4. Evaluate the final model




**The Object Detection Problem:** \\
When we’re shown an image, our brain instantly recognizes the objects contained in it. On the other hand, it takes a lot of time and training data for a machine to identify these objects. But with the recent advances in hardware and deep learning, this computer vision field has become a whole lot easier and more intuitive.
First try to understand what is object detection problem: \\

![image1](https://cdn.analyticsvidhya.com/wp-content/uploads/2018/10/I1_2009_09_08_drive_0012_001351-768x223.png)\\
![image1](https://cdn.analyticsvidhya.com/wp-content/uploads/2018/06/I1_2009_09_08_drive_0012_001351-copy-768x223.png) \\
Our objective behind doing object detection is two folds:

1.   To identify what all objects are present in the image and where they’re located
2.   Filter out the object of attention
\\
**Then we can introduce what is object detection using deep learning. We will start from RCNN series net work to Yolo** \\



### RCNN
the RCNN algorithm proposes a bunch of boxes in the image and checks if any of these boxes contain any object. **RCNN uses selective search to extract these boxes from an image (these boxes are called regions).**\\


*   First, an image is taken as an input: \\
![image1](https://cdn.analyticsvidhya.com/wp-content/uploads/2018/10/Screenshot-from-2018-10-08-14-59-02.png)
*   Then, we get the Regions of Interest (ROI) using some proposal method (for example, selective search as seen above): \\
![image1](https://cdn.analyticsvidhya.com/wp-content/uploads/2018/10/Screenshot-from-2018-10-08-15-00-09.png)
*   All these regions are then reshaped as per the input of the CNN, and each region is passed to the ConvNet: \\
![image1](https://cdn.analyticsvidhya.com/wp-content/uploads/2018/10/Screenshot-from-2018-10-08-15-01-56.png)
*   CNN then extracts features for each region and SVMs are used to divide these regions into different classes: \\
![image1](https://cdn.analyticsvidhya.com/wp-content/uploads/2018/10/Screenshot-from-2018-10-08-15-03-02.png)
*   Finally, a bounding box regression (Bbox reg) is used to predict the bounding boxes for each identified region: \\
![image1](https://cdn.analyticsvidhya.com/wp-content/uploads/2018/10/Screenshot-from-2018-10-08-15-06-33.png)

##  Fast RCNN 
Instead of running a CNN 2,000 times per image, we can run it just once per image and get all the regions of interest (regions containing some object).

Ross Girshick, the author of RCNN, came up with this idea of running the CNN just once per image and then finding a way to share that computation across the 2,000 regions. In Fast RCNN, we feed the input image to the CNN, which in turn generates the convolutional feature maps. Using these maps, the regions of proposals are extracted. We then use a RoI pooling layer to reshape all the proposed regions into a fixed size, so that it can be fed into a fully connected network. \\

The differences with RCNN network are as follows: \\
*   The input image is passed to a ConvNet which returns the region of interests accordingly. Then we apply the RoI pooling layer on the extracted regions of interest to make sure all the regions are of the same size: \\
![image1](https://cdn.analyticsvidhya.com/wp-content/uploads/2018/10/Screenshot-from-2018-10-08-15-45-26.png)
*   Finally, these regions are passed on to a fully connected network which classifies them, as well as returns the bounding boxes using softmax and linear regression layers simultaneously: \\
![image1](https://cdn.analyticsvidhya.com/wp-content/uploads/2018/10/Screenshot-from-2018-10-08-15-47-18.png)


##  Faster RCNN
Faster RCNN is the modified version of Fast RCNN. The major difference between them is that Fast RCNN uses selective search for generating Regions of Interest, while Faster RCNN uses “Region Proposal Network”, aka RPN. RPN takes image feature maps as an input and generates a set of object proposals, each with an objectness score as output. The main steps are as follows: \\ 

Finally, the proposals are passed to a fully connected layer which has a softmax layer and a linear regression layer at its top, to classify and output the bounding boxes for objects.

1.   We take an image as input and pass it to the ConvNet which returns the feature map for that image.
2.   Region proposal network is applied on these feature maps. This returns the object proposals along with their objectness score.
3.   A RoI pooling layer is applied on these proposals to bring down all the proposals to the same size.
4.   Finally, the proposals are passed to a fully connected layer which has a softmax layer and a linear regression layer at its top, to classify and output the bounding boxes for objects.


*   The input image is passed to a ConvNet which returns the region of interests accordingly. Then we apply the RoI pooling layer on the extracted regions of interest to make sure all the regions are of the same size: \\
![image1](https://cdn.analyticsvidhya.com/wp-content/uploads/2018/10/Screenshot-from-2018-10-09-14-15-36.png)
*   Let me briefly explain how this Region Proposal Network (RPN) actually works. To begin with, Faster RCNN takes the feature maps from CNN and passes them on to the Region Proposal Network. RPN uses a sliding window over these feature maps, and at each window, it generates k Anchor boxes of different shapes and sizes:: \\
![image1](https://cdn.analyticsvidhya.com/wp-content/uploads/2018/10/0n6pZEyvW47nlcdQz.png)

Anchor boxes are fixed sized boundary boxes that are placed throughout the image and have different shapes and sizes. For each anchor, RPN predicts two things:

*   The first is the probability that an anchor is an object (it does not consider which class the object belongs to)
*   Second is the bounding box regressor for adjusting the anchors to better fit the object

##  YOLO

Then we are going to introduce another famous deep learning object detection model: Yolo

The YOLO framework (You Only Look Once) deals with object detection in a different way. It takes the entire image in a single instance and predicts the bounding box coordinates and class probabilities for these boxes. The biggest advantage of using YOLO is its superb speed – it’s incredibly fast and can process 45 frames per second. YOLO also understands generalized object representation.

This is one of the best algorithms for object detection and has shown a comparatively similar performance to the R-CNN algorithms. 

**Yolo network structure**

Our model will be trained as follows:

![image1](https://cdn.analyticsvidhya.com/wp-content/uploads/2018/12/Screenshot-from-2018-11-15-18-46-10.png)

*   The framework first take the raw image with annotation. Then seperate images as grid(3X3 in our sample): \\
![image1](https://cdn.analyticsvidhya.com/wp-content/uploads/2018/12/Screenshot-from-2018-11-15-17-46-32.png) \\
*   Image classification and localization are applied on each grid. YOLO then predicts the bounding boxes and their corresponding class probabilities for objects (if any are found, of course). 

The labelled data can be encoded as follows: \\
![image1](https://cdn.analyticsvidhya.com/wp-content/uploads/2018/12/Screenshot-from-2018-11-15-18-01-24.png) \\
Here,
*   pc defines whether an object is present in the grid or not (it is the probability)
*   bx, by, bh, bw specify the bounding box if there is an object
*   c1, c2, c3 represent the classes, obviously, it can have more classes in different datasets. So, if the object is a car, c2 will be 1 and c1 & c3 will be 0, and so on. \\

Let's take an example, the grid in which we have a car (c2 = 1): \\
![image1](https://cdn.analyticsvidhya.com/wp-content/uploads/2018/12/Screenshot-from-2018-11-15-19-35-31.png) \\

The bounding box can be encoded as follows: \\
![image1](https://cdn.analyticsvidhya.com/wp-content/uploads/2018/12/Screenshot-from-2018-11-15-18-27-25.png) \\

pc will be equal to 1. \\
bx, by, bh, and bw will be calculated relative to this grid only. bx and by are the center point of the object and bh and bw are the ratio of width and height of the bounding box. \\
In this case, it will be (around) bx = 0.4, by = 0.3, bh = 0.9, bw = 0.5: \\

![image17](https://cdn.analyticsvidhya.com/wp-content/uploads/2018/12/Screenshot-from-2018-11-15-19-39-51.png) \\


**Intersection over Union and Non-Max Suppression**

This is where Intersection over Union comes into the picture. It calculates the intersection over union of the actual bounding box and the predicted bonding box. \\

![image18](https://cdn.analyticsvidhya.com/wp-content/uploads/2018/12/Screenshot-from-2018-11-16-13-12-02.png) \\

IoU = Area of the intersection / Area of the union, i.e.

IoU = Area of yellow box / Area of green box \\

One of the most common problems with object detection algorithms is that is that an object might be detected multiple times. To improve the output of YOLO significantly, Non-Max Suppression is used. \\

![image19](https://cdn.analyticsvidhya.com/wp-content/uploads/2018/12/Screenshot-from-2018-11-16-13-32-40.png) \\

1. Discard all the boxes having probabilities less than or equal to a pre-defined threshold (for example, 0.5)
2. For the remaining boxes:
Pick the box with the highest probability and take that as the output 
prediction. Discard any other box which has IoU greater than the threshold with the output box from the above
3. Repeat step 2 until all the boxes are either taken as the output prediction or discarded

![image20](https://cdn.analyticsvidhya.com/wp-content/uploads/2018/12/Screenshot-from-2018-11-17-12-21-31.png) \\

The output are the boxes with maximum probability and suppressing the close-by boxes with non-max probabilities.

**Anchor Boxes**

Till now, each grid can only identify one object. Considering there may be multiple objects in a single grid, anchor boxes are used. \\

Each object is assigned to the corresponding grid based on the midpoint of the object and its location. In the example below, we divide the image into 3x3 grid and 2 anchor boxes in each grid.

![image21](https://cdn.analyticsvidhya.com/wp-content/uploads/2018/12/Screenshot-from-2018-11-17-13-20-41.png) \\

So the y label for YOLO with 2 anchor boxes will be: \\

![image22](https://cdn.analyticsvidhya.com/wp-content/uploads/2018/12/Screenshot-from-2018-11-17-13-33-31.png) \\


# Notebook setting

In [None]:
#TODOS:
#  correct rcnn, yolo,...  diagrams
#  try using pytorch code directly (using same dataset fron the tutorial)
#  organise the code
#  use on spine 2d data            img       seg            loc       loc    
#         download 3d dataset: img.nii.gz img_seg.nii.gz img.fcsv img.json
#         1. resize all images  (resize image and segmentation)
#         2. extract 2d images (3 views) (from image and segmentation) 
#         3. compute the center (based on segmentation)
#            image vtLbael1 center x,y       
#         4. cretae yolo labels (your task)
#         5. augmentation
#    
#  use on spine 3d data 

usePytorch = 1
datasetID  = 4  # 1: minst is selected by default, for cifar10 use 2, for facial use 3, for VOC use 4
NNID = 5  # 1: NN is by default, for DNN use 2,or 3, for 3D use 4, for Yolo use 5  
number_of_classes = 20

batch_size = 2 # you can try larger batch size e.g. 1024 * 6
# if you have large GPU memory you can combine the images to batches 
# for faster training.
# It is good to try different values

lossFunctionID = 4
# 1: SparseCategoricalCrossentropy is by default, for MeanSquaredError use 2
# for CategoricalCrossentropy use 3, for yolo_loss use 4


In [None]:
# TODO:
# complete the introduction
# add the introduction with the figures to this notebook https://github.com/idhamari/Deep-Learning-Coursera/blob/master/Convolutional%20Neural%20Networks/Week3/Car%20detection%20for%20Autonomous%20Driving/Autonomous_driving_application_Car_detection_v3a.ipynb
# define yolo model and its related functions 
# find dataset and use it 

import sys
# Setup 
doInstall = 1
if doInstall:
    !pip install SimpleITK
    !pip install labelImg 


import os, random, time, math, colorsys, imghdr, PIL, csv

from glob import glob

import numpy as np
import scipy.io
import scipy.misc
import pandas as pd

from skimage.transform import resize
import cv2 
import SimpleITK as sitk 
from PIL import Image, ImageDraw, ImageFont

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras import datasets, layers, models

from keras import backend as K
from keras.layers import Input, Lambda, Conv2D, BatchNormalization
from keras.models import load_model, Model

import torch
import torch.nn as nn
import torchvision.transforms as transforms
import torch.optim as optim
import torchvision.transforms.functional as FT
from tqdm import tqdm
from torch.utils.data import DataLoader

import matplotlib.pyplot as plt
import matplotlib.patches as patches
from collections import Counter

# Yolo
# from yad2k.yad2k.models.keras_yolo import yolo_head, yolo_boxes_to_corners, preprocess_true_boxes, yolo_loss, yolo_body

%matplotlib inline

# to reproduce the same results given same input
np.random.seed(1)               
!ls


Collecting SimpleITK
  Downloading SimpleITK-2.1.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (48.4 MB)
[K     |████████████████████████████████| 48.4 MB 10 kB/s 
[?25hInstalling collected packages: SimpleITK
Successfully installed SimpleITK-2.1.1
Collecting labelImg
  Downloading labelImg-1.8.6.tar.gz (247 kB)
[K     |████████████████████████████████| 247 kB 12.6 MB/s 
[?25hCollecting pyqt5
  Downloading PyQt5-5.15.6-cp36-abi3-manylinux1_x86_64.whl (8.3 MB)
[K     |████████████████████████████████| 8.3 MB 40.1 MB/s 
Collecting PyQt5-sip<13,>=12.8
  Downloading PyQt5_sip-12.9.0-cp37-cp37m-manylinux1_x86_64.whl (317 kB)
[K     |████████████████████████████████| 317 kB 46.9 MB/s 
[?25hCollecting PyQt5-Qt5>=5.15.2
  Downloading PyQt5_Qt5-5.15.2-py3-none-manylinux2014_x86_64.whl (59.9 MB)
[K     |████████████████████████████████| 59.9 MB 61 kB/s 
[?25hBuilding wheels for collected packages: labelImg
  Building wheel for labelImg (setup.py) ... [?25l[?25hdone
  Cre

# Image Object Detection


## Reading and exploring the datasets

### Using dataset minst / cifar10

In [None]:
# minst dataset
if datasetID == 1:
  (x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()
  class_names = range(10)


# cifar10 dataset
if datasetID == 2:
  # The CIFAR10 dataset contains 60,000 color images in 10 classes, 
  # with 6,000 images in each class.
  # The dataset is divided into 50,000 training images and 10,000 testing images.
  # The classes are mutually exclusive and there is no overlap between them.

  (x_train, y_train), (x_test, y_test) = keras.datasets.cifar10.load_data()
  class_names = ['airplane', 'automobile', 'bird', 'cat', 'deer',
              'dog', 'frog', 'horse', 'ship', 'truck']


if NNID == 4:
  # 3D dataset
  # TODO: fix this 
  showSamples =0
  
  x_train = x_train.reshape(-1,32*32*3) # (32 * 32 * 3)        
  x_train = np.resize(x_train,(x_train.shape[0],15,15,15))        
  x_train = x_train.reshape(-1,15,15,15)
  x_test = x_test.reshape(-1,32*32*3) # (32 * 32 * 3)
  x_test = np.resize(x_test,(x_test.shape[0],15,15,15))
  x_test = x_test.reshape(-1,15,15,15)
  x_train  =  x_train[..., np.newaxis] # np.reshape(x_train, (-1, h,w,1))
  y_train  =  y_train[..., np.newaxis] # np.reshape(y_train, (-1, h,w,1))
  x_test   =  x_test[..., np.newaxis]  # np.reshape(x_test,  (-1, h,w,1))
  y_test   =  y_test[..., np.newaxis]  # np.reshape(y_test,  (-1, h,w,1))


### Using facial dataset


We are using Face Detection Data Set and Benchmark (FDDB) [dataset](http://vis-www.cs.umass.edu/fddb/README.txt) from university of Massachusetts.  


FDDB-folds contains files with names: FDDB-fold-xx.txt and FDDB-fold-xx-ellipseList.txt, where xx = {01, 02, ..., 10} represents the fold-index.

Each line in the "FDDB-fold-xx.txt" file specifies a path to an image in the above-mentioned data set. For instance, the entry "2002/07/19/big/img_130" corresponds to "originalPics/2002/07/19/big/img_130.jpg."

The corresponding annotations are included in the file "FDDB-fold-xx-ellipseList.txt" in the following 
format:



```
<image name i>
<number of faces in this image =im>
<face i1>
<face i2>
...
<face im>
```

Here, each face is denoted by:

            major_axis_radius minor_axis_radius angle center_x center_y 1


In [None]:
showSamples = 1
doDownload = 0

# Facial dataset
# more info http://vis-www.cs.umass.edu/fddb/README.txt

if datasetID == 3:
  doDownload = 1

if doDownload:
  !wget http://vis-www.cs.umass.edu/fddb/originalPics.tar.gz
  !tar zxvf  originalPics.tar.gz

  !wget http://vis-www.cs.umass.edu/fddb/FDDB-folds.tgz
  !tar zxvf FDDB-folds.tgz
  !mkdir originalPics
  !mv 2002 originalPics
  !mv 2003 originalPics

In [None]:
if datasetID == 3:
  trnRatio = 0.90
  tstRatio = 1 - trnRatio
  img_fnms = sorted([y for x in os.walk("originalPics") for y in glob(os.path.join(x[0], '*.jpg'))])
  lbls_fnms = sorted([y for x in os.walk("FDDB-folds") for y in glob(os.path.join(x[0], '*ellipseList.txt'))])
  # print((img_fnms[0:2]))
  # print(len(img_fnms))
  # print((lbls_fnms[0:2]))
  # print(len(lbls_fnms))
  # print("--------------------------")
  lblInfo = []
  numObj = 0
  for fnm in lbls_fnms:
      f = open(fnm,'r')
      lines = f.readlines()
      imgLoc=[]
      for x in lines:
          if "img" in x: 
              imgPath = x.strip()
          elif "." in x:     
              imgLoc.append([float(y) for y in x.split()])
              if len(imgLoc) == numObj:
                lblInfo.append([imgPath,numObj,imgLoc,])                   
                imgLoc=[]
                numObj=0
          else:
              numObj = int(x)

  # print("---------------------------")
  # annotations for 5171 faces
  # 2845 images  ???
  # print(len(lblInfo))
  s = 0 
  for x in lblInfo:
  #     print(x[0])
  #     print(x[1])
        s = s  + x[1]
  #     for i in range (x[1]):
  #         print(x[2][i])
  # print(s)
  # imgLst = [1             , 2               ]
  # lblLst = [[1,3,[loc]]   , [1,3,[loc]]     ]

  #(x_train, y_train), (x_test, y_test)
  # class_names = range(10)

### Using VOC dataset

In [None]:
if datasetID == 4:
  if not usePytorch:
    !wget http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtrainval_06-Nov-2007.tar
    !wget http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtest_06-Nov-2007.tar

    !tar xvf VOCtrainval_06-Nov-2007.tar
    !tar xvf VOCtest_06-Nov-2007.tar

    !rm VOCtrainval_06-Nov-2007.tar
    !rm VOCtest_06-Nov-2007.tar
  
  else:
    # DOWNLOAD FROM HERE (FASTER DOWNLOAD)                                          
    # VOC2007 DATASET                                                              
    !wget http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtrainval_06-Nov-2007.tar
    !wget http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtest_06-Nov-2007.tar # 

    # VOC2012 DATASET                                                              
    !wget http://host.robots.ox.ac.uk/pascal/VOC/voc2012/VOCtrainval_11-May-2012.tar

    # Extract tar files
    !tar xf VOCtrainval_11-May-2012.tar
    !tar xf VOCtrainval_06-Nov-2007.tar
    !tar xf VOCtest_06-Nov-2007.tar

    # Need voc_label.py to clean up data from xml files
    !wget https://pjreddie.com/media/files/voc_label.py

    # Run python file to clean data from xml files
    !python voc_label.py

    # Get train by using train+val from 2007 and 2012
    # Then we only test on 2007 test set
    # Unclear from paper what they actually just as a dev set
    !cat 2007_train.txt 2007_val.txt 2012_*.txt > train.txt
    !cp 2007_test.txt test.txt

    # Move txt files we won't be using to clean up a little bit
    !mkdir old_txt_files
    !mv 2007* 2012* old_txt_files/

    read_train = open("train.txt", "r").readlines()

    with open("train.csv", mode="w", newline="") as train_file:
        for line in read_train:
            image_file = line.split("/")[-1].replace("\n", "")
            text_file = image_file.replace(".jpg", ".txt")
            data = [image_file, text_file]
            writer = csv.writer(train_file)
            writer.writerow(data)

    read_train = open("test.txt", "r").readlines()

    with open("test.csv", mode="w", newline="") as train_file:
        for line in read_train:
            image_file = line.split("/")[-1].replace("\n", "")
            text_file = image_file.replace(".jpg", ".txt")
            data = [image_file, text_file]
            writer = csv.writer(train_file)
            writer.writerow(data)

    !mkdir data
    !mkdir data/images
    !mkdir data/labels                                                                       
                                                                                            
    !mv VOCdevkit/VOC2007/JPEGImages/*.jpg data/images/                                      
    !mv VOCdevkit/VOC2012/JPEGImages/*.jpg data/images/                                      
    !mv VOCdevkit/VOC2007/labels/*.txt data/labels/                                          
    !mv VOCdevkit/VOC2012/labels/*.txt data/labels/ 

    # We don't need VOCdevkit folder anymore, can remove
    # in order to save some space 
    !rm -rf VOCdevkit/
    !mv test.txt old_txt_files/
    !mv train.txt old_txt_files/

--2021-11-05 08:08:43--  http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtrainval_06-Nov-2007.tar
Resolving host.robots.ox.ac.uk (host.robots.ox.ac.uk)... 129.67.94.152
Connecting to host.robots.ox.ac.uk (host.robots.ox.ac.uk)|129.67.94.152|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 460032000 (439M) [application/x-tar]
Saving to: ‘VOCtrainval_06-Nov-2007.tar’


2021-11-05 08:08:46 (124 MB/s) - ‘VOCtrainval_06-Nov-2007.tar’ saved [460032000/460032000]

--2021-11-05 08:08:46--  http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtest_06-Nov-2007.tar
Resolving host.robots.ox.ac.uk (host.robots.ox.ac.uk)... 129.67.94.152
Connecting to host.robots.ox.ac.uk (host.robots.ox.ac.uk)|129.67.94.152|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 451020800 (430M) [application/x-tar]
Saving to: ‘VOCtest_06-Nov-2007.tar’


2021-11-05 08:08:49 (183 MB/s) - ‘VOCtest_06-Nov-2007.tar’ saved [451020800/451020800]

--2021-11-05 08:08:49--  http://host

## Data preparation

In [None]:
if datasetID < 4:
  # get size 
  h = x_train.shape[1] # image height
  w = x_train.shape[2] # image width
  # check for rgb 
  try:
    # number of channels
    c =  x_train.shape[3]
  except:
    # number of channels
    c =  1
    # if there is no number of channels, add 1
    x_train  =  x_train[..., np.newaxis] # np.reshape(x_train, (-1, h,w,1))
    y_train  =  y_train[..., np.newaxis] # np.reshape(y_train, (-1, h,w,1))
    x_test   =  x_test[..., np.newaxis]  # np.reshape(x_test,  (-1, h,w,1))
    y_test   =  y_test[..., np.newaxis]  # np.reshape(y_test,  (-1, h,w,1))


  # Reserve 10,000 samples for validation.
  x_val = x_train[-10000:]
  y_val = y_train[-10000:]
  x_train = x_train[:-10000]
  y_train = y_train[:-10000]

  number_of_pixels = h * w * c


  # print("dataset shape   : ",x_train.shape)
  # print("number of images: ",x_train.shape[0])
  # print("image size      : ",x_train[0].shape)
  # print("image data type : ",type(x_train[0][0][0][0]))
  # print("image max  value: ",np.max(x_train[0]))
  # print("image min  value: ",np.min(x_train[0]))
  # if c==1:
  #    print("gray or binary image (not color image)")
  # elif c==3:
  #    print("rgb color image (or probably non-color image represented with 3 channels)")


  # display sample images 
  if showSamples:
    plt.figure(figsize=(10,10))
    for i in range(25):
      plt.subplot(5,5,i+1)
      plt.xticks([])
      plt.yticks([])
      plt.grid(False)
      #plt.imshow(x_train[i])
      plt.imshow(cv2.cvtColor(x_train[i], cv2.COLOR_BGR2RGB))

      # The CIFAR labels happen to be arrays, 
      # which is why you need the extra index
      if datasetID==1:
        plt.xlabel(y_train[i])
      elif datasetID==2:
        plt.xlabel(class_names[y_train[i][0]])
    plt.show()


  # normalisation
  x_train = np.array([ x/255.0 for x in x_train])
  x_val   = np.array([ x/255.0 for x in x_val])
  x_test  = np.array([ x/255.0 for x in x_test])
  #y_train = y_train.astype(np.float32)

  # for NN we need 1D 
  if NNID == 1:
    x_train = np.reshape(x_train, (-1, number_of_pixels))
    x_val = np.reshape(x_val, (-1, number_of_pixels))
    x_test = np.reshape(x_test, (-1, number_of_pixels))

  # Prepare the training dataset.
  # print(x_train.shape,y_train.shape)
  train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))
  train_dataset = train_dataset.shuffle(buffer_size=1024).batch(batch_size)
  #train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train)).batch(batch_size)

  # Prepare the validation dataset.
  val_dataset = tf.data.Dataset.from_tensor_slices((x_val, y_val))
  val_dataset = val_dataset.batch(batch_size)

  # Prepare the test dataset.
  tst_dataset = tf.data.Dataset.from_tensor_slices((x_test, y_test))
  tst_dataset = tst_dataset.batch(batch_size)

In [None]:
import argparse
import xml.etree.ElementTree as ET
import os
import cv2 as cv
import numpy as np
from tensorflow import keras

if datasetID == 4:
  if not usePytorch:
    parser = argparse.ArgumentParser(description='Build Annotations.')
    parser.add_argument('dir', default='..', help='Annotations.')

    sets = [('2007', 'train'), ('2007', 'val'), ('2007', 'test')]

    classes_num = {'aeroplane': 0, 'bicycle': 1, 'bird': 2, 'boat': 3, 'bottle': 4, 'bus': 5,
                  'car': 6, 'cat': 7, 'chair': 8, 'cow': 9, 'diningtable': 10, 'dog': 11,
                  'horse': 12, 'motorbike': 13, 'person': 14, 'pottedplant': 15, 'sheep': 16,
                  'sofa': 17, 'train': 18, 'tvmonitor': 19}


    def convert_annotation(year, image_id, f):
        in_file = os.path.join('VOCdevkit/VOC%s/Annotations/%s.xml' % (year, image_id))
        tree = ET.parse(in_file)
        root = tree.getroot()

        for obj in root.iter('object'):
            difficult = obj.find('difficult').text
            cls = obj.find('name').text
            classes = list(classes_num.keys())
            if cls not in classes or int(difficult) == 1:
                continue
            cls_id = classes.index(cls)
            xmlbox = obj.find('bndbox')
            b = (int(xmlbox.find('xmin').text), int(xmlbox.find('ymin').text),
                int(xmlbox.find('xmax').text), int(xmlbox.find('ymax').text))
            f.write(' ' + ','.join([str(a) for a in b]) + ',' + str(cls_id))

    for year, image_set in sets:
      print(year, image_set)
      with open(os.path.join('VOCdevkit/VOC%s/ImageSets/Main/%s.txt' % (year, image_set)), 'r') as f:
          image_ids = f.read().strip().split()
      with open(os.path.join("VOCdevkit", '%s_%s.txt' % (year, image_set)), 'w') as f:
          for image_id in image_ids:
              f.write('%s/VOC%s/JPEGImages/%s.jpg' % ("VOCdevkit", year, image_id))
              convert_annotation(year, image_id, f)
              f.write('\n')

    # load data into batches
    def read(image_path, label):
        image = cv.imread(image_path)
        image = cv.cvtColor(image, cv.COLOR_BGR2RGB)
        image_h, image_w = image.shape[0:2]
        image = cv.resize(image, (448, 448))
        image = image / 255.

        label_matrix = np.zeros([7, 7, 30])
        for l in label:
            l = l.split(',')
            l = np.array(l, dtype=np.int)
            xmin = l[0]
            ymin = l[1]
            xmax = l[2]
            ymax = l[3]
            cls = l[4]
            x = (xmin + xmax) / 2 / image_w
            y = (ymin + ymax) / 2 / image_h
            w = (xmax - xmin) / image_w
            h = (ymax - ymin) / image_h
            loc = [7 * x, 7 * y]
            loc_i = int(loc[1])
            loc_j = int(loc[0])
            y = loc[1] - loc_i
            x = loc[0] - loc_j

            if label_matrix[loc_i, loc_j, 24] == 0:
                label_matrix[loc_i, loc_j, cls] = 1
                label_matrix[loc_i, loc_j, 20:24] = [x, y, w, h]
                label_matrix[loc_i, loc_j, 24] = 1  # response

        return image, label_matrix

In [None]:
class My_Custom_Generator(keras.utils.Sequence) :
  
  def __init__(self, images, labels, batch_size) :
    self.images = images
    self.labels = labels
    self.batch_size = batch_size
    
    
  def __len__(self) :
    return (np.ceil(len(self.images) / float(self.batch_size))).astype(np.int)
  
  
  def __getitem__(self, idx) :
    batch_x = self.images[idx * self.batch_size : (idx+1) * self.batch_size]
    batch_y = self.labels[idx * self.batch_size : (idx+1) * self.batch_size]

    train_image = []
    train_label = []

    for i in range(0, len(batch_x)):
      img_path = batch_x[i]
      label = batch_y[i]
      image, label_matrix = read(img_path, label)
      train_image.append(image)
      train_label.append(label_matrix)
    return np.array(train_image), np.array(train_label)

if datasetID == 4:
  if not usePytorch:
    train_datasets = []
    val_datasets = []

    with open(os.path.join("VOCdevkit", '2007_train.txt'), 'r') as f:
        train_datasets = train_datasets + f.readlines()
    with open(os.path.join("VOCdevkit", '2007_val.txt'), 'r') as f:
        val_datasets = val_datasets + f.readlines()

    X_train = []
    Y_train = []

    X_val = []
    Y_val = []

    for item in train_datasets:
      item = item.replace("\n", "").split(" ")
      X_train.append(item[0])
      arr = []
      for i in range(1, len(item)):
        arr.append(item[i])
      Y_train.append(arr)

    for item in val_datasets:
      item = item.replace("\n", "").split(" ")
      X_val.append(item[0])
      arr = []
      for i in range(1, len(item)):
        arr.append(item[i])
      Y_val.append(arr)

In [None]:
class VOCDataset(torch.utils.data.Dataset):
    def __init__(
        self, csv_file, img_dir, label_dir, S=7, B=2, C=20, transform=None,
    ):
        self.annotations = pd.read_csv(csv_file)
        self.img_dir = img_dir
        self.label_dir = label_dir
        self.transform = transform
        self.S = S
        self.B = B
        self.C = C

    def __len__(self):
        return len(self.annotations)

    def __getitem__(self, index):
        label_path = os.path.join(self.label_dir, self.annotations.iloc[index, 1])
        boxes = []
        with open(label_path) as f:
            for label in f.readlines():
                class_label, x, y, width, height = [
                    float(x) if float(x) != int(float(x)) else int(x)
                    for x in label.replace("\n", "").split()
                ]

                boxes.append([class_label, x, y, width, height])

        img_path = os.path.join(self.img_dir, self.annotations.iloc[index, 0])
        image = Image.open(img_path)
        boxes = torch.tensor(boxes)

        if self.transform:
            # image = self.transform(image)
            image, boxes = self.transform(image, boxes)

        # Convert To Cells
        label_matrix = torch.zeros((self.S, self.S, self.C + 5 * self.B))
        for box in boxes:
            class_label, x, y, width, height = box.tolist()
            class_label = int(class_label)

            # i,j represents the cell row and cell column
            i, j = int(self.S * y), int(self.S * x)
            x_cell, y_cell = self.S * x - j, self.S * y - i

            """
            Calculating the width and height of cell of bounding box,
            relative to the cell is done by the following, with
            width as the example:
            
            width_pixels = (width*self.image_width)
            cell_pixels = (self.image_width)
            
            Then to find the width relative to the cell is simply:
            width_pixels/cell_pixels, simplification leads to the
            formulas below.
            """
            width_cell, height_cell = (
                width * self.S,
                height * self.S,
            )

            # If no object already found for specific cell i,j
            # Note: This means we restrict to ONE object
            # per cell!
            if label_matrix[i, j, 20] == 0:
                # Set that there exists an object
                label_matrix[i, j, 20] = 1

                # Box coordinates
                box_coordinates = torch.tensor(
                    [x_cell, y_cell, width_cell, height_cell]
                )

                label_matrix[i, j, 21:25] = box_coordinates

                # Set one hot encoding for class_label
                label_matrix[i, j, class_label] = 1

        return image, label_matrix

## Dataset augmentation

It is important to train the model on different variations of the dataset. It is also important to have large datset for training.

Using dataset augmentation helps to achieve both of the above goals. From one image, one can generate hundred thousands of images using image transformation.

The image transformation could be [spatial transform]() or point transform where we move the points of the image to new locations e.g. shifting, flipping, and/or rotating the imag. 

Another type of transformation is intensity transform or pixel transform where we change the color values of the pixels in the image e.g. invert the color, add more brightness or darkness. 



In [None]:
# TODO
doAug = 0

def imagePixelTransforms(img):    
    images = []
    # let's make 3 simple transformations
    img1   = 1.0- img # invert color
    img2   = img +0.3 # more brightness
    img3   = img -0.3 # more darkness
    images = np.array([img1,img2,img3])
    images = [ img.reshape(img.shape) for img in images]

    # plt.figure() ;    plt.imshow(img)
    # plt.figure() ;    plt.imshow(img1)
    # plt.figure() ;    plt.imshow(img2)
    # plt.figure() ;    plt.imshow(img3)    
    return images

def imagePointTransforms(img):
    images = []
    # let's make 3 simple transformations
    # Perform the rotation
    center  = (img.shape[0] / 2, img.shape[1] / 2)
    sz      = (img.shape[1], img.shape[0])
    tMatrix = cv2.getRotationMatrix2D(center, 45, 1)
    img1 = cv2.warpAffine(img, tMatrix, sz)
    img1 = img1[...,np.newaxis] if img1.shape !=img.shape else img1
    tMatrix = cv2.getRotationMatrix2D(center, 90, 1)
    img2 = cv2.warpAffine(img, tMatrix, sz)
    img2 = img2[...,np.newaxis] if img2.shape !=img.shape else img2
    tMatrix = cv2.getRotationMatrix2D(center, 270, 1)
    img3 = cv2.warpAffine(img, tMatrix, sz)
    img3 = img3[...,np.newaxis] if img3.shape !=img.shape else img3

    images = np.array([img1,img2,img3])
    # plt.figure() ;    plt.imshow(img)
    # plt.figure() ;    plt.imshow(img1)
    # plt.figure() ;    plt.imshow(img2)
    # plt.figure() ;    plt.imshow(img3)    
    #plt.figure() ;    plt.imshow(cv2.cvtColor(img, cv2.COLOR_BGR2RGB))
    # plt.figure() ;    plt.imshow(cv2.cvtColor(img1, cv2.COLOR_BGR2RGB))
    return images


# define a function for sitk transform
def resample(img_array, transform):
    # Output image Origin, Spacing, Size, Direction are taken from the reference
    # image in this call to Resample
    image = sitk.GetImageFromArray(img_array)
    reference_image = image
    interpolator = sitk.sitkCosineWindowedSinc
    default_value = 100.0
    resampled_img = sitk.Resample(image, reference_image, transform,
                         interpolator, default_value)
    resampled_array = sitk.GetArrayFromImage(resampled_img)
    return resampled_array

def affine_rotate(transform, degrees):
    parameters = np.array(transform.GetParameters())
    new_transform = sitk.AffineTransform(transform)
    dimension =3 
    matrix = np.array(transform.GetMatrix()).reshape((dimension,dimension))
    radians = -np.pi * degrees / 180.
    rotation = np.array([[1  ,0,0], 
                         [0, np.cos(radians), -np.sin(radians)],
                         [0, np.sin(radians), np.cos(radians)]]
                        )
    new_matrix = np.dot(rotation, matrix)
    new_transform.SetMatrix(new_matrix.ravel())
    return new_transform


def imagePoint3DTransforms(img):
    #print("imagePoint3DTransforms")
    images = []
    # let's make 3 simple transformations
    # Perform the rotation
    # In SimpleITK resampling convention, the transformation maps points 
    # from the fixed image to the moving image,
    # so inverse of the transform is applied

    center = (img.shape[0] /2, img.shape[1] /2,img.shape[1] /2)
    rotation_around_center = sitk.AffineTransform(3)
    rotation_around_center.SetCenter(center)
    
    rotation_around_center = affine_rotate(rotation_around_center, -45)
    img1 = resample(img, rotation_around_center)

    rotation_around_center = affine_rotate(rotation_around_center, -90)
    img2 = resample(img, rotation_around_center)

    rotation_around_center = affine_rotate(rotation_around_center, -90)
    img3 = resample(img, rotation_around_center)

    images = np.array([img1,img2,img3])
    # plt.figure() ;    plt.imshow(img)
    # plt.figure() ;    plt.imshow(img1)
    # plt.figure() ;    plt.imshow(img2)
    # plt.figure() ;    plt.imshow(img3)    
    #plt.figure() ;    plt.imshow(cv2.cvtColor(img, cv2.COLOR_BGR2RGB))
    # plt.figure() ;    plt.imshow(cv2.cvtColor(img1, cv2.COLOR_BGR2RGB))
    return images

def doAugmentation(images,labels,batch_size):
    # input is an image or a batch e.g. list of images 
    # get numpy arrays from the tensor    
    images = images.numpy()
    labels = labels.numpy()
    # if 1d convert back to 2d
    #print(images.shape)
    rgb = 0 ; is3d = 0
    if len(images.shape) == 2:
       try: 
          img2d_shape = int(math.sqrt(images.shape[1])) # gray or binary image
          images =images.reshape(-1,img2d_shape,img2d_shape)
       except:
          try: 
            img2d_shape = int(math.sqrt(images.shape[1]/3)) # rgb image
            images =images.reshape(-1,img2d_shape,img2d_shape,3)
            rgb = 1  
          except:
            pass  
            # img3d_shape = int(math.sqrt(images.shape[1]/3)) # rgb image
            # images =images.reshape(-1,img2d_shape,img2d_shape,3)
            # is3d = 1  



    x_outputs = [] ; y_outputs = []
    i = 0
    for img in images:
        #print("-------------------------", i ,"--------------------")
        if NNID==4:
           img = img.squeeze() 
        # from each images we generate 6 images
        # 64 batch will generate 448
        x_outputs.extend([img])
        imgs1 = imagePoint3DTransforms(img)
        imgs2 = imagePixelTransforms(img)
        #if not rgb:
           #imgs1 = np.array( x[...,np.newaxis] for x in imgs1 if len(x.shape)<3) 
           #imgs2 = np.array( x[...,np.newaxis] for x in imgs2 if len(x.shape)<3)
        x_outputs.extend(imgs1) # 3 images
        x_outputs.extend(imgs2) # 3 images
        # print(img.shape)
        # print(imgs1[0].shape)
        # print(imgs2[0].shape)
        # assign the same label to all transformed images
        for j in range ( len(imgs1) +len(imgs2)+1):
            y_outputs.extend([labels[i]])

        i = i +1
    x_outputs = np.array(x_outputs)
    if NNID==4:
       x_outputs = np.array([x[...,np.newaxis] for x in x_outputs])
    y_outputs = np.array(y_outputs)

    if (not rgb) and (NNID==1):
       x_outputs = np.reshape(x_outputs, (-1,img2d_shape*img2d_shape,1))
    elif (rgb) and (NNID==1):
       x_outputs = np.reshape(x_outputs, (-1,img2d_shape*img2d_shape*3))   

    new_train_dataset = tf.data.Dataset.from_tensor_slices((x_outputs, y_outputs))
    new_train_dataset = new_train_dataset.shuffle(buffer_size=1024).batch(batch_size)

    return new_train_dataset

In [None]:
# NN TensorFlow
def getNNModel(number_of_pixels,number_of_classes):
    inputs = keras.Input(shape=(number_of_pixels,), name="digits")
    x = layers.Dense(64, activation="relu", name="dense_1")(inputs)
    x = layers.Dense(64, activation="relu", name="dense_2")(x)
    outputs = layers.Dense(number_of_classes, name="predictions")(x)
    model = keras.Model(inputs=inputs, outputs=outputs)
    return model
print("NN model is defined ...")

NN model is defined ...


In [None]:
# TODO
# NN pytorch


## Define optimiser and loss function

In [None]:
# defined loss functions of Yolo

def xywh2minmax(xy, wh):
  xy_min = xy - wh / 2
  xy_max = xy + wh / 2

  return xy_min, xy_max


def iou(pred_mins, pred_maxes, true_mins, true_maxes):
  intersect_mins = K.maximum(pred_mins, true_mins)
  intersect_maxes = K.minimum(pred_maxes, true_maxes)
  intersect_wh = K.maximum(intersect_maxes - intersect_mins, 0.)
  intersect_areas = intersect_wh[..., 0] * intersect_wh[..., 1]

  pred_wh = pred_maxes - pred_mins
  true_wh = true_maxes - true_mins
  pred_areas = pred_wh[..., 0] * pred_wh[..., 1]
  true_areas = true_wh[..., 0] * true_wh[..., 1]

  union_areas = pred_areas + true_areas - intersect_areas
  iou_scores = intersect_areas / union_areas

  return iou_scores


def yolo_head(feats):
  # Dynamic implementation of conv dims for fully convolutional model.
  conv_dims = K.shape(feats)[1:3]  # assuming channels last
  # In YOLO the height index is the inner most iteration.
  conv_height_index = K.arange(0, stop=conv_dims[0])
  conv_width_index = K.arange(0, stop=conv_dims[1])
  conv_height_index = K.tile(conv_height_index, [conv_dims[1]])

  # TODO: Repeat_elements and tf.split doesn't support dynamic splits.
  # conv_width_index = K.repeat_elements(conv_width_index, conv_dims[1], axis=0)
  conv_width_index = K.tile(K.expand_dims(conv_width_index, 0), [conv_dims[0], 1])
  conv_width_index = K.flatten(K.transpose(conv_width_index))
  conv_index = K.transpose(K.stack([conv_height_index, conv_width_index]))
  conv_index = K.reshape(conv_index, [1, conv_dims[0], conv_dims[1], 1, 2])
  conv_index = K.cast(conv_index, K.dtype(feats))

  conv_dims = K.cast(K.reshape(conv_dims, [1, 1, 1, 1, 2]), K.dtype(feats))

  box_xy = (feats[..., :2] + conv_index) / conv_dims * 448
  box_wh = feats[..., 2:4] * 448

  return box_xy, box_wh

def yolo_loss(y_true, y_pred):
  # S is the split size of image
  # B is the num of boxes
  # C is the num of classes

  
  y_pred = K.cast(y_pred, K.dtype(y_true))
  
  label_class   = y_true[..., :20]  # ? * 7 * 7 * 20
  label_box     = y_true[..., 20:24]  # ? * 7 * 7 * 4
  response_mask = y_true[..., 24]  # ? * 7 * 7
  response_mask = K.expand_dims(response_mask)  # ? * 7 * 7 * 1

  predict_class  = y_pred[..., :20]  # ? * 7 * 7 * 20
  predict_trust  = y_pred[..., 20:22]  # ? * 7 * 7 * 2

  predict_box    = y_pred[..., 22:]  # ? * 7 * 7 * 8

  _label_box    = K.reshape(label_box, [-1, 7, 7, 1, 4])
  _predict_box  = K.reshape(predict_box, [-1, 7, 7, 2, 4])

  label_xy, label_wh = yolo_head(_label_box)  # ? * 7 * 7 * 1 * 2, ? * 7 * 7 * 1 * 2
  label_xy = K.expand_dims(label_xy, 3)  # ? * 7 * 7 * 1 * 1 * 2
  label_wh = K.expand_dims(label_wh, 3)  # ? * 7 * 7 * 1 * 1 * 2
  label_xy_min, label_xy_max = xywh2minmax(label_xy, label_wh)  # ? * 7 * 7 * 1 * 1 * 2, ? * 7 * 7 * 1 * 1 * 2

  predict_xy, predict_wh = yolo_head(_predict_box)  # ? * 7 * 7 * 2 * 2, ? * 7 * 7 * 2 * 2
  predict_xy = K.expand_dims(predict_xy, 4)  # ? * 7 * 7 * 2 * 1 * 2
  predict_wh = K.expand_dims(predict_wh, 4)  # ? * 7 * 7 * 2 * 1 * 2
  predict_xy_min, predict_xy_max = xywh2minmax(predict_xy, predict_wh)  # ? * 7 * 7 * 2 * 1 * 2, ? * 7 * 7 * 2 * 1 * 2

  iou_scores = iou(predict_xy_min, predict_xy_max, label_xy_min, label_xy_max)  # ? * 7 * 7 * 2 * 1
  best_ious = K.max(iou_scores, axis=4)  # ? * 7 * 7 * 2
  best_box = K.max(best_ious, axis=3, keepdims=True)  # ? * 7 * 7 * 1

  box_mask = K.cast(best_ious >= best_box, K.dtype(best_ious))  # ? * 7 * 7 * 2

  
  response_mask = K.cast(response_mask, K.dtype(box_mask))

  no_object_loss = 0.5 * (1 - box_mask * response_mask)
  no_object_loss = no_object_loss * K.square(0 - predict_trust)
  object_loss = box_mask * response_mask * K.square(1 - predict_trust)
  confidence_loss = no_object_loss + object_loss
  confidence_loss = K.sum(confidence_loss)

  class_loss = response_mask * K.square(label_class - predict_class)
  class_loss = K.sum(class_loss)

  _label_box = K.reshape(label_box, [-1, 7, 7, 1, 4])
  _predict_box = K.reshape(predict_box, [-1, 7, 7, 2, 4])

  label_xy, label_wh = yolo_head(_label_box)  # ? * 7 * 7 * 1 * 2, ? * 7 * 7 * 1 * 2
  predict_xy, predict_wh = yolo_head(_predict_box)  # ? * 7 * 7 * 2 * 2, ? * 7 * 7 * 2 * 2

  box_mask = K.expand_dims(box_mask)
  response_mask = K.expand_dims(response_mask)

  box_loss  = 5 * box_mask * response_mask * K.square((label_xy - predict_xy) / 448)
  box_loss += 5 * box_mask * response_mask * K.square((K.sqrt(label_wh) - K.sqrt(predict_wh)) / 448)
  box_loss = K.sum(box_loss)

  loss = confidence_loss + class_loss + box_loss

  return loss

In [None]:
# Instantiate an optimizer to train the model.

optimiserID = 1 # SGD by default for ADAM use 2 
optimizer = keras.optimizers.SGD(learning_rate=1e-3)
if optimiserID ==2:
   optimizer = keras.optimizers.Adam()#learning_rate=0.0001
# Instantiate a loss function.

if lossFunctionID == 1: # SparseCategoricalCrossentropy by default for MSE use 2 
  loss_fn = keras.losses.SparseCategoricalCrossentropy(from_logits=True)

# Prepare the metrics.
train_acc_metric = keras.metrics.SparseCategoricalAccuracy()
val_acc_metric   = keras.metrics.SparseCategoricalAccuracy()
tst_acc_metric   = keras.metrics.SparseCategoricalAccuracy()
if lossFunctionID==2:
   loss_fn = keras.losses.MeanSquaredError()
   # Prepare the metrics.
   train_acc_metric = keras.metrics.MeanSquaredError()
   val_acc_metric   = keras.metrics.MeanSquaredError()
   tst_acc_metric   = keras.metrics.MeanSquaredError()

elif lossFunctionID==3:
   loss_fn = keras.losses.CategoricalCrossentropy()
   # Prepare the metrics.
   train_acc_metric = keras.metrics.CategoricalCrossentropy()
   val_acc_metric   = keras.metrics.CategoricalCrossentropy()
   tst_acc_metric   = keras.metrics.CategoricalCrossentropy()
elif lossFunctionID==4:
   print("loss: yolo_loss")
   loss_fn = yolo_loss
  #  # Prepare the metrics.
  #  train_acc_metric = keras.metrics.CategoricalCrossentropy()
  #  val_acc_metric   = keras.metrics.CategoricalCrossentropy()
  #  tst_acc_metric   = keras.metrics.CategoricalCrossentropy()
print("optimiser, loss, and metrics are defined .... ")


loss: yolo_loss
optimiser, loss, and metrics are defined .... 


In [None]:
# TODO
# Pytorch
class YoloLoss(nn.Module):
    """
    Calculate the loss for yolo (v1) model
    """

    def __init__(self, S=7, B=2, C=20):
        super(YoloLoss, self).__init__()
        self.mse = nn.MSELoss(reduction="sum")

        """
        S is split size of image (in paper 7),
        B is number of boxes (in paper 2),
        C is number of classes (in paper and VOC dataset is 20),
        """
        self.S = S
        self.B = B
        self.C = C

        # These are from Yolo paper, signifying how much we should
        # pay loss for no object (noobj) and the box coordinates (coord)
        self.lambda_noobj = 0.5
        self.lambda_coord = 5

    def forward(self, predictions, target):
        # predictions are shaped (BATCH_SIZE, S*S(C+B*5) when inputted
        predictions = predictions.reshape(-1, self.S, self.S, self.C + self.B * 5)

        # Calculate IoU for the two predicted bounding boxes with target bbox
        iou_b1 = intersection_over_union(predictions[..., 21:25], target[..., 21:25])
        iou_b2 = intersection_over_union(predictions[..., 26:30], target[..., 21:25])
        ious = torch.cat([iou_b1.unsqueeze(0), iou_b2.unsqueeze(0)], dim=0)

        # Take the box with highest IoU out of the two prediction
        # Note that bestbox will be indices of 0, 1 for which bbox was best
        iou_maxes, bestbox = torch.max(ious, dim=0)
        exists_box = target[..., 20].unsqueeze(3)  # in paper this is Iobj_i

        # ======================== #
        #   FOR BOX COORDINATES    #
        # ======================== #

        # Set boxes with no object in them to 0. We only take out one of the two 
        # predictions, which is the one with highest Iou calculated previously.
        box_predictions = exists_box * (
            (
                bestbox * predictions[..., 26:30]
                + (1 - bestbox) * predictions[..., 21:25]
            )
        )

        box_targets = exists_box * target[..., 21:25]

        # Take sqrt of width, height of boxes to ensure that
        box_predictions[..., 2:4] = torch.sign(box_predictions[..., 2:4]) * torch.sqrt(
            torch.abs(box_predictions[..., 2:4] + 1e-6)
        )
        box_targets[..., 2:4] = torch.sqrt(box_targets[..., 2:4])

        box_loss = self.mse(
            torch.flatten(box_predictions, end_dim=-2),
            torch.flatten(box_targets, end_dim=-2),
        )

        # ==================== #
        #   FOR OBJECT LOSS    #
        # ==================== #

        # pred_box is the confidence score for the bbox with highest IoU
        pred_box = (
            bestbox * predictions[..., 25:26] + (1 - bestbox) * predictions[..., 20:21]
        )

        object_loss = self.mse(
            torch.flatten(exists_box * pred_box),
            torch.flatten(exists_box * target[..., 20:21]),
        )

        # ======================= #
        #   FOR NO OBJECT LOSS    #
        # ======================= #

        #max_no_obj = torch.max(predictions[..., 20:21], predictions[..., 25:26])
        #no_object_loss = self.mse(
        #    torch.flatten((1 - exists_box) * max_no_obj, start_dim=1),
        #    torch.flatten((1 - exists_box) * target[..., 20:21], start_dim=1),
        #)

        no_object_loss = self.mse(
            torch.flatten((1 - exists_box) * predictions[..., 20:21], start_dim=1),
            torch.flatten((1 - exists_box) * target[..., 20:21], start_dim=1),
        )

        no_object_loss += self.mse(
            torch.flatten((1 - exists_box) * predictions[..., 25:26], start_dim=1),
            torch.flatten((1 - exists_box) * target[..., 20:21], start_dim=1)
        )

        # ================== #
        #   FOR CLASS LOSS   #
        # ================== #

        class_loss = self.mse(
            torch.flatten(exists_box * predictions[..., :20], end_dim=-2,),
            torch.flatten(exists_box * target[..., :20], end_dim=-2,),
        )

        loss = (
            self.lambda_coord * box_loss  # first two rows in paper
            + object_loss  # third row in paper
            + self.lambda_noobj * no_object_loss  # forth row
            + class_loss  # fifth row
        )

        return loss


## Define required functions


In [None]:
# define training parameters and file paths 

# model log files path
modelPath   = "./modelClassification.h5"
logFilePath = "./training_log.csv"
figPath     = "./training_log.png"

logFile = open(logFilePath,'w')
logFile.write("epoch \t trnLoss \t valLoss \t trnAcc \t valAcc \t time \n" )
logFile.close()
# Using optimised tensorflow functions provides more speed

@tf.function
def train_step(model,x, y):
    print("train_step")

    with tf.GradientTape() as tape:
        #print("get result")
        model_output = model(x, training=True)
        # print("model_output: ", model_output.shape)
        # print("y: ", y.shape)
        #print("get loss value ")
        #y = keras.utils.to_categorical(y)
        loss_value = loss_fn(y, model_output)
        # print("loss_value: ", loss_value)
    grads = tape.gradient(loss_value, model.trainable_weights)
    # print("grads")
    optimizer.apply_gradients(zip(grads, model.trainable_weights))
    # print("optimizer")
    # print("y: ", y.shape)
    # print("model_output: ", model_output.shape)
    # train_acc_metric.update_state(y, model_output)
    # print("update_state")
    return loss_value

@tf.function
def val_step(model,x, y):
    val_logits = model(x, training=False)
    #y = keras.utils.to_categorical(y)
    loss_value = loss_fn(y, val_logits)
    # val_acc_metric.update_state(y, val_logits)
    return loss_value

# plotting function to monitor the curves
def iaPlotLoss(logPath,figPath=None):
    f = open(logPath,'r')
    lst = f.readlines()
    # first line is labels:
    labels = lst[0].split()[1:-2]
    x  = [ int(  ln.split()[0]) for ln in lst[1:]] # epoch
    y1 = [ float(ln.split()[1]) for ln in lst[1:]] # lossTrain
    y2 = [ float(ln.split()[2]) for ln in lst[1:]] # lossValidation
    y3 = [ float(ln.split()[3]) for ln in lst[1:]] # accTrain
    y4 = [ float(ln.split()[4]) for ln in lst[1:]] # accValidation
    #plotting    
    plt.clf()
    fig, ax = plt.subplots()    
    l1, = ax.plot(x, y1) ;     l2, = ax.plot(x, y2) ;
    l3, = ax.plot(x, y3) ;     l4, = ax.plot(x, y4) ;
    ax.legend((l1, l2,l3,l4), labels, loc='upper right', shadow=True)
    plt.xlabel('epoch')
    if figPath:
        plt.savefig(figPath, bbox_inches='tight')
    else:
        plt.show()
        plt.close()


In [None]:
if NNID==1:
    # Load the saved model 
    model = keras.models.load_model(modelPath, compile=False)

    start_time = time.time() 
    # Run a validation loop at the end of each epoch.
    for x_batch_tst, y_batch_tst in tst_dataset:
        output = model.predict(x_batch_tst)
        #y = keras.utils.to_categorical(y_batch_tst)
        tst_acc_metric.update_state(y_batch_tst, output)

    tst_acc = tst_acc_metric.result()

    # compute time required for each epoch
    end_time = time.time() - start_time

    print("test accuracy : %.4f \t time:  %.2f" % (  float(tst_acc), end_time))

In [None]:
# TODO Pytorch
def intersection_over_union(boxes_preds, boxes_labels, box_format="midpoint"):
    """
    Calculates intersection over union
    Parameters:
        boxes_preds (tensor): Predictions of Bounding Boxes (BATCH_SIZE, 4)
        boxes_labels (tensor): Correct labels of Bounding Boxes (BATCH_SIZE, 4)
        box_format (str): midpoint/corners, if boxes (x,y,w,h) or (x1,y1,x2,y2)
    Returns:
        tensor: Intersection over union for all examples
    """

    if box_format == "midpoint":
        box1_x1 = boxes_preds[..., 0:1] - boxes_preds[..., 2:3] / 2
        box1_y1 = boxes_preds[..., 1:2] - boxes_preds[..., 3:4] / 2
        box1_x2 = boxes_preds[..., 0:1] + boxes_preds[..., 2:3] / 2
        box1_y2 = boxes_preds[..., 1:2] + boxes_preds[..., 3:4] / 2
        box2_x1 = boxes_labels[..., 0:1] - boxes_labels[..., 2:3] / 2
        box2_y1 = boxes_labels[..., 1:2] - boxes_labels[..., 3:4] / 2
        box2_x2 = boxes_labels[..., 0:1] + boxes_labels[..., 2:3] / 2
        box2_y2 = boxes_labels[..., 1:2] + boxes_labels[..., 3:4] / 2

    if box_format == "corners":
        box1_x1 = boxes_preds[..., 0:1]
        box1_y1 = boxes_preds[..., 1:2]
        box1_x2 = boxes_preds[..., 2:3]
        box1_y2 = boxes_preds[..., 3:4]  # (N, 1)
        box2_x1 = boxes_labels[..., 0:1]
        box2_y1 = boxes_labels[..., 1:2]
        box2_x2 = boxes_labels[..., 2:3]
        box2_y2 = boxes_labels[..., 3:4]

    x1 = torch.max(box1_x1, box2_x1)
    y1 = torch.max(box1_y1, box2_y1)
    x2 = torch.min(box1_x2, box2_x2)
    y2 = torch.min(box1_y2, box2_y2)

    # .clamp(0) is for the case when they do not intersect
    intersection = (x2 - x1).clamp(0) * (y2 - y1).clamp(0)

    box1_area = abs((box1_x2 - box1_x1) * (box1_y2 - box1_y1))
    box2_area = abs((box2_x2 - box2_x1) * (box2_y2 - box2_y1))

    return intersection / (box1_area + box2_area - intersection + 1e-6)


def non_max_suppression(bboxes, iou_threshold, threshold, box_format="corners"):
    """
    Does Non Max Suppression given bboxes
    Parameters:
        bboxes (list): list of lists containing all bboxes with each bboxes
        specified as [class_pred, prob_score, x1, y1, x2, y2]
        iou_threshold (float): threshold where predicted bboxes is correct
        threshold (float): threshold to remove predicted bboxes (independent of IoU) 
        box_format (str): "midpoint" or "corners" used to specify bboxes
    Returns:
        list: bboxes after performing NMS given a specific IoU threshold
    """

    assert type(bboxes) == list

    bboxes = [box for box in bboxes if box[1] > threshold]
    bboxes = sorted(bboxes, key=lambda x: x[1], reverse=True)
    bboxes_after_nms = []

    while bboxes:
        chosen_box = bboxes.pop(0)

        bboxes = [
            box
            for box in bboxes
            if box[0] != chosen_box[0]
            or intersection_over_union(
                torch.tensor(chosen_box[2:]),
                torch.tensor(box[2:]),
                box_format=box_format,
            )
            < iou_threshold
        ]

        bboxes_after_nms.append(chosen_box)

    return bboxes_after_nms


def mean_average_precision(
    pred_boxes, true_boxes, iou_threshold=0.5, box_format="midpoint", num_classes=20
):
    """
    Calculates mean average precision 
    Parameters:
        pred_boxes (list): list of lists containing all bboxes with each bboxes
        specified as [train_idx, class_prediction, prob_score, x1, y1, x2, y2]
        true_boxes (list): Similar as pred_boxes except all the correct ones 
        iou_threshold (float): threshold where predicted bboxes is correct
        box_format (str): "midpoint" or "corners" used to specify bboxes
        num_classes (int): number of classes
    Returns:
        float: mAP value across all classes given a specific IoU threshold 
    """

    # list storing all AP for respective classes
    average_precisions = []

    # used for numerical stability later on
    epsilon = 1e-6

    for c in range(num_classes):
        detections = []
        ground_truths = []

        # Go through all predictions and targets,
        # and only add the ones that belong to the
        # current class c
        for detection in pred_boxes:
            if detection[1] == c:
                detections.append(detection)

        for true_box in true_boxes:
            if true_box[1] == c:
                ground_truths.append(true_box)

        # find the amount of bboxes for each training example
        # Counter here finds how many ground truth bboxes we get
        # for each training example, so let's say img 0 has 3,
        # img 1 has 5 then we will obtain a dictionary with:
        # amount_bboxes = {0:3, 1:5}
        amount_bboxes = Counter([gt[0] for gt in ground_truths])

        # We then go through each key, val in this dictionary
        # and convert to the following (w.r.t same example):
        # ammount_bboxes = {0:torch.tensor[0,0,0], 1:torch.tensor[0,0,0,0,0]}
        for key, val in amount_bboxes.items():
            amount_bboxes[key] = torch.zeros(val)

        # sort by box probabilities which is index 2
        detections.sort(key=lambda x: x[2], reverse=True)
        TP = torch.zeros((len(detections)))
        FP = torch.zeros((len(detections)))
        total_true_bboxes = len(ground_truths)
        
        # If none exists for this class then we can safely skip
        if total_true_bboxes == 0:
            continue

        for detection_idx, detection in enumerate(detections):
            # Only take out the ground_truths that have the same
            # training idx as detection
            ground_truth_img = [
                bbox for bbox in ground_truths if bbox[0] == detection[0]
            ]

            num_gts = len(ground_truth_img)
            best_iou = 0

            for idx, gt in enumerate(ground_truth_img):
                iou = intersection_over_union(
                    torch.tensor(detection[3:]),
                    torch.tensor(gt[3:]),
                    box_format=box_format,
                )

                if iou > best_iou:
                    best_iou = iou
                    best_gt_idx = idx

            if best_iou > iou_threshold:
                # only detect ground truth detection once
                if amount_bboxes[detection[0]][best_gt_idx] == 0:
                    # true positive and add this bounding box to seen
                    TP[detection_idx] = 1
                    amount_bboxes[detection[0]][best_gt_idx] = 1
                else:
                    FP[detection_idx] = 1

            # if IOU is lower then the detection is a false positive
            else:
                FP[detection_idx] = 1

        TP_cumsum = torch.cumsum(TP, dim=0)
        FP_cumsum = torch.cumsum(FP, dim=0)
        recalls = TP_cumsum / (total_true_bboxes + epsilon)
        precisions = torch.divide(TP_cumsum, (TP_cumsum + FP_cumsum + epsilon))
        precisions = torch.cat((torch.tensor([1]), precisions))
        recalls = torch.cat((torch.tensor([0]), recalls))
        # torch.trapz for numerical integration
        average_precisions.append(torch.trapz(precisions, recalls))

    return sum(average_precisions) / len(average_precisions)


def plot_image(image, boxes):
    """Plots predicted bounding boxes on the image"""
    im = np.array(image)
    height, width, _ = im.shape

    # Create figure and axes
    fig, ax = plt.subplots(1)
    # Display the image
    ax.imshow(im)

    # box[0] is x midpoint, box[2] is width
    # box[1] is y midpoint, box[3] is height

    # Create a Rectangle potch
    for box in boxes:
        box = box[2:]
        assert len(box) == 4, "Got more values than in x, y, w, h, in a box!"
        upper_left_x = box[0] - box[2] / 2
        upper_left_y = box[1] - box[3] / 2
        rect = patches.Rectangle(
            (upper_left_x * width, upper_left_y * height),
            box[2] * width,
            box[3] * height,
            linewidth=1,
            edgecolor="r",
            facecolor="none",
        )
        # Add the patch to the Axes
        ax.add_patch(rect)

    plt.show()

def get_bboxes(
    loader,
    model,
    iou_threshold,
    threshold,
    pred_format="cells",
    box_format="midpoint",
    device="cuda",
):
    all_pred_boxes = []
    all_true_boxes = []

    # make sure model is in eval before get bboxes
    model.eval()
    train_idx = 0

    for batch_idx, (x, labels) in enumerate(loader):
        x = x.to(device)
        labels = labels.to(device)

        with torch.no_grad():
            predictions = model(x)

        batch_size = x.shape[0]
        true_bboxes = cellboxes_to_boxes(labels)
        bboxes = cellboxes_to_boxes(predictions)

        for idx in range(batch_size):
            nms_boxes = non_max_suppression(
                bboxes[idx],
                iou_threshold=iou_threshold,
                threshold=threshold,
                box_format=box_format,
            )


            #if batch_idx == 0 and idx == 0:
            #    plot_image(x[idx].permute(1,2,0).to("cpu"), nms_boxes)
            #    print(nms_boxes)

            for nms_box in nms_boxes:
                all_pred_boxes.append([train_idx] + nms_box)

            for box in true_bboxes[idx]:
                # many will get converted to 0 pred
                if box[1] > threshold:
                    all_true_boxes.append([train_idx] + box)

            train_idx += 1

    model.train()
    return all_pred_boxes, all_true_boxes



def convert_cellboxes(predictions, S=7):
    """
    Converts bounding boxes output from Yolo with
    an image split size of S into entire image ratios
    rather than relative to cell ratios. Tried to do this
    vectorized, but this resulted in quite difficult to read
    code... Use as a black box? Or implement a more intuitive,
    using 2 for loops iterating range(S) and convert them one
    by one, resulting in a slower but more readable implementation.
    """

    predictions = predictions.to("cpu")
    batch_size = predictions.shape[0]
    predictions = predictions.reshape(batch_size, 7, 7, 30)
    bboxes1 = predictions[..., 21:25]
    bboxes2 = predictions[..., 26:30]
    scores = torch.cat(
        (predictions[..., 20].unsqueeze(0), predictions[..., 25].unsqueeze(0)), dim=0
    )
    best_box = scores.argmax(0).unsqueeze(-1)
    best_boxes = bboxes1 * (1 - best_box) + best_box * bboxes2
    cell_indices = torch.arange(7).repeat(batch_size, 7, 1).unsqueeze(-1)
    x = 1 / S * (best_boxes[..., :1] + cell_indices)
    y = 1 / S * (best_boxes[..., 1:2] + cell_indices.permute(0, 2, 1, 3))
    w_y = 1 / S * best_boxes[..., 2:4]
    converted_bboxes = torch.cat((x, y, w_y), dim=-1)
    predicted_class = predictions[..., :20].argmax(-1).unsqueeze(-1)
    best_confidence = torch.max(predictions[..., 20], predictions[..., 25]).unsqueeze(
        -1
    )
    converted_preds = torch.cat(
        (predicted_class, best_confidence, converted_bboxes), dim=-1
    )

    return converted_preds


def cellboxes_to_boxes(out, S=7):
    converted_pred = convert_cellboxes(out).reshape(out.shape[0], S * S, -1)
    converted_pred[..., 0] = converted_pred[..., 0].long()
    all_bboxes = []

    for ex_idx in range(out.shape[0]):
        bboxes = []

        for bbox_idx in range(S * S):
            bboxes.append([x.item() for x in converted_pred[ex_idx, bbox_idx, :]])
        all_bboxes.append(bboxes)

    return all_bboxes

def save_checkpoint(state, filename="my_checkpoint.pth.tar"):
    print("=> Saving checkpoint")
    torch.save(state, filename)


def load_checkpoint(checkpoint, model, optimizer):
    print("=> Loading checkpoint")
    model.load_state_dict(checkpoint["state_dict"])
    optimizer.load_state_dict(checkpoint["optimizer"])

## Creating the model

In [None]:
# define the output layer of YOLO

class Yolo_Reshape(tf.keras.layers.Layer):
  def __init__(self, target_shape):
    super(Yolo_Reshape, self).__init__()
    self.target_shape = tuple(target_shape)

  def get_config(self):
    config = super().get_config().copy()
    config.update({
        'target_shape': self.target_shape
    })
    return config

  def call(self, input):
    # grids 7x7
    S = [self.target_shape[0], self.target_shape[1]]
    # classes
    C = number_of_classes
    # no of bounding boxes per grid
    B = 2

    idx1 = S[0] * S[1] * C
    idx2 = idx1 + S[0] * S[1] * B
    
    # class probabilities
    class_probs = K.reshape(input[:, :idx1], (K.shape(input)[0],) + tuple([S[0], S[1], C]))
    class_probs = K.softmax(class_probs)

    #confidence
    confs = K.reshape(input[:, idx1:idx2], (K.shape(input)[0],) + tuple([S[0], S[1], B]))
    confs = K.sigmoid(confs)

    # boxes
    boxes = K.reshape(input[:, idx2:], (K.shape(input)[0],) + tuple([S[0], S[1], B * 4]))
    boxes = K.sigmoid(boxes)

    outputs = K.concatenate([class_probs, confs, boxes])
    return outputs

In [None]:
# Simple DNN
# just two conolution layers followed by dense layer
def getSimpleDNNModel(input_shape,number_of_pixels,number_of_classes):
    nF        = 16 # number of filters
    inputs = keras.Input(shape=input_shape, name="images") 
    # Create CNN model
    x11  = layers.Conv2D(nF, (3, 3), activation='relu', input_shape=input_shape) (inputs)
    x13  = layers.MaxPooling2D((2, 2)) (x11)
    x21  = layers.Conv2D(2*nF, (3, 3), activation='relu') (x13)
    x23  = layers.MaxPooling2D((2, 2))(x21)
    #dense layer for classification
    x31 = layers.Flatten()(x23)# convert from 3d to 1d
    outputs = layers.Dense(number_of_classes, name="predictions")(x31)
    model = keras.Model(inputs=inputs, outputs=outputs)
    return model

# this is a better model for CIFAR10
def getDNNModel(input_shape,number_of_pixels,number_of_classes):
    nF        = 64 # number of filters
    inputs = keras.Input(shape=input_shape, name="images") 
    # Create CNN model
    x11  = layers.Conv2D(nF, (3, 3), activation='relu', input_shape=input_shape) (inputs)
    x12  = layers.BatchNormalization()(x11)
    x13  = layers.MaxPooling2D((2, 2)) (x12)
    x14  = layers.Dropout(0.25)(x13)
    x21  = layers.Conv2D(2*nF, (3, 3), activation='relu') (x14)
    x22  = layers.BatchNormalization()(x21)
    x23  = layers.MaxPooling2D((2, 2))(x22)
    x24  = layers.Dropout(0.25)(x23)
    x31  = layers.Conv2D(2*nF, (3, 3), activation='relu')(x24)
    #dense layer for classification
    x41 = layers.Flatten()(x31)# convert from 3d to 1d
    #x7 = layers.Dense(2*nF, activation='relu')(x6)
    #x8 = layers.Dense(2*nF, activation='relu')(x7)
    x42  = layers.Dropout(0.50)(x41)
    outputs = layers.Dense(number_of_classes, name="predictions")(x42)
    model = keras.Model(inputs=inputs, outputs=outputs)
    return model

def getSimpleDNN3DModel(input_shape,number_of_pixels,number_of_classes):
    nF        = 16 # number of filters
    inputs = keras.Input(shape=input_shape, name="images") 
    # Create CNN model
    x11  = layers.Conv3D(nF, (3, 3, 3), activation='relu', input_shape=input_shape) (inputs)
    x13  = layers.MaxPooling3D((2, 2, 2)) (x11)
    x21  = layers.Conv3D(2*nF, (3, 3, 3), activation='relu') (x13)
    x23  = layers.MaxPooling3D((2, 2 ,2))(x21)
    #dense layer for classification
    x31 = layers.Flatten()(x23)# convert from 3d to 1d
    outputs = layers.Dense(number_of_classes, name="predictions")(x31)
    model = keras.Model(inputs=inputs, outputs=outputs)
    return model

#========================================
#              Yolo  
#========================================

def getYoloDNNModel(input_shape,number_of_pixels,number_of_classes):
    # input_shape = [batch_size, width, height, num of channels]
    lrelu = tf.keras.layers.LeakyReLU(alpha=0.1)
    nF        = 16 # number of filters
    lrelu = layers.LeakyReLU(alpha=0.1)
    inputs = keras.Input(shape=input_shape, name="images")
    # Create Yolo model

    x011  = layers.Conv2D(nF*4, (7, 7), 2, padding = 'same', activation=lrelu, input_shape=input_shape) (inputs)
    x012  = layers.MaxPooling2D((2, 2), 2, padding = 'same') (x011)
   
    x021  = layers.Conv2D(nF*12, (3, 3), 1, padding = 'same', activation=lrelu) (x012)
    x022  = layers.MaxPooling2D((2, 2), 2, padding = 'same') (x021)
    
    x030  = layers.Conv2D(nF*8, (1, 1), 1, padding = 'same', activation=lrelu)  (x022)
    x040  = layers.Conv2D(nF*16, (3, 3), 1, padding = 'same', activation=lrelu) (x030)
    x050  = layers.Conv2D(nF*16, (1, 1), 1, padding = 'same', activation=lrelu) (x040)
    
    x061  = layers.Conv2D(nF*32, (3, 3), 1, padding = 'same', activation=lrelu) (x050)
    x062  = layers.MaxPooling2D((2, 2), padding = 'same') (x061)
    
    x070  = layers.Conv2D(nF*16, (1, 1), 1, padding = 'same', activation=lrelu) (x062)
    x080  = layers.Conv2D(nF*32, (3, 3), 1, padding = 'same', activation=lrelu) (x070)
    x090  = layers.Conv2D(nF*16, (1, 1), 1, padding = 'same', activation=lrelu) (x080)
    x100  = layers.Conv2D(nF*32, (3, 3), 1, padding = 'same', activation=lrelu) (x090)
    x110  = layers.Conv2D(nF*16, (1, 1), 1, padding = 'same', activation=lrelu) (x100)
    x120  = layers.Conv2D(nF*32, (3, 3), 1, padding = 'same', activation=lrelu) (x110)
    x130  = layers.Conv2D(nF*16, (1, 1), 1, padding = 'same', activation=lrelu) (x120)
    x140  = layers.Conv2D(nF*32, (3, 3), 1, padding = 'same', activation=lrelu) (x130)
    x150  = layers.Conv2D(nF*32, (1, 1), 1, padding = 'same', activation=lrelu) (x140)

    x161  = layers.Conv2D(nF*64, (3, 3), 1, padding = 'same', activation=lrelu) (x150)
    x162  = layers.MaxPooling2D((2, 2), 2, padding = 'same') (x161)

    x170  = layers.Conv2D(nF*32, (1, 1), 1, padding = 'same', activation=lrelu) (x162)
    x180  = layers.Conv2D(nF*64, (3, 3), 1, padding = 'same', activation=lrelu) (x170)
    x190  = layers.Conv2D(nF*32, (1, 1), 1, padding = 'same', activation=lrelu) (x180)
    x200  = layers.Conv2D(nF*64, (3, 3), 1, padding = 'same', activation=lrelu) (x190)
    x210  = layers.Conv2D(nF*64, (3, 3), 1, padding = 'same', activation=lrelu) (x200)
    x220  = layers.Conv2D(nF*64, (3, 3), 2, padding = 'same', activation=lrelu) (x210)
    x230  = layers.Conv2D(nF*64, (3, 3), 1, padding = 'same', activation=lrelu) (x220)
    x240  = layers.Conv2D(nF*64, (3, 3), 1, padding = 'same', activation=lrelu) (x230)


    # x011  = layers.Conv2D(filters=64, kernel_size= (7, 7), strides=(1, 1), input_shape=input_shape, padding = 'same', activation=lrelu) (inputs)
    # x012  = layers.MaxPooling2D(pool_size=(2, 2), strides=(2, 2), padding = 'same') (x011)

    # x021  = layers.Conv2D(filters=192, kernel_size= (3, 3), padding = 'same', activation=lrelu) (x012)
    # x022  = layers.MaxPooling2D(pool_size=(2, 2), strides=(2, 2), padding = 'same') (x021)

    # x030  = layers.Conv2D(filters=128, kernel_size= (1, 1), padding = 'same', activation=lrelu) (x022)
    # x040  = layers.Conv2D(filters=256, kernel_size= (3, 3), padding = 'same', activation=lrelu) (x030)
    # x050  = layers.Conv2D(filters=256, kernel_size= (1, 1), padding = 'same', activation=lrelu) (x040)
    # x061  = layers.Conv2D(filters=512, kernel_size= (3, 3), padding = 'same', activation=lrelu) (x050)
    # x062  = layers.MaxPooling2D(pool_size=(2, 2), strides=(2, 2), padding = 'same') (x061)

    # x070  = layers.Conv2D(filters=256, kernel_size= (1, 1), padding = 'same', activation=lrelu) (x062)
    # x080  = layers.Conv2D(filters=512, kernel_size= (3, 3), padding = 'same', activation=lrelu) (x070)
    # x090  = layers.Conv2D(filters=256, kernel_size= (1, 1), padding = 'same', activation=lrelu) (x080)
    # x100  = layers.Conv2D(filters=512, kernel_size= (3, 3), padding = 'same', activation=lrelu) (x090)
    # x110  = layers.Conv2D(filters=256, kernel_size= (1, 1), padding = 'same', activation=lrelu) (x100)
    # x120  = layers.Conv2D(filters=512, kernel_size= (3, 3), padding = 'same', activation=lrelu) (x110)
    # x130  = layers.Conv2D(filters=256, kernel_size= (1, 1), padding = 'same', activation=lrelu) (x120)
    # x140  = layers.Conv2D(filters=512, kernel_size= (3, 3), padding = 'same', activation=lrelu) (x130)
    # x150  = layers.Conv2D(filters=512, kernel_size= (1, 1), padding = 'same', activation=lrelu) (x140)
    # x161  = layers.Conv2D(filters=1024, kernel_size= (3, 3), padding = 'same', activation=lrelu) (x150)
    # x162  = layers.MaxPooling2D(pool_size=(2, 2), strides=(2, 2), padding = 'same') (x161)

    # x170  = layers.Conv2D(filters=512, kernel_size= (1, 1), padding = 'same', activation=lrelu) (x162)
    # x180  = layers.Conv2D(filters=1024, kernel_size= (3, 3), padding = 'same', activation=lrelu) (x170)
    # x190  = layers.Conv2D(filters=512, kernel_size= (1, 1), padding = 'same', activation=lrelu) (x180)
    # x200  = layers.Conv2D(filters=1024, kernel_size= (3, 3), padding = 'same', activation=lrelu) (x190)
    # x210  = layers.Conv2D(filters=1024, kernel_size= (3, 3), padding = 'same', activation=lrelu) (x200)
    # x220  = layers.Conv2D(filters=1024, kernel_size= (3, 3), strides=(2, 2), padding = 'same') (x210)

    # x230  = layers.Conv2D(filters=1024, kernel_size= (3, 3), activation=lrelu) (x220)
    # x240  = layers.Conv2D(filters=1024, kernel_size= (3, 3), activation=lrelu) (x230)





    x250   = layers.Flatten()(x240)# convert from 3d to 1d
    x260   = layers.Dense(512)(x250)
    x270   = layers.Dense(1024)(x260)
    x280 = layers.Dense(1470, activation='sigmoid')(x270)
    outputs = Yolo_Reshape(target_shape=(7,7,30))(x280)


    #dense layer for classification
    model = keras.Model(inputs=inputs, outputs=outputs)
    # print(model.summary())

    return model    


print("DNN model is defined ...")    



DNN model is defined ...


In [None]:
# TODO
# Pytorch
""" 
Information about architecture config:
Tuple is structured by (kernel_size, filters, stride, padding) 
"M" is simply maxpooling with stride 2x2 and kernel 2x2
List is structured by tuples and lastly int with number of repeats
"""

architecture_config = [
    (7, 64, 2, 3),
    "M",
    (3, 192, 1, 1),
    "M",
    (1, 128, 1, 0),
    (3, 256, 1, 1),
    (1, 256, 1, 0),
    (3, 512, 1, 1),
    "M",
    [(1, 256, 1, 0), (3, 512, 1, 1), 4],
    (1, 512, 1, 0),
    (3, 1024, 1, 1),
    "M",
    [(1, 512, 1, 0), (3, 1024, 1, 1), 2],
    (3, 1024, 1, 1),
    (3, 1024, 2, 1),
    (3, 1024, 1, 1),
    (3, 1024, 1, 1),
]


class CNNBlock(nn.Module):
    def __init__(self, in_channels, out_channels, **kwargs):
        super(CNNBlock, self).__init__()
        self.conv = nn.Conv2d(in_channels, out_channels, bias=False, **kwargs)
        self.batchnorm = nn.BatchNorm2d(out_channels)
        self.leakyrelu = nn.LeakyReLU(0.1)

    def forward(self, x):
        return self.leakyrelu(self.batchnorm(self.conv(x)))


class Yolov1(nn.Module):
    def __init__(self, in_channels=3, **kwargs):
        super(Yolov1, self).__init__()
        self.architecture = architecture_config
        self.in_channels = in_channels
        self.darknet = self._create_conv_layers(self.architecture)
        self.fcs = self._create_fcs(**kwargs)

    def forward(self, x):
        x = self.darknet(x)
        return self.fcs(torch.flatten(x, start_dim=1))

    def _create_conv_layers(self, architecture):
        layers = []
        in_channels = self.in_channels

        for x in architecture:
            if type(x) == tuple:
                layers += [
                    CNNBlock(
                        in_channels, x[1], kernel_size=x[0], stride=x[2], padding=x[3],
                    )
                ]
                in_channels = x[1]

            elif type(x) == str:
                layers += [nn.MaxPool2d(kernel_size=(2, 2), stride=(2, 2))]

            elif type(x) == list:
                conv1 = x[0]
                conv2 = x[1]
                num_repeats = x[2]

                for _ in range(num_repeats):
                    layers += [
                        CNNBlock(
                            in_channels,
                            conv1[1],
                            kernel_size=conv1[0],
                            stride=conv1[2],
                            padding=conv1[3],
                        )
                    ]
                    layers += [
                        CNNBlock(
                            conv1[1],
                            conv2[1],
                            kernel_size=conv2[0],
                            stride=conv2[2],
                            padding=conv2[3],
                        )
                    ]
                    in_channels = conv2[1]

        return nn.Sequential(*layers)

    def _create_fcs(self, split_size, num_boxes, num_classes):
        S, B, C = split_size, num_boxes, num_classes

        # In original paper this should be
        # nn.Linear(1024*S*S, 4096),
        # nn.LeakyReLU(0.1),
        # nn.Linear(4096, S*S*(B*5+C))

        return nn.Sequential(
            nn.Flatten(),
            nn.Linear(1024 * S * S, 496),
            nn.Dropout(0.0),
            nn.LeakyReLU(0.1),
            nn.Linear(496, S * S * (C + B * 5)),
        )

## Training the model

In [None]:
if not usePytorch:
    epochs = 10 # number of iterations

    if NNID>=2:
      # input_shape = [h,w,c] 

      if NNID==2:
        model = getSimpleDNNModel(input_shape, number_of_pixels,number_of_classes)
      elif NNID==3: # advanced 
        model = getDNNModel(input_shape, number_of_pixels,number_of_classes)
      elif NNID==4: # 3D 
        input_shape = [h,w,c,1] 
        model = getSimpleDNN3DModel(input_shape, number_of_pixels,number_of_classes)
      elif NNID==5: # Yolo
        grid_w=7
        grid_h=7
        cell_w=64
        cell_h=64
        img_w=grid_w*cell_w
        img_h=grid_h*cell_h
        channel=3
        input_shape = [img_w,img_h,channel]
        number_of_classes = 20
        number_of_pixels = img_w * img_h * channel
        model = getYoloDNNModel(input_shape, number_of_pixels, number_of_classes)
        

        print("===================================================")
        print("               Training Loop           ")
        print("===================================================")
        total_time_start = time.time()
        # we loop number of iterations
        # for each iteration, we loop through all the training samples
        for epoch in range(epochs):
            #print("\nStart of epoch %d" % (epoch,))
            start_time = time.time()

            # Iterate over the batches of the dataset.
            train_dataset = My_Custom_Generator(X_train, Y_train, batch_size)
            for step, (x_batch_train, y_batch_train) in enumerate(train_dataset):
                #print(train_dataset.shape)
                #print(x_batch_train.shape,y_batch_train.shape)
                doAug = 0
                if doAug: 
                    #do augmentation
                    new_train_batch = doAugmentation(x_batch_train , y_batch_train , batch_size)                
                    for stp, (new_x_batch_train, new_y_batch_train) in enumerate(new_train_batch):
                        #print(stp)
                        #print(new_train_batch.shape)
                        #print(new_x_batch_train.shape,new_y_batch_train.shape)
                        #model.summary()
                        loss_value = train_step(model,new_x_batch_train, new_y_batch_train)
                        train_acc = train_acc_metric.result()
                        train_acc_metric.reset_states()
                        print("   epoch:%d \t stp %d trnLoss: %.4f " % (epoch, stp, float(loss_value)))
                else:                    
                    # print(step)
                    # print(x_batch_train.shape)
                    # print(y_batch_train.shape)
                    # model.summary()
                    lossFunctionID = 4
                    loss_value = train_step(model,x_batch_train, y_batch_train)
                    train_acc = train_acc_metric.result()
                    train_acc_metric.reset_states()

            # Run a validation loop at the end of each epoch.
            val_dataset = My_Custom_Generator(X_val, Y_val, batch_size)
            for x_batch_val, y_batch_val in val_dataset:
                val_loss_value = val_step(model, x_batch_val, y_batch_val)

            val_acc = val_acc_metric.result()
            val_acc_metric.reset_states()
            
            # compute time required for each epoch
            end_time = time.time() - start_time

            print("epoch:%d \t trnLoss: %.4f \t valLoss: %.4f \t trnAcc: %.4f \t valAcc: %.4f \t time:  %.2f" % (epoch, float(loss_value),float(val_loss_value), float(train_acc), float(val_acc), end_time))
            logFile = open(logFilePath,'a')
            logFile.write("%d \t %.4f \t  %.4f \t %.4f \t  %.4f \t  %.2f \n" % (epoch, float(loss_value),float(val_loss_value), float(train_acc), float(val_acc), end_time))
            logFile.close()
            if epoch % 5 ==0:
            # plot the result        
              iaPlotLoss(logFilePath)
              model.save(modelPath)      
        # save the final model
        model.save(modelPath)     

        # plot the result        
        iaPlotLoss(logFilePath)
        total_time_end = time.time() - total_time_start
        print("Training this dataset took ", total_time_end," seconds!") 
        print("Training this dataset took ", total_time_end/60.0," minutes!") 


In [None]:
# py 
if usePytorch:
  if NNID == 5:
    seed = 123
    torch.manual_seed(seed)

    # Hyperparameters etc. 
    LEARNING_RATE = 2e-5
    DEVICE = "cuda" if torch.cuda.is_available else "cpu"
    BATCH_SIZE = 1 # 64 in original paper but I don't have that much vram, grad accum?
    WEIGHT_DECAY = 0
    EPOCHS = 300
    NUM_WORKERS = 2
    PIN_MEMORY = True
    LOAD_MODEL = False
    LOAD_MODEL_FILE = "overfit.pth.tar"
    IMG_DIR = "data/images"
    LABEL_DIR = "data/labels"

    class Compose(object):
      def __init__(self, transforms):
          self.transforms = transforms

      def __call__(self, img, bboxes):
          for t in self.transforms:
              img, bboxes = t(img), bboxes

          return img, bboxes


    transform = Compose([transforms.Resize((448, 448)), transforms.ToTensor(),])


    def train_fn(train_loader, model, optimizer, loss_fn):
        loop = tqdm(train_loader, leave=True)
        mean_loss = []

        for batch_idx, (x, y) in enumerate(loop):
            x, y = x.to(DEVICE), y.to(DEVICE)
            out = model(x)
            loss = loss_fn(out, y)
            mean_loss.append(loss.item())
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            # update progress bar
            loop.set_postfix(loss=loss.item())

        print(f"Mean loss was {sum(mean_loss)/len(mean_loss)}")


    model = Yolov1(split_size=7, num_boxes=2, num_classes=20).to(DEVICE)
    optimizer = optim.Adam(
        model.parameters(), lr=LEARNING_RATE, weight_decay=WEIGHT_DECAY
    )
    loss_fn = YoloLoss()

    if LOAD_MODEL:
        load_checkpoint(torch.load(LOAD_MODEL_FILE), model, optimizer)

    train_dataset = VOCDataset(
        "train.csv",
        transform=transform,
        img_dir=IMG_DIR,
        label_dir=LABEL_DIR,
    )

    test_dataset = VOCDataset(
        "test.csv", transform=transform, img_dir=IMG_DIR, label_dir=LABEL_DIR,
    )

    train_loader = DataLoader(
        dataset=train_dataset,
        batch_size=BATCH_SIZE,
        num_workers=NUM_WORKERS,
        pin_memory=PIN_MEMORY,
        shuffle=True,
        drop_last=True,
    )

    test_loader = DataLoader(
        dataset=test_dataset,
        batch_size=BATCH_SIZE,
        num_workers=NUM_WORKERS,
        pin_memory=PIN_MEMORY,
        shuffle=True,
        drop_last=True,
    )

    for epoch in range(EPOCHS):
        # for x, y in train_loader:
        #    x = x.to(DEVICE)
        #    for idx in range(8):
        #        bboxes = cellboxes_to_boxes(model(x))
        #        bboxes = non_max_suppression(bboxes[idx], iou_threshold=0.5, threshold=0.4, box_format="midpoint")
        #        plot_image(x[idx].permute(1,2,0).to("cpu"), bboxes)

        #    import sys
        #    sys.exit()

        pred_boxes, target_boxes = get_bboxes(
            train_loader, model, iou_threshold=0.5, threshold=0.4
        )

        mean_avg_prec = mean_average_precision(
            pred_boxes, target_boxes, iou_threshold=0.5, box_format="midpoint"
        )
        # print("epoch:%d \t trnLoss: %.4f \t valLoss: %.4f \t trnAcc: %.4f \t valAcc: %.4f \t time:  %.2f" % (epoch, float(loss_value),float(val_loss_value), float(train_acc), float(val_acc), end_time))
        print(f"Epoch: {epoch} Train mAP: {mean_avg_prec}")

        #if mean_avg_prec > 0.9:
        #    checkpoint = {
        #        "state_dict": model.state_dict(),
        #        "optimizer": optimizer.state_dict(),
        #    }
        #    save_checkpoint(checkpoint, filename=LOAD_MODEL_FILE)
        #    import time
        #    time.sleep(10)

        train_fn(train_loader, model, optimizer, loss_fn)


  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)


Epoch: 0 Train mAP: 0.0


100%|██████████| 16550/16550 [15:43<00:00, 17.54it/s, loss=5.31]

Mean loss was 13.303371761479047





Epoch: 1 Train mAP: 6.750193279003724e-05


100%|██████████| 16550/16550 [15:55<00:00, 17.32it/s, loss=7.64]

Mean loss was 11.759896752603824





Epoch: 2 Train mAP: 0.00020331663836259395


 13%|█▎        | 2195/16550 [02:10<14:10, 16.87it/s, loss=14.7]


KeyboardInterrupt: ignored

## Evaluation

In [None]:
if NNID==2:
    # Load the saved model 
    model = keras.models.load_model(modelPath, compile=False)

    start_time = time.time() 
    # Run a validation loop at the end of each epoch.
    for x_batch_tst, y_batch_tst in tst_dataset:
        output = model.predict(x_batch_tst)
        #y = keras.utils.to_categorical(y_batch_tst)
        tst_acc_metric.update_state(y_batch_tst, output)

    tst_acc = tst_acc_metric.result()

    # compute time required for each epoch
    end_time = time.time() - start_time

    print("test accuracy : %.4f \t time:  %.2f" % (  float(tst_acc), end_time))

In [None]:
#TODO: show examples using the above datasets
# https://machinelearningmastery.com/diagnose-overfitting-underfitting-lstm-models/

# More resources:

* 3Blue1Brown Neural Network [video tutorials](https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi) 
* Deep Learning Video Lectures by Prof. Andreas Maier [Winter 20/21](https://www.youtube.com/watch?v=SCFToE1vM2U&list=PLpOGQvPCDQzvJEPFUQ3mJz72GJ95jyZTh)
* Some of the code in this notebook is taken from [here](https://www.tensorflow.org/guide/keras/writing_a_training_loop_from_scratch)
* Calculating number of parameters in [CNN](https://towardsdatascience.com/understanding-and-calculating-the-number-of-parameters-in-convolution-neural-networks-cnns-fc88790d530d)
* Some of the code in this notebook is taken from [here](https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/images/cnn.ipynb)
* https://nihcc.app.box.com/v/ChestXray-NIHCC
* https://www.tensorflow.org/datasets/catalog/patch_camelyon
* https://danielnouri.org/notes/2014/12/17/using-convolutional-neural-nets-to-detect-facial-keypoints-tutorial/
* https://elix-tech.github.io/ja/2016/06/02/kaggle-facial-keypoints-ja.html
* https://fairyonice.github.io/achieving-top-23-in-kaggles-facial-keypoints-detection-with-keras-tensorflow.html
* https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/training.html
* https://colab.research.google.com/github/tensorflow/hub/blob/master/examples/colab/object_detection.ipynb
* https://github.com/nicknochnack/TFODCourse
* https://www.analyticsvidhya.com/blog/2018/10/a-step-by-step-introduction-to-the-basic-object-detection-algorithms-part-1/
* https://github.com/enggen/Deep-Learning-Coursera
* https://github.com/prateeshreddy/Deep-Learning-Coursera
* https://github.com/JudasDie/deeplearning.ai
* https://github.com/aladdinpersson/Machine-Learning-Collection/tree/master/ML/Pytorch/object_detection/YOLO
* https://youtu.be/n9_XyCGr-MI
* https://www.maskaravivek.com/post/yolov1/