# Introduction to Neural Computation Coursework- Group 16

Jon Duffy, Ed Wong, Shiyu Fan, Yawen Sun

## Introduction
For this task we look into using a neural network to classify and localise objects in images. We have X number of preprocessed training images of dimension 400x400 pixels, each of which are paired with a text file that specify classes and, if present, bounding boxes around a present object in that image. These images are from the VOC data set.

We also have Y number of testing images of the same format. Our task is to accurately generate an accompanying text file, which specifies the classes and, if present, bounding boxes of all the test images.

Below is the text file for image 2007_000027. Here we can see 20 classes.
The “1 0” following the class name indicates that this class is not present in the image.
If an object has been detected, the numbers that follow indicate the bounding box with pixel references. The pixels are numbered from top to bottom, then left to right. For example, 1 is pixel (1,1), 2 is pixel (2,1), etc. The numbers are paired such that the first number represents the starting pixel and the second represents how many consecutive pixels after this. This creates a bounding box around our objects.

This text file corresponds to this image:

<img src="files/2007_000027.jpg">

Classification and localisation is an active area of research within machine learning. The VOC data set that we have here is the standard benchmark to measure the performance of classification of neural networks. 

We aim to experiment with (WHAT ARE WE DOING) to improve the classification and localization of the neural network as well as understanding the impact of these changes on this classification and localization problem. The quality of the neural network will be measured by various cross validation methods.

## Design
Our initial approach was to create a neural network from scratch so we could experiment with all of the details of the network, such as the topology, applying different types of layers (such as fully connected layers, convolutional layers), learning rates etc. However, there were a few disadvantages/bottlenecks that we came across:
Approaching this problem by working from the ground up leaves much more room for error.
Implementing a network from scratch means that we must train all of our weights and biases using many epochs of our training set. With object detection and classification, it is generally accepted that deep neural networks are better at this task than shallow neural networks, resulting in many more parameters to calculate and iterate. As we have 14,626 training images, this presents a computational resource bottleneck. Even if we were to utilise cloud based services such a google cloud, or amazon web services, we will still run into another barrier.
Namely, time. This was a precious resource in this project. According to CITE, a good neural network could take months to train to a good standard from scratch, even with many GPUs.
As a result we decided to change our approach to take an existing network with pretrained weights, and update these weights by training with our dataset.
Researching Existing Networks
When researching existing Neural Networks, we came across many different variations. For instance, R-CNN and SSD300 to name a couple. However, a neural network that caught our eye was [YOLOv2](https://pjreddie.com/darknet/yolo/), because of [this video](https://www.youtube.com/watch?v=VOC3huqHrss) demonstration that applied this neural network to a movie clip in real time. This video clip showcased the speed of this network- a factor which was crucial.

After some further research into this neural network, we analysed the topology of it. It consisted of 24 convolutional layers followed by 2 fully connected layers. We could experiment with the topology of the network but adding layers, removing layers, changing the dimensions of the layers etc. However, due to time constraints, we instead focus on fixing this topology and altering other features, ie the hyperparameters of the network.

Pretrained weights from ImageNet were used on the convolutional layers. This essentially means that they only need to train the weights and biases in the final two fully connected layers, which saves time and resources. We could also implement a similar approach by training an implementation of this network on our training data with pretrained weights on the convolutional layers and we find the weights and biases on the fully connected layer. However due to time limitations, it is more logical to take the YOLOv2 network and train it in the network’s current state to update the current weights by training it on few epochs with our own training set).

Let us briefly touch upon how this network is used in order to both classify and localise objects in an image. First, the image is divided into an equal sized grid, e.g. 9x9. Inside each grid cell, the network assigns several anchor boxes. These anchor boxes then attempt to locate an object. The reason for multiple anchor boxes per cell is so we can classify multiple objects in a single cell (Clearly, anchor boxes limit the number of objects in the same grid cell). After this, it implements a certain loss function that balances the trade off between location error and classification error. This loss value is then used to implement back propagation.

The output of the network is a high dimensional tensor with each component of the tensor referring to the grid cell, the boxes, the class and the probability of that class occuring.

The layers of the network consist of convolutional layers (which extract the features) and end with fully connected layers (which calculate the probabilities). This information was found by exploring the network, and from [this paper](https://arxiv.org/abs/1506.02640).

We found a python implementation of YOLOv2 called [YAD2K](https://github.com/allanzelener/YAD2K). This uses the YOLOv2 network design but uses Keras, tensorflow and numpy.
How we will Train:
The training cycle worked through 3 different stages. Each stage split the data into 1800 training images and 200 validation images:
The first stage was 5 epochs. [First it finetunes only the last layer]
The second 30 epochs. [tunes all layers since it need not save weights at end of each epoch it runs without early stopping and checkpointing
The third 30 epochs. [runs with checkpoint and early stopping to stop as soon as val loss starts increasing.]
We seperate whole set of images into 8 subsets. Each of these subset contains 2000 images.
Our Version Of This Network
We aim to change the following features and analyse how they effect the networks ability to detect and classify:
Activation functions
Anchor box coordinates
Learning rate

## Implementation- input code
Describe how you implemented your neural network and the associated performance analysis mechanisms.  Explain  why  you  chose  to  do  it  that  way.   Remember  to  cite  any sources you used.  Include comments in the source code that explains how the code works.  [20%]
Parser for File Types and Data Format
In order to implement the neural network we have chosen, our first task was to write a parser so we can understand the annotation files. Then, we needed to write some code to modify the output so the network generates the specific format we require.

<convert.py>

<test_copy.py>

However, the data that we have actually has some room for interpretation in some cases. For example, if there are many overlapping objects of the same class, the data we have means that we cannot explicitly decipher between two objects in certain circumstances. We can visualise our data as below. We state three examples where this becomes an issue:
<img src="files/figures/figure3_1.png">
Top left: Original
Top right: We do not know if there is an object inside the “silhouette”
Bottom: We do not know how fair this box extends to.

It would be rare that this situation will indeed occur, however it should still be noted when converting the data format.

The data format that we are converting to and from is much more efficient and robust than the data format we have. It simply states a set of coordinates and dimensions per object in an image. For example, we can express the example we had in the introduction as:
2007_000027.jpg_person, 143 80 287 279
Where each integer refers to (x_1, y_1, x_2, y_2) which is the bounding box around an object of that class in that image.

We can pinpoint where we lose information in our code for test_yolo.py

A further issue we found when testing was that the network would not perform correctly when given a greyscale image. Therefore, we are required to resave each black and white image so our network can take it as input.
Computational Resources
Training a neural network to classify and localise images requires a large amount of computational resource. We quickly learnt that it was not feasible to train on our laptops for two reasons. Firstly, the previously mentioned amount or resource required meant that it would take an infeasible time given the time constraints of this assignment, and secondly running from a laptop requires the laptop to be open and awake for processes to run. We decided to turn to the cloud to solve this problem. This allowed us to rent a powerful virtual machines with professional GPU’s (rather than gaming GPU’s found in consumer desktop/ laptops). We also installed and configured Jenkins on our VM. https://jenkins.io/ Jenkins is a job orchestration tool that can be controlled via a web interface. This allowed us to define and schedule jobs on the neural network. As the jenkins user is running the tasks, were we able to not worry about maintaining an open session and could check the status of Job’s via the web UI. 
    We initially installed on Amazon Web Services (AWS) however there were $300 worth or free credits available for setting up a new account on Google Cloud (GC). AWS provides Amazon machine images (AMI’s) that are pre built and ready for machine learning, with many key tools, such as tensorflow and keras pre installed. This made setup on AWS very fast. However when we moved to GC we had to install display drivers and many tools before we could get started. 
Even when training on a powerful cloud instance with 80GB or RAM, we trying to train the network on all the images, RAM would max out within seconds and the process would get killed by the operating system. To work around this we broke the training data into smaller batches and trained on these. Jenkins made this easy as we were able to queue up many training jobs, and we were able to check the web UI anytime of day to get updated on progress.
During training we came across a bug in the the YAD2K code. When drawing the bounding boxes on images during the training cycle, it encountered a type mismatch bug, int -> float. Updating a single function argument fixed this issue and training jobs were able to then able to run successfully. 

We retrain 2000 images 8 times and combine them into yolo.weights for testing.
Performance Analysis:
We will discuss an accuracy measurements which is used to give confidence for a particular object. First, we introduce the notion of “Intersection over Union” (or IoU). While training our network, we can judge our results and analyse the accuracy because we have ground truth boxes. Below, we see that the green box is the position of object what we can get from training data (ground truth) and the red one is our predicted position after training.
<img src="files/figures/figure3_2.png">
This visual representation from Adrian Bosebrock (2016) demonstrates the usefulness of IOU measurements.
<img src="files/figures/figure3_3.png">
With this IOU value, we multiply it by the probability that the object is of that certain class:
Pr(Class_i) ∗ IOU^truth_pred
<img src="files/figures/figure3_4.png">
From this figure, our network is 80% confident that the object in pink box is a truck. Our aim is to maximise this confidence.


## Experiments
Describe the experiments you carried out to optimise your network’s generalization performance, and present the results you obtained.  Explain in detail how you used the training and testing data sets.  The results should be presented in a statistically rigorous manner.  [45%]
Now that we have a functioning network that we can train, we turn to experimenting. Let’s recap what we will be experimenting with:
Varying Activation Functions
Adjusting Anchor Box Sizes
Learning Rates
Varying Activation Functions:
Our exploration of the network began with the activation functions. While looking through the code of this neural network, we changed the activation functions of the network and ran some tests to see how it would affect the output. However, this clearly makes little logical sense as changing the activation layer will be altering the network which has been trained on a certain data set. Altering functions in a pre-trained network should have bad implications on the accuracy of the outputs.

The original activation functions were “leaky” (with gradient 0.1), with the exception of the penultimate layer where the activation function was “linear”. We changed these activation functions (and their derivatives) and tried implementing the following:
Changed all activations to “linear”.
Changed to Relu activations.
Changed to leaky activation with varying gradients (0.0001, 0.05, 0.11, 0.105, 0.1075, 0.115, 0.2, 0.5, 0.8)
These changes were completely at random and their purpose was simply to see what the network would do with these changes.

When we made these changes, we found that some interesting results. Unsurprisingly, many of these functions completely over classified in the wrong areas, resulting in completely ludicrous outputs. On the other hand, some of the outputs classified nothing at all. These outputs are to be expected because of the points stated above.

However, something unexpected happened when we used a leaky activation with gradient 0.11- only a slight variant on the original. On the particular image we were testing, the network managed to detect a motorbike in the top left hand side- an object that was not detected by the original trained network!

Although the bounding boxes on the originally detected objects were slightly a skew, it was a surprise to see that this network seemed better on detection. What seemed to happen is that the there was a decrease of the threshold of detection, allowing for more questionable objects to be detected that were in fact correct. However this was only tested on a single image and should be explored further before drawing any conclusions.

Below are the results.

Normal yolo.cfg:
<img src="files/figures/figure4_1.png">
Linear activations:
<img src="files/figures/figure4_2.png">
Relu activations and leaky activation with 0.0001 and 0.05 (just didn’t classify anything):
<img src="files/figures/figure4_3.png">
Leaky with 0.5:
<img src="files/figures/figure4_4.png">
Leaky with 0.8:
<img src="files/figures/figure4_5.png">
Leaky with 0.2:
<img src="files/figures/figure4_6.png">
Leaky 0.11:
<img src="files/figures/figure4_7.png">
Leaky 0.115:
<img src="files/figures/figure4_8.png">
Leaky 0.105:
<img src="files/figures/figure4_9.png">
Leaky 0.1075:
<img src="files/figures/figure4_10.png">
We can analogise changing the network after it has been trained to moving football goal posts after a ball has been kicked. With this, it is incredibly difficult or impossible to mathematically reason that your network is going in the right direction (or our goal posts will aline with the trajectory of the ball).

Unfortunately, due to time constraints, we were unable to explore this branch further and delve into the mathematics of why this happens. Regardless, it was an interesting aside to “move the network to optimize” instead of the traditional gradient descent approach.
Varying Anchor Box Sizes:
A possible parameter we could experiment with is the number of anchor boxes and their dimensions. We tackle this by considering several approaches:
The first could be just to pick varying sizes by hand. This method will come with it’s obvious disadvantages as it is not statistically rigorous. We omit this method.
An alternative is to use k-means algorithm to select the number of boxes. This is a possible option, however due to lack of time, implementing this algorithm may hinder progress so we choose to not use this. YOLOv2 uses 5 anchor boxes, and according to [Vivek Yadav](https://medium.com/@vivek.yadav/part-1-generating-anchor-boxes-for-yolo-like-network-for-vehicle-detection-using-kitti-dataset-b2fe033e5807), after some k-means experimentation, the conclusion is that 5 anchor boxes is the best for this network (note they are using a different dataset but the general idea is the same).
Use k-means algorithm to select the dimensions of the boxes. As we are choosing 5 anchor boxes, we could run 5-means algorithm to calculate the best dimensions for these anchor boxes. However, our data set is very similar to the data set that this network has already been trained on, so it would make sense to omit this experimentation as it has already been optimized to this factor and save some time.
The 5 anchor box dimensions we will use are:
(    (0.57273, 0.677385),
(1.87446, 2.06253),
(3.33843, 5.47434),
(7.88282, 3.52778),
(9.77052, 9.16828)    )
Learning Rate:
Different learning rates to try:
Control- using the original parameters (0.001)
0.0001
0.0005
0.0015
0.0025
A note on the Number of Epochs:
Initially, we thought that we could experiment with the number of epochs that our network runs through. However, our network implementation already uses “early stopping” which ensure that a network does not overfit to a certain dataset.


## Conclusions
Summarize your key findings, including which factors  proved  most  crucial,  and  what  was  the  best  generalization performance you achieved.  [10%]
From our experiments and implementations, we have found that 
FINDINGS FROM EXPERIMENTS

Our findings could be improved if we had more time and resources,
More computational power to have the ability to train these networks to our own training data, from scratch.

We could also designed our own datasets consisting of simple data, such as coloured shapes, to see how well our network generalises (is this what I mean?).

Also, it is important to note tha

## Description of contribution
Describe each group member's’ contribution to the overall project. [5%]
Jon- 
Ed-
Shiyu- 
Yawen- 
That dude we never met- 