# Assignment 7 Report

This is an outline for your report to ease the amount of work required to create your report. Jupyter notebook supports markdown, and I recommend you to check out this [cheat sheet](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet). If you are not familiar with markdown.

Before delivery, **remember to convert this file to PDF**. You can do it in two ways:
1. Print the webpage (ctrl+P or cmd+P)
2. Export with latex. This is somewhat more difficult, but you'll get somehwat of a "prettier" PDF. Go to File -> Download as -> PDF via LaTeX. You might have to install nbconvert and pandoc through conda; `conda install nbconvert pandoc`.

# Task 1

## task 1a)

Intersection over union is used in measuring precision. It is particularly useful when we want to measure how well a neural network is able to separate objects by using bounding boxes.

![Intersection over union](report_material/intersection_over_union.PNG "Intersection over union")

In intersection over union, we measure how much the predicted bounding box overlaps with the real bounding box. Say we have the red bounding box which is our prediction, and the blue one is the ground truth. To measure how well the neural network is able to predict this bounding box, we look at the ratio of the intersection of the two boxes (area A), divided by the union of the two boxes (areas A+B+C).

## task 1b)

**True positive (TP)**: We have predicted a result as a positive, and the ground truth is also positive (correct prediction).

**False positive (FP)**: We have predicted the result as a positive, but the ground truth is actually negative (incorrect prediction).

**Precision**: Out of all the positives we report, how many of them are really positive? i.e. how many of our positive predictions are correct? 
$$Precision =\frac{TP}{TP+FP}$$

**Recall**: How well are we able to find all the positives? High recall means we have few false negatives (FN), i.e we are able to catch almost all positive cases.
$$Recall = \frac{TP}{TP+FN}$$


## task 1c)

Here I have plotted the precision recall curves for both classes. The red boxes are the boxes over which we sum AP.

![Precision recall for class 1](report_material/precision_recall_curve_class_1_boxes.png "Precision recall for class 1")

For class 1, we have:
$$
AP=\frac{1}{11}\sum_{r\in \{0.0, \cdots 1.0\}}AP_r = \frac{1}{11}(5\cdot 1.0 + 0.75 + 0.5 + 0.4 + 0.3 + 0.2) = \frac{7.15}{11} = 0.65
$$


![Precision recall for class 2](report_material/precision_recall_curve_class_2_boxes.png "Precision recall for class 2")

For class 2:
$$
AP=\frac{1}{11}(3\cdot 1.0+0.8+0.6+0.55+0.5+0.4+0.3) = \frac{6.25}{11}=0.568
$$

The mean average precision is the mean of the two APs:
$$
mAP = \frac{1}{2}\frac{6.25+7.15}{11}=0.609
$$

# Task 2

### Task 2f)
![Precision recall for task 2f](task2/precision_recall_curve.png "Precision recall for task 2f")

# Task 3

### Task 3a)
This is called **non-max suppression**. It removes duplicate predictions that point to the same object by picking only the prediction with the highest confidence score (as long as it has an IoU over a certain threshold).

### Task 3b)
This is false: The earlier layers have a higher resolution, and high-resolution feature maps are better at detecting small objects. 

### Task 3c)
It might make sense to use a different set of bounding boxes depending on what class we are trying to identify. For example, a wide bounding box would make sense for a car but not for a pedestrian. By using anchors, we are essentially saying "Since we are looking for either cars or pedestrians, I'm pretty certain that the bounding box will be either flat or tall and skinny".

### Task 3d)
In SSD, additional convolutional feature layers are added after the base layer. These layers decrease in size progressively which allows the network to detect multiple sizes of the same class. YOLO does not have this gradual decrease in feature layer sizes.

## Task 3e)
We have $38\times 38$ locations for our anchors. There are 6 anchors per location. So, we have $38 \cdot 38 \cdot 6 =8664$ anchor boxes in the feature map.

## Task 3f)
Same as last time, just that now we repeat the operation for multiple image sizes. This means we have $6\cdot(38\cdot38+19\cdot19+10\cdot10+5\cdot5+3\cdot3+1\cdot1)=11640$ anchor boxes in total.

# Task 4

## Task 4b)
![Precision recall for task 2f](task2/precision_recall_curve.png "Precision recall for task 2f")
Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.748

## Task 4c)
I used a lot of the things I learned from last lecture. Mainly batch normalization and network size increasing, did a lot for my network, with some other small changes as well.

The gist of it is: 
- Implement batch normalization (after conv2d, but before reLU).
- Increase the network size:
    - Change `output_channels=[128, 256, 512, 256, 128, 64]`
    - Add more layers within the blocks already made (no new blocks were implemented)
- I also changed the bounding boxes a little to only use aspect ratios of 2:1 and 1:2, as I can imagine only the number "1" has an aspect ratio of 3:1/1:3. 
- Change image augmentation: `mean = [0.485, 0.456, 0.406], std = [0.229, 0.224, 0.225]`
- Use default Adam optimizer

At epoch 28, which was after 9048 iterations, I got a reported
```
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.88808
```
## Task 4d)
The stride for the $5\times5$ feature map is $64\times64$. 
```
[32, 32], [32, 96], [32, 160], [32, 224], [32, 288]
[96, 32], [96, 96], [96, 160], [96, 224], [96, 288],
[160, 32], [160, 96], [160, 160], [160, 224], [160, 288],
[224, 32], [224, 96], [224, 160], [224, 224], [224, 288],
[288, 32], [288, 96], [288, 160], [288, 224], [288, 288],
```
For aspect ratios I will assume that we are still looking at the $5\times5$ feature map. Then, I end up with:


**Square of side *min_size*:** $162^2$

**Square of side $\sqrt{minSize\cdot nextMinSize}$**: $\sqrt{162\cdot 162}^2=162^2$

**$[minSize\cdot \sqrt{aspectRatio}, minSize/\sqrt{aspectRatio}]$**: $162\cdot \sqrt{\frac{1}{2}}, 162 / \sqrt{\frac{1}{2}} =[114.55129, 229.10259] = [115, 229]$

**$[minSize / \sqrt{aspectRatio}, minSize \cdot \sqrt{aspectRatio}]$**: $162 / \sqrt{\frac{1}{2}}, 162 \cdot \sqrt{\frac{1}{2}} =[229.10259, 114.55129] = [229, 115]$


By looking at the anchor box plot, I think my calculations look correct.

![Anchor boxes](report_material/task4d.PNG "Anchor Boxes")

## Task 4e)
I think the network in general has a lot of problems with numbers that overlap, as seen in the image below:

![Network predictions](report_material/4.PNG "Network predictions")

My network, for some reason, also seems to predict image class 5 a whole lot (the number 4). I find this pretty strange and I'm then wondering how I could get such a high mAP. Perhaps I would have the really good model do predictions, I should use a checkpoint other than 9999 but I'm not so sure how to do that.

## Task 4f)

![VGG16 Network predictions](report_material/task4f.PNG "VGG16 Network predictions")