# Classification + Localization

### Problem Statement

You are given an image of dimension $H$ by $W$, and a set of labels (eg, `cat, tree, sky`) ($\| \text{classes}\| = C$). Your goal is to predict the image class, and to return a set of coordinates which bounds the object in the image. (note that we are assuming there is really only one "thing" in the image, if there were multiple things the task would be **Object Detection**)

For instance,
```
(image)               # classification, (x, y, w, h)
[[0.23, 0.05], ---->  (cat, [1, 0, 1, 1]
 [0.15, 0.03]]        
```

### Approach

First generate a large feature map using one of the larger convolutional neural nets (vgg, resnet, ... whatever). Then we have 2 separate fully connnected layers
- Layer 1 is a fully connected layer `(feat_map, 1)` which predicts the class label (Classification Layer)
- Layer 2 is a fully connected layer `(feat_map, 4)` which predicts the bounding box dimensions (Bounding Box Layer)

Given that we have 2 outputs, we now also have 2 losses
- Clasification Layer will use Softmax Loss (SL)
- Bounding Box Layer will use L2 Loss

Dealing with 2 outputs like this implies that we are going to use a **Multi-Task Loss (MTL)**, which means the Gradient will be calculated using the loss from each task, e.g

$$\text{Multi-Task Loss} = \alpha \cdot \text{Softmax Loss} + (1 - \alpha) \cdot \text{L2 Loss} $$

$$\alpha \in (0, 1)$$

### References
- [CS 231n Lecture 11 - Detection and Segmentation](https://www.youtube.com/watch?v=nDPWywWRIRo)