# one note about AP score in detection

For all detection methods, to get AP score, we simply set some threshold, to generate a bunch of bounding boxes higher than that score (but score info has to be kept). It will be desierable if all the boxes above threshold are exactly those in the ground truth. For the case of images with only background, it will be desirable that our threshold is set in a way such that the background classifier has highest score. Some of my comments on detection papers are about how to vary threshold to generate AP. just ignore them (I marked them with `//`)

Check <http://host.robots.ox.ac.uk/pascal/VOC/voc2010/index.html#devkit>, download the dev kit, and check `VOCevaldet.m` to see how AP is computed. Essentially, it sorts the confidence of all boxes, and compute recall and precision and each (discrete) threshold level. Notice that looks that it's likely that we will never achieve 100% recall, as some ground truth boxes are simply ignored by the system.

# Classical approaches

## Bag-of-Words

### the original one

J. Sivic and A. Zisserman, “Video Google: a text retrieval approach to object matching in videos,” presented at the ICCV 2003: 9th International Conference on Computer Vision, 2003, vol. 2, pp. 1470–1477.

The classical paper for Bag of Words (BOW) methods.

Based on introduction, it seems that previous methods don't have bag of words, simply using brute-force methods.

By using bag of words, precomputation becomes possible, since vectors are quantized.

Section 2.

Here, first, two types of region finders are used. They both have elliptic output. Then, I guess after affine transofmration back, the normalized patch's SIFT feature is computed, regardless of type.

> The regions detected in each frame of the video are tracked using a simple constant velocity dynamical model and correlation. Any region which does not survive for more than three frames is rejected.

So only regions surviving multiple frames (3) can survive to the next step.

Section 3.

After some pruning, still, only a subset of frames are selected.

When doing clustering, a whitened space is used. The cov matrix is estimated using all descriptors, but when computing distance between two tracks, mean descriptors are used.

Section 4

I wonder what are the documents in this database, as user will only select a subregion, yet a frame has many regions.

Section 6.

Here, spatial consistency is kind of enforced individually for each matched region pair, but not globally using some geometry. It's a good compromise.

Section 6.2

> The idea is implemented here by first retrieving frames using the weighted frequency vector alone, and then re-ranking them based on a measure of spatial consistency.

Notice that the vector obtained using a subregion is compared with that obtained using whole frames. Looks unfair (one big document, one large one), but based on <http://www.cse.cuhk.edu.hk/~taoyf/course/wst540/notes/lec5.pdf> or <http://www.cs.cmu.edu/~knigam/15-505/information-retrieval-lecture.ppt>, this seems to be the case for text retrieval as well.

Other notes

* p.1470: The benefits of this approach is that matches are effec-tively pre-computed so that at run-time frames and shots containing any particular object can be retrieved with nodelay. -- Highlighted Feb 19, 2017
* p.1470: retrieve those key frames and shots of a video containing a particular object with the ease, speed and accuracy with which Google retrieves text documents (web pages) containing particular words. -- Highlighted Feb 19, 2017
* p.1470: An inverted file is structured like an ideal book index -- Highlighted Feb 19, 2017
* p.1471: In addition the match on the ordering and separation of the words may be used to rank the returned documents. -- Highlighted Feb 19, 2017
* p.1471: Although previous work has borrowed ideas from the text retrieval literature for image retrieval from databases (e.g. [15] used the weighting and inverted file schemes) to the best of our knowledge this is the first systematic application of these ideas to object retrieval in videos. -- Highlighted Feb 19, 2017
* p.1471: Note, both region detection and the description is computed on monochrome versions of the frames, colour information is not currently used in this work. -- Highlighted Feb 19, 2017
* p.1471: The regions detected in each frame of the video are tracked using a simple constant velocity dynamical model and correlation. Any region which does not survive for more than three frames is rejected. -- Highlighted Feb 19, 2017
* p.1472: To reject unstable regions the 10% of tracks with the largest diagonal covariance matrix are rejected. This generates an average of about 1000 regions per frame. -- Highlighted Feb 19, 2017
* p.1472: Instead a subset of 48 shots is selected (these shots are discussed in more detail in section 5.1) covering about 10k frames which represent about 10% of all the frames in the movie. -- Highlighted Feb 19, 2017
* p.1473: In our case the query vector is given by the visual words contained in a user specified sub-part of a frame,  -- Highlighted Feb 19, 2017
* p.1473: In the retrieval tests the entire frame is used as a query region. -- Highlighted Feb 19, 2017
* p.1475: The idea is implemented here by first retrieving frames using the weighted frequency vector alone, and then re-ranking them based on a measure of spatial consistency. -- Highlighted Feb 19, 2017
* p.1475: In our case the inverted file has an entry for each visual word, which stores all the matches, i.e. occurrences of the same word in all frames.  -- Highlighted Feb 19, 2017


~~~
@inproceedings{Sivic:2003jx,
author = {Sivic, Josef and Zisserman, Andrew},
title = {{Video Google: a text retrieval approach to object matching in videos}},
booktitle = {ICCV 2003: 9th International Conference on Computer Vision},
year = {2003},
pages = {1470--1477},
publisher = {IEEE},
annote = {The classical paper for Bag of Words (BOW) methods.

Based on introduction, it seems that previous methods don't have bag of words, simply using brute-force methods.

By using bag of words, precomputation becomes possible, since vectors are quantized.

Section 2.

Here, first, two types of region finders are used. They both have elliptic output. Then, I guess after affine transofmration back, the normalized patch's SIFT feature is computed, regardless of type.

> The regions detected in each frame of the video are tracked using a simple constant velocity dynamical model and correlation. Any region which does not survive for more than three frames is rejected.

So only regions surviving multiple frames (3) can survive to the next step.

Section 3.

After some pruning, still, only a subset of frames are selected.

When doing clustering, a whitened space is used. The cov matrix is estimated using all descriptors, but when computing distance between two tracks, mean descriptors are used.

Section 4

I wonder what are the documents in this database, as user will only select a subregion, yet a frame has many regions.

Section 6.

Here, spatial consistency is kind of enforced individually for each matched region pair, but not globally using some geometry. It's a good compromise.

Section 6.2

> The idea is implemented here by first retrieving frames using the weighted frequency vector alone, and then re-ranking them based on a measure of spatial consistency.

Notice that the vector obtained using a subregion is compared with that obtained using whole frames. Looks unfair (one big document, one large one), but based on <http://www.cse.cuhk.edu.hk/{\textasciitilde}taoyf/course/wst540/notes/lec5.pdf> or <http://www.cs.cmu.edu/{\textasciitilde}knigam/15-505/information-retrieval-lecture.ppt>, this seems to be the case for text retrieval as well.


},
keywords = {classics},
doi = {10.1109/ICCV.2003.1238663},
isbn = {0-7695-1950-4},
read = {Yes},
rating = {5},
date-added = {2017-02-20T02:26:30GMT},
date-modified = {2017-02-20T04:12:05GMT},
url = {http://ieeexplore.ieee.org/document/1238663/},
local-url = {file://localhost/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2003/Sivic/ICCV%202003%202003%20Sivic.pdf},
file = {{ICCV 2003 2003 Sivic.pdf:/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2003/Sivic/ICCV 2003 2003 Sivic.pdf:application/pdf;ICCV 2003 2003 Sivic.pdf:/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2003/Sivic/ICCV 2003 2003 Sivic.pdf:application/pdf}},
uri = {\url{papers3://publication/doi/10.1109/ICCV.2003.1238663}}
}
~~~

# Deep Learning approaches

## Unsuccessful ones

### one classifier to a grid of classifiers.

C. Szegedy, A. Toshev, and D. Erhan, “Deep Neural Networks for Object Detection,” presented at the Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States., 2013, pp. 2553–2561.

First widely known attempt to deep-learning based objection detection, by chaging output layer from one classifier to a lot of classifiers, each one predicting whether an object exists or not.

It's kind of ugly.

* Ugly point 1: the output resolution is coarse. To remedy this, some tricks, such as soft target value (Eq. 1), 5 types of masks, refinement, etc. are used.
* Ugly point 2: CNN can only have fixed number of output, but there can be variable number of boxes. To do this, they use 5 maskes, and use Eq. (2) and (3) to help locate the boxes.

There are two types of networks trained. Mask-generator, and a classifer (for post-processing, see end of 5.2, pruning). I'm not sure whether there are 5 or 1 networks (without 5 output layers), but that should not matter.

The mask generator network has input size of 225x225. Not sure why, but may be not relevant. Using 227 or 224 should do as well.

`//` To generate AP score, I guess they have to generate a set of boxes, with different scores. This can be simply done by setting to a very low threshold when doing filtering mentioned in end of 5.2, and I believe the classifier filtering and non-maximum suppression are kind of monotonic operations, so boxes kept with higher threshold is strictly a subset of boxes obtained with lower threshold.

In pp. 5, note that mutiple scales of images are used for more robust detection. There are three scales. First scale is the full image (maybe resizing image into 225x225). It's not clearly specified how to choose the other two, except that one is half of size of the other (their absolute size are not specified). But the point is we need a sliding window-like approach here.

other notes

* p.2554: These approaches are traditionally challenged by the difficulty of training and use specially designed learning procedures. Moreover, at inference time they combine bottom-up and top-down processes. -- Highlighted Feb 17, 2017
* p.2554: Both approaches, however, use the NNs as local or semi-local classifiers either over superpixels or at each pixel location. Our approach, however, uses the full image as an input and performs localization through regression. As such, it is a more efficient application of NNs. -- Highlighted Feb 17, 2017
* p.2557: We further prune them by applying a DNN classifier by [14] trained on the classes of interest and retaining the positively classified ones w.r.t to the class of the current detector. -- Highlighted Feb 17, 2017

~~~
@inproceedings{Szegedy:2013va,
author = {Szegedy, Christian and Toshev, Alexander and Erhan, Dumitru},
title = {{Deep Neural Networks for Object Detection}},
booktitle = {Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States.},
year = {2013},
editor = {Burges, Christopher J C and Bottou, L{\'e}on and Ghahramani, Zoubin and Weinberger, Kilian Q},
pages = {2553--2561},
annote = {First widely known attempt to deep-learning based objection detection, by chaging output layer from one classifier to a lot of classifiers, each one predicting whether an object exists or not.

It's kind of ugly.

Ugly point 1: the output resolution is coarse. To remedy this, some tricks, such as soft target value (Eq. 1), 5 types of masks, refinement, etc. are used.
Ugly point 2: CNN can only have fixed number of output, but there can be variable number of boxes. To do this, they use 5 maskes, and use Eq. (2) and (3) to help locate the boxes.

There are two types of networks trained. Mask-generator, and a classifer (for post-processing, see end of 5.2, pruning). I'm not sure whether there are 5 or 1 networks (without 5 output layers), but that should not matter.

The mask generator network has input size of 225x225. Not sure why, but may be not relevant. Using 227 or 224 should do as well.

To generate AP score, I guess they have to generate a set of boxes, with different scores. This can be simply done by setting to a very low threshold when doing filtering mentioned in end of 5.2, and I believe the classifier filtering and non-maximum suppression are kind of monotonic operations, so boxes kept with higher threshold is strictly a subset of boxes obtained with lower threshold.

In pp. 5, note that mutiple scales of images are used for more robust detection. There are three scales. First scale is the full image (maybe resizing image into 225x225). It's not clearly specified how to choose the other two, except that one is half of size of the other (their absolute size are not specified). But the point is we need a sliding window-like approach here.},
keywords = {deep learning},
read = {Yes},
rating = {3},
date-added = {2017-02-17T21:04:25GMT},
date-modified = {2017-02-18T19:55:17GMT},
url = {http://papers.nips.cc/paper/5207-deep-neural-networks-for-object-detection},
local-url = {file://localhost/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2013/Szegedy/NIPS%202013%202013%20Szegedy.pdf},
file = {{NIPS 2013 2013 Szegedy.pdf:/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2013/Szegedy/NIPS 2013 2013 Szegedy.pdf:application/pdf;NIPS 2013 2013 Szegedy.pdf:/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2013/Szegedy/NIPS 2013 2013 Szegedy.pdf:application/pdf}},
uri = {\url{papers3://publication/uuid/4CB40EEA-D7BC-46C0-A548-3C86E3DBE37A}}
}
~~~

## Misc. successful ones (not R-CNN)

### Merging classification and bbox across all locations and scales

P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, “OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks,” arXiv, vol. cs.CV, Dec. 2013.

Very principled and simple approach to object detection using deep learning. Sadly, probably there's no complete code for it, compared to R-CNN. Also, it's worse than R-CNN, as mentioned in the R-CNN paper. That's probably why it doesn't become popular.

Essential idea is doing sliding window over different parts and scales of the image using convolution for classifcation as well as bounding box prediction ensemble  (instead of naive sliding window, which might be more accurate but more expensive). So basically passing every part of image for classification, plus some bbox adjustment. 

From <https://cs.stanford.edu/people/karpathy/rcnn/>:

Other CNN-based detection systems I'm aware of include Overfeat (from Pierre Sermanet et al at NYU) and Generic Object Detection with Dense Neural Patterns and Regionlets." (from Will Zou et al at Stanford), but neither have nice code (for detection) available online.

pp. 1 the bootstrapping here should be the data mining method for hard examples in Deformable Part Model paper. In section 5, the authors say bootstrapping is bad, but I think this bootstrapping is exact, and it's just an implementation trick, for SVM in the DPM paper. Maybe since it's CNN, nice properties of SVM don't hold.

top of pp. 1. "Even with this, however, many viewing windows may contain a perfectly identifiable portion of the object (say, the head of a dog), but not the entire object, nor even the center of the object. This leads to decent classification but poor localization and detection." I think just means that a high scored window doesn't mean that window is bbox. So idea 3 need to be considered. In their system, all three are considered.

On pp. 2, the introduction is well written, in terms of previous work. In paragraph 3, pure sliding window approach is introduced, corresponding to first idea. In paragraph 4, it says that many intrinsic paramters of objects of interest can be predicted using CNN (here the parmaeters would be bbox, corresponding to second idea) In paragraph 5, methods where CNN are used as region proposals, as a preprocessing step for more expensive classification methods (maybe some part-based models?).


pp. 3. Section 2. Notice the difference between detection and localization in ILSVRC. For localization, no penalty for wrong find; for detection, there's false positive penalty, also there can be no object.

pp. 5 Section 3.3 the subsampling ratio is essentially the stride of higher layer units, in terms of input pixels. In this paper, since later layers all have 1x1 stride, so subsampling ratio can be computed as the stride of conv1, time all the pooling strides. Here 2x3x2x3 refers to the "accurate" network in Table 3.

In pp. 6, the Figure 3 shows there trick to increate spatial resolution. It's really simple. Basically, do pooling for the last conv layer using different offsets. Somehow the description of method here is very confusing. I think Figure 3 looks good. In paragraph below Fig 3, I think skip kernel simply means interleving outputs. It says "Or equivalently, as applying the final pooling layer and fully- connected stack at every possible offset, and assembling the results by interleaving the outputs.", I think it's not every possible offset. Only at every stride of the last conv layer before doing pooling.

In pp. 8, localization is mentioned. Essentially, here, instead of doing classification at locations and scales, we predict 1000 bounding boxes at each location and scale (I guess bbox coordinates can go outside the input image...)


In bottom of pp. 11, it says their method is better than non-mamximum suppression, I guess this is because somehow the scores of windows are fixed before doing NMS, and here it keeps changing.


In Section 5, the detection part is essentially same, except that the classifier should have 1001 output, one for background. For background, no bbox merging is needed.


`//` mAP probably can be obtained by using a very large k in Section 4.3, and I assume their merging procedure is monotonic, such that set of boxes with higher k is subset of set of boxes with lower k.

Section 6. Potential improvements look sensible to me, although I don't get the reason of (iii). But anyway.

other notes

* p.2: Our dense sliding window method, however, is able to outperform object proposal methods on the ILSVRC13 detection dataset. -- Highlighted Feb 18, 2017
* p.2: Although they demonstrated an impressive localization performance, there has been no published work describing how their approach. Our paper is thus the first to provide a clear explanation how ConvNets can be used for localization and detection for ImageNet data. -- Highlighted Feb 18, 2017
* p.2: In this paper we use the terms localization and detection in a way that is consistent with their use in the ImageNet 2013 competition, namely that the only difference is the evaluation criterion used and both involve predicting the bounding box for each object in the image. -- Highlighted Feb 18, 2017

~~~
@article{Sermanet:2013vi,
author = {Sermanet, P and Eigen, D and Zhang, X and Mathieu, M and Fergus, R and LeCun, Y},
title = {{OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks}},
journal = {ArXiv e-prints},
year = {2013},
volume = {cs.CV},
month = dec,
annote = {Very principled and simple approach to object detection using deep learning. Sadly, probably there's no complete code for it, compared to R-CNN. Also, it's worse than R-CNN, as mentioned in the R-CNN paper. That's probably why it doesn't become popular.

Essential idea is doing sliding window over different parts and scales of the image using convolution for classifcation as well as bounding box prediction ensemble  (instead of naive sliding window, which might be more accurate but more expensive). So basically passing every part of image for classification, plus some bbox adjustment. 

From <https://cs.stanford.edu/people/karpathy/rcnn/>:

Other CNN-based detection systems I'm aware of include Overfeat (from Pierre Sermanet et al at NYU) and Generic Object Detection with Dense Neural Patterns and Regionlets." (from Will Zou et al at Stanford), but neither have nice code (for detection) available online.

pp. 1 the bootstrapping here should be the data mining method for hard examples in Deformable Part Model paper. In section 5, the authors say bootstrapping is bad, but I think this bootstrapping is exact, and it's just an implementation trick, for SVM in the DPM paper. Maybe since it's CNN, nice properties of SVM don't hold.

top of pp. 1. "Even with this, however, many viewing windows may contain a perfectly identifiable portion of the object (say, the head of a dog), but not the entire object, nor even the center of the object. This leads to decent classification but poor localization and detection." I think just means that a high scored window doesn't mean that window is bbox. So idea 3 need to be considered. In their system, all three are considered.

On pp. 2, the introduction is well written, in terms of previous work. In paragraph 3, pure sliding window approach is introduced, corresponding to first idea. In paragraph 4, it says that many intrinsic paramters of objects of interest can be predicted using CNN (here the parmaeters would be bbox, corresponding to second idea) In paragraph 5, methods where CNN are used as region proposals, as a preprocessing step for more expensive classification methods (maybe some part-based models?).

pp. 3. Section 2. Notice the difference between detection and localization in ILSVRC. For localization, no penalty for wrong find; for detection, there's false positive penalty, also there can be no object.

pp. 5 Section 3.3 the subsampling ratio is essentially the stride of higher layer units, in terms of input pixels. In this paper, since later layers all have 1x1 stride, so subsampling ratio can be computed as the stride of conv1, time all the pooling strides. Here 2x3x2x3 refers to the "accurate" network in Table 3.

In pp. 6, the Figure 3 shows there trick to increate spatial resolution. It's really simple. Basically, do pooling for the last conv layer using different offsets. Somehow the description of method here is very confusing. I think Figure 3 looks good. In paragraph below Fig 3, I think skip kernel simply means interleving outputs. It says "Or equivalently, as applying the final pooling layer and fully- connected stack at every possible offset, and assembling the results by interleaving the outputs.", I think it's not every possible offset. Only at every stride of the last conv layer before doing pooling.

In pp. 8, localization is mentioned. Essentially, here, instead of doing classification at locations and scales, we predict 1000 bounding boxes at each location and scale (I guess bbox coordinates can go outside the input image...)

In bottom of pp. 11, it says their method is better than non-mamximum suppression, I guess this is because somehow the scores of windows are fixed before doing NMS, and here it keeps changing.

In Section 5, the detection part is essentially same, except that the classifier should have 1001 output, one for background. For background, no bbox merging is needed.


mAP probably can be obtained by using a very large k in Section 4.3, and I assume their merging procedure is monotonic, such that set of boxes with higher k is subset of set of boxes with lower k.

Section 6. Potential improvements look sensible to me, although I don't get the reason of (iii). But anyway.
},
keywords = {deep learning},
read = {Yes},
rating = {4},
date-added = {2017-02-17T21:06:55GMT},
date-modified = {2017-02-19T04:53:33GMT},
url = {http://arxiv.org/abs/1312.6229},
local-url = {file://localhost/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2013/Sermanet/arXiv%202013%20Sermanet.pdf},
file = {{arXiv 2013 Sermanet.pdf:/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2013/Sermanet/arXiv 2013 Sermanet.pdf:application/pdf}},
uri = {\url{papers3://publication/uuid/E9245B0C-8283-4CC0-A828-BBEE830F6D69}}
}
~~~

## R-CNN methods

### Original R-CNN

R. B. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” arXiv, vol. cs.CV, Nov. 2013.

General idea: region proposal + CNN feature + SVM. It's kind of a patchwork, so later on pure CNN approches have been proposed.

Hoiem may have a good analysis tool for detection errors. Might worth looking at.

Relationship to OverFeat (section 4.6):

OverFeat can be seen (roughly) as a special case of R-CNN, as mentioned in Section 4.6. But maybe it fails due to some details, such as SVM training, region proposals, etc.


pp. 2 left

> However, units high up in our network, which has five convolutional layers, have very large receptive fields (195 × 195 pixels) and strides (32×32 pixels) in the input image, which makes precise localization within the sliding-window paradigm an open technical challenge.

As in OverFeat, multiscale + bbox regression can always be used with this approach, to make for precise localization. Maybe authors think this will make the system too complicated, or it doesn't result in good performance in practice, as shown by OverFeat.

pp. 6, end of 3.2

> The network appears to learn a representation that combines a small number of class-tuned features together with a distributed representation of shape, texture, color, and material properties. The subsequent fully connected layer fc6 has the ability to model a large set of compositions of these rich features.

Seems that this is the case for all CNNs.

pp. 6 3.2

> Much of the CNN’s representational power comes from its convolutional layers, rather than from the much larger densely connected layers.

This looks great.

pp. 7 top

> which suggests that the pool5 features learned from ImageNet are general and that most of the improvement is gained from learning domain-specific non-linear classifiers on top of them.

This is good insight as well

pp. 7 above 3.3

> when compared internally to their private DPM baselines—both use non- public implementations of DPM that underperform the open source version [20]

Well probably this is a way to publish papers...



pp. 10 Section 5

Here they first use CPMC <http://www.maths.lth.se/matematiklth/personal/sminchis/code/cpmc/> to get segments (which are not rectangular) in the image, and then find a bbox containing the segment to send to CNN. Since two segments for different objects may have same bbox (say a person holding a very tall tripod, person and tripod may have very similar bbox), they replace background with mean color to disambiguate this. See <https://people.eecs.berkeley.edu/~rbg/slides/rcnn-cvpr14-slides.pdf>.



![](./_rcnn/rcnn_segmentation.png)

* p.1: We find that R-CNN outperforms OverFeat by a large margin on the 200-class ILSVRC2013 detection dataset. -- Highlighted Feb 19, 2017
* p.2: One approach frames localization as a regression problem. However, work from Szegedy et al. [38], concurrent with our own, indicates that this strategy may not fare well in practice (they report a mAP of 30.5% on VOC 2007 compared to the 58.5% achieved by our method) -- Highlighted Feb 19, 2017
* p.2: As an immediate consequence of this analysis, we demonstrate that a simple bounding-box regression method significantly reduces mislocalizations, which are the dominant error mode. -- Highlighted Feb 19, 2017
* p.2: At test time, our method generates around 2000 category-independent region proposals for the input image, extracts a fixed-length feature vector from each proposal using a CNN, and then classifies each region with category-specific linear SVMs. -- Highlighted Feb 19, 2017
* p.2: OverFeat uses a sliding-window CNN for detection and until now was the best performing method on ILSVRC2013 detection. We show that R-CNN significantly outperforms OverFeat, with a mAP of 31.4% versus 24.3%. -- Highlighted Feb 19, 2017
* p.3: The features used in the UVA detection system [39], for example, are two orders of magnitude larger than ours (360k vs. 4k-dimensional). -- Highlighted Feb 19, 2017
* p.3: Prior to warping, we dilate the tight bounding box so that at the warped size there are exactly p pixels of warped image context around the original box (we use p = 16). -- Highlighted Feb 19, 2017
* p.3: Aside from replacing the CNN’s ImageNetspecific 1000-way classification layer with a randomly initialized (N + 1)-way classification layer (where N is the number of object classes, plus 1 for background) -- Highlighted Feb 19, 2017
* p.4: We bias the sampling towards positive windows because they are extremely rare compared to background. -- Highlighted Feb 19, 2017
* p.4: indicating that there is significant nuance in how CNNs can be applied to object detection, leading to greatly varying outcomes. -- Highlighted Feb 19, 2017
* p.6: Much of the CNN’s representational power comes from its convolutional layers, rather than from the much larger densely connected layers. -- Highlighted Feb 19, 2017
* p.6: The network appears to learn a representation that combines a small number of class-tuned features together with a distributed representation of shape, texture, color, and material properties. The subsequent fully connected layer fc6 has the ability to model a large set of compositions of these rich features. -- Highlighted Feb 19, 2017
* p.7: which suggests that the pool5 features learned from ImageNet are general and that most of the improvement is gained from learning domain-specific non-linear classifiers on top of them. -- Highlighted Feb 19, 2017
* p.7: Based on the error analysis, we implemented a simple method to reduce localization errors. -- Highlighted Feb 19, 2017
* p.7: when compared internally to their private DPM baselines—both use nonpublic implementations of DPM that underperform the open source version [20] -- Highlighted Feb 19, 2017
* p.8: Unlike val and test, the train images (due to their large number) are not exhaustively annotated. -- Highlighted Feb 19, 2017
* p.9: This recall is notably lower than in PASCAL, where it is approximately 98%, indicating significant room for improvement in the region proposal stage. -- Highlighted Feb 19, 2017
* p.9: OverFeat can be seen (roughly) as a special case of R-CNN. -- Highlighted Feb 19, 2017

~~~
@article{Girshick:2013vu,
author = {Girshick, Ross B and Donahue, J and Darrell, Trevor and Malik, Jitendra},
title = {{Rich feature hierarchies for accurate object detection and semantic segmentation}},
journal = {ArXiv e-prints},
year = {2013},
volume = {cs.CV},
month = nov,
annote = {General idea: region proposal + CNN feature + SVM. It's kind of a patchwork, so later on pure CNN approches have been proposed.

Hoiem may have a good analysis tool for detection errors. Might worth looking at.

Relationship to OverFeat (section 4.6):

OverFeat can be seen (roughly) as a special case of R-CNN, as mentioned in Section 4.6. But maybe it fails due to some details, such as SVM training, region proposals, etc.


pp. 2 left

> However, units high up in our network, which has five convolutional layers, have very large receptive fields (195 {\texttimes} 195 pixels) and strides (32{\texttimes}32 pixels) in the input image, which makes precise localization within the sliding-window paradigm an open technical challenge.

As in OverFeat, multiscale + bbox regression can always be used with this approach, to make for precise localization. Maybe authors think this will make the system too complicated, or it doesn't result in good performance in practice, as shown by OverFeat.

pp. 6, end of 3.2

> The network appears to learn a representation that combines a small number of class-tuned features together with a distributed representation of shape, texture, color, and material properties. The subsequent fully connected layer fc6 has the ability to model a large set of compositions of these rich features.

Seems that this is the case for all CNNs.

pp. 6 3.2

> Much of the CNN{\textquoteright}s representational power comes from its convolutional layers, rather than from the much larger densely connected layers.

This looks great.

pp. 7 top

> which suggests that the pool5 features learned from ImageNet are general and that most of the improvement is gained from learning domain-specific non-linear classifiers on top of them.

This is good insight as well

pp. 7 above 3.3

> when compared internally to their private DPM baselines{\textemdash}both use non- public implementations of DPM that underperform the open source version [20]

Well probably this is a way to publish papers...



pp. 10 Section 5

Here they first use CPMC <http://www.maths.lth.se/matematiklth/personal/sminchis/code/cpmc/> to get segments (which are not rectangular) in the image, and then find a bbox containing the segment to send to CNN. Since two segments for different objects may have same bbox (say a person holding a very tall tripod, person and tripod may have very similar bbox), they replace background with mean color to disambiguate this. See <https://people.eecs.berkeley.edu/{\textasciitilde}rbg/slides/rcnn-cvpr14-slides.pdf>.



},
keywords = {classics, deep learning},
read = {Yes},
rating = {5},
date-added = {2017-02-19T15:46:40GMT},
date-modified = {2017-02-19T19:13:46GMT},
url = {http://arxiv.org/abs/1311.2524},
local-url = {file://localhost/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2013/Girshick/arXiv%202013%20Girshick.pdf},
file = {{arXiv 2013 Girshick.pdf:/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2013/Girshick/arXiv 2013 Girshick.pdf:application/pdf}},
uri = {\url{papers3://publication/uuid/0EC70D5F-CEA5-43DE-A6E2-AF820F2E22A4}}
}
~~~

### Use region proposal at feature maps, instead of images.

K. He, X. Zhang, S. Ren, and J. Sun, “Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition,” presented at the Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part III, 2014, vol. 8691, pp. 346–361.

Notes: pp. 347 this is essentially what SPP has done. Notice that max pooling is used. In detection part, similarly, multiple scales are used.

> In other words, we perform some information “aggregation” at a deeper stage of the network hierarchy (between convolutional layers and fully-connected layers) to avoid the need for cropping or warping at the beginning.

~~~
@inproceedings{He:2014dz,
author = {He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian},
title = {{Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition}},
booktitle = {Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part III},
year = {2014},
editor = {Fleet, David J and Pajdla, Tom{\'a}s and Schiele, Bernt and Tuytelaars, Tinne},
pages = {346--361},
publisher = {Springer},
annote = {pp. 347 this is essentially what SPP has done. Notice that max pooling is used. In detection part, similarly, multiple scales are used.

> In other words, we perform some information {\textquotedblleft}aggregation{\textquotedblright} at a deeper stage of the network hierarchy (between convolutional layers and fully-connected layers) to avoid the need for cropping or warping at the beginning.},
keywords = {deep learning},
doi = {10.1007/978-3-319-10578-9_23},
language = {English},
read = {Yes},
rating = {3},
date-added = {2017-02-16T19:48:05GMT},
date-modified = {2017-02-20T05:15:50GMT},
abstract = {Existing deep convolutional neural networks (CNNs) require a fixed-size (e.g.~224{\texttimes}224) input image. This requirement is {\textquotedblleft}artificial{\textquotedblright} and may hurt the recognition accuracy for the images or sub-images},
url = {http://dx.doi.org/10.1007/978-3-319-10578-9_23},
local-url = {file://localhost/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2014/He/ECCV%202014%20Part%20III%202014%20He.pdf},
file = {{ECCV 2014 Part III 2014 He.pdf:/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2014/He/ECCV 2014 Part III 2014 He.pdf:application/pdf;ECCV 2014 Part III 2014 He.pdf:/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2014/He/ECCV 2014 Part III 2014 He.pdf:application/pdf}},
uri = {\url{papers3://publication/doi/10.1007/978-3-319-10578-9_23}}
}
~~~

### Fast R-CNN

R. B. Girshick, “Fast R-CNN,” presented at the 2015 IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1440–1448.

Notes: This is an improvement over SPPNet, see section 1.2 for contributions.

In pp. 1142, in loss equation Eq. (1), u is class label if it's positive example (1-1000 for ImageNet), 0 if background, and 0 if negative example.

In pp. 1142, the backpropagation derivation is based on a 1d network.

Section 5.

There are some insights about network design

1. Multi-task loss is helpful
2. Scale invariance may not help much. Not sure this is challenged later.
3. Region proposal is better than dumb sliding window.

Other notes
> p.1442: When the regression targets are unbounded, training with L2 loss can require careful tuning of learning rates in order to prevent exploding gradients. Eq. 3 eliminates this sensitivity. -- Highlighted Feb 20, 2017

~~~
@inproceedings{Girshick:2015ib,
author = {Girshick, Ross B},
title = {{Fast R-CNN}},
booktitle = {2015 IEEE International Conference on Computer Vision (ICCV)},
year = {2015},
pages = {1440--1448},
publisher = {IEEE},
annote = {This is an improvement over SPPNet, see section 1.2 for contributions.

In pp. 1142, in loss equation Eq. (1), u is class label if it's positive example (1-1000 for ImageNet), 0 if background, and 0 if negative example.

In pp. 1142, the backpropagation derivation is based on a 1d network.

Section 5.

There are some insights about network design

1. Multi-task loss is helpful
2. Scale invariance may not help much. Not sure this is challenged later.
3. Region proposal is better than dumb sliding window.},
keywords = {deep learning},
doi = {10.1109/ICCV.2015.169},
isbn = {978-1-4673-8391-2},
read = {Yes},
rating = {4},
date-added = {2017-02-14T21:55:24GMT},
date-modified = {2017-02-20T15:41:28GMT},
url = {http://ieeexplore.ieee.org/document/7410526/},
local-url = {file://localhost/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2015/Girshick/ICCV%202015%202015%20Girshick.pdf},
file = {{ICCV 2015 2015 Girshick.pdf:/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2015/Girshick/ICCV 2015 2015 Girshick.pdf:application/pdf;ICCV 2015 2015 Girshick.pdf:/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2015/Girshick/ICCV 2015 2015 Girshick.pdf:application/pdf}},
uri = {\url{papers3://publication/doi/10.1109/ICCV.2015.169}}
}
~~~

### Faster R-CNN

S. Ren, K. He, R. B. Girshick, and J. Sun, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” presented at the Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, 2015, pp. 91–99.

Fast R-CNN plus region proposal network, and they are merged together, shared convolutional features.

[20,22,24] are some region proposal works.

RPN

This network first generates a conv map, then there's a 3x3 convolutiion layer, getting an intermediate vector for each location in the conv map. Based on this intermediate vector, we can get k pairs of objectness and coordinates.

Scores in RPN are used for NMS, and to select top proposals. See second to last paragraph of pp. 6.


they use anchor so that 1) you can generate multiple bboxes from the same feature vector 2) you can assign ground-truth during training, compared to outputting k boxes without anchors.

In pp. 5, they talk about optimization, and they use a four-step alternative optimization for RPN and fast R-CNN to use the same conv layers. 

~~~
@inproceedings{Ren:2015ug,
author = {Ren, Shaoqing and He, Kaiming and Girshick, Ross B and Sun, Jian},
title = {{Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks}},
booktitle = {Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada},
year = {2015},
editor = {Cortes, Corinna and Lawrence, Neil D and Lee, Daniel D and Sugiyama, Masashi and Garnett, Roman},
pages = {91--99},
annote = {Fast R-CNN plus region proposal network, and they are merged together, shared convolutional features.

[20,22,24] are some region proposal works.

RPN

This network first generates a conv map, then there's a 3x3 convolutiion layer, getting an intermediate vector for each location in the conv map. Based on this intermediate vector, we can get k pairs of objectness and coordinates.

Scores in RPN are used for NMS, and to select top proposals. See second to last paragraph of pp. 6.




they use anchor so that 1) you can generate multiple bboxes from the same feature vector 2) you can assign ground-truth during training, compared to outputting k boxes without anchors.

In pp. 5, they talk about optimization, and they use a four-step alternative optimization for RPN and fast R-CNN to use the same conv layers. },
keywords = {deep learning},
read = {Yes},
rating = {4},
date-added = {2017-02-16T21:08:25GMT},
date-modified = {2017-02-20T17:27:31GMT},
url = {http://papers.nips.cc/paper/5638-faster-r-cnn-towards-real-time-object-detection-with-region-proposal-networks},
local-url = {file://localhost/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2015/Ren/NIPS%202015%202015%20Ren.pdf},
file = {{NIPS 2015 2015 Ren.pdf:/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2015/Ren/NIPS 2015 2015 Ren.pdf:application/pdf;NIPS 2015 2015 Ren.pdf:/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2015/Ren/NIPS 2015 2015 Ren.pdf:application/pdf}},
uri = {\url{papers3://publication/uuid/93B2D728-8E69-4E4E-8516-E46B3A47F094}}
}
~~~

## Multi scale, context, topdown, etc. bell and whistles to RCNN

### Inside-Outside net. Skip connection + RNN context

S. Bell, C. L. Zitnick, K. Bala, and R. B. Girshick, “Inside-Outside Net: Detecting Objects in Context with Skip Pooling and Recurrent Neural Networks,” presented at the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 2874–2883.

Notes: My notes are based on the arxiv version <https://arxiv.org/abs/1512.04143> v1.

the idea looks great. Using RNN to provide global context, and lower layer feature map for precision. I would be even better to show what those contexts really are.

Notice that when combining feature maps from different scales, they use L2 normalization. I think this is needed mainly because they send this combined feature map into pretrained fc6 + fc7 classifer. Check Section 3.1

Actually, while they emphasize the importance of using L2 normalization, I still doubt whether this would work at all. I mean, simply matching L2 norm may not make it work, as fc6 only expects pool5 feature originally. Maybe they have some tricks in learning rate when finetuning? Anyway.

~~~
@inproceedings{Bell:2016fb,
author = {Bell, Sean and Zitnick, C Lawrence and Bala, Kavita and Girshick, Ross B},
title = {{Inside-Outside Net: Detecting Objects in Context with Skip Pooling and Recurrent Neural Networks}},
booktitle = {2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2016},
pages = {2874--2883},
publisher = {IEEE},
annote = {My notes are based on the arxiv version <https://arxiv.org/abs/1512.04143> v1.

the idea looks great. Using RNN to provide global context, and lower layer feature map for precision. I would be even better to show what those contexts really are.

Notice that when combining feature maps from different scales, they use L2 normalization. I think this is needed mainly because they send this combined feature map into pretrained fc6 + fc7 classifer. Check Section 3.1

Actually, while they emphasize the importance of using L2 normalization, I still doubt whether this would work at all. I mean, simply matching L2 norm may not make it work, as fc6 only expects pool5 feature originally. Maybe they have some tricks in learning rate when finetuning? Anyway.


},
keywords = {context, deep learning, hierarchical, recurrent},
doi = {10.1109/CVPR.2016.314},
isbn = {978-1-4673-8851-1},
read = {Yes},
rating = {4},
date-added = {2017-04-25T18:16:29GMT},
date-modified = {2017-04-25T19:00:29GMT},
url = {http://ieeexplore.ieee.org/document/7780683/},
local-url = {file://localhost/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/Unknown/Bell/CVPR%202016%20%20Bell.pdf},
file = {{CVPR 2016  Bell.pdf:/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/Unknown/Bell/CVPR 2016  Bell.pdf:application/pdf;CVPR 2016  Bell.pdf:/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/Unknown/Bell/CVPR 2016  Bell.pdf:application/pdf}},
uri = {\url{papers3://publication/doi/10.1109/CVPR.2016.314}}
}
~~~

### Top down modulation

A. Shrivastava, R. Sukthankar, J. Malik, and A. Gupta, “Beyond Skip Connections: Top-Down Modulation for Object Detection,” arXiv, vol. cs.CV, Dec. 2016.

Notes: great top down feedback paper for object detection.

analysis part shows the importance of having top-down modulation. More importantly, we should have some conv mechanism to select bottom up and top-down feature (Figure 4; conv around T_3,2 and L_2).

Skip connection itself works less impressively. (Section 6.3; Table 4, skip-pool baseline compared to baseline; as they have roughly same number of parameters).

some detail.

#### Above 4.1.3

> For ResNet101, we change the pooling stride after conv3_x block to 1, and use atrous [6, 34] convolution to maintain the receptive field.

To understand this, first you need to understand the structure of ResNet 101 blocks. it has following blocks. conv1, conv2_x, conv3_x, conv4_x. (last block is ued for classification in this work).

each block first uses a 1x1 conv with stride 2 to downsample the input, effectively losing 3/4 of data.

In this work, to make sure the feature map size is still big, in the downsampling part, they use 1x1 conv with stride 1 instead. However, this means that they neeed to dilate the later on conv operations (a previous 2x2 block for input to conv, after a stride2 downsampling), is now scattered in a 3x3 block, at 4 corners).

Therefore, they use atrous (also called dilated) conv later on.

This approach can be found in earlier works, such as R-FCN <https://arxiv.org/abs/1605.06409> R-FCN: Object Detection via Region-based Fully Convolutional Network

Also the author of the paper told me it's standard inside Google.

> Hey Yimeng,
> 
> The details followed the standard architecture used inside Google. Some details are available at: https://arxiv.org/abs/1611.10012. 
> 
> Reducing the stride make the FoV smaller; so, to maintain the FoV, we increase make the following layer’s atrous rate 2. Atrous convolution does not increase the output size in this case, but only increases the FoV.
> 
> These tricks were used because Google researchers found it to perform better than the vanilla version. Does this make sense? If not, let me know. We can discuss it futher.

Also he said it could be applied to VGG as well. although he didn't use it.

> It might help with VGG as well. I used the standard implementations of Google researchers.

#### end of 4.1.3

> To counter this, we apply RPN
at a stride which ensures that computation remains exactly the same (e.g., using stride of 4 for $T_2^{out}$. 

This means that, no matter how many levels of top down modules they have, they always compute RPN loss at same number of locations as in original network without all top down modules.

The reason you compute at stride of 4 For $T_2$, it because the feature map isze is 150x250, which is 4 times bigger than 37x63, the original size of feature map without top down. Check Figure 2.

~~~
@article{Shrivastava:2016tr,
author = {Shrivastava, Abhinav and Sukthankar, Rahul and Malik, Jitendra and Gupta, Abhinav},
title = {{Beyond Skip Connections: Top-Down Modulation for Object Detection}},
journal = {ArXiv e-prints},
year = {2016},
volume = {cs.CV},
month = dec,
annote = {great top down feedback paper for object detection.

analysis part shows the importance of having top-down modulation. More importantly, we should have some conv mechanism to select bottom up and top-down feature (Figure 4; conv around T_3,2 and L_2).

Skip connection itself works less impressively. (Section 6.3; Table 4, skip-pool baseline compared to baseline; as they have roughly same number of parameters).



some detail.


Above 4.1.3

> For ResNet101, we change the pooling stride after conv3_x block to 1, and use atrous [6, 34] convolution to maintain the receptive field.

To understand this, first you need to understand the structure of ResNet 101 blocks. it has following blocks. conv1, conv2_x, conv3_x, conv4_x. (last block is ued for classification in this work).

each block first uses a 1x1 conv with stride 2 to downsample the input, effectively losing 3/4 of data.

In this work, to make sure the feature map size is still big, in the downsampling part, they use 1x1 conv with stride 1 instead. However, this means that they neeed to dilate the later on conv operations (a previous 2x2 block for input to conv, after a stride2 downsampling), is now scattered in a 3x3 block, at 4 corners).

Therefore, they use atrous (also called dilated) conv later on.

This approach can be found in earlier works, such as R-FCN <https://arxiv.org/abs/1605.06409> R-FCN: Object Detection via Region-based Fully Convolutional Network

Also the author of the paper told me it's standard inside Google.

> Hey Yimeng,
> 
> The details followed the standard architecture used inside Google. Some details are available at: https://arxiv.org/abs/1611.10012. 
> 
> Reducing the stride make the FoV smaller; so, to maintain the FoV, we increase make the following layer{\textquoteright}s atrous rate 2. Atrous convolution does not increase the output size in this case, but only increases the FoV.
> 
> These tricks were used because Google researchers found it to perform better than the vanilla version. Does this make sense? If not, let me know. We can discuss it futher.

Also he said it could be applied to VGG as well. although he didn't use it.

> It might help with VGG as well. I used the standard implementations of Google researchers.



end of 4.1.3

> To counter this, we apply RPN
at a stride which ensures that computation remains exactly the same (e.g., using stride of 4 for $T_2^{out}$. 

This means that, no matter how many levels of top down modules they have, they always compute RPN loss at same number of locations as in original network without all top down modules.

The reason you compute at stride of 4 For $T_2$, it because the feature map isze is 150x250, which is 4 times bigger than 37x63, the original size of feature map without top down. Check Figure 2.},
keywords = {context, deep learning, hierarchical},
read = {Yes},
rating = {5},
date-added = {2017-04-25T18:27:10GMT},
date-modified = {2017-04-25T18:49:00GMT},
url = {http://arxiv.org/abs/1612.06851},
local-url = {file://localhost/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2016/Shrivastava/arXiv%202016%20Shrivastava.pdf},
file = {{arXiv 2016 Shrivastava.pdf:/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2016/Shrivastava/arXiv 2016 Shrivastava.pdf:application/pdf}},
uri = {\url{papers3://publication/uuid/B959A444-0282-43C9-9CA7-C00DD49E1AC1}}
}
~~~

## Face detection

### Tiny faces

P. Hu and D. Ramanan, “Finding Tiny Faces,” arXiv, vol. cs.CV, Dec. 2016.

Essentially, training different classifiers for faces of different number of pixels (in raw size). For different raw size faces, the classifiers will be of similar size in its raw pixel (after rescaling the input image), but they will contain different amount of context info relative to face.

Here, for a face of size h x w, and image of scale sigma, they simply use a detector of size h sigma x w sigma after you upsample the feautre to make each pixel have a hypercolumn feature. However, the smaller the sigma, actually more context information is given. Higher the sigma, less context is given. (See Figure 7)

Seems that this is contradictive to previous result in 3.1, as 3.2 claims for small face, less context better, and big face, more context better. and 3.1 claims context always helps. But maybe these two are separate.

3.2 is probably due to dataset bias, favoring amount of context for medium sized faces, given that hypercolumn is used. 3.1 is comparing in terms of hypercolumn feature vs. single layer feature map.


pp. 2

> a delicate mixture of scale-specific detectors that are used in a scale-invariant fashion (by processing an image pyramid to capture large scale variations).

"scale-invariant" refers to image pyramid, scale-specific refers to different detectors.

pp. 3

> we select a range of object sizes and aspects through cross-validation

Don't see any cross-validation. Only see clustering.

Section 3.1 Context

end of left column, "our results suggest that we can build..." just means we can extract same feature map for all of our detectors, all using this hypercolumn feature map.

Section 3.2 Resolution

in face image: most small face, fewer medium, fewest large; in ImageNet: most medium objects. These can explain some results here.

4 Approach: scale-specific detection

Pruning: I think the reason they say 31x25 face template is redundant is that by looking at Fig 8, you know 1x and 2x curves are roughly the same in a pretty big range. The reason that somehow after using all the classifiers is worse than using A+B, might be due to some correlation between different classifiers.

Fig 9. the CNN features are the same; templates are different for different scales. For 1x and 0.5x, only A templates; for 2x, A+B.

other notes

* p.2: we find that interpolating the lowest layer of the pyramid is particularly crucial for finding small objects [5]. -- Highlighted Feb 22, 2017
* p.2: a delicate mixture of scale-specific detectors that are used in a scale-invariant fashion (by processing an image pyramid to capture large scale variations). -- Highlighted Feb 22, 2017
* p.2: nstead of a “one-size-fitsall” approach, we train separate detectors tuned for different scales (and aspect ratios). -- Highlighted Feb 22, 2017
* p.2: We demonstrate that convolutional deep features extracted from multiple layers (also known as “hypercolumn” features [8, 14]) are effective “foveal” descriptors that capture both high-resolution detail and coarse low-resolution cues across large receptive field (Fig. 2 (e)). We show that highresolution components of our foveal descriptors (extracted from lower convolutional layers) are crucial for such accurate localization in Fig. 5. -- Highlighted Feb 22, 2017
* p.3: Our work differs in our exploration of context for tiny object detection. -- Highlighted Feb 22, 2017
* p.3: We show that context is mostly useful for finding low-resolution faces. -- Highlighted Feb 22, 2017
* p.4: In Fig. 5, we compare between descriptors with and without foveal structure, which shows that high-resolution components of our foveal descriptors are crucial for accurate detection on small instances. -- Highlighted Feb 22, 2017

~~~
@article{Hu:2016uc,
author = {Hu, Peiyun and Ramanan, Deva},
title = {{Finding Tiny Faces}},
journal = {ArXiv e-prints},
year = {2016},
volume = {cs.CV},
month = dec,
annote = {Essentially, training different classifiers for faces of different number of pixels (in raw size). For different size, the classifiers will be of similar size in its raw pixel (after rescaling the input image), but they will contain different amount of context info relative to face.

Here, for a face of size h x w, and image of scale sigma, they simply use a detector of size h sigma x w sigma after you upsample the feautre to make each pixel have a hypercolumn feature. However, the smaller the sigma, actually more context information is given. Higher the sigma, less context is given. Seems that this is contradictive to previous result in 3.1, as 3.2 claims for small face, less context better, and big face, more context better. But this is probably due to dataset bias, with all of them using same feature map. And Section 1 is comparing different feature map.

pp. 2

> a delicate mixture of scale-specific detectors that are used in a scale-invariant fashion (by processing an image pyramid to capture large scale variations).

"scale-invariant" refers to image pyramid, scale-specific refers to different detectors.

pp. 3

> we select a range of object sizes and aspects through cross-validation

Don't see any cross-validation. Only see clustering.

Section 3.1 Context

end of left column, "our results suggest that we can build..." just means we can extract same feature map for all of our detectors, all using this hypercolumn feature map.

Section 3.2 Resolution

in face image: most small face, fewer medium, fewest large; in ImageNet: most medium objects. These can explain some results here.

4 Approach: scale-specific detection

Pruning: I think the reason they say 31x25 face template is redundant is that by looking at Fig 8, you know 1x and 2x curves are roughly the same in a pretty big range. The reason that somehow after using all the classifiers is worse than using A+B, might be due to some correlation between different classifiers.

Fig 9. the CNN features are the same; templates are different for different scales. For 1x and 0.5x, only A templates; for 2x, A+B.},
keywords = {deep learning},
read = {Yes},
rating = {4},
date-added = {2017-02-20T19:01:42GMT},
date-modified = {2017-03-02T15:35:36GMT},
url = {http://arxiv.org/abs/1612.04402},
local-url = {file://localhost/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2016/Hu/arXiv%202016%20Hu.pdf},
file = {{arXiv 2016 Hu.pdf:/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2016/Hu/arXiv 2016 Hu.pdf:application/pdf}},
uri = {\url{papers3://publication/uuid/0F31F332-25C7-441C-9567-DD1F68C2BBE4}}
}
~~~

### cascade of CNNs

H. Li, Z. Lin, X. Shen, J. Brandt, and G. Hua, “A convolutional neural network cascade for face detection,” presented at the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 5325–5334.

Notes: Essential idea: use coarse feature map from coarse CNN to quickly reject unlikely position and sizes of faces, and work on difficult problem using finer scale CNN.

I think it's similar to many other cascade framework, such Viola Jone's face detection, or Ross Girshick's Cascade DPM.

A tiny difference, apart from using CNN feature, is that fc features of coarser CNN got combined with finer CNN. Check Figure 2. 

Some details, such as how to regress box, how to determine whether to reject or not, is kind of inelegant.

~~~
@inproceedings{Li:2015cb,
author = {Li, Haoxiang and Lin, Zhe and Shen, Xiaohui and Brandt, Jonathan and Hua, Gang},
title = {{A convolutional neural network cascade for face detection}},
booktitle = {2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2015},
pages = {5325--5334},
publisher = {IEEE},
annote = {Essential idea: use coarse feature map from coarse CNN to quickly reject unlikely position and sizes of faces, and work on difficult problem using finer scale CNN.

I think it's similar to many other cascade framework, such Viola Jone's face detection, or Ross Girshick's Cascade DPM.

A tiny difference, apart from using CNN feature, is that fc features of coarser CNN got combined with finer CNN. Check Figure 2. 

Some details, such as how to regress box, how to determine whether to reject or not, is kind of inelegant.},
keywords = {deep learning, hierarchical},
doi = {10.1109/CVPR.2015.7299170},
isbn = {978-1-4673-6964-0},
read = {Yes},
rating = {3},
date-added = {2017-04-25T17:30:06GMT},
date-modified = {2017-04-25T17:36:18GMT},
url = {http://ieeexplore.ieee.org/document/7299170/},
local-url = {file://localhost/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2015/Li/CVPR%202015%202015%20Li.pdf},
file = {{CVPR 2015 2015 Li.pdf:/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2015/Li/CVPR 2015 2015 Li.pdf:application/pdf;CVPR 2015 2015 Li.pdf:/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2015/Li/CVPR 2015 2015 Li.pdf:application/pdf}},
uri = {\url{papers3://publication/doi/10.1109/CVPR.2015.7299170}}
}
~~~


## neural network module for learning localization

M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu, “Spatial Transformer Networks,” arXiv, vol. cs.CV, Jun. 2015.

Notes: STN is a great way to localize objects within CNN.

It's differentiable, and easy to use.

It's important for those attention based models.


The experiments in this paper are so interesting.


some details.

Section 3.4

>  However, with sampling kernels with a fixed, small spatial support (such as the bilinear kernel), downsampling with a spatial transformer can cause aliasing effects.

this is also what I found when doing downsampling using Tensorflow's bilinear sampling operation. For downsampling, it's better to use mean averaging.

~~~
@article{Jaderberg:2015vo,
author = {Jaderberg, M and Simonyan, K and Zisserman, Andrew and Kavukcuoglu, K},
title = {{Spatial Transformer Networks}},
journal = {ArXiv e-prints},
year = {2015},
volume = {cs.CV},
month = jun,
annote = {STN is a great way to localize objects within CNN.

It's differentiable, and easy to use.

It's important for those attention based models.


The experiments in this paper are so interesting.


some details.

Section 3.4

>  However, with sampling kernels with a fixed, small spatial support (such as the bilinear kernel), downsampling with a spatial transformer can cause aliasing effects.

this is also what I found when doing downsampling using Tensorflow's bilinear sampling operation. For downsampling, it's better to use mean averaging.

},
keywords = {deep learning},
read = {Yes},
rating = {5},
date-added = {2017-03-04T17:24:40GMT},
date-modified = {2017-04-24T15:13:44GMT},
url = {http://arxiv.org/abs/1506.02025},
local-url = {file://localhost/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2015/Jaderberg/arXiv%202015%20Jaderberg.pdf},
file = {{arXiv 2015 Jaderberg.pdf:/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2015/Jaderberg/arXiv 2015 Jaderberg.pdf:application/pdf}},
uri = {\url{papers3://publication/uuid/C20F59FE-CB4F-4FA0-AE3E-57A7EBC915C7}}
}
~~~

## computation speedup

### share work among feature maps for region proposals by precomputing some feature pyramid as in DPM and older models.

F. Iandola, M. Moskewicz, S. Karayev, R. B. Girshick, T. Darrell, and K. Keutzer, “DenseNet: Implementing Efficient ConvNet Descriptor Pyramids,” arXiv, vol. cs.CV, Apr. 2014.

Notes: essentially, a cheaper method to computing features on region proposals, by computing feature map of whole images at multiple scales, and cropping them to regiona proposals.


Actually such method dates back to older models such as DPM. (Fig 2 in the paper) Maybe because CNN features are global, not local (like HOG), and thus this method is not implemented before, as it's not accurate (since CNN has pooling, strides, etc;), compared to sliding window feature in HOG.

It's not elegant, such as aspect ratio, etc. Later on much better methods (Faster RCNN, etc.) are used. People find that a single scale might be sufficient.

~~~
@article{Iandola:2014tj,
author = {Iandola, F and Moskewicz, M and Karayev, S and Girshick, Ross B and Darrell, Trevor and Keutzer, K},
title = {{DenseNet: Implementing Efficient ConvNet Descriptor Pyramids}},
journal = {ArXiv e-prints},
year = {2014},
volume = {cs.CV},
month = apr,
annote = {essentially, a cheaper method to computing features on region proposals, by computing feature map of whole images at multiple scales, and cropping them to regiona proposals.


Actually such method dates back to older models such as DPM. Maybe because CNN features are global, not local (like HOG), and thus this method is not implemented before, as it's not accurate (since CNN has pooling, strides, etc;), compared to sliding window feature in HOG.

It's not elegant, such as aspect ratio, etc. Later on much better methods (Faster RCNN, etc.) are used. People find that a single scale might be sufficient.},
keywords = {deep learning},
read = {Yes},
rating = {2},
date-added = {2017-04-24T15:21:02GMT},
date-modified = {2017-04-24T15:32:26GMT},
url = {http://arxiv.org/abs/1404.1869},
local-url = {file://localhost/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2014/Iandola/arXiv%202014%20Iandola.pdf},
file = {{arXiv 2014 Iandola.pdf:/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2014/Iandola/arXiv 2014 Iandola.pdf:application/pdf}},
uri = {\url{papers3://publication/uuid/744F7A4F-AB1A-43F1-8F08-25A025987E62}}
}
~~~

## other detection works

### top down modulation in facial keypoint localization

S. Honari, J. Yosinski, P. Vincent, and C. Pal, “Recombinator Networks: Learning Coarse-to-Fine Feature Aggregation,” presented at the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 5743–5752.

Notes: My notes are based on arxiv version <https://arxiv.org/abs/1511.07356>. (v2)

I think in general, the idea is similar to that in the top down modulation paper (A. Shrivastava, R. Sukthankar, J. Malik, and A. Gupta, “Beyond Skip Connections: Top-Down Modulation for Object Detection,” arXiv, vol. cs.CV, Dec. 2016.). Instead of fusing multiple feature maps in the end, they fuse them incrementally through layers.

~~~
@inproceedings{Honari:2016eb,
author = {Honari, Sina and Yosinski, Jason and Vincent, Pascal and Pal, Christopher},
title = {{Recombinator Networks: Learning Coarse-to-Fine Feature Aggregation}},
booktitle = {2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2016},
pages = {5743--5752},
publisher = {IEEE},
annote = {My notes are based on arxiv version <https://arxiv.org/abs/1511.07356>. (v2)

I think in general, the idea is similar to that in the top down modulation paper (A. Shrivastava, R. Sukthankar, J. Malik, and A. Gupta, {\textquotedblleft}Beyond Skip Connections: Top-Down Modulation for Object Detection,{\textquotedblright} arXiv, vol. cs.CV, Dec. 2016.). Instead of fusing multiple feature maps in the end, they fuse them incrementally through layers.

},
keywords = {context, deep learning, hierarchical},
doi = {10.1109/CVPR.2016.619},
isbn = {978-1-4673-8851-1},
read = {Yes},
rating = {4},
date-added = {2017-04-25T18:59:48GMT},
date-modified = {2017-04-25T19:07:22GMT},
url = {http://ieeexplore.ieee.org/document/7780988/},
uri = {\url{papers3://publication/doi/10.1109/CVPR.2016.619}}
}
~~~