# Classical features before deep learning

## GIST

### Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope

A. Oliva and A. Torralba, “Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope,” IJCV, vol. 42, no. 3, pp. 145–175, 2001.

Classical paper about GIST.

However, how GIST descriptor is computed is not given in the paper. Instead, this paper tries to learn some directions in the DFT (or windowed DFT) of an image, representing some "spatial envelope properties" that human can understand, such as roughedness, openness, etc. This is why spatial envelop properties are better than PCA, as said in pp. 154. 

> However, the contribution of each feature cannot be understood as they stand, and more importantly, they are not directly meaningful to human observers.

To find those spatial envelope directions (say, direction discriminating openness), they need label of openness, say, ranging from -1 to 1. But I think this is pretty subjective. Then solve the regression problem using Eq. (12).

I feel that what people call GIST descriptor (in their website <http://people.csail.mit.edu/torralba/code/spatialenvelope/>) might be an approximation of doing PCA on spectrogram (WFT in the paper). As here we only need features, then whether it's interpretable to human is less concerned.

For a descrption of GIST code, see <https://www.quora.com/Computer-Vision-What-is-a-GIST-descriptor>.

> Given an input image, a GIST descriptor is computed by
> 1. Convolve the image with 32 Gabor filters at 4 scales, 8 orientations, producing 32 feature maps of the same size > of the input image.
> 2. Divide each feature map into 16 regions (by a 4x4 grid), and then average the feature values within each region.
> 3. Concatenate the 16 averaged values of all 32 feature maps, resulting in a 16x32=512 GIST descriptor.
> Intuitively, GIST summarizes the gradient information (scales and orientations) for different parts of an image, which provides a rough description (the gist) of the scene.

or to be more succint, see 2.1 of Evaluation of GIST descriptors for web-scale image search (DOI 10.1145/1646396.1646421)

> To compute the color GIST description the image is segmented by a 4 by 4 grid for which orientation histograms are extracted.

or there's a picture from <http://graphics.cs.cmu.edu/courses/15-463/2012_fall/Lectures/InternetData1.ppt>

![](./_gist/gist_cmu.png)

Below Eq. (12), it mentions the case where the attribute of spatial envelope is binary. This is also mentioned in PRML when discussing LDA.

in pp. 155, I don't understand why they need to have positive and negative parts. Maybe only in this way can you get the filters in the spatial domain. Anyway, it's just a visualization technique.

end of pp. 151. unlocalized (energy spectra) and localized (spectrogram). So energy spectra only has frequency, and spectrogram has position/time x freqency.

Other notes

* p.154: However, the contribution of each feature cannot be understood as they stand, and more importantly, they are not directly meaningful to human observers. -- Highlighted Feb 15, 2017
* p.155: The functions h+ and h− are not uniquely constrained by the DST as the phase function can have any value. We fix the phase function at zero in order to have localized spatial functions. -- Highlighted Feb 15, 2017
* p.155: In such a case, the regression parameters (Eq. (12)) are equivalent to the parameters obtained by applying a linear discriminant analysis (see Ripley, 1996; Swets and Weng, 1996). -- Highlighted Feb 15, 2017

~~~
@article{Oliva:2001ck,
author = {Oliva, Aude and Torralba, Antonio},
title = {{Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope}},
journal = {International Journal of Computer Vision},
year = {2001},
volume = {42},
number = {3},
pages = {145--175},
annote = {Classical paper about GIST paper.

However, how GIST descriptor is computed is not given in the paper. Instead, this paper tries to learn some directions in the DFT (or windowed DFT) of an image, representing some "spatial envelope properties" that human can understand, such as roughedness, openness, etc. This is why spatial envelop properties are better than PCA, as said in pp. 154. 

However, the con- tribution of each feature cannot be understood as they stand, and more importantly, they are not directly mean- ingful to human observers.

To find those spatial envelope directions (say, direction discriminating openness), they need label of openness, say, ranging from -1 to 1. But I think this is pretty subjective. Then solve the regression problem using Eq. (12).

I feel that what people call GIST descriptor (in their website <http://people.csail.mit.edu/torralba/code/spatialenvelope/>) might be an approximation of doing PCA on spectrogram (WFT). As here we only need features, then whether it's interpretable to human is less concerned.

For a descrption of GIST code, see <https://www.quora.com/Computer-Vision-What-is-a-GIST-descriptor>.

========START========
Given an input image, a GIST descriptor is computed by
1. Convolve the image with 32 Gabor filters at 4 scales, 8 orientations, producing 32 feature maps of the same size of the input image.
2. Divide each feature map into 16 regions (by a 4x4 grid), and then average the feature values within each region.
3. Concatenate the 16 averaged values of all 32 feature maps, resulting in a 16x32=512 GIST descriptor.
Intuitively, GIST summarizes the gradient information (scales and orientations) for different parts of an image, which provides a rough description (the gist) of the scene.
========END========

or to be more succint, see 2.1 of Evaluation of GIST descriptors for web-scale image search (DOI 10.1145/1646396.1646421)

========START=========
To compute the color GIST description the image is segmented by a 4 by 4 grid for which orientation histograms are extracted.
========END=======



Below Eq. (12), it mentions the case where the attribute of spatial envelope is binary. This is also mentioned in PRML when discussing LDA.


in pp. 155, I don't understand why they need to have positive and negative parts. Maybe only in this way can you get the filters in the spatial domain. Anyway, it's just a visualization technique.

end of pp. 151. unlocalized (energy spectra) and localized (spectrogram). So energy spectra only has frequency, and spectrogram has position/time x freqency.},
publisher = {Kluwer Academic Publishers},
keywords = {classics},
doi = {10.1023/A:1011139631724},
language = {English},
read = {Yes},
rating = {5},
date-added = {2017-02-15T18:42:13GMT},
date-modified = {2017-02-17T16:15:52GMT},
abstract = {In this paper, we propose a computational model of the recognition of real world scenes that bypasses the segmentation and the processing of individual objects or regions. The procedure is based on a},
url = {http://link.springer.com/article/10.1023/A:1011139631724},
local-url = {file://localhost/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2001/Oliva/IJCV%202001%20Oliva.pdf},
file = {{IJCV 2001 Oliva.pdf:/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2001/Oliva/IJCV 2001 Oliva.pdf:application/pdf;IJCV 2001 Oliva.pdf:/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2001/Oliva/IJCV 2001 Oliva.pdf:application/pdf}},
uri = {\url{papers3://publication/doi/10.1023/A:1011139631724}}
}
~~~

# Feature benchmarks, comparisons

## good old features

### Benchmark of feature encoding methods for image classification

K. Chatfield, V. Lempitsky, A. Vedaldi, and A. Zisserman, “The devil is in the details: an evaluation of recent feature encoding methods,” presented at the British Machine Vision Conference 2011, 2011, pp. 76.1–76.12.

Their toolkit <http://www.robots.ox.ac.uk/~vgg/research/encoding_eval/> might be useful for evaluating unsupervised models in general.

* p.4: where s is a constant chosen to balance sk with uk numerically -- Highlighted Feb 17, 2017
* p.5: In the experiments the spatial regions are obtained by dividing the image in 1 × 1, 3 × 1 (three horizontal stripes), and 2 × 2 (four quadrants) grids, for a total of 8 regions -- Highlighted Feb 17, 2017
* p.7: However, as we learned from personal communication with the authors, [27] used nontrivial modifications not discussed in the paper to achieve those results (these include using LDA to compute the SVM kernel and second order information as in the Fisher encoding). According to the authors, the performance achieved with our implementation is representative of their method, given that we did not apply these additional modifications. -- Highlighted Feb 16, 2017

~~~
@inproceedings{Chatfield:2011ks,
author = {Chatfield, Ken and Lempitsky, Victor and Vedaldi, Andrea and Zisserman, Andrew},
title = {{The devil is in the details: an evaluation of recent feature encoding methods}},
booktitle = {British Machine Vision Conference 2011},
year = {2011},
pages = {76.1--76.12},
publisher = {British Machine Vision Association},
annote = {Their toolkit <http://www.robots.ox.ac.uk/{\textasciitilde}vgg/research/encoding_eval/> might be useful for evaluating unsupervised models in general.},
keywords = {benchmark},
doi = {10.5244/C.25.76},
isbn = {1-901725-43-X},
read = {Yes},
rating = {5},
date-added = {2017-02-15T22:30:15GMT},
date-modified = {2017-02-17T19:45:18GMT},
url = {http://www.bmva.org/bmvc/2011/proceedings/paper76/index.html},
local-url = {file://localhost/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2011/Chatfield/BMVC%202011%202011%20Chatfield.pdf},
file = {{BMVC 2011 2011 Chatfield.pdf:/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2011/Chatfield/BMVC 2011 2011 Chatfield.pdf:application/pdf;BMVC 2011 2011 Chatfield.pdf:/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2011/Chatfield/BMVC 2011 2011 Chatfield.pdf:application/pdf}},
uri = {\url{papers3://publication/doi/10.5244/C.25.76}}
}
~~~

## CNN features rule them all.

K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman, “Return of the Devil in the Details: Delving Deep into Convolutional Nets,” arXiv, vol. cs.CV, May 2014.

Notes: Essentially, CNN features are the best, though CNN training tricks, such as data augmentation, etc. may help shallow features, but the differences are big.

~~~
@article{Chatfield:2014ww,
author = {Chatfield, Ken and Simonyan, K and Vedaldi, A and Zisserman, Andrew},
title = {{Return of the Devil in the Details: Delving Deep into Convolutional Nets}},
journal = {ArXiv e-prints},
year = {2014},
volume = {cs.CV},
month = may,
annote = {Essentially, CNN features are the best, though CNN training tricks, such as data augmentation, etc. may help shallow features, but the differences are big.},
keywords = {benchmark, deep learning},
read = {Yes},
rating = {4},
date-added = {2017-02-28T18:41:04GMT},
date-modified = {2017-03-27T01:00:00GMT},
url = {http://arxiv.org/abs/1405.3531},
local-url = {file://localhost/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2014/Chatfield/arXiv%202014%20Chatfield.pdf},
file = {{arXiv 2014 Chatfield.pdf:/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2014/Chatfield/arXiv 2014 Chatfield.pdf:application/pdf}},
uri = {\url{papers3://publication/uuid/59D43402-21F4-4016-8F5B-E5D4C6913567}}
}
~~~

# Deep Learning features

## DeCAF (I guess it's just AlexNet)

J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell, “DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition,” presented at the Proceedings of the th International Conference on Machine Learning, ICML , Beijing, China, - June, 2014, vol. 32, pp. 647–655.

Notes: maybe most influential paper showing usefulness of CNN in many tasks. It's famous because it's the predecessor of Caffe.

DeCAF just means the feature vector of CNN, say fc7.

pp. 2 has the main result:


Our main result is the empirical validation that a generic visual feature based on a convolutional network weights trained on ImageNet outperforms a host of conventional vi- sual representations on standard benchmark object recog- nition tasks


Section 3.1

They use a AlexNet model trained by themselves, getting 42.9 error rate. It's same as blvc_alexnet of Caffe. See <https://github.com/BVLC/caffe/tree/rc4/models/bvlc_alexnet>

I think in practice they use 227x227 input, as indicated by the decaf github project's wiki <https://github.com/UCBAIR/decaf-release/wiki/imagenet>

as well as code in <https://github.com/UCBAIR/decaf-release/blob/6fa4cdfbd0d0b8d486d7146bf1e32edd3662fec4/decaf/scripts/imagenet.py>

For 224 vs 227 they think it's just some trick to speed up GPU computation. See <https://github.com/UCBAIR/decaf-release/wiki/imagenet>

"The Decaf implementation uses input images of size 227x227, while the cuda-convnet code uses images of size 224x224. We did 227x227 simply to have a full convolution (if the size is 224x224, the last row/column will only have height/width 8 instead of 11). We believe that cuda-convnet chose 224 for speed consideration as that creates good performance for GPUs. The performance difference should not be big.

Since we trained our network using GPU and are running on CPU, we actually observed some performance differences between them. We are not clear yet what caused it (it might be a bug in our code, admittedly).
"


pp. 4 see footnote 5. In order for tSNE to work, some random projection might be needed.

pp. 4 Figure 3 shows that fully connected layer takes most of time. Wonder if that still applies.

pp. 6 4.3 The Deformable Part descriptors sounds like averaging features from different parts, with some learned weights. But anyway, this is not important.


other notes

* p.648: A key question for such learning problems is to find a feature represen-tation that captures the object category related information while discarding noise irrelevant to object category information such as illumination. -- Highlighted Feb 14, 2017
* p.650: in large networks such as the current ImageNet CNN model, the last few fully-connected layers require the most computation time as they involve large transform matrices. This is particularly important when one considers classification into a larger number of categories or with larger hidden-layer sizes, suggesting that certain sparse approaches such as Bayesian output coding (Hsu et al., 2009) may be necessary to carry out classification into even larger number of object categories. -- Highlighted Feb 17, 2017
* p.650: Some of the features were very high dimensional (e.g. LLC had 16K dimension), in which case we preprocess them by randomly projecting them down to 512 dimensions – random projections are cheap to apply and tend to preserve distances well, which is all the t-SNE algorithm cares about. -- Highlighted Feb 17, 2017

~~~
@inproceedings{Donahue:2014ta,
author = {Donahue, Jeff and Jia, Yangqing and Vinyals, Oriol and Hoffman, Judy and Zhang, Ning and Tzeng, Eric and Darrell, Trevor},
title = {{DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition}},
booktitle = {Proceedings of the th International Conference on Machine Learning, ICML , Beijing, China, - June },
year = {2014},
pages = {647--655},
publisher = {JMLR.org},
annote = {maybe most influential paper showing usefulness of CNN in many tasks. It's famous because it's the predecessor of Caffe.

DeCAF just means the feature vector of CNN, say fc7.

pp. 2 has the main result:


Our main result is the empirical validation that a generic visual feature based on a convolutional network weights trained on ImageNet outperforms a host of conventional vi- sual representations on standard benchmark object recog- nition tasks


Section 3.1

They use a AlexNet model trained by themselves, getting 42.9 error rate. It's same as blvc_alexnet of Caffe. See <https://github.com/BVLC/caffe/tree/rc4/models/bvlc_alexnet>

I think in practice they use 227x227 input, as indicated by the decaf github project's wiki <https://github.com/UCBAIR/decaf-release/wiki/imagenet>

as well as code in <https://github.com/UCBAIR/decaf-release/blob/6fa4cdfbd0d0b8d486d7146bf1e32edd3662fec4/decaf/scripts/imagenet.py>

For 224 vs 227 they think it's just some trick to speed up GPU computation. See <https://github.com/UCBAIR/decaf-release/wiki/imagenet>

"The Decaf implementation uses input images of size 227x227, while the cuda-convnet code uses images of size 224x224. We did 227x227 simply to have a full convolution (if the size is 224x224, the last row/column will only have height/width 8 instead of 11). We believe that cuda-convnet chose 224 for speed consideration as that creates good performance for GPUs. The performance difference should not be big.

Since we trained our network using GPU and are running on CPU, we actually observed some performance differences between them. We are not clear yet what caused it (it might be a bug in our code, admittedly).
"


pp. 4 see footnote 5. In order for tSNE to work, some random projection might be needed.

pp. 4 Figure 3 shows that fully connected layer takes most of time. Wonder if that still applies.

pp. 6 4.3 The Deformable Part descriptors sounds like averaging features from different parts, with some learned weights. But anyway, this is not important.

},
keywords = {deep learning},
read = {Yes},
rating = {3},
date-added = {2017-02-14T20:38:53GMT},
date-modified = {2017-02-17T20:55:28GMT},
url = {http://jmlr.org/proceedings/papers/v32/donahue14.html},
local-url = {file://localhost/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2014/Donahue/ICML%202014%202014%20Donahue.pdf},
file = {{ICML 2014 2014 Donahue.pdf:/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2014/Donahue/ICML 2014 2014 Donahue.pdf:application/pdf;ICML 2014 2014 Donahue.pdf:/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2014/Donahue/ICML 2014 2014 Donahue.pdf:application/pdf}},
uri = {\url{papers3://publication/uuid/F1F8335F-D616-413E-B778-A41B7EB95AB5}}
}
~~~

## Hypercolumn for fine-grained tasks

B. Hariharan, P. Arbelaez, R. B. Girshick, and J. Malik, “Hypercolumns for object segmentation and fine-grained localization,” presented at the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 447–456.


Hypercolumn, multilayer representation is going to help fine-grained tasks.

pp. 449 Interpolating into a grid of classifiers

based on description here, SDS paper always uses separate classifiers at each of 10x10 grid.

Notice that here it seems they use classifers across different object categories, probably because hypercolumn vector already provides some class info, and that could be used to adapt one classifier at one location to different classes.

pp. 450 Efficient classification using convolutions and upsampling

Essentially, they say they can swap order of upsampling for higher layers and linear classifier. I think here they will only deal with upsampling, as they first all resize input region to 50x50, and I think number of columns in each feature map can only be smaller than 50x50.

other notes

* p.447: We borrow the term “hypercolumn” from neuroscience, where it is used to describe a set of V1 neurons sensitive to edges at multiple orientations and multiple frequencies arranged in a columnar structure [24]. However, our hypercolumn includes not just edge detectors but also more semantic units and is thus a more general notion. -- Highlighted Feb 22, 2017

~~~
@inproceedings{Hariharan:2015ig,
author = {Hariharan, Bharath and Arbelaez, Pablo and Girshick, Ross B and Malik, Jitendra},
title = {{Hypercolumns for object segmentation and fine-grained localization}},
booktitle = {2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2015},
pages = {447--456},
publisher = {IEEE},
annote = {Hypercolumn, multilayer representation is going to help fine-grained tasks.


pp. 449 Interpolating into a grid of classifiers

based on description here, SDS paper always uses separate classifiers at each of 10x10 grid.

Notice that here it seems they use classifers across different object categories, probably because hypercolumn vector already provides some class info, and that could be used to adapt one classifier at one location to different classes.

pp. 450 Efficient classification using convolutions and upsampling

Essentially, they say they can swap order of upsampling for higher layers and linear classifier. I think here they will only deal with upsampling, as they first all resize input region to 50x50, and I think number of columns in each feature map can only be smaller than 50x50.

},
keywords = {deep learning, hierarchical},
doi = {10.1109/CVPR.2015.7298642},
isbn = {978-1-4673-6964-0},
read = {Yes},
rating = {4},
date-added = {2017-02-20T19:09:40GMT},
date-modified = {2017-02-22T14:37:14GMT},
url = {http://ieeexplore.ieee.org/document/7298642/},
local-url = {file://localhost/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2015/Hariharan/CVPR%202015%202015%20Hariharan.pdf},
file = {{CVPR 2015 2015 Hariharan.pdf:/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2015/Hariharan/CVPR 2015 2015 Hariharan.pdf:application/pdf}},
uri = {\url{papers3://publication/doi/10.1109/CVPR.2015.7298642}}
}
~~~

## population codes at higher layers can act as part detectors

J. Wang, Z. Zhang, C. Xie, V. Premachandran, and A. L. Yuille, “Unsupervised learning of object semantic parts from internal states of CNNs by population encoding,” arXiv, vol. cs.LG, Nov. 2015.

Notes: Essentially, they found part detectors in population code (visual concept) of higher layers of CNN.

Main problem of this is that they haven't proposed ways to train network so that more VCs can be learned. This is essentially addressed in paper "Learning Deep Parsimonious Representations". by Raquel Urtasun


in Section 6.4, they show that each VC may correspond to multiple semantic parts as defiend by human, as vice versa. This is understandable.

This is proved by Figure 7, where they consider each VC as a multi part detector. They also tried combining VCs for one part would increase performance as well (well not sure how it's implemented, maybe by mean over scores, max over scores, or average of different VCs, etc.) But anyway, key point is that VC and parts are not one-to-one.


In the end, they say future work is to work on compositional model (citation [24]). But I think isn't fc layers already doing that?

other notes
* p.1: Note, we use the term “semantic parts” to mean object parts, like wheels and windows of cars, which are defined in terms of the three-dimensional object. -- Highlighted Mar 28, 2017

~~~
@article{Wang:2015wd,
author = {Wang, J and Zhang, Z and Xie, C and Premachandran, V and Yuille, Alan L},
title = {{Unsupervised learning of object semantic parts from internal states of CNNs by population encoding}},
journal = {ArXiv e-prints},
year = {2015},
volume = {cs.LG},
month = nov,
annote = {Essentially, they found part detectors in population code (visual concept) of higher layers of CNN.

Main problem of this is that they haven't proposed ways to train network so that more VCs can be learned. This is essentially addressed in paper "Learning Deep Parsimonious Representations". by Raquel Urtasun


in Section 6.4, they show that each VC may correspond to multiple semantic parts as defiend by human, as vice versa. This is understandable.

This is proved by Figure 7, where they consider each VC as a multi part detector. They also tried combining VCs for one part would increase performance as well (well not sure how it's implemented, maybe by mean over scores, max over scores, or average of different VCs, etc.) But anyway, key point is that VC and parts are not one-to-one.


In the end, they say future work is to work on compositional model (citation [24]). But I think isn't fc layers already doing that?
},
keywords = {deep learning},
read = {Yes},
rating = {3},
date-added = {2017-03-27T21:52:29GMT},
date-modified = {2017-03-28T19:44:27GMT},
url = {http://arxiv.org/abs/1511.06855},
local-url = {file://localhost/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2015/Wang/arXiv%202015%20Wang.pdf},
file = {{arXiv 2015 Wang.pdf:/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2015/Wang/arXiv 2015 Wang.pdf:application/pdf}},
uri = {\url{papers3://publication/uuid/57BC5B33-8860-4F56-BB50-7296A5877A19}}
}
~~~

## CNN + Spatial Pyramid

Y. Gong, L. Wang, R. Guo, and S. Lazebnik, “Multi-scale Orderless Pooling of Deep Convolutional Activation Features,” presented at the Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VII, 2014, vol. 8695, pp. 392–407.

Notes: Essentially, it's a Spatial Pyramid Matching method, using FC7 features....

In Section 3, they argued why FC7 has some spatial specificity. I think this is consistent with later works, such as "Understanding Deep Image Representations by Inverting Them".

~~~
@inproceedings{Gong:2014jk,
author = {Gong, Yunchao and Wang, Liwei and Guo, Ruiqi and Lazebnik, Svetlana},
title = {{Multi-scale Orderless Pooling of Deep Convolutional Activation Features}},
booktitle = {Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VII},
year = {2014},
editor = {Fleet, David J and Pajdla, Tom{\'a}s and Schiele, Bernt and Tuytelaars, Tinne},
pages = {392--407},
publisher = {Springer},
annote = {Essentially, it's a Spatial Pyramid Matching method, using FC7 features....

In Section 3, they argued why FC7 has some spatial specificity. I think this is consistent with later works, such as "Understanding Deep Image Representations by Inverting Them".},
keywords = {deep learning},
doi = {10.1007/978-3-319-10584-0_26},
language = {English},
read = {Yes},
rating = {3},
date-added = {2017-05-05T20:06:51GMT},
date-modified = {2017-05-05T20:42:11GMT},
abstract = {Deep convolutional neural networks (CNN) have shown their promise as a universal representation for recognition. However, global CNN activations lack geometric invariance, which limits their robustnes},
url = {http://dx.doi.org/10.1007/978-3-319-10584-0_26},
local-url = {file://localhost/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2014/Gong/ECCV%202014%20Part%20VII%202014%20Gong.pdf},
file = {{ECCV 2014 Part VII 2014 Gong.pdf:/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2014/Gong/ECCV 2014 Part VII 2014 Gong.pdf:application/pdf;ECCV 2014 Part VII 2014 Gong.pdf:/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2014/Gong/ECCV 2014 Part VII 2014 Gong.pdf:application/pdf}},
uri = {\url{papers3://publication/doi/10.1007/978-3-319-10584-0_26}}
}
~~~