# Population Coding

## Axis based code for faces in posterior IT

Chang, L. & Tsao, D. Y. The Code for Facial Identity in the Primate Brain. Cell 169, 1013–1020.e14 (2017).

### summary

The "highlights" of the first page gives a good summary of this paper.


* Facial images can be linearly reconstructed using responses of ∼200 face cells
* Face cells display flat tuning along dimensions orthogonal to the axis being coded
* The axis model is more efficient, robust, and flexible than the exemplar model
* Face patches ML/MF and AM carry complementary information about faces

### notes

Notes: Very clear exposition, but Doug thinks the exemplar code model (grandmother cells) is a strawman. 

* Proof of success: Decoding faces, and predicting neural response when stimulated by a novel face. 
* Apparently, faces as a stimulus class, are relatively easy to parametrize? (end of paragraph 2 of intro)
> *we recorded responses of cells in face patches middle lateral (ML)/middle fundus (MF) and anterior medial (AM) to a large set of realistic faces parameterized by 50 dimensions.*
>*a hierarchical relationship between ML/MF and AM and suggest that AM is the final output stage of IT face processing. In particular, a population of sparse cells has been found in AM, which appear to encode exemplars for specific individuals, as they respond to faces of only a few specific individuals*
* So their past work yielded a local area of cells which would be promising, Doug thinks this fully captures and predicts the current result though. 
> *A prediction of this model is that each cell should have a linear null space, orthogonal to the preferred axis, in which all faces elicit the same response. We confirm this prediction, even for sparse AM cells that had previously been assumed to explicitly code exemplars of specific identities.*
    
    The above quote is the impact of the paper. 
* Face Generation: First, each of 16 faces were labelled with 'landmarks', which became 200 shape and 200 appearance dimensions. Then PCA reduced this into 25 shape and 25 appearance dimensions. Sampling from this metric space produces artifical but realistic faces. (there's a supplemental movie). Further, dimensions were not locally concentrated, so a change along one axis could be hairline and chin, for example. 2000 faces were sampled from this space. This landmark method seems a good way to generally parametrize and sample other stimuli spaces??
* This seems like a place where Ben's DCA, or something like it, could have been used, to find the transformation from stimuli space to neural representational space. 
* Other important info:
>*same set of 2,000 stimuli were presented
to each cell from three to five times each. We recorded 205 cells
in total from two monkeys: 51 cells from ML/MFand 64 cells from
AM for monkey 1; 55 cells from ML/MF, and 35 cells from AM for
monkey 2.*

### Results

> *On average, each cell was significantly tuned
along 6.1 feature dimensions (covering the range [0 17] with
SD = 3.8). We next compared the relative sensitivity to shape
or appearance for each neuron: a ‘‘shape preference index’’
was computed based on the vector length of the STA for shape
versus appearance dimensions.*

* The dichotomy between identity and things like view (specialization) from their and other previous work on face patches is echoed again here. 
    * (pp. 1015) They found that the tuning of AM cells to appearance is regardless of view and identity. It's pretty robust against these factors, and this is consistent.
      > The shape preference indices computed with subsets of stimuli were highly correlated ... the tuning of AM cells to appearance dimensions indicates invariance to a much larger set of transformations in articulated shape than just view changes, consistent with ... 
    * (pp. 1015) I don't think the following quote is valid. Just because shape parameters can encode more than view change doesn't mean appearance neurons will be invariant to those additional changes. However, the above mentioned split set testing can somehow support the claim here (`the tuning of AM cells to appearance dimensions indicates invariance ...`).
      > Importantly, because shape dimensions encompass a much larger set of transformations than just view changes,
* They found 'ramp like tuning' (Fig. 1J, like sigmoid, with both upper and lower bounds) , but what is the significance of by itself? Is that expected? 
    * Yimeng: I think this is expected behavior of real neurons?
* Then they used linear regression on these cells for 1999 faces to predict on the held out face. This is very different from normal population decoding though, because the cells were not simultaneously recorded, so we have no information about the relationship between cells. In other words, there might be an opportunity here.
    * They say linear model might be good because ramp tuning. Well, it would be even better, if it has no ramp, and it's completely linear. Ramp is somehow OK, because it has a linear part. But more linear, the better.
      > If a face cell has ramp-shaped tuning to different features, this means that its response can be roughly approximated by a linear combination of the facial features
    * Somehow,. in Figure 2C, you will find that, when using all cells, the performance improved. This might tell that, actually there is some information shared between AM and ML/MF cells (the `ML/MF` and `AM` lines are just `All` models with many coefficients set to zero).
    * They say they first estimate $S, C$ in $R=SF+C$, and then invert it. But I think it's probably easier to simply estimate the inverted model from the beginning. But in any case, it will be a linear model.
* Some human psychophysics to say, not only do we decode better than chance, but the faces we predict are perceptually similar. gimmicky, in my opinion. 
* Null space argument here is interesting. On the one hand, it is trying to cash in on a current fashionable trend (null spaces and orthagonal axes), but on the other, it seems to be the intuitive way to disprove an exemplar code. **For the tang data, it might be powerful to use the same test, and get an opposite result, in order prove more of an exemplar code in V1**
    * Check caption for Fig. 4A to see how these tuning figures are generated. Essentially, they find some axes orthgonal to STA direction randomly, and then compute the tuning of neurons along those directions (I think here you need to perform some projection; no stimuli will exactly hit upon the axes). These figures are averages over neurons / images.
    * They mainly study AM cells here. As AM cells are known not to be tuned to shape. They just remove all shape parameters when doing this experiment.
    * Details are in caption for Fig. 4A. However, some are contradictory to what's in e4 of STAR methods. For example, what's the number of dimensions (25 or 50) for max pooling model? Since this model considers different positions (shape parameters), it must lie in 50 d space, but they say explicilitly that AM cells are analyzed in 25 d space. So this might be unfair. But this is detail.
    * In STAR methods, there are two ways to compute orthgonal axes. I think the 300 out of 2000 way is used for Fig 4A-B , and the other way (compute axes among orthogonalized faces) is use for 4D-E.
* They try to rule out some rival explanations to the axes model with simulated units, but I don't know these alternatives well enough to check if they successfully defended their position exhaustively.
    * Yimeng: essentially I see those alternatives as variants on exemplar/distance based models. Check e4 in STAR methods on how these models are computed.
    * They further rule out these models, in Figure S4I-K. They tried different ratios of stds for Gaussian distribution in alternative models. But I should say, as this aspect ratio increases, there will be less and less difference between axis model and Gaussian based models. But still, by eyeballing, seems that an axis model is favored compared to a Gaussian model with extreme ratio of std.
    * Suppose our space is 2d, and STA direction is $(1,0)$. Then the axis model means that, the response to image $(a,b)$ is given by a function of $a$ (their dot product). However, Gaussian models say it's given by a function of $\sqrt{(a-1)^2+b^2/K}$, where $K$ measures the ratio of two std along two directions. As $K$ goes to infinity, it's reduced to $|a-1|$, which is a function of $a$ only.
* They ruled out possiblity of adapation for their conclusion, by studying STA from stimuli with different adaptation strength.
* They reconciled this axis model with previous exemplar view, by showing that
    1. AM neurons, while are axis based, can still exhibit sparse response.
    2. Figure 4F. If exemplar is correct, then there should be correlations between two sparseness measures used here (`A(0.67)/A(0)` and Vinje and Gallant spasrseness). The lack of correlation for AM cells, is another evidence that they are not exemplar based.
* Something more conclusive: they tried an axes versus a distance/exemplar model for predicting to novel real faces, and found the variance explained. There was nonlinearity for tuning involved though, and I haven't seen the details of what the nonlinearity used were, and whether it is the same for both encoding model classes. also compared against eigenfaces. beat both. 
   * No idea why they do not simply inverse the decoding model. My guess is, simply inverting the decoding model may not work well, as the ramp-like nonlinearity in neurons. This also suggests that actually decoding model may not be that good, in terms of recovering face parameters **numerically**. For decoding purpose (getting high classification accuracy, etc.), it seems to be enough, though.
   * They claim their model works better than DiCarlo's ones (by Yamins etc.). I think this has two reasons.
      1. there’s this particular 3rd order nonlinearity, which may account for some ‘ramp-like” behavior better than regression on CNN higher layers.
      2. they focus on faces, whereas previous works use more generic object dataset.
   * I think in general, CNN layer responses are similar to those of AM **without** ramp-like nonliearity. This might be something worth trying.
   * pp. 1022 they presented test faces more often, so that they get more reliable PSTH when computing explained variance.
   > To obtain high signal quality, the 100 faces were repeated ten times more frequently than the rest of the 1900 faces.
* The spread of metamers: one of their pillars of argument is the disambiguation via metamers, done in an online experimental session. 
* One interesting is view invariance in AM cells, where they decoded the face identity even in profile. 
    * Yimeng: check the technical details part. It's using the transformed profile face parameters.
* Then to theoretical benefits of this coding scheme (they cite the paper I recently discussed at scabby). Less neurons needed, more distributed inputs to each neurons (is this sensitive to the network implementation though?) and robust to noise because of this property. In their own words:

> *added a large amount of random noise to the in-puts (Figure 7C2, lower).We found that, for dimensionality higher
than three, axis models perform better than distance models
(Figure 7C3). Finally, an axis metric endows downstream areas
reading out the activity of AM with greater flexibility to discrimi-nate along a variety of different dimensions. If there is a linear
relationship between facial features and responses, then one
can linearly decode the facial features (Figure 3) and use these
decoded features flexibly for any purpose, not only for face iden-tification (e.g., by ‘‘Jennifer Aniston’’ cells in the hippocampus;
Quiroga et al., 2005) but also for other tasks such as gender
discrimination or recognition of daily changes in a familiar face
(Figure 7D). In sum, axis coding is more flexible, efficient, and
robust to noise for representation of objects in a high-dimensional space compared to exemplar coding.*

### Discussion

A CNN showed the same axes coding, which doesn't seem surprising, it is a stretch to use this as support though. They make the analogy to grid and place cells, as the idea of finding the correct basis for a group of cells. Much of the power of this paper is that they used 'correct' and rigorous methods and arguments, and made a lot of connections to fashionable things. From a science perspective, I was persuaded that single cells in IT care about axes of face properties, rather than identities, but apparently this isn't that radical? 

### technical details

#### generation of face stimuli.

Check pp. e2 of STAR methods. Essentially, for each face from the database, they first find 28 landmarks on the image, and then align all faces w.r.t. these landmarks by performing some image warping. PCA on the offset vectors of these landmarks are used as shape information, and PCA on the aligned faces are used as appearance information. Notice that, they say aligned face vectors have length of 17304. I'm not sure how this is computed, but it msut be $17304 = 103 \times 168$, as $103$ is a prime. But of all the faces shown in the paper, they are not of aspect ratio of 168 by 103. So some implementation details are missing, and there might be some cropping, etc. in the shown images.

##### face paraterization space.

> Pairwise correlations between the 2000-long vectors for each dimension were further removed by orthogonalization.

This space is whitened. Otherwise, it's not likely to perform STA on it, and obtaining Gaussian like distribution. Remembmer, STA assumes that

> The correspondence between frontal coordinates and profile coordinates was identified using linear regression for shape dimensions and appearance dimensions independently; the resulting linear transformation was applied to the profile coordinates, to produce new profile coordinates registered to the frontal coordinates.

In **The Axis Coding Model Is Tolerant to View Changes**, Fig. 6B shows very high correlation between STA on frontal faces and STA on profile (one side of face) faces. This is achievable because the profile faces are expressed in the frontal face space, as shown in the previous quote. later on after the quoted sentences, they say they fit a linear model from frontal to profile. But I think in practice they may have reversed it, since that's what's actually used. Nevertheless, it's essential to know, that these profile face parameters are transformed, and this shows that, frontal parmaetersn and profile parameters live in the same space, up to some linear transformation.


~~~
@article{Chang:2017il,
abstract = {Primates recognize complex objects such as faces with remarkable speed and reliability. Here,wereveal the brain’s code for facial identity. Experiments in macaques demonstrate an extraordinarily simple transformation between faces and responses of cells in face patches. By formatting faces as points in a high-dimensional linear space, we discovered that each face cell’s firing rate is proportional to the projection of an incoming face stimulus onto a single axis in this space, allowing a face cell ensemble to encode the location of any face in the space. Using this code,wecould precisely decode faces from neu- ral population responses and predict neural firing rates to faces. Furthermore, this code disavows the long-standing assumption that face cells encode specific facial identities, confirmed by engineering faces with drastically different appearance that eli- cited identical responses in single face cells. Our work suggests that other objects could be encoded by analogous metric coordinate systems.},
author = {Chang, Le and Tsao, Doris Y},
doi = {10.1016/j.cell.2017.05.011},
file = {:Users/faisal/Documents/Papers/PIIS009286741730538X.pdf:pdf},
issn = {0092-8674},
journal = {Cell},
keywords = {decoding,electrophysiology,face processing,inferior temporal cortex,primate vision},
number = {6},
pages = {1013--1020.e14},
pmid = {28575666},
publisher = {Elsevier Inc.},
title = {{The Code for Facial Identity in the Primate Brain}},
url = {http://dx.doi.org/10.1016/j.cell.2017.05.011},
volume = {169},
year = {2017},
}
~~~

## Increasing nonlinearity, or configuration coding in posterior IT over the course of exposure

Brincat, S. L. & Connor, C. E. Dynamic shape synthesis in posterior inferotemporal cortex. Neuron 49, 17–24 (2006).

### Notes

We start from the gaussian models for tuning peaks and parts from the 2004 paper, and then we observe how the weights of the fitted models evolve over the course of the 750 ms while the stimuli are on screen. So, for each neuron, we fit the response in the form $R_i=aG_1+bG_2+cG_1G_2$. In practice, instead of just 2 base gaussians, 1-6 are used, with the exact number chosen via stepwise regression. The gaussians are in the curvature-orientation-position domain. 

Neurons are classified as linear, nonlinear, or mixed, based on the ratio of those weights, and a heuristic numeric criteria. For the key figures and some additional context/significance, the [presentation slides for lab meeting are attached.](./_slides/6_19_17PaperPresentation.pdf)

#### Findings
62 ms between the latency peaks of the population linear and nonlinear components, and this is due to both across cells and within cells. So nonlinear cells have slightly later latencies than linear and mixed, but more importantly, the mixed tend to transition from linear to nonlinear (figure 2). 

#### Model
The model here is a network where units vary between how much they weigh feedforward input (their own part tuning) to recurrent input (difference of gaussians connectivity, with excitation from horizontal units with similar tuning and inhibition outside that). This structure of recurrent input leads to the nonlinear response to the configuration of parts, and the time course of this response. 

~~~
@article{Brincat2006,
abstract = {How does the brain synthesize low-level neural signals for simple shape parts into coherent representations of complete objects? Here, we present evidence for a dynamic process of object part integration in macaque posterior inferotemporal cortex (IT). Immediately after stimulus onset, neural responses carried information about individual object parts (simple contour fragments) only. Subsequently, information about specific multipart configurations emerged, building gradually over the course of ???60 ms, producing a sparser and more explicit representation of object shape. We show that this gradual transformation can be explained by a recurrent network process that effectively compares parts signals across neurons to generate inferences about multipart shape configurations. ??2006 Elsevier Inc.},
author = {Brincat, Scott L. and Connor, Charles E.},
doi = {10.1016/j.neuron.2005.11.026},
file = {:Users/faisal/Documents/Papers/PIIS0896627305010068.pdf:pdf;:Users/faisal/Downloads/mmc1.pdf:pdf},
isbn = {0896-6273 (Print)$\backslash$r0896-6273 (Linking)},
issn = {08966273},
journal = {Neuron},
mendeley-groups = {Vision},
number = {1},
pages = {17--24},
pmid = {16387636},
title = {{Dynamic shape synthesis in posterior inferotemporal cortex}},
volume = {49},
year = {2006},
}
~~~