# What makes neurons picky?
## Tom George, April 2020

### Introduction
Neurons in differnet regions of the brain do not all behave the same. Take the difference between the visual cortex and the prefrontal cortex. In the visual cortex, neurons respond highly selectively to specific visual stimuli from oriented lines and edges in layer V1 to complex shapes, textures and faces in deeper layers. If a neuron is known to respond (fire above baseline) when a face is present in a visual scene it will probably respond only weakly, or not at all, when the subject is shown, e.g., an apple, or a dog. Neurons are specifically tuned to pick out one thing. Conversely, neurons in the prefrontal cortex (the area of the brain known to be involved in decision making and planning complex cognitive behaviours)  show incredibly complex responses patterns. Their firing rates have been shown to contain information about a broad range of task-relevant stimuli. To put this in the context a neuron in the PFC may be important when you are trying to remember a phone number which has just been read to you, but it is likely this is also important in many other very different cognitive tasks, e.g. planning when to cross a road.

But what causes this? Why are neurons 'mixed selective' in the PFC but 'selective' in the visual cortex. Both brain regions perform their function well but are subject to different constraints. First and foremost, the space of all possible cognitive tasks the PFC might be required to perofrm is practically infinite (complex tasks can be composed recursively from simpler tasks) whereas visual scenes (although rich and varied) are generally built from a basic set of polygons, colours and textures. Rigotti et al (2013) show that mixed selectivity offers a significant computational advantage over selectivity by vastly increasing the repertoir of possibly input-output function. Another valied hypothesis might be that neurons are, by default, mixed selective, but when a task is repeatedly performed so often it become more efficiently to dedicate a neuon (or more likely a subset of neurons) to perfroming just that task - perhaps then the visual system contains selective neurons because tasks such as recognising faces, emotions, family members or food types are performed regularly. This argument can be extended to a higher level, perhap higly specialised brain regions (the visual cortex, the motor cortex, the olfactory cortex) developed throughout evolution because otherwise more-generalised computing regions in the brain (for example the PFC) were becoming overwhelmed by having to repeatedly perform the same tasks over-and-over again. 

In this project I want to explore and test these hypotheses from a computational perspective. To do this we will build a deep neural network (representing a general but simple function approximator which we can probe) and train it on multiple tasks, simultaneously. We can then analyse the hidden neurons (which represent the neurons in the PFC or the visual cortex which 'do' the computation which leads to our concious percepts or behaviour) to see whether they are mixed-selective or selective and discuss test what effects this. It's not meant to be a detailed biologically plausible replica of any single brain area but rather the simplest model from which we can probe neuronal selectivity and the parameters which effect this. 

### Simple model trained on simple tasks
To start we will train a very simple model on very simple tasks. For this purpose we will build a vanilla deep neural network with just two input neurons and one output neuron. The purpose is to learn, simultanously, to add and multiply the two inputs. Along with the numerical inputs neurons we pass a task-context input, this is a one-hot vector ([1,0] for task 1 and [0,1] for task 2) informing the network which task to perform. the model has four hidden layers and is trained by batch back propagation.

• Task 1: $x_0 + x_1$

• Task 2: $x_0 \cdot x_1$

The model class $\texttt{simple_network}$ in $\texttt{networks.py}$ does this. It is initialised with a list of hyperparameters.



In [1]:
import numpy as np
np.seterr(all='ignore')
from networks import simple_network
simple_hyperparameters = {'N_train' : 1000, #size of training dataset 
                          'N_test' : 100, #size of test set x
                          'lr' : 0.001, #SGD learning rate 
                          'epochs' : 10, #training epochs
                          'batch_size' : 10,  #batch size (large will probably fail)           
                          'context_location' : 'start',  #where the feed in the task context 'start' vs 'end'
                          'train_mode' : 'random', #training mode 'random' vs 'replay' 
                          'second_task' : 'prod', #first task adds x+y, second task 'prod' = xy or 'add1.5' = x+1.5y
                          'fraction' : 0.50, #fraction of training data for tasks 1 vs task 2
                          'hidden_size' : 50} #hidden layer width

simple_model = simple_network(simple_hyperparameters)

Now train the model and plot the training accuracy on both tasks throughout training. In fact the analysis is better if we train a large number of models and plot the average result. A model is saved if and only if the error on both tasks is less than some threshold (0.05) after training.  

In [None]:
from utils import plot_training, train_multiple
models = train_multiple(simple_network, simple_hyperparameters, N_models = 10)
plot_training(models)

Training 50 models


 20%|██        | 10/50 [00:16<01:01,  1.53s/it]

We now have trained models which can perform both of the two tasks. But how do we know the selectivity preferences of the hidden neurons?  For this we define the 'importance', $\mathcal{I}_i(A)$, of neuron $i$ on task A. Intuitively we will define this as the squared change in the expected loss function (over task A's test set) when the hidden neurons are set to zero:
\begin{equation}
\mathcal{I}_i(A) = \big(   \mathop{\mathbb{E}}_{z\sim\mathcal{D_{A}}}[\ell(z;\mathbf{h})] - \mathop{\mathbb{E}}_{z\sim\mathcal{D_{A}}}[\ell(z;\mathbf{h}|h_{i}=0)]  \big)^2 .
\end{equation}
Here $\mathbf{h}$ represents a vector containing the state of all hidden neurons in the network, and $\mathbf{h}|h_{i}=0$ represents the same network where $h_i$ is set to zero and then the effect of this is propagated forwards. We can Taylor expand the loss function about $\mathbf{h}$:
\begin{equation}
\mathop{\mathbb{E}}_{z\sim\mathcal{D_{A}}}[\ell(z;\mathbf{h}|h_{i}=0)]  = \mathop{\mathbb{E}}_{z\sim\mathcal{D_{A}}}[\ell(z;\mathbf{h}) + (\mathbf{h}_{h_i = 0} - \mathbf{h})^{\mathsf{T}}\frac{\partial \ell}{\partial \mathbf{h}} + \frac{1}{2} (\mathbf{h}_{h_i = 0} - \mathbf{h})^{\mathsf{T}} \mathsf{H} (\mathbf{h}_{h_i = 0} - \mathbf{h}) +...
\end{equation}
To first order, therefore, the importance of a neuron is given by: 
\begin{equation}
\mathcal{I}_i(A) \approx \bigg(h_i \cdot \mathop{\mathbb{E}}_{z\sim\mathcal{D_{A}}} \bigg[\frac{\partial \ell}{\partial h_i}\bigg] \bigg)^2
\end{equation}
which is fairly trivially computed using a pytorch framework by computing the $\texttt{.grad()}$ function on the hidden layers.

To study whether a neuron is selective to one task or mixed selvtive we will define the 'relative importance', $\mathcal{RI}_i(A,B)$, of neuron $i$ over tasks $A$ and $B$:
\begin{equation}
\mathcal{RI}_i(A,B) = \frac{\mathcal{I}_i(A) - \mathcal{I}_i(B)}{\mathcal{I}_i(A)+ \mathcal{I}_i(B)}
\end{equation}
If the relative importance is close to 1 then we can assume the neuron is entirely selective to task A and unimportant for task B - vica versa if $\mathcal{RI}_i(A,B)$ is close to -1. IF $\mathcal{RI}_i(A,B) \approx 0$ then the neuron is equally important to both the tasks. By plotting a histogram over all the neurons in a hidden layer (and over many identically trained but independently initialised models) we can get a broad picture of how the computational effort requireed to solve the two tasks is shared amongst the neurons. The function $\texttt{plot_FTV()}$ plots these relative importance histograms for neurons in all four hidden layers and across all models (the input and output layer are not shown). Note, following Yang et al. (2019)'s lead 'FTV' means 'fractional task variance' and was originally what I called 'relative importance'.

In [None]:
from utils import plot_FTV
plot_FTV(models)

On the x-axis of each plot is the relative importance. On the y-axis is the proportion of neurons with that corresponding $\mathcal{RI}$. The colour smoothly changes from green on the right (representing neuorns which are mostly important for task 1 but not task 2) to orange on the left (representing neurons which are mostly important for task 2 but not task 1). The precentage in the top right corner shows the proportion of neuorns which are not important for either task 1 or or task 2 and so do no show in the histogram.

In the first layer the FTV is strongly bimodal. Most neurons are either exclusively important for task 1 or for task 2 and relatively few neurons are mixed. In fact I found this to common amongst all layers immiately after teh task context input. Task context vector functions by dividing the neurons in that layer into two non-overlapping subgroups one dedicated to task 1 and one dedicated to task 2. 

To test this theory lets relocate the context and, rather than adding it in at the begin, add it in at the fourth hidden layer (i.e. append to the third hidden layer outputs the context vector)

In [None]:
simple_hyperparameters['context_location'] = 'end'
models = train_multiple(simple_network, simple_hyperparameters, N_models = 10)
plot_FTV(models)

As expected, neurons immediately after the context is added (layer 4) are strongly bimodal whereas neurons in earlier layers are generally mixed. Interestingly some task selectivity has 'leaked' backwards. A surplus of neurons in layers 1 to 3 are in fact specialised to one task or the other even though 'which task' information is not passed in until further downstream. This is the power of backpropagation. 

Now let try a different second task. Finding the proudct of two number is contextually quite different to finding their sum. But what if we mae the second task also an addition, like task 1, but instead of $x_0 + x_1$ task 2 is to find $x_0 + 1.5x_1$). Will this make a significant differnence?

In [None]:
simple_hyperparameters['context_location'] = 'start'
simple_hyperparameters['second_task'] = 'add1.5'
models = train_multiple(simple_network, simple_hyperparameters, N_models = 10)
plot_FTV(models, title=r'second task = $x_0 + 1.5x_1$')

The difference is stark: now the bimodality in the initial layer is rapidly lost as we go deeper into the network and the $\mathcal{RI}$ disstribution becomes unimodal. What we can conclude here is that the network effectively 'learns' that the two task it is meatn to perform are effectively the same and so shares computational effort of this amongst the neurons.

What about training bias? Will the above networks still learn mixed selective representations if one task (say, task 1) is training on more regularly than task 2. Lets check this by biasing training so that tas 1 comes up 10 times more frequently than task 2.

In [None]:
simple_hyperparameters['fraction'] = 0.91 #i.e. 91% of the training data is now task 1
models = train_multiple(simple_network, simple_hyperparameters, N_models = 10)
plot_FTV(models, title=r'training bias')