<a href="https://colab.research.google.com/github/NeuromatchAcademy/course-content/blob/master/tutorials/W2D1_DeepLearning/student/W2D1_Tutorial1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tutorial 1: Decoding Neural Responses
**Week 2, Day 1: Deep Learning**

**By Neuromatch Academy**

**Content creators**: Jorge A. Menendez, Carsen Stringer

**Content reviewers**: Roozbeh Farhoodi,  Madineh Sarvestani, Kshitij Dwivedi, Spiros Chavlis, Ella Batty, Michael Waskom


**Our 2021 Sponsors, including Presenting Sponsor Facebook Reality Labs**

<p align='center'><img src='https://github.com/NeuromatchAcademy/widgets/blob/master/sponsors.png?raw=True'/></p>

---
# Tutorial Objectives
In this tutorial, we'll use deep learning to decode stimulus information from the responses of sensory neurons. Specifically, we'll look at the activity of ~20,000 neurons in mouse primary visual cortex responding to oriented gratings recorded in [this study](https://www.biorxiv.org/content/10.1101/679324v2.abstract). Our task will be to decode the orientation of the presented stimulus from the responses of the whole population of neurons. We could do this in a number of ways, but here we'll use deep learning. Deep learning is particularly well-suited to this problem for a number of reasons:
* The data are very high-dimensional: the neural response to a stimulus is a ~20,000 dimensional vector. Many machine learning techniques fail in such high dimensions, but deep learning actually thrives in this regime, as long as you have enough data (which we do here!).
* As you'll be able to see below, different neurons can respond quite differently to stimuli. This complex pattern of responses will, therefore, require non-linear methods to be decoded, which we can easily do with non-linear activation functions in deep networks.
* Deep learning architectures are highly flexible, meaning we can easily adapt the architecture of our decoding model to optimize decoding. Here, we'll focus on a single architecture, but you'll see that it can easily be modified with few changes to the code.

More concretely, our goal will be learn how to:
* Build a deep feed-forward network using PyTorch
* Evaluate the network's outputs using PyTorch built-in loss functions
* Compute gradients of the loss with respect to each parameter of the network using automatic differentiation
* Implement gradient descent to optimize the network's parameters
* (Bonus) Learn how to regularize the optimization


In [None]:
# @title Video 1: Decoding from neural data using feed-forward networks in pytorch
from ipywidgets import widgets

out2 = widgets.Output()
with out2:
  from IPython.display import IFrame
  class BiliVideo(IFrame):
      def __init__(self, id, page=1, width=400, height=300, **kwargs):
          self.id=id
          src = 'https://player.bilibili.com/player.html?bvid={0}&page={1}'.format(id, page)
          super(BiliVideo, self).__init__(src, width, height, **kwargs)

  video = BiliVideo(id="BV1Xa4y1a7Jz", width=854, height=480, fs=1)
  print('Video available at https://www.bilibili.com/video/{0}'.format(video.id))
  display(video)

out1 = widgets.Output()
with out1:
  from IPython.display import YouTubeVideo
  video = YouTubeVideo(id="SlrbMvvBOzM", width=854, height=480, fs=1, rel=0)
  print('Video available at https://youtube.com/watch?v=' + video.id)
  display(video)

out = widgets.Tab([out1, out2])
out.set_title(0, 'Youtube')
out.set_title(1, 'Bilibili')

display(out)

---
# Setup


In [None]:
import os
import numpy as np

import torch
from torch import nn
from torch import optim

import matplotlib as mpl
from matplotlib import pyplot as plt

In [None]:
#@title Data retrieval and loading
import hashlib
import requests

fname = "W3D4_stringer_oribinned1.npz"
url = "https://osf.io/683xc/download"
expected_md5 = "436599dfd8ebe6019f066c38aed20580"

if not os.path.isfile(fname):
  try:
    r = requests.get(url)
  except requests.ConnectionError:
    print("!!! Failed to download data !!!")
  else:
    if r.status_code != requests.codes.ok:
      print("!!! Failed to download data !!!")
    elif hashlib.md5(r.content).hexdigest() != expected_md5:
      print("!!! Data download appears corrupted !!!")
    else:
      with open(fname, "wb") as fid:
        fid.write(r.content)

In [None]:
#@title Figure Settings
%config InlineBackend.figure_format = 'retina'
plt.style.use("https://raw.githubusercontent.com/NeuromatchAcademy/course-content/master/nma.mplstyle")

In [None]:
#@title Plotting Functions

def plot_data_matrix(X, ax):
  """Visualize data matrix of neural responses using a heatmap

  Args:
    X (torch.Tensor or np.ndarray): matrix of neural responses to visualize
        with a heatmap
    ax (matplotlib axes): where to plot

  """

  cax = ax.imshow(X, cmap=mpl.cm.pink, vmin=np.percentile(X, 1), vmax=np.percentile(X, 99))
  cbar = plt.colorbar(cax, ax=ax, label='normalized neural response')

  ax.set_aspect('auto')
  ax.set_xticks([])
  ax.set_yticks([])

def plot_decoded_results(train_loss, test_loss, test_labels, predicted_test_labels):
  """ Plot decoding results in the form of network training loss and test predictions

  Args:
    train_loss (list): training error over iterations
    test_labels (torch.Tensor): n_test x 1 tensor with orientations of the
      stimuli corresponding to each row of train_data, in radians
    predicted_test_labels (torch.Tensor): n_test x 1 tensor with predicted orientations of the
      stimuli from decoding neural network

  """

  # Plot results
  fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

  # Plot the training loss over iterations of GD
  ax1.plot(train_loss)
  # Plot the testing loss over iterations of GD
  ax1.plot(test_loss)
  ax1.legend(['train loss', 'test loss'])

  # Plot true stimulus orientation vs. predicted class
  ax2.plot(stimuli_test.squeeze(), predicted_test_labels, '.')

  ax1.set_xlim([0, None])
  ax1.set_ylim([0, None])
  ax1.set_xlabel('iterations of gradient descent')
  ax1.set_ylabel('negative log likelihood')
  ax2.set_xlabel('true stimulus orientation ($^o$)')
  ax2.set_ylabel('decoded orientation bin')
  ax2.set_xticks(np.linspace(0, 360, n_classes + 1))
  ax2.set_yticks(np.arange(n_classes))
  class_bins = [f'{i * 360 / n_classes: .0f}$^o$ - {(i + 1) * 360 / n_classes: .0f}$^o$' for i in range(n_classes)]
  ax2.set_yticklabels(class_bins);

  # Draw bin edges as vertical lines
  ax2.set_ylim(ax2.get_ylim())  # fix y-axis limits
  for i in range(n_classes):
    lower = i * 360 / n_classes
    upper = (i + 1) * 360 / n_classes
    ax2.plot([lower, lower], ax2.get_ylim(), '-', color="0.7", linewidth=1, zorder=-1)
    ax2.plot([upper, upper], ax2.get_ylim(), '-', color="0.7", linewidth=1, zorder=-1)

  plt.tight_layout()

In [None]:
#@title Helper Functions

def load_data(data_name=fname, bin_width=1):
  """Load mouse V1 data from Stringer et al. (2019)

  Data from study reported in this preprint:
  https://www.biorxiv.org/content/10.1101/679324v2.abstract

  These data comprise time-averaged responses of ~20,000 neurons
  to ~4,000 stimulus gratings of different orientations, recorded
  through Calcium imaging. The responses have been normalized by
  spontaneous levels of activity and then z-scored over stimuli, so
  expect negative numbers. They have also been binned and averaged
  to each degree of orientation.

  This function returns the relevant data (neural responses and
  stimulus orientations) in a torch.Tensor of data type torch.float32
  in order to match the default data type for nn.Parameters in
  Google Colab.

  This function will actually average responses to stimuli with orientations
  falling within bins specified by the bin_width argument. This helps
  produce individual neural "responses" with smoother and more
  interpretable tuning curves.

  Args:
    bin_width (float): size of stimulus bins over which to average neural
      responses

  Returns:
    resp (torch.Tensor): n_stimuli x n_neurons matrix of neural responses,
        each row contains the responses of each neuron to a given stimulus.
        As mentioned above, neural "response" is actually an average over
        responses to stimuli with similar angles falling within specified bins.
    stimuli: (torch.Tensor): n_stimuli x 1 column vector with orientation
        of each stimulus, in degrees. This is actually the mean orientation
        of all stimuli in each bin.

  """
  with np.load(data_name) as dobj:
    data = dict(**dobj)
  resp = data['resp']
  stimuli = data['stimuli']

  if bin_width > 1:
    # Bin neural responses and stimuli
    bins = np.digitize(stimuli, np.arange(0, 360 + bin_width, bin_width))
    stimuli_binned = np.array([stimuli[bins == i].mean() for i in np.unique(bins)])
    resp_binned = np.array([resp[bins == i, :].mean(0) for i in np.unique(bins)])
  else:
    resp_binned = resp
    stimuli_binned = stimuli

  # Return as torch.Tensor
  resp_tensor = torch.tensor(resp_binned, dtype=torch.float32)
  stimuli_tensor = torch.tensor(stimuli_binned, dtype=torch.float32).unsqueeze(1)  # add singleton dimension to make a column vector

  return resp_tensor, stimuli_tensor


def identityLine():
  """
  Plot the identity line y=x
  """
  ax = plt.gca()
  lims = np.array([ax.get_xlim(), ax.get_ylim()])
  minval = lims[:, 0].min()
  maxval = lims[:, 1].max()
  equal_lims = [minval, maxval]
  ax.set_xlim(equal_lims)
  ax.set_ylim(equal_lims)
  line = ax.plot([minval, maxval], [minval, maxval], color="0.7")
  line[0].set_zorder(-1)

def get_data(n_stim, train_data, train_labels):
  """ Return n_stim randomly drawn stimuli/resp pairs

  Args:
    n_stim (scalar): number of stimuli to draw
    resp (torch.Tensor):
    train_data (torch.Tensor): n_train x n_neurons tensor with neural
      responses to train on
    train_labels (torch.Tensor): n_train x 1 tensor with orientations of the
      stimuli corresponding to each row of train_data, in radians

  Returns:
    (torch.Tensor, torch.Tensor): n_stim x n_neurons tensor of neural responses and n_stim x 1 of orientations respectively
  """
  n_stimuli = train_labels.shape[0]
  istim = np.random.choice(n_stimuli, n_stim)
  r = train_data[istim]  # neural responses to this stimulus
  ori = train_labels[istim]  # true stimulus orientation

  return r, ori

def stimulus_class(ori, n_classes):
  """Get stimulus class from stimulus orientation

  Args:
    ori (torch.Tensor): orientations of stimuli to return classes for
    n_classes (int): total number of classes

  Returns:
    torch.Tensor: 1D tensor with the classes for each stimulus

  """
  bins = np.linspace(0, 360, n_classes + 1)
  return torch.tensor(np.digitize(ori.squeeze(), bins)) - 1  # minus 1 to accomodate Python indexing

---
# Section 1: Load and visualize data

In the next cell, we have provided code to load the data and plot the matrix of neural responses.

Next to it, we plot the tuning curves of three randomly selected neurons.

In [None]:
#@title

#@markdown Execute this cell to load and visualize data

# Load data
resp_all, stimuli_all = load_data()  # argument to this function specifies bin width
n_stimuli, n_neurons = resp_all.shape

print(f'{n_neurons} neurons in response to {n_stimuli} stimuli')

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(2 * 6, 5))

# Visualize data matrix
plot_data_matrix(resp_all[:100, :].T, ax1)  # plot responses of first 100 neurons
ax1.set_xlabel('stimulus')
ax1.set_ylabel('neuron')

# Plot tuning curves of three random neurons
ineurons = np.random.choice(n_neurons, 3, replace=False)  # pick three random neurons
ax2.plot(stimuli_all, resp_all[:, ineurons])
ax2.set_xlabel('stimulus orientation ($^o$)')
ax2.set_ylabel('neural response')
ax2.set_xticks(np.linspace(0, 360, 5))

plt.tight_layout()

We will split our data into a training set and test set. In particular, we will have a training set of orientations (`stimuli_train`) and the corresponding responses (`resp_train`). Our testing set will have held-out orientations (`stimuli_test`) and the corresponding responses (`resp_test`).

In [None]:
#@title
#@markdown Execute this cell to split into training and test sets

# Set random seeds for reproducibility
np.random.seed(4)
torch.manual_seed(4)

# Split data into training set and testing set
n_train = int(0.6 * n_stimuli)  # use 60% of all data for training set
ishuffle = torch.randperm(n_stimuli)
itrain = ishuffle[:n_train]  # indices of data samples to include in training set
itest = ishuffle[n_train:]  # indices of data samples to include in testing set
stimuli_test = stimuli_all[itest]
resp_test = resp_all[itest]
stimuli_train = stimuli_all[itrain]
resp_train = resp_all[itrain]

---
# Section 2: Deep feed-forward networks in *pytorch* 

We'll now build a simple deep neural network that takes as input a vector of neural responses and outputs a single number representing the decoded stimulus orientation.

To keep things simple, we'll build a deep network with **one** hidden layer that is **fully connected**, which means *every* output is a weighted sum of *every* input. See Bonus Section 1 for a deeper discussion of what this choice entails, and when one might want to use deeper/shallower and wider/narrower architectures.

Let $\mathbf{r}^{(n)} = \begin{bmatrix} r_1^{(n)} & r_2^{(n)} & \ldots & r_N^{(n)} \end{bmatrix}^T$ denote the vector of neural responses (of neurons $1, \ldots, N$) to the $n$th stimulus. The network we will use is described by the following set of equations:
\begin{align}
    \mathbf{h}^{(n)} &= \mathbf{W}^{in} \mathbf{r}^{(n)} + \mathbf{b}^{in}, && [\mathbf{W}^{in}: M \times N], \\
    y^{(n)} &= \mathbf{W}^{out} \mathbf{h}^{(n)} + \mathbf{b}^{out},  && [\mathbf{W}^{out}: 1 \times M],
\end{align}
where $y^{(n)}$ denotes the scalar output of the network: the decoded orientation of the $n$th stimulus. 

The $M$-dimensional vector $\mathbf{h}^{(n)}$ denotes the activations of the **hidden layer** of the network. The blue components of this diagram denote the **parameters** of the network, which we will later optimize with gradient descent. These include all the weights and biases $\mathbf{W}^{in}, \mathbf{b}^{in}, \mathbf{W}^{out}, \mathbf{b}^{out}$.

<p align="center">
  <img src="https://github.com/NeuromatchAcademy/course-content/blob/master/tutorials/static/one-layer-network.png?raw=true" width="450" />
</p>





### Section 2.1: Introduction to PyTorch

Here, we'll use the **PyTorch** package to build, run, and train deep networks of this form in Python. There are two core components to the PyTorch package: 

1. The first is the `torch.Tensor` data type used in PyTorch. `torch.Tensor`'s are effectively just like a `numpy` arrays, except that they have some important attributes and methods needed for automatic differentiation (to be discussed below). They also come along with infrastructure for easily storing and computing with them on GPU's, a capability we won't touch on here but which can be really useful in practice.

2. The second core ingredient is the PyTorch `nn.Module` class. This is the class we'll use for constructing deep networks, so that we can then easily train them using built-in PyTorch functions. Keep in my mind that `nn.Module` classes can actually be used to build, run, and train any model -- not just deep networks!

  The next cell contains code for building the deep network we defined above using the `nn.Module` class. It contains three key ingredients:

  * `__init__()` method to initialize its parameters, like in any other Python class. In this case, it takes two arguments:
    * `n_inputs`: the number of input units. This should always be set to the number of neurons whose activities are being decoded (i.e. the dimensionality of the input to the network). 
    * `n_hidden`: the number of hidden units. This is a parameter that we are free to vary in deciding how to build our network. See Bonus Section 1 for a discussion of how this architectural choice affects the computations the network can perform.

  * `nn.Linear` modules, which are built-in PyTorch classes containing all the weights and biases for a given network layer (documentation [here](https://pytorch.org/docs/master/generated/torch.nn.Linear.html)). This class takes two arguments to initialize:
    * \# of inputs to that layer
    * \# of outputs from that layer

    For the input layer, for example, we have:
    * \# of inputs = \# of neurons whose responses are to be decoded ($N$, specified by `n_inputs`)
    * \# of outputs = \# of hidden layer units ($M$, specified by `n_hidden`)
    
    PyTorch will initialize all weights and biases randomly.

  * `forward()` method, which takes as argument an input to the network and returns the network output. In our case, this comprises computing the output $y$ from a given input $\mathbf{r}$ using the above two equations. See the next cell for code implementing this computation using the built-in PyTorch `nn.Linear` classes.

In [None]:
class DeepNet(nn.Module):
  """Deep Network with one hidden layer

  Args:
    n_inputs (int): number of input units
    n_hidden (int): number of units in hidden layer

  Attributes:
    in_layer (nn.Linear): weights and biases of input layer
    out_layer (nn.Linear): weights and biases of output layer

  """

  def __init__(self, n_inputs, n_hidden):
    super().__init__()  # needed to invoke the properties of the parent class nn.Module
    self.in_layer = nn.Linear(n_inputs, n_hidden) # neural activity --> hidden units
    self.out_layer = nn.Linear(n_hidden, 1) # hidden units --> output

  def forward(self, r):
    """Decode stimulus orientation from neural responses

    Args:
      r (torch.Tensor): vector of neural responses to decode, must be of
        length n_inputs. Can also be a tensor of shape n_stimuli x n_inputs,
        containing n_stimuli vectors of neural responses

    Returns:
      torch.Tensor: network outputs for each input provided in r. If
        r is a vector, then y is a 1D tensor of length 1. If r is a 2D
        tensor then y is a 2D tensor of shape n_stimuli x 1.

    """
    h = self.in_layer(r)  # hidden representation
    y = self.out_layer(h)
    return y

The next cell contains code for initializing and running this network. We use it to decode stimulus orientation from a vector of neural responses to the very first stimulus. Note that when the initialized network class is called as a function on an input (e.g. `net(r)`), its `.forward()` method is called. This is a special property of the `nn.Module` class.

Note that the decoded orientations at this point will be nonsense, since the network has been initialized with random weights. Below, we'll learn how to optimize these weights for good stimulus decoding.

In [None]:
# Set random seeds for reproducibility
np.random.seed(1)
torch.manual_seed(1)

# Initialize a deep network with M=200 hidden units
net = DeepNet(n_neurons, 200)

# Get neural responses (r) to and orientation (ori) to one stimulus in dataset
r, ori = get_data(1, resp_train, stimuli_train)  # using helper function get_data

# Decode orientation from these neural responses using initialized network
out = net(r)  # compute output from network, equivalent to net.forward(r)

print('decoded orientation: %.2f degrees' % out)
print('true orientation: %.2f degrees' % ori)

---
### Section 2.2: Activation functions

In [None]:
# @title Video 2: Nonlinear activation functions
from ipywidgets import widgets

out2 = widgets.Output()
with out2:
  from IPython.display import IFrame
  class BiliVideo(IFrame):
      def __init__(self, id, page=1, width=400, height=300, **kwargs):
          self.id=id
          src = 'https://player.bilibili.com/player.html?bvid={0}&page={1}'.format(id, page)
          super(BiliVideo, self).__init__(src, width, height, **kwargs)

  video = BiliVideo(id="BV1m5411h7V5", width=854, height=480, fs=1)
  print('Video available at https://www.bilibili.com/video/{0}'.format(video.id))
  display(video)

out1 = widgets.Output()
with out1:
  from IPython.display import YouTubeVideo
  video = YouTubeVideo(id="JAdukDCQALA", width=854, height=480, fs=1, rel=0)
  print('Video available at https://youtube.com/watch?v=' + video.id)
  display(video)

out = widgets.Tab([out1, out2])
out.set_title(0, 'Youtube')
out.set_title(1, 'Bilibili')

display(out)

Note that the deep network we constructed above comprises solely **linear** operations on each layer: each layer is just a weighted sum of all the elements in the previous layer. It turns out that linear hidden layers like this aren't particularly useful, since a sequence of linear transformations is actually essentially the same as a single linear transformation. We can see this from the above equations by plugging in the first one into the second one to obtain
\begin{equation}
    y^{(n)} = \mathbf{W}^{out} \left( \mathbf{W}^{in} \mathbf{r}^{(n)} + \mathbf{b}^{in} \right) + \mathbf{b}^{out} = \mathbf{W}^{out}\mathbf{W}^{in} \mathbf{r}^{(n)} + \left( \mathbf{W}^{out}\mathbf{b}^{in} + \mathbf{b}^{out} \right)
\end{equation}
In other words, the output is still just a weighted sum of elements in the input -- the hidden layer has done nothing to change this.

To extend the set of computable input/output transformations to more than just weighted sums, we'll incorporate a **non-linear activation function** in the hidden units. This is done by simply modifying the equation for the hidden layer activations to be
\begin{equation}
    \mathbf{h}^{(n)} = \phi(\mathbf{W}^{in} \mathbf{r}^{(n)} + \mathbf{b}^{in})
\end{equation}
where $\phi$ is referred to as the activation function. Using a non-linear activation function will ensure that the hidden layer performs a non-linear transformation of the input, which will make our network much more powerful (or *expressive*, see Bonus Section 1). In practice, deep networks *always* use non-linear activation functions.



#### Coding Exercise 2.2: Nonlinear Activations 

Create a new class `DeepNetReLU` by modifying our above deep network model to use a non-linear activation function. We'll use the linear rectification function:
\begin{equation}
  \phi(x) = 
  \begin{cases}
    x & \text{if } x > 0 \\
    0 & \text{else}
  \end{cases}
\end{equation}
which can be implemented in PyTorch using `torch.relu()`. Hidden layers with this activation function are typically referred to as "**Re**ctified **L**inear **U**nits", or **ReLU**'s.

Initialize this network with 10 hidden units and run on an example stimulus.

**Hint**: you only need to modify the `forward()` method of the above `DeepNet()` class.


In [None]:
class DeepNetReLU(nn.Module):
  """ network with a single hidden layer h with a RELU """

  def __init__(self, n_inputs, n_hidden):
    super().__init__()  # needed to invoke the properties of the parent class nn.Module
    self.in_layer = nn.Linear(n_inputs, n_hidden) # neural activity --> hidden units
    self.out_layer = nn.Linear(n_hidden, 1) # hidden units --> output

  def forward(self, r):

    ############################################################################
    ## TO DO for students: write code for computing network output using a
    ## rectified linear activation function for the hidden units
    # Fill out function and remove
    raise NotImplementedError("Student exercise: complete DeepNetReLU forward")
    ############################################################################

    h = ...
    y = ...

    return y


# Set random seeds for reproducibility
np.random.seed(1)
torch.manual_seed(1)

# Get neural responses (r) to and orientation (ori) to one stimulus in dataset
r, ori = get_data(1, resp_train, stimuli_train)

# Initialize deep network with M=10 hidden units and uncomment lines below
net = DeepNetReLU(n_neurons, 10)

# Decode orientation from these neural responses using initialized network
# net(r) is equivalent to net.forward(r)
out = net(r)

print('decoded orientation: %.2f degrees' % out)
print('true orientation: %.2f degrees' % ori)

[*Click for solution*](https://github.com/NeuromatchAcademy/course-content/tree/master//tutorials/W2D1_DeepLearning/solutions/W2D1_Tutorial1_Solution_1e3c4101.py)



You should see that the decoded orientation is 0.13 $^{\circ}$ while the true orientation is 139.00 $^{\circ}$.

---
# Section 3: Loss functions and gradient descent


In [None]:
# @title Video 3: Loss functions & gradient descent
from ipywidgets import widgets

out2 = widgets.Output()
with out2:
  from IPython.display import IFrame
  class BiliVideo(IFrame):
      def __init__(self, id, page=1, width=400, height=300, **kwargs):
          self.id=id
          src = 'https://player.bilibili.com/player.html?bvid={0}&page={1}'.format(id, page)
          super(BiliVideo, self).__init__(src, width, height, **kwargs)

  video = BiliVideo(id="BV19k4y1271n", width=854, height=480, fs=1)
  print('Video available at https://www.bilibili.com/video/{0}'.format(video.id))
  display(video)

out1 = widgets.Output()
with out1:
  from IPython.display import YouTubeVideo
  video = YouTubeVideo(id="aEtKpzEuviw", width=854, height=480, fs=1, rel=0)
  print('Video available at https://youtube.com/watch?v=' + video.id)
  display(video)

out = widgets.Tab([out1, out2])
out.set_title(0, 'Youtube')
out.set_title(1, 'Bilibili')

display(out)

### Section 3.1: Loss functions

Because the weights of the network are currently randomly chosen, the outputs of the network are nonsense: the decoded stimulus orientation is nowhere close to the true stimulus orientation. We'll shortly write some code to change these weights so that the network does a better job of decoding.

But to do so, we first need to define what we mean by "better". One simple way of defining this is to use the squared error
\begin{equation}
    L = (y - \tilde{y})^2
\end{equation}
where $y$ is the network output and $\tilde{y}$ is the true stimulus orientation. When the decoded stimulus orientation is far from the true stimulus orientation, $L$ will be large. We thus refer to $L$ as the **loss function**, as it quantifies how *bad* the network is at decoding stimulus orientation.

PyTorch actually carries with it a number of built-in loss functions. The one corresponding to the squared error is called `nn.MSELoss()`. This will take as arguments a **batch** of network outputs $y_1, y_2, \ldots, y_P$ and corresponding target outputs $\tilde{y}_1, \tilde{y}_2, \ldots, \tilde{y}_P$, and compute the **mean squared error (MSE)**
\begin{equation}
    L = \frac{1}{P}\sum_{n=1}^P \left(y^{(n)} - \tilde{y}^{(n)}\right)^2
\end{equation}
where $P$ is the number of different stimuli in a batch, called the *batch size*.



#### Coding Exercise 3.1: Computing MSE 


Evaluate the mean squared error for a deep network with $M=10$ rectified linear units, on the decoded orientations from neural responses to 20 random stimuli.

In [None]:
# Set random seeds for reproducibility
np.random.seed(1)
torch.manual_seed(1)

# Initialize a deep network with M=10 hidden units
net = DeepNetReLU(n_neurons, 10)

# Get neural responses to first 20 stimuli in the data set
r, ori = get_data(20, resp_train, stimuli_train)

# Decode orientation from these neural responses
out = net(r)

###################################################
## TO DO for students: evaluate mean squared error
###################################################

# Initialize PyTorch mean squared error loss function (Hint: look at nn.MSELoss)
loss_fn = ...

# Evaluate mean squared error
loss = ...

# Uncomment once above is filled in
# print('mean squared error: %.2f' % loss)

[*Click for solution*](https://github.com/NeuromatchAcademy/course-content/tree/master//tutorials/W2D1_DeepLearning/solutions/W2D1_Tutorial1_Solution_8dd6b0af.py)



You should see a mean squared error of 42949.14.

---
### Section 3.2: Optimization with gradient descent

Our goal is now to modify the weights to make the mean squared error loss $L$ as small as possible over the whole data set. To do this, we'll use the **gradient descent (GD)** algorithm, which consists of iterating three steps:
1. **Evaluate the loss** on the training data,
```
out = net(train_data)
loss = loss_fn(out, train_labels)
```
where `train_data` are the network inputs in the training data (in our case, neural responses), and `train_labels` are the target outputs for each input (in our case, true stimulus orientations).
2. **Compute the gradient of the loss** with respect to each of the network weights. In PyTorch, we can do this with one line of code:
```
loss.backward()
```
3. **Update the network weights** by descending the gradient. In Pytorch, we can do this using built-in optimizers. We'll use the `optim.SGD` optimizer (documentation [here](https://pytorch.org/docs/stable/optim.html#torch.optim.SGD)) which updates parameters along the negative gradient, scaled by a learning rate (see Bonus Section 2.1 for details). To initialize this optimizer, we have to tell it
  * which parameters to update, and
  * what learning rate to use

  For example, to optimize *all* the parameters of a network `net` using a learning rate of .001, the optimizer would be initialized as follows
  ```
  optimizer = optim.SGD(net.parameters(), lr=.001)
  ```
  where `.parameters()` is a method of the `nn.Module` class that returns a [Python generator object](https://wiki.python.org/moin/Generators) over all the parameters of that `nn.Module` class (in our case, $\mathbf{W}^{in}, \mathbf{b}^{in}, \mathbf{W}^{out}, \mathbf{b}^{out}$).
  
  After computing all the parameter gradients in step 2, we can then update each of these parameters using the `.step()` method of this optimizer,
  ```
  optimizer.step()
  ```

Finally, an important detail to remember is that the gradients of each parameter need to be cleared before calling `.backward()`, or else PyTorch will try to accumulate gradients across iterations. This can again be done using built-in optimizers via the method `zero_grad()`, as follows:
```
optimizer.zero_grad()
```

Putting all this together, each iteration of the GD algorith will contain a block of code that looks something like this:
```
Get outputs from network
Evaluate loss

# Compute gradients
optimizer.zero_grad()  # clear gradients
loss.backward()

# Update weights
optimizer.step()
```

In the next exercise, we'll give you a code skeleton for implementing the GD algorithm. Your job will be to fill in the blanks.

For the mathematical details of the GD algorithm, see Bonus Section 2.1. Note, in particular, that here we using the gradient descent algorithm, rather than the more commonly used *stochastic* gradient descent algorithm. See Bonus Section 2.2 for a more detailed discussion of how these differ and when one might need to use the stochastic variant.

#### Coding Exercise 3.2: Gradient descent in PyTorch

Complete the function `train()` that uses the gradient descent algorithm to optimize the weights of a given network. This function takes as input arguments
* `net`: the PyTorch network whose weights to optimize
* `loss_fn`: the PyTorch loss function to use to evaluate the loss
* `train_data`: the training data to evaluate the loss on (i.e. neural responses to decode)
* `train_labels`: the target outputs for each data point in `train_data` (i.e. true stimulus orientations)

We will then train a neural network on our data and plot the loss (mean squared error) over time. When we run this function, behind the scenes PyTorch is actually changing the parameters inside this network to make the network better at decoding, so its weights will now be different than they were at initialization.


**Hint:** all the code you need for doing this is provided in the above description of the GD algorithm.

In [None]:
def train(net, loss_fn, train_data, train_labels,
          n_iter=50, learning_rate=1e-4,
          test_data=None, test_labels=None,
          L2_penalty=0, L1_penalty=0):
  """Run gradient descent to opimize parameters of a given network

  Args:
    net (nn.Module): PyTorch network whose parameters to optimize
    loss_fn: built-in PyTorch loss function to minimize
    train_data (torch.Tensor): n_train x n_neurons tensor with neural
      responses to train on
    train_labels (torch.Tensor): n_train x 1 tensor with orientations of the
      stimuli corresponding to each row of train_data
    n_iter (int, optional): number of iterations of gradient descent to run
    learning_rate (float, optional): learning rate to use for gradient descent
    test_data (torch.Tensor, optional): n_test x n_neurons tensor with neural
      responses to test on
    test_labels (torch.Tensor, optional): n_test x 1 tensor with orientations of
      the stimuli corresponding to each row of test_data
    L2_penalty (float, optional): l2 penalty regularizer coefficient
    L1_penalty (float, optional): l1 penalty regularizer coefficient

  Returns:
    (list): training loss over iterations

  """

  # Initialize PyTorch SGD optimizer
  optimizer = optim.SGD(net.parameters(), lr=learning_rate)

  # Placeholder to save the loss at each iteration
  train_loss = []
  test_loss = []

  # Loop over epochs
  for i in range(n_iter):

    ######################################################################
    ## TO DO for students: fill in missing code for GD iteration
    raise NotImplementedError("Student exercise: write code for GD iterations")
    ######################################################################

    # compute network output from inputs in train_data
    out = ...  # compute network output from inputs in train_data

    # evaluate loss function
    if L2_penalty==0 and L1_penalty==0:
      # normal loss function
      loss = loss_fn(out, train_labels)
    else:
      # custom loss function from bonus exercise 3.3
      loss = loss_fn(out, train_labels, net.in_layer.weight,
                     L2_penalty, L1_penalty)

    # Clear previous gradients
    ...

    # Compute gradients
    ...

    # Update weights
    ...

    # Store current value of loss
    train_loss.append(loss.item())  # .item() needed to transform the tensor output of loss_fn to a scalar

    # Get loss for test_data, if given (we will use this in the bonus exercise 3.2 and 3.3)
    if test_data is not None:
      out_test = net(test_data)
      # evaluate loss function
      if L2_penalty==0 and L1_penalty==0:
        # normal loss function
        loss_test = loss_fn(out_test, test_labels)
      else:
        # (BONUS code) custom loss function from Bonus exercise 3.3
        loss_test = loss_fn(out_test, test_labels, net.in_layer.weight,
                            L2_penalty, L1_penalty)
      test_loss.append(loss_test.item())  # .item() needed to transform the tensor output of loss_fn to a scalar

    # Track progress
    if (i + 1) % (n_iter // 5) == 0:
      if test_data is None:
        print(f'iteration {i + 1}/{n_iter} | loss: {loss.item():.3f}')
      else:
        print(f'iteration {i + 1}/{n_iter} | loss: {loss.item():.3f} | test_loss: {loss_test.item():.3f}')

  if test_data is None:
    return train_loss
  else:
    return train_loss, test_loss

# Set random seeds for reproducibility
np.random.seed(1)
torch.manual_seed(1)

# Initialize network with 10 hidden units
net = DeepNetReLU(n_neurons, 10)

# Initialize built-in PyTorch MSE loss function
loss_fn = nn.MSELoss()

# Run GD on data
train_loss = train(net, loss_fn, resp_train, stimuli_train)

# Plot the training loss over iterations of GD
with plt.xkcd():
  plt.plot(train_loss)
  plt.xlim([0, None])
  plt.ylim([0, None])
  plt.xlabel('iterations of gradient descent')
  plt.ylabel('mean squared error')
  plt.show()

[*Click for solution*](https://github.com/NeuromatchAcademy/course-content/tree/master//tutorials/W2D1_DeepLearning/solutions/W2D1_Tutorial1_Solution_41f14a37.py)

*Example output:*

<img alt='Solution hint' align='left' width=560 height=416 src=https://raw.githubusercontent.com/NeuromatchAcademy/course-content/master/tutorials/W2D1_DeepLearning/static/W2D1_Tutorial1_Solution_41f14a37_3.png>



---
# Section 4: Investigating model and evaluating performance



## Section 4.1: Peering inside the decoding model

We have built a model to perform decoding that takes as input neural activity and outputs the estimated angle of the stimulus. We can imagine that an animal that needs to determine angles would have a brain area that acts like the hidden layer in our model. It transforms the neural activity from visual cortex and outputs a decision. Decisions about orientations of edges could include figuring out how to jump onto a branch, how to avoid obstacles, or determining the type of an object, e.g. food or predator.

What sort of connectivity would this brain area have with visual cortex? Determining this experimentally would be very difficult, perhaps we can look at the model we have and see if its structure constrains the type of connectivity we'd expect.

Below we will visualize the weights from the neurons in visual cortex to the hidden units $W_{in}$, and the weights from the hidden units to the output orientation $W_{out}$.

In [None]:
W_in = net.in_layer.weight.detach().numpy() # we can run .detach() and .numpy() to get a numpy array
print('shape of W_in:')
print(W_in.shape)

W_out = net.out_layer.weight.detach().numpy() # we can run .detach() and .numpy() to get a numpy array
print('shape of W_out:')
print(W_out.shape)

plt.figure(figsize=(10,4))
plt.subplot(1,2,1)
plt.imshow(W_in, aspect='auto', cmap='bwr', vmin=-1e-2, vmax=1e-2)
plt.xlabel('neurons')
plt.ylabel('hidden units')
plt.colorbar()
plt.title('$W_{in}$')

plt.subplot(1,2,2)
plt.imshow(W_out.T, cmap='bwr', vmin=-3, vmax=3)
plt.xticks([])
plt.xlabel('output')
plt.ylabel('hidden units')
plt.colorbar()
plt.title('$W_{out}$')

plt.show()

### Coding Exercise 4.1: Visualizing weights

It's difficult to see any structure in this weight matrix. How might we visualize it in a better way? 

Perhaps we can sort the neurons by their preferred orientation. We will use the `resp_all` matrix which is 360 stimuli (360$^\circ$ of angles) by number of neurons. How do we find the preferred orientation? 

Let's visualize one column of this `resp_all` matrix first as we did at the beginning of the notebook. Can you see how we might want to first process this tuning curve before choosing the preferred orientation?

In [None]:
idx = 235
plt.plot(resp_all[:,idx])
plt.ylabel('neural response')
plt.xlabel('stimulus orientation ($^\circ$)')
plt.title(f'neuron {idx}')
plt.show()

Looking at this tuning curve, there is a bit of noise across orientations, so let's smooth with a gaussian filter and then find the position of the maximum for each neuron. After getting the maximum position aka the "preferred orientation" for each neuron, we will re-sort the $W_{in}$ matrix. The maximum position in a matrix can be computed using the `.argmax(axis=_)` function in python -- make sure you specify the right axis though! Next, to get the indices of a matrix sorted you will need to use the `.argsort()` function.

In [None]:
from scipy.ndimage import gaussian_filter1d

# first let's smooth the tuning curves to make sure we get an accurate peak
# that isn't just noise
resp_smoothed = gaussian_filter1d(resp_all, 5, axis=0)

###################################################
## TO DO for students: find preferred orientation and resort W_in
###################################################

# find position of max response for each neuron
# aka preferred orientation for each neuron
preferred_orientation = ...

# resort W_in matrix by preferred orientation
isort = ...
W_in_sorted = ...

# plot resorted W_in matrix
plt.figure(figsize=(10,4))
plt.subplot(1,2,1)
# plt.imshow(W_in_sorted, aspect='auto', cmap='bwr', vmin=-1e-2, vmax=1e-2)
# plt.colorbar()
plt.xlabel('sorted neurons')
plt.ylabel('hidden units')
plt.title('$W_{in}$')

plt.subplot(1,2,2)
plt.imshow(W_out.T, cmap='bwr', vmin=-3, vmax=3)
plt.xticks([])
plt.xlabel('output')
plt.ylabel('hidden units')
plt.colorbar()
plt.title('$W_{out}$')

[*Click for solution*](https://github.com/NeuromatchAcademy/course-content/tree/master//tutorials/W2D1_DeepLearning/solutions/W2D1_Tutorial1_Solution_7dcde31a.py)

*Example output:*

<img alt='Solution hint' align='left' width=681 height=272 src=https://raw.githubusercontent.com/NeuromatchAcademy/course-content/master/tutorials/W2D1_DeepLearning/static/W2D1_Tutorial1_Solution_7dcde31a_0.png>



### Think! 4: Interpreting weights

We have just visualized how the model transforms neural activity to hidden layer activity. How should we interpret these matrices? Here are some guiding questions to explore:
* Why are some of the $W_{in}$ weights close to zero for some of the hidden units? Do these correspond to close to zero weights in $W_{out}$?
* Note how each hidden unit seems to have strongest weights to two groups of neurons in $W_{in}$, corresponding to two different sets of preferred orientations. Why do you think that is? What does this tell us about the structure of the tuning curves of the neurons?
* How can we better understand the hidden units? What simple visualization would you try next?

[*Click for solution*](https://github.com/NeuromatchAcademy/course-content/tree/master//tutorials/W2D1_DeepLearning/solutions/W2D1_Tutorial1_Solution_cfa88623.py)



## Section 4.2: Generalization performance with test data

Note that gradient descent is essentially an algorithm for fitting the network's parameters to a given set of training data. Selecting this training data is thus crucial for ensuring that the optimized parameters **generalize** to unseen data they weren't trained on. In our case, for example, we want to make sure that our trained network is good at decoding stimulus orientations from neural responses to any orientation, not just those in our data set.

To ensure this, we have split up the full data set into a **training set** and a **testing set**. In Coding Exercise 3.2, we trained a deep network by optimizing the parameters on a training set. We will now evaluate how good the optimized parameters are by using the trained network to decode stimulus orientations from neural responses in the testing set. Good decoding performance on this testing set should then be indicative of good decoding performance on the neurons' responses to any other stimulus orientation. This procedure is commonly used in machine learning (not just in deep learning)and is typically referred to as **cross-validation**.

We will compute the MSE on the test data and plot the decoded stimulus orientations as a function of the true stimulus.


In [None]:
#@title
#@markdown Execute this cell to evaluate and plot test error

out = net(resp_test)  # decode stimulus orientation for neural responses in testing set
ori = stimuli_test  # true stimulus orientations
test_loss = loss_fn(out, ori)  # MSE on testing set (Hint: use loss_fn initialized in previous exercise)

plt.plot(ori, out.detach(), '.')  # N.B. need to use .detach() to pass network output into plt.plot()
identityLine()  # draw the identity line y=x; deviations from this indicate bad decoding!
plt.title('MSE on testing set: %.2f' % test_loss.item())  # N.B. need to use .item() to turn test_loss into a scalar
plt.xlabel('true stimulus orientation ($^o$)')
plt.ylabel('decoded stimulus orientation ($^o$)')
axticks = np.linspace(0, 360, 5)
plt.xticks(axticks)
plt.yticks(axticks)
plt.show()

**PyTorch Note**:

An important thing to note in the code snippet for plotting the decoded orientations is the `.detach()` method. The PyTorch `nn.Module` class is special in that, behind the scenes, each of the variables inside it are linked to each other in a computational graph, for the purposes of automatic differentiation (the algorithm used in `.backward()` to compute gradients). As a result, if you want to do anything that is not a `torch` operation to the parameters or outputs of an `nn.Module` class, you'll need to first "detach" it from its computational graph. This is what the `.detach()` method does. In this hidden code above, we need to call it on the outputs of the network so that we can plot them with the `plt.plot()` function.

If interested, please see Bonus Section 3 to think more about model criticism (3.1), improve the loss function accordingly (3.2), and add regularization (3.3)

---
# Summary

We have now covered a number of common and powerful techniques for applying deep learning to decoding from neural data, some of which are common to almost any machine learning problem:
* Building and training deep networks using the **PyTorch** `nn.Module` class and built-in **optimizers**
* Choosing and evaluating **loss functions**
* Testing a trained model on unseen data via **cross-validation**, by splitting the data into a **training set and testing set**

An important aspect of this tutorial was the `train()` function we wrote in exercise 3. Note that it can be used to train *any* network to minimize *any* loss function (cf. advanced exercise 6) on *any* training data. This is the power of using PyTorch to train neural networks and, for that matter, **any other model**! There is nothing in the `nn.Module` class that forces us to use `nn.Linear` layers that implement neural network operations. You can actually put anything you want inside the `.__init__()` and `.forward()` methods of this class. As long as its parameters and computations involve only `torch.Tensor`'s, and the model is differentiable, you'll then be able to optimize the parameters of this model in exactly the same way we optimized the deep networks here.

What kinds of conclusions can we draw from these sorts of analyses? If we can decode the stimulus well from visual cortex activity, that means that there is information about this stimulus available in visual cortex. Whether or not the animal uses that information to make decisions is not determined from an analysis like this. In fact mice perform poorly in orientation discrimination tasks compared to monkeys and humans, even though they have information about these stimuli in their visual cortex. Why do you think they perform poorly in orientation discrimination tasks?

See [this paper](https://www.biorxiv.org/content/10.1101/679324v2) for some potential hypotheses, but this is totally an open question!

---
# Bonus

## Bonus Section 1: Neural network *depth*, *width* and *expressivity*

Two important architectural choices that always have to be made when constructing deep feed-forward networks like those used here are
* the number of hidden layers, or the network's *depth*
* the number of units in each layer, or the layer *widths*

Here, we restricted ourselves to networks with a single hidden layer with a width of $M$ units, but it is easy to see how this code could be adapted to arbitrary depths. Adding another hidden layer simply requires adding another `nn.Linear` module to the `__init__()` method and incorporating it into the `.forward()` method.

The depth and width of a network determine the set of input/output transormations that it can perform, often referred to as its *expressivity*. The deeper and wider the network, the more *expressive* it is; that is, the larger the class of input/output transformations it can compute. In fact, it turns out that an infinitely wide *or* infinitely deep networks can in principle [compute (almost) *any* input/output transformation](https://en.wikipedia.org/wiki/Universal_approximation_theorem).

A classic mathematical demonstration of the power of depth is given by the so-called [XOR problem](https://medium.com/@jayeshbahire/the-xor-problem-in-neural-networks-50006411840b#:~:text=The%20XOr%2C%20or%20%E2%80%9Cexclusive%20or,value%20if%20they%20are%20equal.). This toy problem demonstrates how even a single hidden layer can drastically expand the set of input/output transformations a network can perform, relative to a shallow network with no hidden layers. The key intuition is that the hidden layer allows you to represent the input in a new format, which can then allow you to do almost anything you want with it. The *wider* this hidden layer, the more flexibility you have in this representation. In particular, if you have more hidden units than input units, then the hidden layer representation of the input is higher-dimensional than the raw data representation. This higher dimensionality effectively gives you more "room" to perform arbitrary computations in. It turns out that even with just this one hidden layer, if you make it wide enough you can actually approximate any input/output transformation you want. See [here](http://neuralnetworksanddeeplearning.com/chap4.html) for a neat visual demonstration of this.

In practice, however, it turns out that increasing depth seems to grant more expressivity with fewer units than increasing width does (for reasons that are not well understood). It is for this reason that truly *deep* networks are almost always used in machine learning, which is why this set of techniques is often referred to as *deep* learning.

That said, there is a cost to making networks deeper and wider. The bigger your network, the more parameters (i.e. weights and biases) it has, which need to be optimized! The extra expressivity afforded by higher width and/or depth thus carries with it (at least) two problems:
* optimizing more parameters usually requires more data
* a more highly parameterized network is more prone to overfit to the training data, so requires more sophisticated optimization algorithms to ensure generalization

## Bonus Section 2: Gradient descent

### Bonus Section 2.1: Gradient descent equations

Here we provide the equations for the three steps of the gradient descent algorithm, as applied to our decoding problem:

1. **Evaluate the loss** on the training data. For a mean squared error loss, this is given by
\begin{equation}
    L = \frac{1}{P}\sum_{n=1}^P (y^{(n)} - \tilde{y}^{(n)})^2
\end{equation}
where $y^{(n)}$ denotes the stimulus orientation decoded from the population response $\mathbf{r}^{(n)}$ to the $n$th stimulus in the training data, and $\tilde{y}^{(n)}$ is the true orientation of that stimulus. $P$ denotes the total number of data samples in the training set. In the syntax of our `train()` function above, $\mathbf{r}^{(n)}$ is given by `train_data[n, :]` and $\tilde{y}^{(n)}$ by `train_labels[n]`.

2. **Compute the gradient of the loss** with respect to each of the network weights. In our case, this entails computing the quantities
\begin{equation}
    \frac{\partial L}{\partial \mathbf{W}^{in}}, \frac{\partial L}{\partial \mathbf{b}^{in}}, \frac{\partial L}{\partial \mathbf{W}^{out}}, \frac{\partial L}{\partial \mathbf{b}^{out}}
\end{equation}
Usually, we would require lots of math in order to derive each of these gradients, and lots of code to compute them. But this is where PyTorch comes to the rescue! Using a cool technique called [automatic differentiation](https://en.wikipedia.org/wiki/Automatic_differentiation), PyTorch automatically calculates these gradients when the `.backward()` function is called.

  More specifically, when this function is called on a particular variable (e.g. `loss`, as above), PyTorch will compute the gradients with respect to each network parameter. These are computed and stored behind the scenes, and can be accessed through the `.grad` attribute of each of the network's parameters. As we saw above, however, we actually never need to look at or call these gradients when implementing gradient descent, as this can be taken care of by PyTorch's built-in optimizers, like `optim.SGD`.

3. **Update the network weights** by descending the gradient:
\begin{align}
    \mathbf{W}^{in} &\leftarrow \mathbf{W}^{in} - \alpha \frac{\partial L}{\partial \mathbf{W}^{in}} \\
    \mathbf{b}^{in} &\leftarrow \mathbf{b}^{in} - \alpha \frac{\partial L}{\partial \mathbf{b}^{in}} \\
    \mathbf{W}^{out} &\leftarrow \mathbf{W}^{out} - \alpha \frac{\partial L}{\partial \mathbf{W}^{out}} \\
    \mathbf{b}^{out} &\leftarrow \mathbf{b}^{out} - \alpha \frac{\partial L}{\partial \mathbf{b}^{out}}
\end{align}
where $\alpha$ is called the **learning rate**. This **hyperparameter** of the SGD algorithm controls how far we descend the gradient on each iteration. It should be as large as possible so that fewer iterations are needed, but not too large so as to avoid parameter updates from skipping over minima in the loss landscape.

While the equations written down here are specific to the network and loss function considered in this tutorial, the code provided above for implementing these three steps is completely general: no matter what loss function or network you are using, exactly the same commands can be used to implement these three steps.

The way that the gradients are calculated is called **backpropagation**. We have a loss function:
\begin{align}
    L &= (y - \tilde{y})^2 \\
     &= (\mathbf{W}^{out} \mathbf{h} - \tilde{y})^2
\end{align}
where $\mathbf{h} = \phi(\mathbf{W}^{in} \mathbf{r} + \mathbf{b}^{in})$
You may see that $\frac{\partial L}{\partial \mathbf{W}^{out}}$ is simple to calculate as it is on the outside of the equation (it is also a vector in this case, not a matrix, so the derivative is standard):
\begin{equation}
\frac{\partial L}{\partial \mathbf{W}^{out}} = 2 (\mathbf{h} - \tilde{y})
\end{equation}
Now let's compute the derivative with respect to $\mathbf{W}^{in}$ using the chain rule. Note it is only positive if the output is positive due to the RELU $\phi$:
\begin{align}
\frac{\partial L}{\partial \mathbf{W}^{in}} &= \begin{cases}
    \frac{\partial L}{\partial \mathbf{W}^{out}} \frac{\partial \mathbf{h}}{\partial \mathbf{W}^{in}}  & \text{if }  \mathbf{h} > 0 \\
    0 & \text{else}
  \end{cases} \\
  &= \begin{cases}
    2 (\mathbf{h} - \tilde{y}) \mathbf{r}^\top  & \text{if }  \mathbf{h} > 0 \\
    0 & \text{else}
  \end{cases}
\end{align}
It is most efficient to compute the derivative once for the last layer, then once for the next layer and multiply by the previous layer's derivative and so on using the chain rule. Each of these operations is relatively fast, making training of deep networks feasible. 

The command `loss.backward()` computes these gradients for the defined `loss` with respect to each network parameter. The computation is done using [automatic differentiation](https://en.wikipedia.org/wiki/Automatic_differentiation), which implements backpropagation. Note that this works no matter how big/small the network is, allowing us to perform gradient descent for any deep network model built using PyTorch.

### Bonus Section 2.2: *Stochastic* gradient descent (SGD) vs. gradient descent (GD)

In this tutorial, we used the gradient descent algorithm, which differs in a subtle yet very important way from the more commonly used **stochastic gradient descent (SGD)** algorithm. The key difference is in the very first step of each iteration, where in the GD algorithm we evaluate the loss *at every data sample in the training set*. In SGD, on the other hand, we evaluate the loss only at a random subset of data samlpes from the full training set, called a **mini-batch**. At each iteration, we randomly sample a mini-batch to perform steps 1-3 on. All the above equations still hold, but now the $P$ data samples $\mathbf{r}^{(n)}, \tilde{y}^{(n)}$ denote a mini-batch of $P$ random samples from the training set, rather than the whole training set.

There are several reasons why one might want to use SGD instead of GD. The first is that the training set might be too big, so that we actually can't actually evaluate the loss on every single data sample in it. In this case, GD is simply infeasible, so we have no choice but to turn to SGD, which bypasses the restrictive memory demands of GD by sub-sampling the training set into smaller mini-batches.

But, even when GD is feasible, SGD turns out to be generally better. The stochasticity induced by the extra random sampling step in SGD effectively adds some noise in the search for local minima of the loss function. This can be really useful for avoiding potential local minima, and enforce that whatever minimum is converged to is a good one. This is particularly important when networks are wider and/or deeper, in which case the large number of parameters can lead to overfitting.

Here, we used only GD because (1) it is simpler, and (2) it suffices for the problem being considered here. Because we have so many neurons in our data set, decoding is not too challenging and doesn't require a particularly deep or wide network. The small number of parameters in our deep networks therefore can be optimized without a problem using GD.

## Bonus Section 3: Evaluating & improving models

---
### Bonus Section 3.1: Model criticism

Let's now take a step back and think about how our model is succeeding/failing and how to improve it.

In [None]:
#@title
#@markdown Execute this cell to plot decoding error

out = net(resp_test)  # decode stimulus orientation for neural responses in testing set
ori = stimuli_test  # true stimulus orientations
error = out - ori  # decoding error


plt.plot(ori, error.detach(), '.')   # plot decoding error as a function of true orientation (make sure all arguments to plt.plot() have been detached from PyTorch network!)

# Plotting
plt.xlabel('true stimulus orientation ($^o$)')
plt.ylabel('decoding error ($^o$)')
plt.xticks(np.linspace(0, 360, 5))
plt.yticks(np.linspace(-360, 360, 9))
plt.show()

#### Bonus Think! 3.1: Delving into error problems

In the cell below, we will plot the *decoding error* for each neural response in the testing set. The decoding error is defined as the decoded stimulus orientation minus true stimulus orientation
\begin{equation}
  \text{decoding error} = y^{(n)} - \tilde{y}^{(n)}
\end{equation}

In particular, we plot decoding error as a function of the true stimulus orientation.


  * Are some stimulus orientations harder to decode than others?
  * If so, in what sense? Are the decoded orientations for these stimuli more variable and/or are they biased?
  * Can you explain this variability/bias? What makes these stimulus orientations different from the others?
  * (Will be addressed in next exercise) Can you think of a way to modify the deep network in order to avoid this?

[*Click for solution*](https://github.com/NeuromatchAcademy/course-content/tree/master//tutorials/W2D1_DeepLearning/solutions/W2D1_Tutorial1_Solution_de0cd867.py)



### Bonus Section 3.2: Improving the loss function 
As illustrated in the previous exercise, the squared error is not a good loss function for circular quantities like angles, since two angles that are very close (e.g. $1^o$ and $359^o$) might actually have a very large squared error.

Here, we'll avoid this problem by changing our loss function to treat our decoding problem as a **classification problem**. Rather than estimating the *exact* angle of the stimulus, we'll now aim to construct a decoder that classifies the stimulus into one of $C$ classes, corresponding to different bins of angles of width $b = \frac{360}{C}$. The true class $\tilde{y}^{(n)}$ of stimulus $i$ is now given by
\begin{equation}
  \tilde{y}^{(n)} =
  \begin{cases}
    1 &\text{if angle of stimulus $n$ is in the range } [0, b] \\
    2 &\text{if angle of stimulus $n$ is in the range } [b, 2b] \\
    3 &\text{if angle of stimulus $n$ is in the range } [2b, 3b] \\
    \vdots \\
    C &\text{if angle of stimulus $n$ is in the range } [(C-1)b, 360]
  \end{cases}
\end{equation}

We have a helper function `stimulus_class` that will extract `n_classes` stimulus classes for us from the stimulus orientations.

To decode the stimulus class from neural responses, we'll use a deep network that outputs a $C$-dimensional vector of probabilities $\mathbf{p} = \begin{bmatrix} p_1, p_2, \ldots, p_C \end{bmatrix}^T$, corresponding to the estimated probabilities of the stimulus belonging to each class $1, 2, \ldots, C$. 

To ensure the network's outputs are indeed probabilities (i.e. they are positive numbers between 0 and 1, and sum to 1), we'll use a [softmax function](https://en.wikipedia.org/wiki/Softmax_function) to transform the real-valued outputs from the hidden layer into probabilities. Letting $\sigma(\cdot)$ denote this softmax function, the equations describing our network are
\begin{align}
    \mathbf{h}^{(n)} &= \phi(\mathbf{W}^{in} \mathbf{r}^{(n)} + \mathbf{b}^{in}), && [\mathbf{W}^{in}: M \times N], \\
    \mathbf{p}^{(n)} &= \sigma(\mathbf{W}^{out} \mathbf{h}^{(n)} + \mathbf{b}^{out}),  && [\mathbf{W}^{out}: C \times M],
\end{align}
The decoded stimulus class is then given by that assigned the highest probability by the network:
\begin{equation}
  y^{(n)} = \underset{i}{\arg\max} \,\, p_i
\end{equation}
The softmax function can be implemented in PyTorch simply using `torch.softmax()`.

Often *log* probabilities are easier to work with than actual probabilities, because probabilities tend to be very small numbers that computers have trouble representing. We'll therefore actually use the logarithm of the softmax as the output of our network,
\begin{equation}
    \mathbf{l}^{(n)} = \log \left( \mathbf{p}^{(n)} \right)
\end{equation}
which can implemented in PyTorch together with the softmax via an `nn.LogSoftmax` layer. The nice thing about the logarithmic function is that it's *monotonic*, so if one probability is larger/smaller than another, then its logarithm is also larger/smaller than the other's. We therefore have that
\begin{equation}
  y^{(n)} = \underset{i}{\arg\max} \,\, p_i^{(n)} = \underset{i}{\arg\max} \, \log p_i^{(n)} = \underset{i}{\arg\max} \,\, l_i^{(n)}
\end{equation}

See the next cell for code for constructing a deep network with one hidden layer that of ReLU's that outputs a vector of log probabilities.

In [None]:
# Deep network for classification
class DeepNetSoftmax(nn.Module):
  """Deep Network with one hidden layer, for classification

  Args:
    n_inputs (int): number of input units
    n_hidden (int): number of units in hidden layer
    n_classes (int): number of outputs, i.e. number of classes to output
      probabilities for

  Attributes:
    in_layer (nn.Linear): weights and biases of input layer
    out_layer (nn.Linear): weights and biases of output layer

  """

  def __init__(self, n_inputs, n_hidden, n_classes):
    super().__init__()  # needed to invoke the properties of the parent class nn.Module
    self.in_layer = nn.Linear(n_inputs, n_hidden)  # neural activity --> hidden units
    self.out_layer = nn.Linear(n_hidden, n_classes)  # hidden units --> outputs
    self.logprob = nn.LogSoftmax(dim=1)  # probabilities across columns should sum to 1 (each output row corresponds to a different input)

  def forward(self, r):
    """Predict stimulus orientation bin from neural responses

    Args:
      r (torch.Tensor): n_stimuli x n_inputs tensor with neural responses to n_stimuli

    Returns:
      torch.Tensor: n_stimuli x n_classes tensor with predicted class probabilities

    """
    h = torch.relu(self.in_layer(r))
    logp = self.logprob(self.out_layer(h))
    return logp

What should our loss function now be? Ideally, we want the probabilities outputted by our network to be such that the probability of the true stimulus class is high. One way to formalize this is to say that we want to maximize the *log* probability of the true stimulus class $\tilde{y}^{(n)}$ under the class probabilities predicted by the network,
\begin{equation}
  \log \left( \text{predicted probability of stimulus } n \text{ being of class } \tilde{y}^{(n)} \right) = \log p^{(n)}_{\tilde{y}^{(n)}} = l^{(n)}_{\tilde{y}^{(n)}}
\end{equation}
To turn this into a loss function to be *minimized*, we can then simply multiply it by -1: maximizing the log probability is the same as minimizing the *negative* log probability. Summing over a batch of $P$ inputs, our loss function is then given by
\begin{equation}
  L = -\sum_{n=1}^P \log p^{(n)}_{\tilde{y}^{(n)}} = -\sum_{n=1}^P l^{(n)}_{\tilde{y}^{(n)}}
\end{equation}
In the deep learning community, this loss function is typically referred to as the **cross-entropy**, or **negative log likelihood**. The corresponding built-in loss function in PyTorch is `nn.NLLLoss()` (documentation [here](https://pytorch.org/docs/master/generated/torch.nn.CrossEntropyLoss.html)).



#### Bonus Coding Exercise 3.2: A new loss function
In the next cell, we've provided most of the code to train and test a network to decode stimulus orientations via classification, by minimizing the negative log likelihood. Fill in the missing pieces.

Once you've done this, have a look at the plotted results. Does changing the loss function from mean squared error to a classification loss solve our problems? Note that errors may still occur -- but are these errors as bad as the ones that our network above was making?

In [None]:
def decode_orientation(net, n_classes, loss_fn,
                       train_data, train_labels, test_data, test_labels,
                       n_iter=1000, L2_penalty=0, L1_penalty=0):
  """ Initialize, train, and test deep network to decode binned orientation from neural responses

  Args:
    net (nn.Module): deep network to run
    n_classes (scalar): number of classes in which to bin orientation
    loss_fn (function): loss function to run
    train_data (torch.Tensor): n_train x n_neurons tensor with neural
      responses to train on
    train_labels (torch.Tensor): n_train x 1 tensor with orientations of the
      stimuli corresponding to each row of train_data, in radians
    test_data (torch.Tensor): n_test x n_neurons tensor with neural
      responses to train on
    test_labels (torch.Tensor): n_test x 1 tensor with orientations of the
      stimuli corresponding to each row of train_data, in radians
    n_iter (int, optional): number of iterations to run optimization
    L2_penalty (float, optional): l2 penalty regularizer coefficient
    L1_penalty (float, optional): l1 penalty regularizer coefficient

  Returns:
    (list, torch.Tensor): training loss over iterations, n_test x 1 tensor with predicted orientations of the
      stimuli from decoding neural network
  """

  # Bin stimulus orientations in training set
  train_binned_labels = stimulus_class(train_labels, n_classes)
  test_binned_labels = stimulus_class(test_labels, n_classes)


  # Run GD on training set data, using learning rate of 0.1
  # (add optional arguments test_data and test_binned_labels!)
  train_loss, test_loss = train(net, loss_fn, train_data, train_binned_labels,
                                learning_rate=0.1, test_data=test_data,
                                test_labels=test_binned_labels, n_iter=n_iter,
                                L2_penalty=L2_penalty, L1_penalty=L1_penalty)

  # Decode neural responses in testing set data
  out = net(test_data)
  out_labels = np.argmax(out.detach(), axis=1)  # predicted classes

  frac_correct = (out_labels==test_binned_labels).sum() / len(test_binned_labels)
  print(f'>>> fraction correct = {frac_correct:.3f}')

  return train_loss, test_loss, out_labels

# Set random seeds for reproducibility
np.random.seed(1)
torch.manual_seed(1)

n_classes = 20

##############################################################################
## TODO for students: fill out missing pieces below to initialize network and
# define loss function
##############################################################################

# Initialize network
net = ... # use M=20 hidden units

# Initialize built-in PyTorch negative log likelihood loss function
loss_fn = ...

# Uncomment below to train network and run it on test images
# this function uses the train function you wrote before
#train_loss, test_loss, predicted_test_labels = decode_orientation(net, n_classes, loss_fn,
#                                                                  resp_train, stimuli_train, resp_test, stimuli_test)

# Plot results
#plot_decoded_results(train_loss, test_loss, stimuli_test, predicted_test_labels)

[*Click for solution*](https://github.com/NeuromatchAcademy/course-content/tree/master//tutorials/W2D1_DeepLearning/solutions/W2D1_Tutorial1_Solution_6386844c.py)

*Example output:*

<img alt='Solution hint' align='left' width=1134 height=413 src=https://raw.githubusercontent.com/NeuromatchAcademy/course-content/master/tutorials/W2D1_DeepLearning/static/W2D1_Tutorial1_Solution_6386844c_5.png>



How do the weights $W_{in}$ from the neurons to the hidden layer look now?

In [None]:
W_in = net.in_layer.weight.detach().numpy() # we can run detach and numpy to get a numpy array
print('shape of W_in:')
print(W_in.shape)

plt.figure(figsize=(6,4))
plt.imshow(W_in[:,isort], aspect='auto', cmap='bwr', vmin=-5e-3, vmax=5e-3)
plt.xlabel('neurons')
plt.ylabel('hidden units')
plt.colorbar()
plt.title('$W_{in}$')

plt.show()

### Bonus Section 3.3: Regularization

In [None]:
# @title Video 4: Regularization
from ipywidgets import widgets

out2 = widgets.Output()
with out2:
  from IPython.display import IFrame
  class BiliVideo(IFrame):
      def __init__(self, id, page=1, width=400, height=300, **kwargs):
          self.id=id
          src = 'https://player.bilibili.com/player.html?bvid={0}&page={1}'.format(id, page)
          super(BiliVideo, self).__init__(src, width, height, **kwargs)

  video = BiliVideo(id="BV1na4y1a7ug", width=854, height=480, fs=1)
  print('Video available at https://www.bilibili.com/video/{0}'.format(video.id))
  display(video)

out1 = widgets.Output()
with out1:
  from IPython.display import YouTubeVideo
  video = YouTubeVideo(id="Qnn5OPHKo5w", width=854, height=480, fs=1, rel=0)
  print('Video available at https://youtube.com/watch?v=' + video.id)
  display(video)

out = widgets.Tab([out1, out2])
out.set_title(0, 'Youtube')
out.set_title(1, 'Bilibili')

display(out)


As discussed in the lecture, it is often important to incorporate regularization terms into the loss function to avoid overfitting. In particular, in this case, we will use these terms to enforce sparsity in the linear layer from neurons to hidden units. 

Here we'll consider the classic L2 regularization penalty $\mathcal{R}_{L2}$, which is the sum of squares of each weight in the network $\sum_{ij} {\mathbf{W}^{out}_{ij}}^2$ times a constant that we call `L2_penalty`.

We will also add an L1 regularization penalty $\mathcal{R}_{L1}$ to enforce sparsity of the weights, which is the sum of the absolute values of the weights $\sum_{ij} |{\mathbf{W}^{out}_{ij}}|$ times a constant that we call `L1_penalty`.

We will add both of these to the loss function:
\begin{equation}
    L = (y - \tilde{y})^2 + \mathcal{R}_{L2} + \mathcal{R}_{L1}
\end{equation}

The parameters `L2_penalty` and `L1_penalty` are inputs to the train function.

#### Bonus Coding Exercise 3.3: Add regularization to training 

We will create a new loss function that adds L1 and L2 regularization. 
In particular, you will:
* add L2 loss penalty to the weights 
* add L1 loss penalty to the weights


We will then train the network using this loss function. Full training will take a few minutes: if you want to train for just a few steps to speed up the code while iterating on your code, you can decrease the n_iter input from 500. 

Hint: since we are using `torch` instead of `np`, we will use `torch.abs` instead of `np.absolute`. You can use `torch.sum` or `.sum()` to sum over a tensor.


In [None]:
def regularized_loss(output, target, weights, L2_penalty=0, L1_penalty=0):
  """loss function with L2 and L1 regularization

  Args:
    output (torch.Tensor): output of network
    target (torch.Tensor): neural response network is trying to predict
    weights (torch.Tensor): linear layer weights from neurons to hidden units (net.in_layer.weight)
    L2_penalty : scaling factor of sum of squared weights
    L1_penalty : scalaing factor for sum of absolute weights

  Returns:
    (torch.Tensor) mean-squared error with L1 and L2 penalties added

  """

  ##############################################################################
  # TO DO: add L1 and L2 regularization to the loss function
  raise NotImplementedError("Student exercise: complete regularized_loss")
  ##############################################################################

  loss_fn = nn.NLLLoss()
  loss = loss_fn(output, target)

  L2 = L2_penalty * ...
  L1 = L1_penalty * ...
  loss += L1 + L2

  return loss

# Set random seeds for reproducibility
np.random.seed(1)
torch.manual_seed(1)

n_classes = 20

# Initialize network
net = DeepNetSoftmax(n_neurons, 20, n_classes)  # use M=20 hidden units

# Initialize built-in PyTorch MSE loss function
loss_fn = regularized_loss

# Here you can play with L2_penalty > 0, L1_penalty > 0
train_loss, test_loss, predicted_test_labels = decode_orientation(net, n_classes, loss_fn,
                                                                  resp_train, stimuli_train,
                                                                  resp_test, stimuli_test,
                                                                  n_iter=1000,
                                                                  L2_penalty=1e-2,
                                                                  L1_penalty=5e-4)

# Plot results
plot_decoded_results(train_loss, test_loss, stimuli_test, predicted_test_labels)

[*Click for solution*](https://github.com/NeuromatchAcademy/course-content/tree/master//tutorials/W2D1_DeepLearning/solutions/W2D1_Tutorial1_Solution_e53cd8ce.py)

*Example output:*

<img alt='Solution hint' align='left' width=1134 height=413 src=https://raw.githubusercontent.com/NeuromatchAcademy/course-content/master/tutorials/W2D1_DeepLearning/static/W2D1_Tutorial1_Solution_e53cd8ce_5.png>



It seems we were overfitting a little because we increased the accuracy a small amount by adding an L1 and L2 regularization penalty. What errors are still being made by the model?

Let's see how the weights look after adding `L1_penalty > 0`.

In [None]:
W_in = net.in_layer.weight.detach().numpy() # we can run detach and numpy to get a numpy array
print('shape of W_in:')
print(W_in.shape)

plt.figure(figsize=(6,4))
plt.imshow(W_in[:,isort], aspect='auto', cmap='bwr', vmin=-5e-3, vmax=5e-3)
plt.xlabel('neurons')
plt.ylabel('hidden units')
plt.colorbar()
plt.title('$W_{in}$')

plt.show()

The weights appear to be sparser than before.