# Computational Cognitive Neuroscience - Homework 1 (Neural Networks)
**Start date: 21st January 2021**

**Due date (UPDATED!): 1st February 2021**

This homework set focuses and expands upon chapters 3, 4 and 7 of the Rojas [book](http://page.mi.fu-berlin.de/rojas/neural/chapter/K7.pdf). Task is to implement simple perceptrons and neural networks in pure python, from scratch.

## Submission instructions
Submission is by email to hermanni.halva@helsinki.fi. Follow these instructions to submit:  
1. Title of the email: "ccn homework 1 - student_number"
2. When you have completed the exercises, save the notebook. Attach it to the email.
3. Also download a pdf of the notebook and attach it.

## IMPORTANT
1. Don't share your code and answers with others.
2. It's your responsibility to ensure that the notebook has fully finished running all the cells, all the plots view properly etc. before submitting it. I will not re-run any code.
3. Submit your work by the deadline.
4. If you are confused, think there is a mistake or find things too difficult, just ask on github and I'll help

In [None]:
# set-up -- do not change
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
np.random.seed(1)

## Question 1 - simple perceptron learning algorithm (15 pts)

In this question you are given data, shown below, which comes from two different sources. Your task is to train a simple perceptron (ch. 3, Rojas) using the perceptron learning algorithm (ch. 4, Rojas).

More specifically, in the forward pass of the model we calculate, for a single observation $x_i \in \mathbb{R}^d$: 
$$ \text{score}_i = x_i^Tw +b,$$
where $w$ is vector of edge weights and $b$ is a scalar bias. The score determines whether an observation is classified as $1$ or $0$:
$$\hat{y}_i=1 \quad \text{if} \quad \text{score}_i > 0$$
$$\hat{y}_i=0 \quad \text{if} \quad \text{score}_i \le 0.$$

Model training is, in this implementation, done by the perceptron learning algorithm described on page 85 of Rojas. Training progress is controlled by a simple loss function $$\text{loss} = \sum |y_i-\hat{y}_i|.$$ where $y_i$ is the true label and $\hat{y}_i$ is the predicted label.

**Your task is** to fill in the missing lines of the code in the second cell below (first cell is data generation -- do not change it). Once the code is properly implemented, you should be able to reach loss of zero very quickly. Notice that the implementation requires the algorithm in matrices and vectors for the whole data rather than a single data point at time.



In [None]:
# create 2D data -- do not change
N = 100
mu1 = np.array([0.3, 0.3])
cov1 = np.array([[0.01, 0.],[0., 0.01]])
mu2 = np.array([0.8, 0.8])
cov2 = np.array([[0., 0.01],[0.01, 0.]])

mvn1 = np.random.multivariate_normal(mu1, cov1, size=N)
mvn2 = np.random.multivariate_normal(mu2, cov1, size=N)

x = np.concatenate((mvn1, mvn2), 0)
y = np.concatenate((np.zeros(N), np.ones(N)))

# plot the data -- do not change
sns.scatterplot(x=mvn1[:,0], y=mvn1[:,1], s=10, color="Blue")
sns.scatterplot(x=mvn2[:,0], y=mvn2[:,1], s=10, color="Green")

In [None]:
def simple_loss(predicted_labels, true_labels):
    return #!IMPLEMENT

# implement the missing parts
class simple_perceptron_unit:
    def __init__(self, d):
        # initialize weights and bias at random values -- no need to change
        self.w = np.random.random(size=(d,))
        self.b = np.random.random()
        
    def forward_pass(self, x):
        score = #[[CALCULATE THE SCORE]]
        return #[[OUTPUT CLASSIFICATION USING THRESHOLD AS IN INSTRUCTIOS]]
    
    def train(self, x, y):
        # X is input, y the labels
        # [[IMPLEMENT THE PERCEPTRON TRAINING ALGORITHM]]
            
# train the model -- don't change
spu = simple_perceptron_unit(2)
spu.train(x, y)
print(spu.w, spu.b)

## Question 2  - modular view of neural networks (20 pts)

As discussed in ch.7, it is best to view the neural networks as sequential graph of modules/functions. Each module takes inputs (data+parameters) and gives outputs. This allows learning to happen through the backpropagation algorithm, as gradients for the entire network can easily be calculated using chain rule. In particular, all we have to do is for each module separately calculate the gradients of its outputs with respect to its input data and parameters (if there are any). Once we know these, we can build a model with modules in any order we desire and easily backpropagate through the chain of modules. **Your task** is to implement the following neural network building blocks by defining their:
1. forward pass (input->output)
2. backward pass (gradients of module output wrt. module inputs, and wrt. parameters if they exist)

**Requirements**
- everything should be in matrix/vector form such that it can handle multiple observations of high dimension in parallel
- you are allowed to only use native python packages as well as numpy as scipy. No autograd, no pytorch/tensorflow/jax or any other autodiff packages. You need to manually implement tha backpropagation rules.

I have given you the first one as an example. **Please see the slides** from thursday for formal definitions of all these modules.

In [None]:
class sigmoid_activation:
    def __init__(self):
        self.X = 0
    
    def forward_pass(self, X):
        # if you wish to use any results from forward pass later in the backward pass, use self.
        # to store it. This is useful for sigmoid, as can be seen here. But it's not always needed. 
        sigmoid = (1.+np.exp(-X))**-1
        self.sigmoid = sigmoid
        return sigmoid
    
    def backprop(self, dL_dy):
        # dL_dy = dL/dy is the derivative of the loss function with respect to
        # the output of the sigmoid function.
        dL_dX = self.sigmoid*(1.-self.sigmoid) * dL_dy
        return dL_dX

    
class relu_activation:
  def __init__(self):
    self.X = 0
  
  def forward_pass(self, X):
   #[[IMPLEMENT FORWARD PASS]]
   return 

  def backprop(self, dL_dy):
    dL_dX = #[[IMPLEMENT]]
    return dL_dX

    
class linear_layer:
  def __init__(self, W, b):
    self.W = W
    self.b = b
    self.X = 0
    
  def forward_pass(self, X):
   #[[IMPLEMENT FORWARD PASS]]
   return    

  def backprop(self, dL_dy):
    dL_dW = #[[IMPLEMENT]]
    dL_dX = #[[IMPLEMENT]]
    dL_db = #[[IMPLEMENT]]
    return dL_dW, dL_db, dL_dX
  


class softmax_crossentropy_loss:
  def __init__(self):
    self.loss = 0
  
  def forward_pass(self, scores, true_labels):
    #[[IMPLEMENT FORWARD PASS]]
    return 
  
  def backprop(self, scores, true_labels):
    dL_dscores = #[[IMPLEMENT]]
    return dL_dscores


# evaluation below -- do not change
x = np.random.normal(size=(1000, 100))
dL_dy = np.random.normal(size=(1000, 100))
W = np.random.normal(size=(100, 50))
b = np.random.normal(size=(50, 1))
dL_dy_2 = np.random.normal(size=(1000, 50))
yhats = np.random.normal(size=(100, 10))
ys = np.random.choice([0., 1.], size=(100, 10))


sigmoid = sigmoid_activation()
print("sigmoid fwd pass", sigmoid.forward_pass(x))
print("sigmoid backprop", sigmoid.backprop(dL_dy))

relu = relu_activation()
print("relu fwd pass", relu.forward_pass(x))
print("relu backprop", relu.backprop(dL_dy))

linear = linear_layer(W, b)
print("linear layer fwd pass", linear.forward_pass(x))
print("linear layer backprop", linear.backprop(dL_dy_2))

smxe = softmax_crossentropy_loss()
print("softmaxXE fwd pass", smxe.forward_pass(yhats, ys))
#print("softmaxXE backprop", loss.backprop(yhats, true_labels))

## Question 3 - building your own neural network (20 pts)

In this question you will build your own neural network using a popular deep learning package [PyTorch](https://pytorch.org/). The benefit of this and other similar packages (Tensorflow, JAX etc.) is that they have readily implemented modules (similar to ones you created in previous question) and importantly they use autograd which automatically calculates gradients of those modules for you and thus easily performs backprop. All you need to do is plug these modules together into your desired model. 

The data you will use is the famous [MNIST](https://en.wikipedia.org/wiki/MNIST_database) image data. The idea is to train the neural network to be able to differentiate between images of the different digits 0-9. 

**Your task** is to implement a deep neural network that succesfully performs this task. The structure of your neural network should be:

input data (batch_size X 784 matrix) $\rightarrow$ linear layer (9216 units) $\rightarrow$ relu activation $\rightarrow$ linear layer (124 units) $\rightarrow$ output layer (10 unit linear layer)

**Important:**
   1. you should implement training in minibatches defined by batch_size (in order to do SGD)
   2. each minibatch by default is in the shape of (batch_size, 1, 28, 28) since it's a square image of 28x28 pixels with 1 color channel, you need to flatten each observation into a 784 long vector i.e reshape each batch to shape (batch_size, 784) matrix.
   3. loss function to use is softmax crossentropy. note that it is implemented outside of the neural network i.e.  not include in the 'class Net()' below
   4. You should test accuracy of 98% at least after 5 epochs. Play around with batch_size and learning_rate settings to reach this
   5. I have given you the skeleton code so all you need to do is fill in the missing parts, denoted by [[DIRECTIONS]]. If you still find it too difficult, you can just cheat and look here but try to understand how things work: https://github.com/pytorch/examples/blob/master/mnist/main.py of course the network and code there is bit different but should give you more than enough

In [None]:
# setting & data preparation -- do NOT change
torch.manual_seed(1)
transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,))
        ])
train_data = datasets.MNIST('../data', train=True,
                            download=True,
                            transform=transform)
test_data = datasets.MNIST('../data', train=False,
                          transform=transform)
log_interval = 100

# define the structure of your neural network
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        #[[IMPLEMENT: SEE LINK GIVEN IF CONFUSED]]
        
    def forward(self, x):
        #[[IMPLEMENT FORWARD PASS OF YOUR NETWORK]]
        return

# define training loop for a specific epoch
def train(model, train_loader, optimizer, current_epoch):
    model.train() # needed for training
    for batch_idx, (data, true_labels) in enumerate(train_loader): # this loops over minibatches for one epoch
        data = #[[RESHAPE DATA AS DESCRIBED ABOVE]]
        optimizer.zero_grad() # this clears gradients from previous run
        scores = #[[IMPLEMENT TO CALCULATE OUTPUT LAYER SOCRES]]
        loss = #[[IMPLEMENT LOSS FOR THE MINIBATCH]]
        loss.backward() # this performs the backprop automatically
        optimizer.step() # this performs the gradient descent step
        if batch_idx % log_interval == 0: # print results
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                epoch, batch_idx * len(data), len(train_loader.dataset),
                100. * batch_idx / len(train_loader), loss.item()))
 

def test(model, test_loader):
    model.eval() # needed for training
    test_loss = 0 # initialize loss
    correct_predictions = 0 # variable to count correct predictions
    with torch.no_grad(): # since no gradients needed in testing the model
        for data, true_labels in test_loader: # testing loop readily defined
            data = #[[RESHAPE YOUR DATA HERE]]
            scores = #[[CALCULATE SCORES FOR THE BATCH]]
            test_loss += #[[CALCULATE TOTAL LOSS FOR THE BATCH (SUM ACROSS BATCH)]]
            predicted_labels = #[[CHOOSE MOST LIKELY LABEL AS THE PREDICTION FOR EACH OBSERVATION]] 
            correct_predictions += #[[CALCULATE TOTAL NUMBER OF CORRECT PREDICTIONS FOR THE BATCH]]

    test_loss /= len(test_loader.dataset) # calculates average log loss

    print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
        test_loss, correct_predictions, len(test_loader.dataset),
        100. * correct_predictions / len(test_loader.dataset)))
    
# set values to desired
train_batch_size = #[[CHOOSE MINIBATCH SIZE]]
test_batch_size = 1000
epochs = #[[SET NUMBER OF TRAINING EPOCHS]]
learning_rate = #[[SET LEARNING RATE]]

# set data-loaders -- do not change
train_loader = torch.utils.data.DataLoader(train_data, batch_size=train_batch_size)
test_loader = torch.utils.data.DataLoader(test_data, batch_size=test_batch_size)

# train and evaluate your model
model = #[[INITIALIZE YOUR MODEL]]
optimizer = optim.SGD(model.parameters(), lr=learning_rate, momentum=0.9) # do not change
for epoch in range(1, epochs + 1):
    train(model, train_loader, optimizer, epoch)
    test(model, test_loader)

## Bonus question (optional) - softmax cross-entropy loss and its derivatives (15 pts)

This is question is not compulsory for full points, but the points you gain here can be used to cover any points you lost in the previous question.


Consider the softmax cross-entropy loss function which you implemented in Question 2.
Formally, recall its definition from the slides, for a single observation:
$$
\mathrm{loss}
~~=~~
-\sum_{i=1}^N \log{ \underbrace{\left(\frac{\exp(\mathbf{z}_{i}[y_i])}{\sum_{c=1}^{10} \exp(\mathbf{z}_{i}[c])}\right)}_{\text{softmax output}}}
~~=~~
\sum_{i=1}^N \left( -\mathbf{z}_{i}[y_i] + \log{\left( \sum_{c=1}^{10} \exp(\mathbf{z}_{i}[c]) \right)} \right)$$
where $\mathbf{z}_i \in \mathbb{R}^{10}$ is the input to the softmax layer for observation $i$ and the notation $\mathbf{z}_i{[c]}$ denotes the $c$-th entry of vector $\mathbf{z}_i$. The dataset is $\{(\mathbf{x}_i, y_i)\}_{i=1}^N$, where $y_i$ is the correct label for that observation.

**Your task** is to do the following derivations (no coding needed in this question) and write them below (latex works in the notebook). 


1. Given the cross-entropy loss above for a single observation, compute the derivative of the loss function with respect to entire minibatch input matrix $\mathbf{Z}$ i.e. a matrix with shape (batch_size, 10). Any answer should be fully vectorized. [5 pts] 
$$\frac{\partial loss}{\partial \mathbf{Z}} = ?$$

2. Consider a simple model (input data X --> linear layer --> softmax cross-entropy). Compute the derivatives below, in a fully vectorized and general form with respect to
  * the input $\mathbf{X}$
  $$\frac{\partial loss}{\partial \mathbf{X}} = ?$$
  * the parameters of the linear layer: weights $\mathbf{W}$ and bias $\mathbf{b}$
  $$\frac{\partial loss}{\partial \mathbf{W}} = ?$$
  $$\frac{\partial loss}{\partial \mathbf{b}} = ?$$

