# Classifying cell types with neural networks

In this notebook, we will build a neural network that classifies cell types in the retinal bipolar dataset for Shekhar et al., 2016. These cells have been manually annotated, and here we will show that a neural network can recapitulate these cell type labels.

## 1. Imports

In [1]:
!pip install --user scprep



In [0]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

import scprep

## 2. Loading the retinal bipolar data

We'll use the same retinal bipolar data you saw in preprocessing and visualization.

In [0]:
scprep.io.download.download_google_drive("1pRYn62SOmmJxwVU0sSW7eBagRL2RJmx0", "shekhar_data.pkl")
scprep.io.download.download_google_drive("1FlNktWuJCka3pXOvNIFfRitGluZy2ftt", "shekhar_clusters.pkl")

In [0]:
data_raw = pd.read_pickle("shekhar_data.pkl")
clusters = pd.read_pickle("shekhar_clusters.pkl")

#### Converting data to `numpy` format

PyTorch expects data to be stored as a NumPy array.

In [0]:
data = scprep.reduce.pca(data_raw, n_components=100, method='dense').to_numpy()
labels, cluster_names = pd.factorize(clusters['CELLTYPE'])

In [6]:
num_classes = len(np.unique(labels))
num_classes

28

#### Splitting the data into training and validation sets

We'll allocate 80\% of our data for training and 20\% for testing. You can also do this with scikit-learn:

```python
from sklearn.model_selection import train_test_split
data_training, data_validation, labels_training, labels_validation = train_test_split(
    data, labels, test_size=0.2)
```

In [7]:
# first let's split our data into training and validation sets
train_test_split = int(.8 * data.shape[0])

data_training = data[:train_test_split, :]
labels_training = labels[:train_test_split]
data_validation = data[train_test_split:, :]
labels_validation = labels[train_test_split:]
data_training.shape, data_validation.shape

((15018, 100), (3755, 100))

## 3. Moving Our Data to PyTorch Tensors 

By moving our data from numpy arrays to PyTorch Tensors, we can take advantage of the variety of tensor operations available. 

In [0]:
train_tensor = torch.from_numpy(data_training)
train_labels = torch.from_numpy(labels_training)

valid_tensor = torch.from_numpy(data_validation)
valid_labels = torch.from_numpy(labels_validation)


let's go ahead and check that our tensors are the expected sizes. We can do this identically to how we've previously done it with numpy arrays

In [9]:
# check shapes
print("train tensor shape: {}".format(train_tensor.shape))
print("train labels shape: {}".format(train_labels.shape))

print("valid tensor shape: {}".format(valid_tensor.shape))
print("valid labels shape: {}".format(valid_labels.shape))


train tensor shape: torch.Size([15018, 100])
train labels shape: torch.Size([15018])
valid tensor shape: torch.Size([3755, 100])
valid labels shape: torch.Size([3755])


## Exercise 1 - Tensor Operations 1

1. Create a tensor called x of values (1,20) using torch.arange(). Check the PyTorch documentation for [help](https://pytorch.org/docs/master/generated/torch.arange.html)

2. Reshape this tensor to shape (4,5)

2. Add the constant 5 to x and save this tensor as y

3. Power the values of y to 3rd power and save this tensor as z

4. Print the first row of z



In [10]:
# create x using torch.arange()
x = 

# reshape to (4,5)
x = 

# add 5
y = 

# power z to the 3rd power
z = 

# print the first row of z
print()

SyntaxError: ignored

## Exercise 2 - Tensor Operations 2

1. Subset the training tensor by taking the last 5 rows

2. Double the values and print the result.


In [0]:
# Get the last five rows of `data_training`
data_last5 = 

# Multiply by two
last5_double =

# Print the result
last5_double

## 4. Building a one-layer neural network

Now we know how to write simple recipes in PyTorch, we can create a more complex instruction set defining a simple neural network with a single hidden layer.

In [0]:
class layer(nn.Module):
    def __init__(self, input_size, output_size, activation=None):
        super(layer, self).__init__()

        self.weight = torch.randn(input_size, hidden_size).double().requires_grad_()
        self.bias = torch.randn(hidden_size).double().requires_grad_()
        self.activation = activation

    def forward(self, x):
        output = torch.matmul(x, self.weight) + self.bias
        output = self.activation(output)
        return output


In [14]:
layer_1 = layer(100, 100, activation=nn.ReLU())
layer_2 = layer(100,num_classes, activation=nn.Softmax())

# create a hidden (middle) layer
hidden_layer = layer_1(train_tensor)

# create the output layer used to classify
output = layer_2(hidden_layer)

  # This is added back by InteractiveShellApp.init_path()


#### Build the loss function

In order to train our neural network, we need to define a loss function which tells us how well (or how poorly) our classifier performed.

Here, we'll use the cross-entropy loss which we discussed in lecture.

In [0]:
def to_one_hot(y_tensor, c_dims):
    """converts a N-dimensional input to a NxC dimnensional one-hot encoding
    """
    y_tensor = torch.LongTensor(y_tensor)
    y_tensor = y_tensor.type(torch.LongTensor).view(-1, 1)
    c_dims = c_dims if c_dims is not None else int(torch.max(y_tensor)) + 1
    y_one_hot = torch.zeros(y_tensor.size()[0], c_dims).scatter_(1, y_tensor, 1)
    y_one_hot = y_one_hot.view(*y_tensor.shape, -1)
    return y_one_hot.squeeze()


In [0]:
# convert our integer class labels to a binary "one-hot" matrix

labels_one_hot = to_one_hot(train_labels, num_classes)

# compute cross entropy
loss = labels_one_hot * torch.log(output+ 1e-6) + (1 - labels_one_hot) * torch.log(1 - output + 1e-6)
loss = -1 * loss.sum()

#### Create the optimizer

PyTorch does all of the heavy lifting for us. The optimizer takes the loss value and calculates how we should change the network weights to improve our results.  

**Note Dan/Scott/Matt**: I am guessing that different optimizers will be discussed in the lectures.  If not the choice in the next code block will not make a lot of sense.

In [0]:
# now we need an optimizer that we'll give this loss, and it'll take responsibility
# for updating the network to make this score go down
learning_rate = 0.00001

optimizer = optim.SGD([layer_1.weight, layer_1.bias,
                       layer_2.weight, layer_2.bias],
                       lr=learning_rate)


# how many data points do we want to calculate at once?
batch_size = 10

#### Train the network

Let's train the network for 100 _epochs_. An epoch is defined as having optimized our weights over all of our data points exactly once.

In [28]:
# train the network for 100 epochs
step = 0
for epoch in range(100):
    # randomize the order in which we see the data in each epoch
    random_order_indices = np.random.choice(train_tensor.shape[0], train_tensor.shape[0], replace=False)
    
    # iterate through the data in batches of size `batch_size`
    for batch_indices in np.array_split(random_order_indices, random_order_indices.shape[0] // batch_size):
      
        train_data_batch = train_tensor[batch_indices]
        train_labels_batch = train_labels[batch_indices]
        train_onehot = to_one_hot(train_labels_batch, num_classes)

        step += 1

        # get pass batch through layers
        hidden_layer = layer_1(train_data_batch)
        output = layer_2(hidden_layer)

        # compute cross entropy
        loss = train_onehot * torch.log(output+ 1e-6) + (1 - train_onehot) * torch.log(1 - output + 1e-6)
        loss = -1 * loss.sum()

        # backpropagate the loss
        loss.backward()

        # update parameters
        optimizer.step()

        # reset gradients
        optimizer.zero_grad()

        # evaluate accuracy on both the training and validation datasets every 50 steps
        if step % 50 == 0:

          # don't track gradients
          with torch.no_grad():

              # compute the predicted outputs
              train_prediction = output.argmax(1).numpy()

              # compute the accuracy over the batch
              acc_training = np.mean(train_prediction == train_labels_batch.numpy())

              # compute the loss on all the validation data
              loss_np = []
              output_np = []
              labels_np = []

              random_order_indices = np.random.choice(valid_tensor.shape[0], valid_tensor.shape[0], replace=False)
              
              for batch_indices in np.array_split(random_order_indices, random_order_indices.shape[0] // batch_size):
                  valid_data_batch = valid_tensor[batch_indices]
                  valid_labels_batch = valid_labels[batch_indices]
                    
                  # pass through layers
                  valid_hidden = layer_1(valid_data_batch)
                  valid_output = layer_2(valid_hidden)

                  # compute the predicted outputs

                  prediction_np = valid_output.argmax(1).numpy()

                  output_np = np.concatenate(prediction_np.reshape(-1,1), axis=0)
                  labels_np = np.concatenate(valid_labels_batch.numpy().reshape(-1,1), axis=0)


              # compute the accuracy over the whole dataset
              acc_validation = np.mean(output_np == labels_np)
              
              print('Step {} loss: {:.3f} training accuracy: {:.3f} validation accuracy: {:.3f} '.format(
                  step, loss.item(), acc_training, acc_validation))
          

  # This is added back by InteractiveShellApp.init_path()


Step 50 loss: 237.345 training accuracy: 0.100 validation accuracy: 0.000 
Step 100 loss: 248.679 training accuracy: 0.100 validation accuracy: 0.100 
Step 150 loss: 276.310 training accuracy: 0.000 validation accuracy: 0.000 
Step 200 loss: 248.679 training accuracy: 0.100 validation accuracy: 0.000 
Step 250 loss: 261.845 training accuracy: 0.000 validation accuracy: 0.200 
Step 300 loss: 269.399 training accuracy: 0.000 validation accuracy: 0.100 
Step 350 loss: 259.555 training accuracy: 0.000 validation accuracy: 0.100 
Step 400 loss: 185.768 training accuracy: 0.300 validation accuracy: 0.100 
Step 450 loss: 266.108 training accuracy: 0.000 validation accuracy: 0.000 
Step 500 loss: 240.138 training accuracy: 0.100 validation accuracy: 0.000 
Step 550 loss: 241.397 training accuracy: 0.100 validation accuracy: 0.100 
Step 600 loss: 238.543 training accuracy: 0.100 validation accuracy: 0.000 
Step 650 loss: 177.536 training accuracy: 0.300 validation accuracy: 0.100 
Step 700 loss

KeyboardInterrupt: ignored

### Discussion

How did our network do? Is the classification accuracy high? How many iterations did it take for the training accuracy to stop increasing? How many iterations did it take for the training loss to stop decreasing?

#### _Breakpoint_  - once you get here, please help those around you!

## Exercise 3 - network width

Create a network with a wider hidden layer and compare its performance to the network with 10 hidden neurons we just built

#### _Breakpoint_  - once you get here, please help those around you!

## Exercise 4

Create a network with *two* hidden layers and compare its performance to the network with one hidden layer we just built

#### _Breakpoint_  - once you get here, please help those around you!

## Exercise 5

Create a network with *five* hidden layers and compare its performance to the network with one hidden layer we just built

#### Re-Cap
1. Power of PyTorch is to allow us to setup the neural networks using nn.Module

2. WE can use the same neural network over and over with different data without having to re-write the code.