# Classifying cell types with neural networks

In this notebook, we will build a neural network that classifies cell types in the retinal bipolar dataset for Shekhar et al., 2016. These cells have been manually annotated, and here we will show that a neural network can recapitulate these cell type labels.

## 1. Imports

In [None]:
!pip install --user scprep

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import tensorflow as tf
import scprep

## 2. Loading the retinal bipolar data

We'll use the same retinal bipolar data you saw in preprocessing and visualization.

In [None]:
scprep.io.download.download_google_drive("1pRYn62SOmmJxwVU0sSW7eBagRL2RJmx0", "shekhar_data.pkl")
scprep.io.download.download_google_drive("1FlNktWuJCka3pXOvNIFfRitGluZy2ftt", "shekhar_clusters.pkl")

In [None]:
data_raw = pd.read_pickle("shekhar_data.pkl")
clusters = pd.read_pickle("shekhar_clusters.pkl")

#### Converting data to `numpy` format

Tensorflow expects data to be stored as a NumPy array.

In [None]:
data = scprep.reduce.pca(data_raw, n_components=100, method='dense').to_numpy()
labels, cluster_names = pd.factorize(clusters['CELLTYPE'])

In [None]:
num_classes = len(np.unique(labels))
num_classes

#### Splitting the data into training and validation sets

We'll allocate 80\% of our data for training and 20\% for testing. You can also do this with scikit-learn:

```python
from sklearn.model_selection import train_test_split
data_training, data_validation, labels_training, labels_validation = train_test_split(
    data, labels, test_size=0.2)
```

In [None]:
# first let's split our data into training and validation sets
train_test_split = int(.8 * data.shape[0])

data_training = data[:train_test_split, :]
labels_training = labels[:train_test_split]
data_validation = data[train_test_split:, :]
labels_validation = labels[train_test_split:]
data_training.shape, data_validation.shape

## 3. Computational graphs

Tensorflow works with an abstract computational graph. Let's create some simple operations with the first ten data points.

In [None]:
# let's make an object in this graph corresponding to our first 10 points
data_tf = tf.constant(data_training[:10, :], dtype=tf.float32)

# and now their corresponding labels
labels_tf = tf.constant(labels_training[:10], dtype=tf.int32)

# look at the output
print(labels_tf)

In [None]:
# compare this to the numpy data we started with
print(labels_training[:10])

In [None]:
# and now go back to the original cluster names
cluster_names[labels_training[:10]]

Note that `data` is a NumPy array containing actual numbers corresponding to the `cluster_names`. `data_tf`, on the other hand, is a Tensorflow variable, and is **just a set of instructions**. In this case, the instructions are extremely simple: "take the values from this variable and make them into a constant".

#### Tensorflow's `Session`

Tensorflow variables are just **instructions** for how to do computation, *not* the actual computations themselves.

In order to perform the computations as instructed, we need to start a computation session and ask it to generate the output by using `Session.run()`.

For more information on computational graphs and session, see this [blog post](https://www.easy-tensorflow.com/tf-tutorials/basics/graph-and-session).

In [None]:
sess = tf.InteractiveSession()

In [None]:
sess.run(labels_tf)

When we take the *instructions* in `data_tf` and `labels_tf` and run them with `sess.run()`, we get back the values encoded by those instructions.

In [None]:
np.allclose(sess.run(data_tf), data_training[:10])

#### Writing mathematical recipes

We can think of each instruction as a set in a recipe. When we `run` the instructions, the Tensorflow `Session` executes the recipe with the data it has been given.

In [None]:
# we can now give instructions for computations on this data and then ask for the output
w = 10 * data_tf + 3
x = w / 2
y = x + w
z = y**2

sess.run(z)

Note now that the output of `sess.run(z)` is a NumPy variable that corresponds to the value of `z` after execution. We do not, however, have any NumPy arrays corresponding to `w`, `x`, or `y`.

### Discussion
1. How could we find the value of `w`, `x` or `y`?
2. Can you think of an advantage of writing these computations as instructions, rather than storing all of the intermediate values?

## Exercise 1 - Computational graphs for arithmetic

Print the last 5 rows of the data matrix with their values doubled (using tensorflow operations).

In [None]:
# =================
# Get the last five rows of `data_training`
data_last5 = 
# Create a `tf.constant` storing `data_last5`
tf_last5 = 
# Multiply by two
tf_last5_double =
# Use `sess.run()` to compute the result
data_last5_double =
# Print the result
data_last5_double
# =================

## 4. Building a one-layer neural network

Now we know how to write simple recipes in Tensorflow, we can create a more complex instruction set defining a simple neural network with a single hidden layer.

#### Build the network architecture

In [None]:
# this function applies the simple feedforward operation
def layer(x, n_dim, name, activation=None):
    # create the weight matrix
    W = tf.get_variable(dtype=tf.float32, shape=[x.get_shape()[-1], n_dim], name='W{}'.format(name))
    # create the bias vector
    b = tf.get_variable(dtype=tf.float32, shape=[n_dim], name='b{}'.format(name))
    # X2 = X1 * W + b
    output = tf.matmul(x, W) + b
    if activation:
        # nonlinear activation function
        output = activation(output)
    return output

# create a hidden (middle) layer
hidden_layer_tf = layer(data_tf, n_dim=100, name='hidden', activation=tf.nn.relu)

# create the output layer used to classify
output_tf = layer(hidden_layer_tf, n_dim=num_classes, name='output', activation=tf.nn.softmax)

The output of this instruction set, `output_tf` is a Tensorflow *instruction* encoding the entire step of mathematical operations to get from the input of the neural network to the output. It does not yet contain any data!

**Note Dan/Scott/Matt**: Are we going to explain `relu`, and other functions in the talks.  At least the function used the notebook(s) should be emphasized a bit more.

In [None]:
output_tf

#### Build the loss function

In order to train our neural network, we need to define a loss function which tells us how well (or how poorly) our classifier performed.

Here, we'll use the cross-entropy loss which we discussed in lecture.

In [None]:
# convert our integer class labels to a binary "one-hot" matrix
labels_one_hot = tf.one_hot(labels_tf, num_classes)

# compute cross entropy
loss_tf = labels_one_hot * tf.log(output_tf + 1e-6) + (1 - labels_one_hot) * tf.log(1 - output_tf + 1e-6)
loss_tf = -1 * tf.reduce_sum(loss_tf)

#### Create the optimizer and tell it to minimize the loss

Tensorflow does all of the heavy lifting for us. The _optimizer_ takes the loss value and calculates how we should change the network weights to improve our results.  

**Note Dan/Scott/Matt**: I am guessing that different optimizers will be discussed in the lectures.  If not the choice in the next code block will not make a lot of sense.

In [None]:
# now we need an optimizer that we'll give this loss, and it'll take responsibility
# for updating the network to make this score go down
learning_rate = 0.00001
opt = tf.train.GradientDescentOptimizer(learning_rate)

# this will be the tf instruction we call for when we want to take a single step to train our network
train_op = opt.minimize(loss_tf)

#### Initialize variables

In [None]:
# last thing: we need to set our network weights to random values to start
sess.run(tf.global_variables_initializer())

...and that's it! We've built a one-layer neural network!

#### Evaluating network performance

Let's see how our network does at classifying data.

In [None]:
output_np, labels_np = sess.run([output_tf, labels_tf])

output_np

How do we convert this output matrix into a classification? We take the column of each output with the largest value - this is the network's best guess for the label of each data point.

In [None]:
# network outputs
np.argmax(output_np, axis=1)

How do these compare to the correct answers?

In [None]:
# true output labels
labels_np

Doesn't look great. We can calculate this rather than having to eyeball the data each time.

In [None]:
# count the number we got right
print('Correct: {} / {}'.format((np.argmax(output_np, axis=1) == labels_np).sum(), output_np.shape[0]))

Okay, so we're not doing well yet. But here's the power of neural networks - we'll update the weights based on our performance until we start doing well!

#### Training the network

Here's the important part: we can optimize the weights of the network based on the desired outputs and iterate until we get good performance.

In [None]:
# let's take 1000 gradient steps
for step in range(1000):
    # run the instruction telling tf to take one step
    sess.run(train_op)

    if step % 100 == 0:
        # print the performance every 100 steps
        output_np, labels_np = sess.run([output_tf, labels_tf])
        print('Training step {} correct: {} / {}'.format(step, (np.argmax(output_np, axis=1) == labels_np).sum(), output_np.shape[0]))

Okay, so our network can classify these ten points pretty well. But how can we do this for thousands or millions of points?

### Start again with placeholders so we can use all of the data

The power of tensorflow is that we are able to define computations as we did above, but with 'placeholders' instead of actual data. We just have to define the shape and type of the variable, and then we don't have to give it actual data until we call `sess.run`.

This is powerful because we can call the same computation over and over again with different data without having to rewrite the tensorflow code.

So now let's start over and do it with `tf.placeholder`! Conveniently, we don't have to specify the number of rows in our dataset and can instead just use `None` to indicate this may vary from batch to batch.

For more information on placeholders, check out this [tutorial](https://databricks.com/tensorflow/placeholders).

#### Build the computational graph

In [None]:
# clear out all of the existing instructions and start again
tf.reset_default_graph() 

# start a new session
sess.close()
sess = tf.InteractiveSession()

# how many data points do we want to calculate at once?
batch_size = 10

# create a placeholder which we can fill with data
data_tf = tf.placeholder(shape=[None, data.shape[1]], dtype=tf.float32, name='data_tf')
# and a placeholder for the corresponding labels
labels_tf = tf.placeholder(shape=[None], dtype=tf.int32, name='labels_tf')

# create the instructions to calculate the middle (hidden) layer
hidden_layer_tf = layer(data_tf, n_dim=10, name='hidden', activation=tf.nn.relu)

# create the instructions to calculate the output layer
output_tf = layer(hidden_layer_tf, n_dim=num_classes, name='output', activation=tf.nn.softmax)

# convert our numerical cluster labels to a matrix
labels_one_hot = tf.one_hot(labels_tf, num_classes)

# compute the cross entropy
loss_tf = labels_one_hot * tf.log(output_tf + 1e-6) + (1 - labels_one_hot) * tf.log(1 - output_tf + 1e-6)
loss_tf = - tf.reduce_sum(loss_tf)

# build the optimizer
learning_rate = 0.001
# we'll use the AdamOptimizer as it is much more powerful
opt = tf.train.AdamOptimizer(learning_rate)

# and finally the instruction to tell tf to modify the weights to minimize the loss
train_op = opt.minimize(loss_tf)

# and finally initialize everything!
sess.run(tf.global_variables_initializer())

#### Train the network

Let's train the network for 100 _epochs_. An epoch is defined as having optimized our weights over all of our data points exactly once.

In [None]:
# train the network for 100 epochs
step = 0
for epoch in range(100):
    # randomize the order in which we see the data in each epoch
    random_order_indices = np.random.choice(data_training.shape[0], data_training.shape[0], replace=False)
    
    # iterate through the data in batches of size `batch_size`
    for batch_indices in np.array_split(random_order_indices, random_order_indices.shape[0] // batch_size):
        data_batch = data_training[batch_indices]
        labels_batch = labels_training[batch_indices]
        step += 1

        # update the weights to minimize the loss on this batch
        _, loss_training = sess.run([train_op, loss_tf], {data_tf: data_batch, labels_tf: labels_batch})

        # evaluate accuracy on both the training and validation datasets every 50 steps
        if step % 50 == 0:
            # compute the accuracy on the training batch
            # compute the predicted outputs
            output_np = sess.run(output_tf, {data_tf: data_batch})
            # store the maximum index of each row (the prediction)
            prediction_np = np.argmax(output_np, axis=1)
            # compute the accuracy over the batch
            acc_training = np.mean(prediction_np == labels_batch)

            # compute the loss on all the validation data
            loss_np = []
            output_np = []
            labels_np = []
            random_order_indices = np.random.choice(data_validation.shape[0], data_validation.shape[0], replace=False)
            for batch_indices in np.array_split(random_order_indices, random_order_indices.shape[0] // batch_size):
                data_batch = data_validation[batch_indices]
                labels_batch = labels_validation[batch_indices]
                # compute the predicted outputs of each batch
                output_np_ = sess.run(output_tf, {data_tf: data_batch})
                # store the maximum index of each row (the prediction)
                output_np.append(np.argmax(output_np_, axis=1))
                # store the true labels
                labels_np.append(labels_batch)

            output_np = np.concatenate(output_np, axis=0)
            labels_np = np.concatenate(labels_np, axis=0)
            # compute the accuracy over the whole dataset
            acc_validation = np.mean(output_np == labels_np)
            print('Step {} loss: {:.3f} training accuracy: {:.3f} validation accuracy: {:.3f} '.format(
                step, loss_training, acc_training, acc_validation))

### Discussion

How did our network do? Is the classification accuracy high? How many iterations did it take for the training accuracy to stop increasing? How many iterations did it take for the training loss to stop decreasing?

#### _Breakpoint_  - once you get here, please help those around you!

## Exercise 3 - network width

Create a network with a wider hidden layer and compare its performance to the network with 10 hidden neurons we just built

In [None]:
# reset everything
sess.close()
sess = tf.InteractiveSession()
tf.reset_default_graph()

# ===================
# Copy the code from above for both building the graph and training the network
# Change n_dim in the hidden layer from 10 to something larger

# ===================

#### _Breakpoint_  - once you get here, please help those around you!

## Exercise 4

Create a network with *two* hidden layers and compare its performance to the network with one hidden layer we just built

In [None]:
# reset everything
sess.close()
sess = tf.InteractiveSession()
tf.reset_default_graph()

# ===================
# Copy the code from above and add another hidden layer whose input is the output of the first layer
# The second hidden layer should be used as input to the output layer

# ===================

#### _Breakpoint_  - once you get here, please help those around you!

## Exercise 5

Create a network with *five* hidden layers and compare its performance to the network with one hidden layer we just built

In [None]:
# reset everything
sess.close()
sess = tf.InteractiveSession()
tf.reset_default_graph()

# ===================
# Copy the code from above and add another three hidden layers
# Chain the output from each layer to the input at the next

# ===================

#### Re-Cap
1. Power of TensorFlow is to allow us to setup the neural networks using `placeholders`.
2. WE can use the same neural network over and over with different data without having to re-write the code.