# Classifying cell types with neural networks

In this notebook, we will build a neural network that classifies cell types in the retinal bipolar dataset for Shekhar et al., 2016. These cells have been manually annotated, and here we will show that a neural network can recapitulate these cell type labels.

## 1. Imports

In [None]:
!pip install --user scprep

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import tensorflow as tf
import scprep

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


## 2. Loading the retinal bipolar data

We'll use the same retinal bipolar data you saw in preprocessing and visualization.

In [3]:
scprep.io.download.download_google_drive("1kxsMav1ly_S6pQ1vKeAtlFFW3QVvilz0", "shekhar_data.pkl")
scprep.io.download.download_google_drive("1J4K8bo8Pys-8xayO5vtMK3t5wJ0_TG2Y", "shekhar_clusters.pkl")

In [2]:
data = pd.read_pickle("shekhar_data.pkl")
clusters = pd.read_pickle("shekhar_clusters.pkl")

#### Converting data to `numpy` format

Tensorflow expects data to be stored as a NumPy array.

In [3]:
data = scprep.reduce.pca(data, n_components=100, method='dense').to_numpy()
labels, cluster_names = pd.factorize(clusters['CELLTYPE'])

In [6]:
num_classes = len(np.unique(labels))
num_classes

28

#### Splitting the data into training and validation sets

We'll allocate 80\% of our data for training and 20\% for testing.

In [7]:
# first let's split our data into training and validation sets
train_test_split = int(.8 * data.shape[0])

data_training = data[:train_test_split, :]
labels_training = labels[:train_test_split]
data_validation = data[train_test_split:, :]
labels_validation = labels[train_test_split:]
data_training.shape, data_validation.shape

((15018, 100), (3755, 100))

## 3. Computational graphs

Tensorflow works with an abstract computational graph

In [17]:
# let's make an object in this graph corresponding to our first 10 points
data_tf = tf.constant(data_training[:10, :], dtype=tf.float32)

# and now their corresponding labels
labels_tf = tf.constant(labels_training[:10], dtype=tf.int32)

# look at the output
print(labels_tf)

Tensor("Const_3:0", shape=(10,), dtype=int32)


In [9]:
# compare this to the numpy data we started with
print(labels_training[:10])

[0 1 2 3 2 1 4 0 3 5]


In [10]:
# and now go back to the original cluster names
cluster_names[labels_training[:10]]

Index(['BC5A', 'BC1B', 'BC6', 'Rod BC', 'BC6', 'BC1B', 'BC3B', 'BC5A',
       'Rod BC', 'Muller Glia'],
      dtype='object')

data is a np variable, with actual numbers
data_tf is a tf variable, and is just a set of instructions, i.e. "grab numbers from this variable and make them a constant"

#### Tensorflow's `Session`

tf variables are just *instructions* for how to do computation, not the actual computations themselves
to perform computations as instructed, we need to start a session and ask for the output by "running"

In [11]:
sess = tf.InteractiveSession()

In [12]:
sess.run(labels_tf)

array([0., 1., 2., 3., 2., 1., 4., 0., 3., 5.], dtype=float32)

In [13]:
np.allclose(sess.run(data_tf), data_training[:10])

True

In [14]:
# we can now give instructions for computations on this data and then ask for the output
w = 10 * data_tf + 3
x = w / 2
y = x + w
z = y**2

sess.run(z)

array([[4.55465302e+02, 5.86087031e+04, 4.70691553e+03, 5.71816406e+04,
        2.26201950e+02, 1.89646558e+03, 2.08491040e+03, 9.73422229e-01,
        2.76444971e+03, 2.08067505e+02, 1.80798450e+03, 2.04972794e+02,
        2.91574658e+03, 5.56084277e+03, 1.89451157e+02, 4.93480377e+02,
        8.17330933e+02, 1.20037830e+03, 5.04254181e+02, 3.29946198e+02,
        3.34681976e+02, 1.81121780e+02, 1.18960498e+03, 1.11654945e+02,
        5.72496729e+03, 1.48646570e+03, 1.88747818e+02, 8.97228165e+01,
        1.19735825e+02, 1.76366806e+02, 7.00738342e+02, 3.63830902e+02,
        1.83918190e+01, 7.48567009e+00, 1.55218701e+03, 3.49349976e+03,
        3.63029724e+02, 7.77578857e+02, 1.76109778e+03, 6.54872192e+02,
        4.90719109e+01, 1.01245728e+02, 1.73189758e+02, 2.09980408e+02,
        4.55323877e+03, 2.87891455e+03, 9.27585602e+01, 1.58389636e+03,
        5.67447021e+02, 2.02089600e+03, 3.45625758e+00, 3.18488483e+01,
        1.56000809e+02, 1.67953014e+01, 1.14625403e+03, 2.178075

note now output is a np variable that corresponds to the value of z. we do not have a np variable 
corresponding to w, x, or y. if all we want is the output z, we don't need them!

## Exercise 1 - Print the last 5 rows of the data matrix with their values doubled (using tensorflow operations)

In [60]:
# =================
# Get the last five rows of `data_training`
data_last5 = 
# Create a tensorflow constant storing `data_last5`
tf_last5 = 
# Multiply by two
tf_last5_double =
# Use `sess` to compute the result
data_last5_double =
# Print the result
data_last5_double
# =================

SyntaxError: invalid syntax (<ipython-input-60-c76805c08f8b>, line 3)

## 4. Building a one-layer neural network

#### Build the network architecture

In [15]:
# this function applies the simple feedforward operation
def layer(x, n_dim, name, activation=None):
    # create the weight matrix
    W = tf.get_variable(dtype=tf.float32, shape=[x.get_shape()[-1], n_dim], name='W{}'.format(name))
    # create the bias vector
    b = tf.get_variable(dtype=tf.float32, shape=[n_dim], name='b{}'.format(name))
    # X2 = X1 * W + b
    output = tf.matmul(x, W) + b
    if activation:
        # nonlinear activation function
        output = activation(output)
    return output

# create a hidden (middle) layer
hidden_layer_tf = layer(data_tf, n_dim=100, name='hidden', activation=tf.nn.relu)

# create the output layer used to classify
output_tf = layer(hidden_layer_tf, n_dim=num_classes, name='output', activation=tf.nn.softmax)

Instructions for updating:
Colocations handled automatically by placer.


#### Build the loss function

In [18]:
# we need a loss/score to tell our network how good or bad these results are
# let's use cross-entropy like we talked about
labels_one_hot = tf.one_hot(labels_tf, num_classes)

loss_tf = labels_one_hot * tf.log(output_tf + 1e-6) + (1 - labels_one_hot) * tf.log(1 - output_tf + 1e-6)
loss_tf = -1 * tf.reduce_sum(loss_tf)

#### Create the optimizer and tell it to minimize the loss

In [68]:
# now we need an optimizer that we'll give this loss, and it'll take responsibility
# for updating the network to make this score go down
learning_rate = .00001
opt = tf.train.GradientDescentOptimizer(learning_rate)

# this will be the tf object we call for when we want to take a single step to train our network
train_op = opt.minimize(loss_tf)

#### Initialize variables

In [None]:
# last thing: we need to set our network weights to random values to start
sess.run(tf.global_variables_initializer())

...and that's it! We've built a one-layer neural network!

#### Evaluating network performance

Let's see how our network does at classifying data.

In [None]:
output_np, labels_np = sess.run([output_tf, labels_tf])

output_np

In [76]:
# network outputs
np.argmax(output_np, axis=1)

[24 17 17  4 12  4  9 26 12  7]


In [77]:
# true output labels
labels_np

[0 1 2 3 2 1 4 0 3 5]


In [78]:
# count the number we got right
print('Correct: {} / {}'.format((np.argmax(output_np, axis=1) == labels_np).sum(), output_np.shape[0]))

Correct: 0 / 10


#### Training the network

Here's the important part: we can optimize the weights of the network based on the desired outputs and iterate until we get good performance.

In [80]:
for step in range(1000):
    sess.run(train_op)

    if step % 100 == 0:
        output_np, labels_np = sess.run([output_tf, labels_tf])
        print('Training step {} correct: {} / {}'.format(step, (np.argmax(output_np, axis=1) == labels_np).sum(), output_np.shape[0]))

Training step 0 correct: 0 / 10
Training step 100 correct: 2 / 10
Training step 200 correct: 5 / 10
Training step 300 correct: 5 / 10
Training step 400 correct: 9 / 10
Training step 500 correct: 10 / 10
Training step 600 correct: 10 / 10
Training step 700 correct: 10 / 10
Training step 800 correct: 10 / 10
Training step 900 correct: 10 / 10


Okay, so our network can classify these ten points pretty well. But how can we do this for thousands or millions of points?

### Start again with placeholders so we can use all of the data

The power of tensorflow is that we are able to define computations as we did above, but with 'placeholders' instead of actual data. We just have to define the shape and type of the variable, and then we don't have to give it actual data until we call `sess.run`.

This is powerful because we can call the same computation over and over again with different data without having to rewrite the tensorflow code.

So now let's start over and do it with `tf.placeholder`! Conveniently, we don't have to specify the number of rows in our dataset and can instead just use `None` to indicate this may vary from batch to batch.

#### Build the computational graph

In [None]:
tf.reset_default_graph() # a helpful function for clearing the tf code in your existing session
batch_size = 10
data_tf = tf.placeholder(shape=[None, data.shape[1]], dtype=tf.float32, name='data_tf')
labels_tf = tf.placeholder(shape=[None], dtype=tf.int32, name='labels_tf')


hidden_layer_tf = layer(data_tf, n_dim=10, name='hidden', activation=tf.nn.relu)

output_tf = layer(hidden_layer_tf, n_dim=num_classes, name='output', activation=tf.nn.softmax)

labels_one_hot = tf.one_hot(labels_tf, num_classes)

loss_tf = labels_one_hot * tf.log(output_tf + 1e-6) + (1 - labels_one_hot) * tf.log(1 - output_tf + 1e-6)
loss_tf = - tf.reduce_sum(loss_tf)

learning_rate = .001
# we'll use the AdamOptimizer as it is much more powerful
opt = tf.train.AdamOptimizer(learning_rate)

train_op = opt.minimize(loss_tf)

sess.run(tf.global_variables_initializer())

#### Train the network

In [None]:
# now let's train our network with new data each step
step = 0
for epoch in range(100):
    random_order = np.random.choice(data_training.shape[0], data_training.shape[0], replace=False)
    data_randomized = data_training[random_order]
    labels_randomized = labels_training[random_order]
    
    for data_batch, labels_batch in zip(np.array_split(data_randomized, data_randomized.shape[0] // batch_size), np.array_split(labels_randomized, labels_randomized.shape[0] // batch_size)):
        step += 1

        sess.run(train_op, {data_tf: data_batch, labels_tf: labels_batch})

        # evaluate accuracy on both the training and validation datasets every once in awhile
        if step % 10 == 0:
            loss_np = sess.run(loss_tf, {data_tf: data_batch, labels_tf: labels_batch})
            output_np = []
            labels_np = []
            for data_batch, labels_batch in zip(np.array_split(data_training, data_training.shape[0] // ), np.array_split(labels_training, labels_training.shape[0] // batch_size)):
                output_np_ = sess.run(output_tf, {data_tf: data_batch})
                output_np.append(output_np_)
                labels_np.append(labels_batch)
            output_np = np.concatenate(output_np, axis=0)
            labels_np = np.concatenate(labels_np, axis=0)
            acc_training = (np.argmax(output_np, axis=1) == labels_np).sum() / output_np.shape[0]

            output_np = []
            labels_np = []
            for data_batch, labels_batch in zip(np.array_split(data_validation, data_validation.shape[0] // batch_size), np.array_split(labels_validation, labels_validation.shape[0] // batch_size)):
                output_np_ = sess.run(output_tf, {data_tf: data_batch})
                output_np.append(output_np_)
                labels_np.append(labels_batch)
            output_np = np.concatenate(output_np, axis=0)
            labels_np = np.concatenate(labels_np, axis=0)
            acc_validation = (np.argmax(output_np, axis=1) == labels_np).sum() / output_np.shape[0] 
            print('Step {} loss: {:.3f} training accuracy: {:.3f} validation accuracy: {:.3f} '.format(step, loss_np, acc_training, acc_validation))

### Discussion

How did our network do? Is the classification accuracy high? How many iterations did it take for the training accuracy to stop increasing? How many iterations did it take for the training loss to stop decreasing?

#### _Breakpoint_  - once you get here, please help those around you!

## Exercise 3 - network width

Create a network with a wider hidden layer and compare its performance to the network with 10 hidden neurons we just built

In [None]:
# ===================
# Copy the code from above and change n_dim in the hidden layer from 10 to something larger

# ===================

#### _Breakpoint_  - once you get here, please help those around you!

## Exercise 4

Create a network with *two* hidden layers and compare its performance to the network with one hidden layer we just built

In [None]:
# ===================
# Copy the code from above and add another hidden layer whose input is the output of the first layer
# The second hidden layer should be used as input to the output layer

# ===================

#### _Breakpoint_  - once you get here, please help those around you!

## Exercise 5

Create a network with *five* hidden layers and compare its performance to the network with one hidden layer we just built

In [None]:
# ===================
# Copy the code from above and add another three hidden layers
# Chain the output from each layer to the input at the next

# ===================