# mnist

following along with [this](https://www.tensorflow.org/tutorials/layers). I'll be executing what I want in this directory, but also (as instructed) following along while building `cnn_mnist.py` in the neighboring directory

In [1]:
import importlib

import numpy as np
import tensorflow as tf

import cnn_mnist
import utils

  from ._conv import register_converters as _register_converters


In [2]:
tf.logging.set_verbosity(tf.logging.INFO)

## getting started

they do a few things worth noting:

+ they use the alias `tf`, common across `tensorflow` scripts
+ they set the `logging` for `tf` to `INFO` level
+ they add a dunder-main block calling `tf.app.run()`

## intro to convolutional neural networks

they break a cnn into three "components":

1. convolutional layers, filters which summarize regions of data
    1. refer to `relu` as a way of introducing nonlinearities
1. pooling layers, downsample to reduce dimensionality
    1. they say it's to reduce processing time, implying it's not *a priori* desirable (not sure if it is tbh)
1. dense layers
    1. these take filter/pool features and condense them to predictions/classifications

## building the cnn mnist classifier

architecture:

1. conv, 32 5x5, relu
1. pooling, 2x2, stride 2
1. conv, 64 5x5, relu
1. pooling, 2x2, stride 2
1. dense, 1024 nodes, dropout 0.4
6. dense, 10, logits for predicted proba

this is assisted by the `tf.layers` module, specifically

In [3]:
tf.layers.conv2d?

In [4]:
tf.layers.max_pooling2d?

In [5]:
tf.layers.dense?

each is `tensor --> tensor`, so we add ops to the graph by passing layers into the next.

here's the full code:

In [12]:
importlib.reload(cnn_mnist)

cnn_mnist.cnn_model_fn??

In [7]:
tf.estimator.Estimator?

let's take a deep dive into the code for each layer

### input layer

the first argument to `cnn_model_fn` is the feature collection, which can either be a single tensor or a dictionary of tensors. in this instance we are assuming (in code) that it is a dictionary (see `features["x"]`). whatever shape the input tensor is, we want to reshape it to have a `[batch_size, image_height, image_width, channels]` shape (this is what is required for the 2d convolutional and pooling layers). `mnist` images are monocolored (`channels = 1`) and are 28 x 28 pixels (`image_{height,width} = 28`)

we also use the automatic shape calculation sentinel `-1` so that we don't need to know the `batch_size` ahead of time

```python
input_layer = tf.reshape(features['x'], [-1, 28, 28, 1])
```

### convolutional layer #1

this is pretty straightfoward thanks to `tf.layers.conv2d`

In [15]:
tf.layers.conv2d?

we wish to make 32 5x5 filters with `relu` activation and padding, so:

```python
conv1 = tf.layers.conv2d(
    inputs=input_layer,
    filters=32,
    kernel_size=[5, 5],
    padding="same",
    activation=tf.nn.relu
)
```

#### `conv1` output size

we use `same` padding, so the height and width dimensions of the output shape don't change. all that changes is the channel dimension, and our input shape is `[-1, 28, 28, 32]`

we use `same

### pooling layer #1

here we take the convolution and max pool:

In [17]:
tf.layers.max_pooling2d?

we are going to do a 2x2 pool with a stride of 2 and *valid* padding (that is, pool only over "valid" values, don't create artifical 0 values around the boundary. yes, the naming convention is awful and stupid): 

```python
pool1 = tf.layers.max_pooling2d(
    inputs=conv1,
    pool_size=[2, 2],
    strides=2,
    #padding='valid' is implicit
)
```

#### `pool1` output size

because we use `valid` padding instead of `same` and because we have a stride of 2, our new tensor size post-pooling is `{height,width} / padding`. the number of channels is fixed, so the new output size is `[-1, 14, 14, 32]`

### convolutional layer #2 and pooling layer #2

we repeat the process on this transformed tensor, but this time we double the number of filters to 64

```python
conv2 = tf.layers.conv2d(
    inputs=pool1,
    filters=64,
    kernel_size=[5, 5],
    padding="same",
    activation=tf.nn.relu
)

pool2 = tf.layers.max_pooling2d(inputs=conv2, pool_size=[2, 2], strides=2)
```

#### `conv2` and `pool2` output size

for the convolution layer, we again use `same` padding, so the height and width sizes are fixe. the number of filters increases to 64, so the output tensor of `conv2` has size `[-1, 14, 14, 64]`.

for the pool layer, everything is the same -- `stride=2` halves the height and width sizes and our output from the `pool2` layer is `[-1, 7, 7, 64]`. that is

In [18]:
7 * 7 * 64

3136

elements per record in a batch

### dense layer

the steps above have "built features"

**note**: this is a pet peeve of mine. it's often stated thus as if what *follows* is *not* further feature engineering. what you have are complicated features that you could interpret as still being images, and can mentally relate to the input images. that's not a feature any more than `[1.3436147, -473.147387, ...]` is

now we go through the normal deep neural shit. start by flattening to `batch_size, x_size`: `[-1, 7 * 7 * 64]`:

```python
pool2_flat = tf.reshape(pool2, [-1, 7 * 7 * 64])
```

and then passign that into a 1024 unit `dense` layer:

```python
dense = tf.layers.dense(inputs=pool2_flat, units=1024, activation=tf.nn.relu)
```

for generalizability, we include a dropout layer. everything I've ever seen suggests a dropout value of 0.5, but I guess we know better and use 0.4

```python
dropout = tf.layers.dropout(
    inputs=dense,
    rate=0.4,
    training=mode == tf.estimator.ModeKeys.TRAIN
)
```

the `training` argument is actually super helpful, because it handles the control flow logic of applying random dropout during *training* (when we want it, to promote generalizability) but **not** during evaluation or testing

#### `dense` and `dropout` layers ouptut size

final size: `[-1, 1024]`

### logits layer

finally, given those 1024 features, make a prediction for each of the 10 classes

```python
logits = tf.layers.dense(inputs=dropout, units=10)
```

#### `logits` output size

`[-1, 10]`

### generate predictions

the `logits` values are individual prediction probabilities for each class. take the highest among them with `tf.argmax` and `tf.nn.softmax` to develop overall predictions and prediction probabilities

```python
predictions = {
    "classes": tf.argmax(input=logits, axis=1),
    "probabilities": tf.nn.softmax(logits, name="softmax_tensor")
}
```

if the model was invoked in the `PREDICT` mode, we're done -- just return what we've built above

```python
if mode == tf.estimator.ModeKeys.PREDICT:
  return tf.estimator.EstimatorSpec(mode=mode, predictions=predictions)
```

### calculate loss

if the model was invoked in either the `EVAL` or `TRAIN` mode, then we will need to be able to return the `loss` for the current weights, biases, hyperparmaeters (etc). this is a multi-class prediciton problem, so the natural choice is crossentropy:

```python
onehot_labels = tf.one_hot(indices=tf.cast(labels, tf.int32), depth=10)
loss = tf.losses.softmax_cross_entropy(
    onehot_labels=onehot_labels, logits=logits
)
```

our labels is a `[batch_size]` shaped tensor of integers, whereas our `logits` is `[batch_size, number_of_labels]` shaped. this is the motivation for converting the labels into one-hot tensors. below is a quick diversion into just what that looks like

In [27]:
# stupidest labels
labels = list(range(10)) + list(range(10))
labels

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [32]:
utils.inspect(tf.cast(labels, tf.int32))

[0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9]


In [28]:
onehot_labels = tf.one_hot(indices=tf.cast(labels, tf.int32), depth=10, name='one_hot_labels')
onehot_labels

<tf.Tensor 'one_hot_labels:0' shape=(20, 10) dtype=float32>

In [29]:
utils.inspect(onehot_labels)

[[1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]]


*note*: this is easy because the labels are already 0 - 9. if they had been values, we would have had to do some bullshit. not sure how we would have done that tbh

now, suppose we had managed to make `logits` predictions that were always mostly right but just enough wrong:

In [39]:
logits = onehot_labels * .9 + .01

In [40]:
utils.inspect(logits)

[[0.90999997 0.01       0.01       0.01       0.01       0.01
  0.01       0.01       0.01       0.01      ]
 [0.01       0.90999997 0.01       0.01       0.01       0.01
  0.01       0.01       0.01       0.01      ]
 [0.01       0.01       0.90999997 0.01       0.01       0.01
  0.01       0.01       0.01       0.01      ]
 [0.01       0.01       0.01       0.90999997 0.01       0.01
  0.01       0.01       0.01       0.01      ]
 [0.01       0.01       0.01       0.01       0.90999997 0.01
  0.01       0.01       0.01       0.01      ]
 [0.01       0.01       0.01       0.01       0.01       0.90999997
  0.01       0.01       0.01       0.01      ]
 [0.01       0.01       0.01       0.01       0.01       0.01
  0.90999997 0.01       0.01       0.01      ]
 [0.01       0.01       0.01       0.01       0.01       0.01
  0.01       0.90999997 0.01       0.01      ]
 [0.01       0.01       0.01       0.01       0.01       0.01
  0.01       0.01       0.90999997 0.01      ]
 [0.01       

then the loss is the crossentropy loss:

In [42]:
loss = tf.losses.softmax_cross_entropy(
    onehot_labels=onehot_labels,
    logits=logits
)
loss

<tf.Tensor 'softmax_cross_entropy_loss_1/value:0' shape=() dtype=float32>

In [43]:
utils.inspect(loss)

1.5388281


### configure the training op

one valid mode is `tf.estimator.ModeKeys.TRAIN`. if we are meant to train, we should create a training operation. there are a bunch of ways to do this, and one is:

```python
if mode == tf.estimator.ModeKeys.TRAIN:
    optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001)
    train_op = optimizer.minimize(
        loss=loss,
        global_step=tf.train.get_global_step()
    )
    return tf.estimator.EstimatorSpec(mode=mode, loss=loss, train_op=train_op)
```

### add evaluation metrics

if we haven't already exited, we must be in the `tf.estimator.ModeKeys.EVAL` mode.

we *could* be done at this point, but we're greedy. when we evaluate, we decide to actually evaluate *something*, so we choose to calculate the accuracy of our predictions. we could have calculated a ton of stuff. for a sampling:

In [46]:
[_ for _ in dir(tf.metrics) if not _[0] == '_']

['accuracy',
 'auc',
 'average_precision_at_k',
 'false_negatives',
 'false_negatives_at_thresholds',
 'false_positives',
 'false_positives_at_thresholds',
 'mean',
 'mean_absolute_error',
 'mean_cosine_distance',
 'mean_iou',
 'mean_per_class_accuracy',
 'mean_relative_error',
 'mean_squared_error',
 'mean_tensor',
 'percentage_below',
 'precision',
 'precision_at_k',
 'precision_at_thresholds',
 'precision_at_top_k',
 'recall',
 'recall_at_k',
 'recall_at_thresholds',
 'recall_at_top_k',
 'root_mean_squared_error',
 'sensitivity_at_specificity',
 'sparse_average_precision_at_k',
 'sparse_precision_at_k',
 'specificity_at_sensitivity',
 'true_negatives',
 'true_negatives_at_thresholds',
 'true_positives',
 'true_positives_at_thresholds']

the `EVAL` mode supports calculation of various metric operations, so we push our accuracy calcualtion in as an eval metric operation:

```python
eval_metric_ops = {
    "accuracy": tf.metrics.accuracy(
        labels=labels, predictions=predictions["classes"]
    )
}

return tf.estimator.EstimatorSpec(
    mode=mode,
    loss=loss,
    eval_metric_ops=eval_metric_ops
)
```

## training and evaluating the cnn mnist classifier

the model is defined. dope. not dope enough, though. let's put all of what we need to do into a glorious `main` function

### load training and test data

go get the data. here we are hacking things a bit by using the `load_dataset` for mnist. we shouldn't do this. oh well.

within the `main` function (first 5 lines), go get the data and unpack it into more useful separate feature tensors

```python
# Load training and eval data
mnist = tf.contrib.learn.datasets.load_dataset("mnist")
train_data = mnist.train.images # Returns np.array
train_labels = np.asarray(mnist.train.labels, dtype=np.int32)
eval_data = mnist.test.images # Returns np.array
eval_labels = np.asarray(mnist.test.labels, dtype=np.int32)
```

### create the estimator

next steps in `main`: creating a `tf.estimator.Estimator` object implementing the `cnn_model_fn` we defined above

```python
mnist_classifier = tf.estimator.Estimator(
    model_fn=cnn_model_fn,
    model_dir='/tmp/mnist_convnet_model'
)
```

if you are running this within a `docker` container and want to access the model checkpoint information (you probably do), consider moving that `model_dir` value to a different location accessible from the base container

### set up a logging hook

logging is cool, right?

```python
# Set up logging for predictions
tensors_to_log = {
    # [printed label name]: [tensor name in graph]
    "probabilities": "softmax_tensor"
}
logging_hook = tf.train.LoggingTensorHook(
    tensors=tensors_to_log,
    every_n_iter=50
)
```

this is an alternative to creating explicit summary ops and looking at the summaries via `tensorflow`. in *this* instance, we will actually print the given tensor to the logs (so you will see this in the running logs for the cli). I prefer the tensorboard method only for eventual usability, but this is something I intend to add in the future as well.

for what it's worth, there is an `every_n_secs` option if you dont' care for fixed iterations

In [56]:
tf.train.LoggingTensorHook?

### train the model

we have data loaded into the scope of `main`, and we have an estimator that can train on features of that general shape. what remains is to connect the two -- this is done by defining an `input_fn` for ingesting the features and generating `features, labels` pairs (as expected inputs to the `model_fn`). this is a common use case, so a canned version of this function already exists for us:

```python
# Train the model
train_input_fn = tf.estimator.inputs.numpy_input_fn(
    x={"x": train_data},
    y=train_labels,
    batch_size=100,
    num_epochs=None,
    shuffle=True
)
```

having defined that input function, we simply pass it to the `train` method of the estimator

In [57]:
tf.estimator.Estimator.train?

```python
mnist_classifier.train(
    input_fn=train_input_fn,
    steps=20000,
    hooks=[logging_hook]
)
```

this will run until the number of steps has been exhausted

### evaluate the model

after training, we have a separate test set we can use to evaluate the performance of our trained model on out-of-sample records. just like with training above, we do this by connecting the ingested `numpy` arrays of test data with the estimator using a evaluation `input_fn`

```python
# Evaluate the model and print results
eval_input_fn = tf.estimator.inputs.numpy_input_fn(
    x={"x": eval_data},
    y=eval_labels,
    num_epochs=1,
    shuffle=False)
```

and this is evaluated by the estimator

In [58]:
tf.estimator.Estimator.evaluate?

```python
eval_results = mnist_classifier.evaluate(input_fn=eval_input_fn)
print(eval_results)
```

### run the model

it is not explained *at all*, but for some reason the function `tf.app.run()` seems to create a session context and execute the `main` function. I am not in any way *declaring* that we should use `main`, so that's odd.

In [61]:
tf.app.run??

invocation is done from the shell via `python cnn_mnist.py`. execute that in a neighboring terminal

*note*: on the gpu machine with *no* gpu access but little competition for cpu resources, this took several (closer to 30 than 0) minutes

the final output I received was

```
INFO:tensorflow:Loss for final step: 0.13426968.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-07-10-18:44:57
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /home/zlamberty/tmp/mnist_convnet_model/model.ckpt-20000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2018-07-10-18:45:00
INFO:tensorflow:Saving dict for global step 20000: accuracy = 0.9705, global_step = 20000, loss = 0.09808666
{'accuracy': 0.9705, 'loss': 0.09808666, 'global_step': 20000}
zlamberty@b715efb0d2d6:~/notebooks/deep_learning_world_tour/tens
```

## additional resources

links to other tutorials

# summary

this was a pretty good tutorial on the basic structure of a `tensorflow` program using the `tf.layers` and `tf.estimator` apis. all told, our script has a little over 200 lines of code and runs in about a half an hour to achieve an accuracy of 97.3% with no hyperparameter tuning (not bad!).

the general program flow was implemented again:

1. define some way of ignesting data (preferably with `tf.data.Dataset` `api`, not done that way here)
1. define a model via a `model_fn` function
    1. should support `tf.estimator.ModeKeys.{TRAIN,EVAL,PREDICT}`
1. define a way of stitching the above two together (an `input_fn`)
    1. usually separate functions are provided for `TRAIN` and `EVAL`
1. create an instance of the model and invoke the desired modes
    1. `estimator.train(...)`
    1. `estimator.eval(...)`
