# Neural Bag-of-Words Model

In this notebook, we'll move beyond linear classifiers and implement a neural network for our classification task. 

We'll also introduce the [TensorFlow Estimator API](https://www.tensorflow.org/extend/estimators), which provides a high-level interface similar to scikit-learn. This involves a few new concepts, such as the idea of a `model_fn` and an `input_fn`, but it greatly simplifies experiments and reduces the need to write tedious data-feeding code.

## Outline

- **Part (d):** Model architecture
- **Part (e):** Implementing the Neural BOW model
- **Introduction to `tf.Estimator`**
- **Part (f):** Training, evaluation, and tuning

As with the first half of the assignment, exercised are interspersed throughout the notebook. In particular, Part (d) has 4 questions, Part (e) asks you to write code in `models.py`, and Part (f) has 4 questions plus one optional implementation exercise.

In [4]:
from __future__ import division
import os, sys, re, json, time, datetime, shutil
import itertools, collections
from importlib import reload
from IPython.display import display, HTML

# NLTK for NLP utils and corpora
import nltk

# NumPy and TensorFlow
import numpy as np
import pandas as pd
import tensorflow as tf
assert(tf.__version__.startswith("1.4"))

# Helper libraries
from w266_common import utils, vocabulary, tf_embed_viz, treeviz
from w266_common import patched_numpy_io
# Code for this assignment
import sst, models, models_test

# Monkey-patch NLTK with better Tree display that works on Cloud or other display-less server.
print("Overriding nltk.tree.Tree pretty-printing to use custom GraphViz.")
treeviz.monkey_patch(nltk.tree.Tree, node_style_fn=sst.sst_node_style, format='svg')

Overriding nltk.tree.Tree pretty-printing to use custom GraphViz.


# Part (d): Model Architecture

The neural bag-of-words classifier is one of the simplest neural models for text classification. It takes its name from the bag-of-words assumption common to linear models, in which the weights for each input word are summed to make a prediction. For our neural version, we'll instead sum the _vector representations_ of each word, and then add feed-forward (hidden) layers to make a deep network.

Here's a diagram:

![Neural Bag-of-Words Model](images/neural_bow.png)

We'll use the following notation:
- $w^{(i)} \in \mathbb{Z}$ for the $i^{th}$ word of the sequence (as an integer index)
- $x^{(i)} \in \mathbb{R}^d$ for the vector representation (embedding) of $w^{(i)}$
- $x \in \mathbb{R}^d$ for the fixed-length vector given by summing all the $x^{(i)}$ for an example
- $h^{(j)}$ for the hidden state after the $j^{th}$ fully-connected layer
- $y$ for the target label ($\in 1,\ldots,\mathtt{num\_classes}$)

Our model is defined as:
- **Embedding layer:** $x^{(i)} = W_{embed}[w^{(i)}]$
- **Summing vectors:** $x = \sum_{i=1}^n x^{(i)}$
- **Hidden layer(s):** $h^{(j)} = f(h^{(j-1)} W^{(j)} + b^{(j)})$ where $h^{(-1)} = x$ and $j = 0,1,\ldots,J-1$
- **Output layer:** $\hat{y} = \hat{P}(y) = \mathrm{softmax}(h^{(final)} W_{out} + b_{out})$ where $h^{(final)} = h^{(J-1)}$ is the output of the last hidden layer.

As per usual, we define the logits to be the argument of the softmax:

$$ \mathrm{logits} = h^{(final)}W_{out} + b_{out} $$

We'll refer to the first part of this model (**Embedding layer**, **Summing vectors**, and **Hidden layer(s)**) as the **Encoder**: it has the role of encoding the input sequence into a fixed-length vector representation that we pass to the output layer.

We'll also use these as shorthand for important dimensions:
- `V`: the vocabulary size (equal to `ds.vocab.size`)
- `embed_dim`: the embedding dimension $d$
- `hidden_dims`: a list of dimensions for the output of each hidden layer (i.e. $\mathrm{dim}(h^{(j)})$&nbsp;=&nbsp;`hidden_dims[j]`)
- `num_classes`: the number of target classes (2 for the binary task)

## Part (d) Short Answer Questions

Answer the following in the cell below. 

1. Let `embed_dim = d`, `hidden_dims = [h1, h2]`, and `num_classes = k`. In terms of these values and the vocabulary size `V`, write down the shapes of the following variables: $W_{embed}$, $W^{(0)}$, $b^{(0)}$, $W^{(1)}$, $b^{(1)}$, $W_{out}$, $b_{out}$. (*Hint: $W_{embed}$ has a row for each word in the vocabulary.*)
<p>
2. Using your answer to 1., how many parameters (matrix or vector elements) are in the embedding layer? How about in the hidden layers? And the output layer?  
<p>
<p>
3. Recall that logistic regression can be thought of as a single-layer neural network. What should we set as the values of `embed_dim` and `hidden_dims` such that this model implements logistic regression?
<p>
4. Suppose that we have two examples, `[foo bar baz]` and `[baz bar foo]`. Will this model make the same predictions on these? Why or why not?

## Part (d) Answers
<a id="answers_d1234"></a>

1. 
$W_{embed}$ = `(V by d)`, 
$W^{(0)}$ = `(d by h1)`,
$b^{(0)}$ = `(1 by h1)`,
$W^{(1)}$ = `(h1 by h2)`,
$b^{(1)}$ = `(1 by h2)`,
$W_{out}$ = `(h2 by k)`,
$b_{out}$ = `(1 by k)`
2. 
 - In the embedding layer $W_{embed}$, there are `V*d` parameters.
 - To produce the first hidden layer representation $h^{(0)}$, our weight matrix needs `d*h1` parameters and our bias term needs `1*h1` parameters. To prodice the second hidden layer representation $h^{(1)}$, our weight matrix needs `h1*h2` parameters and our bias term needs `1*h2` parameters. In total the hidden layers require us to estimate`(d+1)*h1 + (h1+1)*h2` parameters.
 - To produce the outputs, the output layers weight matrix needs `h2*k` parameters and the bias term needs `1*k` parameters. In total the output layer needs `(h2+1)*k` parameters.
3. `embed_dim` can be any real number `d`. We don't need any values in hidden_dims because there are no hidden layers. $W_{out}$ should have shape `(d,k)` and shape of $b_{out}$ remains the same.
4. Yes, the model will make the same sentiment predictions (argmax of softmax output). This is because when we sum the embeddings for all the x_s, the summing operation doesn't take the order of tokens into account.

## Training with Minibatches

Modern hardware (especially GPUs) performs most efficiently when processing a large amount of data in parallel. Because of this, we usually feed data to a neural network in batches - that is, running several examples at a time, in parallel. If each example is represented by a vector $x \in \mathbb{R}^d$, then we can feed in a batch of $m$ examples as a matrix $X \in \mathbb{R}^{m \times d}$, where each row is an example. Note that if we write our matrix-vector products with the vector on the left, as in the equations above, the batch dimension carries through while the rows remain independent:

$$ H = f(X W + b) $$

is equivalent to computing in parallel $H_i = f(X_i W + b)$ for each $i = 0, \ldots, m - 1$. Most TensorFlow operations are designed to handle batching seamlessly, so long as $bs$ = `batch_size` is the first dimension of the input data.

### Padding Sequences

Unlike the Naive Bayes classifier, which took long ($d = V \approx 16,000$) sparse vectors as input, our neural network will operate directly on a _sequence_ of ids (as stored in `ds.train.ids`). This can be variable-length (depending on the length of the sequence), but we'll need to coerce it into a fixed-length vector for training.

The easiest thing to do here is to pad the vectors with a dummy index, which we can zero-out inside our model. Consider the inputs:
```
[great movies] (2 tokens)
[this is a terrible movie] (5 tokens)
```
We'll convert these to IDs, then pad with a dummy index `0` to get a 2 x 5 matrix:
```
[[144, 104,  0,   0,  0 ]
 [ 20,  10,  6, 937, 21]]
```

For SST, we'll arbitrarily choose to pad to length 40, and clip any examples longer than that. _(Recall from Part (a) that this will only clip fewer than 5% of the dataset.)_

The `ds.as_padded_array` function is implemented for you, and will handle clipping and padding automatically. Note the second return value, `*_ns`: this is a vector containing the original (clipped) sequence lengths. We'll use this inside the model to mask the dummy indices so they don't bias our predictions.

In [5]:
import sst
ds = sst.SSTDataset(V=20000).process(label_scheme="binary")

Loading SST from data/sst/trainDevTestTrees_PTB.zip
Training set:     8,544 trees
Development set:  1,101 trees
Test set:         2,210 trees
Building vocabulary - 16,474 words
Processing to phrases...  Done!
Splits: train / dev / test : 98,794 / 13,142 / 26,052


In [6]:
max_len = 40
train_x, train_ns, train_y = ds.as_padded_array('train', max_len=max_len, root_only=True)
dev_x,   dev_ns,   dev_y   = ds.as_padded_array('dev',   max_len=max_len, root_only=True)
test_x,  test_ns,  test_y  = ds.as_padded_array('test',  max_len=max_len, root_only=True)

In [7]:
print("Examples:\n", train_x[:3])
print("Original sequence lengths: ", train_ns[:3])
print("Target labels: ", train_y[:3])
print("")
print("Padded:\n", " ".join(ds.vocab.ids_to_words(train_x[0])))
print("Un-padded:\n", " ".join(ds.vocab.ids_to_words(train_x[0,:train_ns[0]])))

Examples:
 [[   4  606   10 3416    9   26    4 2821 1263   11  108   63 5543   64
     7   13   75   11  277    9   84    6 4243   69 3417   40 1869 2822
     5 8181 1682 5544   48  846 8182    3    0    0    0    0]
 [   4 2823 1870 5545    8   63    4 3418    8    4 2441   64 5546   10
    46  905   13    6 5547    8  680   67   29 3419 2113 5548 1030  847
    11 5549  623    8 8183 5550   11 8184    3    0    0    0]
 [8185 5551 2114 8186    6 8187    8 1530   36    6  167  769 1264    5
     6  167   34  296 8188    9    4   51   36   16    4  307 3420  345
   624    4 1031    5 4244    5  447    8    4  273    3    0]]
Original sequence lengths:  [36 37 39]
Target labels:  [1 1 1]

Padded:
 the rock is destined to be the 21st century 's new `` conan '' and that he 's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal . <s> <s> <s> <s>
Un-padded:
 the rock is destined to be the 21st century 's new `` conan '' and that he 's going 

# Part (e): Implementing the Neural BOW Model

In order to better manage the model code, we'll implement our BOW model in `models.py`. In particular, you'll need to implement the following functions:

- `embedding_layer(...)`: constructs an embedding layer
- `BOW_encoder(...)`: constructs the encoder stack as described above
- `softmax_output_layer(...)`: constructs a softmax output layer

**Follow the instructions in the code (function docstrings and comments) carefully!**

In particular, for unit tests to work, you shouldn't change (or add) any `tf.name_scope` or `tf.variable_scope` calls, and must name the variables exactly as documented. (Your model may work just fine, of course, but the test harness will throw all sorts of errors!)

To aid debugging and readability, we've adopted a convention that TensorFlow tensors are represented by variables ending in an underscore, such as `W_embed_` or `train_op_`.

**Before you start**, be sure to answer the short-answer questions in Part (d). (_We guarantee that this section will be **much** harder if you don't!_)

You may find the following TensorFlow API functions useful:
- [`tf.nn.embedding_lookup`](https://www.tensorflow.org/versions/master/api_docs/python/tf/nn/embedding_lookup)
- [`tf.nn.sparse_softmax_cross_entropy_with_logits`](https://www.tensorflow.org/versions/master/api_docs/python/tf/nn/sparse_softmax_cross_entropy_with_logits)
- [`tf.reduce_mean`](https://www.tensorflow.org/versions/master/api_docs/python/tf/reduce_mean) and [`tf.reduce_sum`](https://www.tensorflow.org/versions/master/api_docs/python/tf/reduce_sum)

**Do your work in `models.py`.** When ready, run the cell below to run the unit tests.

In [12]:
reload(models)
utils.run_tests(models_test, ["TestLayerBuilders", "TestNeuralBOW"])

test_embedding_layer (models_test.TestLayerBuilders) ... ok
test_softmax_output_layer (models_test.TestLayerBuilders) ... ok
test_BOW_encoder (models_test.TestNeuralBOW) ... ok

----------------------------------------------------------------------
Ran 3 tests in 0.056s

OK


# Training a Neural Network (the hard way)

In Assignment 1, we trained our simple model with a home-spun training loop, setting up `feed_dict`-s and making 
calls to `session.run()`. For demonstration, let's do the same here.

We've implemented a wrapper function, `models.classifier_model_fn`, which uses the functions you wrote in **Part (e)** to build a model graph. It takes as input `features` and `labels` which contain input and target tensors, as well as `model` and `params` which configure the model. 

**Exercise (not graded):** Read through the code for `classifier_model_fn()` in `models.py`. Where is the code you wrote in Part (e) called? Where is the loss function set up, and what loss is used? How is the optimizer set up, and what options are available? What types of predictions are returned in the `predictions` dict?

Using this function directly, we can write a simple training loop similar to Assignment 1's `train_nn()`:

In [32]:
import models; reload(models)

x, ns, y = train_x, train_ns, train_y
batch_size = 32

# Specify model hyperparameters as used by model_fn
model_params = dict(V=ds.vocab.size, embed_dim=50, hidden_dims=[25], num_classes=len(ds.target_names),
                    encoder_type='bow',
                    lr=0.1, optimizer='adagrad', beta=0.01)
model_fn = models.classifier_model_fn

total_batches = 0
total_examples = 0
total_loss = 0
loss_ema = np.log(2)  # track exponential-moving-average of loss
ema_decay = np.exp(-1/10)  # decay parameter for moving average = np.exp(-1/history_length)
with tf.Graph().as_default(), tf.Session() as sess:
    ##
    # Construct the graph here. No session.run calls - just wiring up Tensors.
    ##
    # Add placeholders so we can feed in data.
    x_ph_  = tf.placeholder(tf.int32, shape=[None, x.shape[1]])  # [batch_size, max_len]
    ns_ph_ = tf.placeholder(tf.int32, shape=[None])              # [batch_size]
    y_ph_  = tf.placeholder(tf.int32, shape=[None])              # [batch_size]
    
    # Construct the graph using model_fn
    features = {"ids": x_ph_, "ns": ns_ph_}  # note that values are Tensors
    estimator_spec = model_fn(features, labels=y_ph_, mode=tf.estimator.ModeKeys.TRAIN,
                              params=model_params)
    loss_     = estimator_spec.loss
    train_op_ = estimator_spec.train_op
    
    ##
    # Done constructing the graph, now we can make session.run calls.
    ##
    sess.run(tf.global_variables_initializer())
    
    # Run a single epoch
    t0 = time.time()
    for (bx, bns, by) in utils.multi_batch_generator(batch_size, x, ns, y):
        # feed NumPy arrays into the placeholder Tensors
        feed_dict = {x_ph_: bx, ns_ph_: bns, y_ph_: by}
        batch_loss, _ = sess.run([loss_, train_op_], feed_dict=feed_dict)
        
        # Compute some statistics
        total_batches += 1
        total_examples += len(bx)
        total_loss += batch_loss * len(bx)  # re-scale, since batch loss is mean
        # Compute moving average to smooth out noisy per-batch loss
        loss_ema = ema_decay * loss_ema + (1 - ema_decay) * batch_loss
        
        if (total_batches % 25 == 0):
            print("{:5,} examples, moving-average loss {:.2f}".format(total_examples, 
                                                                      loss_ema))    
    print("Completed one epoch in {:s}".format(utils.pretty_timedelta(since=t0)))

  800 examples, moving-average loss 0.67
1,600 examples, moving-average loss 0.53
2,400 examples, moving-average loss 0.46
3,200 examples, moving-average loss 0.50
4,000 examples, moving-average loss 0.61
4,800 examples, moving-average loss 0.47
5,600 examples, moving-average loss 0.46
6,400 examples, moving-average loss 0.43
Completed one epoch in 0:00:01


# Training a Neural Network with tf.Estimator

As you see above, there's a lot of boilerplate involved with training a model - we need to instantiate the graph, manage a TensorFlow session, and manually feed data for each batch. This can get tedious, especially as we add support for checkpointing, saving models, and tracking statistics during training. To streamline this process, we can use a high-level api like `tf.Estimator`.

The Estimator API allows us to define custom models, then provides an `Estimator` object that exposes `train()`, `evaluate()`, and `predict()` functions in a similar interface as scikit-learn. Take a few minutes to skim through the main documentation:

- [TensorFlow Estimator API](https://www.tensorflow.org/extend/estimators)
- [Estimators in 'Effective TensorFlow'](https://github.com/vahidk/EffectiveTensorflow#tf_learn) (advanced)

### Model Functions (model_fn)

The Estimator API is a functional interface, built around the idea of a `model_fn`. A `model_fn` is just a function that follows a specific interface, and when called constructs a graph of TensorFlow variables and ops that constitutes your model. Here's an example of what one looks like:

```python
def my_model_fn(features, labels, mode, params):
    x_ = features['x']
    logits_ = my_network(x_, hidden_dims=params['hidden_dims'],
                         foo=params['foo'], bar=params['bar'])
    
    predictions_dict = {"max": tf.argmax(logits_, 1)}
    eval_metrics = {"accuracy": tf.metrics_accuracy(predictions_dict['max']}
    if mode == tf.estimator.ModeKeys.PREDICT:
        return tf.estimator.EstimatorSpec(mode=mode,
                                          predictions=predictions_dict)

    loss_ = my_loss_fn(logits_)
    return tf.estimator.EstimatorSpec(mode=mode,
                                      predictions=predictions_dict,
                                      loss=loss_,
                                      train_op=train_op_,
                                      eval_metric_ops=eval_metrics)
```
You can read more about the arguments here: 
- [Constructing the model_fn](https://www.tensorflow.org/extend/estimators#constructing_the_model_fn)

The Estimator API takes a pointer to this _function_, then calls it internally to instantiate your model in the appropriate context. This allows it to handle things like writing and restoring checkpoints automatically, as well as feeding data to the model during training and evaluation. 

### Input Functions (input_fn)

Data feeding is handled by an `input_fn`, which takes the place of the placeholder variables and `feed_dict` we'd otherwise need. The `input_fn` is defined separately from the `model_fn`, and builds the part of the graph up to `features` and `labels`.

We won't write our own `input_fn` in this assignment, but instead we can just use the existing `numpy_input_fn` implementation. This takes NumPy arrays as inputs, and creates an `input_fn` that will generate minibatches:

```python
train_input_fn = tf.estimator.inputs.numpy_input_fn(
                    x={"ids": train_x, "ns": train_ns}, y=train_y,
                    batch_size=32, num_epochs=20, shuffle=True
                 )
```

You can read more about `input_fn`-s here: 
- [Building Input Functions with tf.Estimator](https://www.tensorflow.org/get_started/input_fn)

**Note:** for this assignment, we'll use a patched version of `tf.estimator.inputs.numpy_input_fn` included with this assignment. This version allows us to seed the random number generator so that training data is shuffled but deterministic.

### Building an Estimator

With a `model_fn` and an `input_fn` in hand, we can now build and train an Estimator with just a couple of lines:

```python
model_params = dict(...)   # passed as 'params' to the model_fn
model = tf.estimator.Estimator(model_fn=my_model_fn, 
                               params=model_params,
                               model_dir="/tmp/my_model_checkpoints")
model.train(input_fn=train_input_fn)
```

The last line will kick off a train loop, ingesting data until the `input_fn` runs dry (20 epochs, for the one above). We can then evaluate on labeled data by calling `model.evaluate(input_fn=...)`, and run inference on unlabeled data by calling `model.predict(input_fn=...)` with appropriate `input_fn`-s.

_**Note:** You might be wondering why TensorFlow adds all this boilerplate on top of the actual model. It doesn't seem necessary for small-scale experiments like this assignment, but as soon as you scale up to models that take hours, days, or even weeks to train, having robust checkpoint management, live dashboards, and distributed data queues really starts to pay off!_

# Part (f): Training and Evaluation

The cell below defines some model params and sets up a checkpoint directory for TensorBoard.

Use the following default parameters to start, as given below in `model_params`:
```python
embed_dim = 50
hidden_dims = [25]  # single hidden layer
optimizer = 'adagrad'
lr = 0.1  # learning rate
beta = 0.01  # L2 regularization
```

**Note:** Due to a bug in TensorFlow, if you re-use the same checkpoint directory (even after deleting the contents) it will sometimes fail to write the event data for TensorBoard. To work around this, the code below creates a new checkpoint directory each time with a name derived from the timestamp. You may want to delete these after a few runs, since they can take up ~35MB each. To do so just run:

```sh
# On command line
rm -rfv /tmp/tf_bow_sst_*
```

In [13]:
import models; reload(models)

# Specify model hyperparameters as used by model_fn
model_params = dict(V=ds.vocab.size, embed_dim=50, hidden_dims=[25], num_classes=len(ds.target_names),
                    encoder_type='bow',
                    lr=0.1, optimizer='adagrad', beta=0.01)

checkpoint_dir = "/tmp/tf_bow_sst_" + datetime.datetime.now().strftime("%Y%m%d-%H%M")
if os.path.isdir(checkpoint_dir):
    shutil.rmtree(checkpoint_dir)
# Write vocabulary to file, so TensorBoard can label embeddings.
# creates checkpoint_dir/projector_config.pbtxt and checkpoint_dir/metadata.tsv
ds.vocab.write_projector_config(checkpoint_dir, "Encoder/Embedding_Layer/W_embed")

model = tf.estimator.Estimator(model_fn=models.classifier_model_fn, 
                               params=model_params,
                               model_dir=checkpoint_dir)
print("")
print("To view training (once it starts), run:\n")
print("    tensorboard --logdir='{:s}' --port 6006".format(checkpoint_dir))
print("\nThen in your browser, open: http://localhost:6006")

Vocabulary (16,474 words) written to '/tmp/tf_bow_sst_20180211-0524/metadata.tsv'
Projector config written to /tmp/tf_bow_sst_20180211-0524/projector_config.pbtxt
INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': '/tmp/tf_bow_sst_20180211-0524', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f85e261d978>, '_task_type': 'worker', '_task_id': 0, '_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}

To view training (once it starts), run:

    tensorboard --logdir='/tmp/tf_bow_sst_20180211-0524' --port 6006

Then in your browser, open: http://localhost:6006


Now run the cell below to start training! If you run TensorBoard from the command line, you should see loss curves in the "Scalars" tab as training progresses. We've set it up to run an evaluation on the dev set every `train_params['eval_every']` epochs, and this should appear in the same tab as a blue line after a couple minutes.

Using the default `model_params` above and the following training params, as given in `train_params` below:
```python
batch_size = 32
total_epochs = 20
eval_every = 2  # every 2 epochs, eval the dev set
```
Your model should train very quickly - 20 epochs in under two minutes on a single-core GCE instance.

After 20 epochs, your loss curves should look something like this:

![Loss curves](images/tensorboard_curves.png)

Don't worry if they don't match exactly - colors may vary, and the red dot labeled "eval\_test" won't appear until you run the evaluation cell below. There are also some other curves that you might see: "global\_step/sec" is the number of minibatches per second that the model processes, and the "enqueue\_input/..." plot has to do with the feeder queues that the Estimator API uses to stream data to the model.

In [14]:
# Training params, just used in this cell for the input_fn-s
train_params = dict(batch_size=32, total_epochs=20, eval_every=2)
assert(train_params['total_epochs'] % train_params['eval_every'] == 0)

# Construct and train the model, saving checkpoints to the directory above.
# Input function for training set batches
# Do 'eval_every' epochs at once, followed by evaluating on the dev set.
# NOTE: use patch_numpy_io.numpy_input_fn instead of tf.estimator.inputs.numpy_input_fn
train_input_fn = patched_numpy_io.numpy_input_fn(
                    x={"ids": train_x, "ns": train_ns}, y=train_y,
                    batch_size=train_params['batch_size'], 
                    num_epochs=train_params['eval_every'], shuffle=True, seed=42
                 )

# Input function for dev set batches. As above, but:
# - Don't randomize order
# - Iterate exactly once (one epoch)
dev_input_fn = tf.estimator.inputs.numpy_input_fn(
                    x={"ids": dev_x, "ns": dev_ns}, y=dev_y,
                    batch_size=128, num_epochs=1, shuffle=False
                )

for _ in range(train_params['total_epochs'] // train_params['eval_every']):
    # Train for a few epochs, then evaluate on dev
    model.train(input_fn=train_input_fn)
    eval_metrics = model.evaluate(input_fn=dev_input_fn, name="dev")

INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Saving checkpoints for 1 into /tmp/tf_bow_sst_20180211-0524/model.ckpt.
INFO:tensorflow:loss = 1.10304, step = 1
INFO:tensorflow:global_step/sec: 97.6156
INFO:tensorflow:loss = 0.591804, step = 101 (1.029 sec)
INFO:tensorflow:global_step/sec: 101.871
INFO:tensorflow:loss = 0.56995, step = 201 (0.982 sec)
INFO:tensorflow:global_step/sec: 102.549
INFO:tensorflow:loss = 0.467347, step = 301 (0.975 sec)
INFO:tensorflow:global_step/sec: 97.9221
INFO:tensorflow:loss = 0.412307, step = 401 (1.021 sec)
INFO:tensorflow:Saving checkpoints for 433 into /tmp/tf_bow_sst_20180211-0524/model.ckpt.
INFO:tensorflow:Loss for final step: 0.37697.
INFO:tensorflow:Starting evaluation at 2018-02-11-05:38:14
INFO:tensorflow:Restoring parameters from /tmp/tf_bow_sst_20180211-0524/model.ckpt-433
INFO:tensorflow:Finished evaluation at 2018-02-11-05:38:15
INFO:tensorflow:Saving dict for global step 433: accuracy = 0.716743, cross_entropy_loss = 0.589964

INFO:tensorflow:global_step/sec: 114.556
INFO:tensorflow:loss = 0.124997, step = 3232 (0.873 sec)
INFO:tensorflow:global_step/sec: 109.388
INFO:tensorflow:loss = 0.122526, step = 3332 (0.915 sec)
INFO:tensorflow:global_step/sec: 103.469
INFO:tensorflow:loss = 0.11915, step = 3432 (0.966 sec)
INFO:tensorflow:Saving checkpoints for 3464 into /tmp/tf_bow_sst_20180211-0524/model.ckpt.
INFO:tensorflow:Loss for final step: 0.0926181.
INFO:tensorflow:Starting evaluation at 2018-02-11-05:38:55
INFO:tensorflow:Restoring parameters from /tmp/tf_bow_sst_20180211-0524/model.ckpt-3464
INFO:tensorflow:Finished evaluation at 2018-02-11-05:38:55
INFO:tensorflow:Saving dict for global step 3464: accuracy = 0.752294, cross_entropy_loss = 0.677709, global_step = 3464, loss = 0.824788
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Restoring parameters from /tmp/tf_bow_sst_20180211-0524/model.ckpt-3464
INFO:tensorflow:Saving checkpoints for 3465 into /tmp/tf_bow_sst_20180211-0524/model.ckpt.
I

## Part(f).1: Evaluating Your Model

To evaluate on the test set, we just need to construct another `input_fn`, then call `model.evaluate`. 

**1.)** Fill in the cell below, and run it to compute accuracy on the test set. With the default parameters, you should get accuracy around 77%.

In [17]:
#### YOUR CODE HERE ####
# Code for Part (f).1
# replace with an input_fn, similar to dev_input_fn
test_input_fn = tf.estimator.inputs.numpy_input_fn(
                    x={"ids": test_x, "ns": test_ns}, y=test_y,
                    batch_size=128, num_epochs=1, shuffle=False)  


eval_metrics =  model.evaluate(input_fn=test_input_fn, name="test")  # replace with result of model.evaluate(...)

#### END(YOUR CODE) ####
print("Accuracy on test set: {:.02%}".format(eval_metrics['accuracy']))
eval_metrics

INFO:tensorflow:Starting evaluation at 2018-02-11-05:45:07
INFO:tensorflow:Restoring parameters from /tmp/tf_bow_sst_20180211-0524/model.ckpt-4330
INFO:tensorflow:Finished evaluation at 2018-02-11-05:45:07
INFO:tensorflow:Saving dict for global step 4330: accuracy = 0.7743, cross_entropy_loss = 0.61913, global_step = 4330, loss = 0.763519
Accuracy on test set: 77.43%


{'accuracy': 0.77429986,
 'cross_entropy_loss': 0.61912978,
 'global_step': 4330,
 'loss': 0.76351869}

We can also evaluate the old-fashioned way, by calling `model.predict(...)` and working with the predicted labels directly:

In [18]:
from sklearn.metrics import accuracy_score
predictions = list(model.predict(test_input_fn))  # list of dicts
y_pred = [p['max'] for p in predictions]
acc = accuracy_score(y_pred, test_y)
print("Accuracy on test set: {:.02%}".format(acc))

INFO:tensorflow:Restoring parameters from /tmp/tf_bow_sst_20180211-0524/model.ckpt-4330
Accuracy on test set: 77.43%


## Part (f).2: Evaluating on "Interesting" examples

Write your answer in the cell below.

**Question 2.)** In the cell below, repeat what you did above, but evaluate the model on the "interesting" examples. Does the neural bag-of-words model perform well here, as compared to the test set as a whole? How about compared to the Naive Bayes baseline? Explain why this might be, in terms of the phenomena you found in Part (c).2.

## Part (f).2 Answers
<a id="answers_f2"></a>

**2.)** Compared to the test set as a whole, our model performs worse on the interesting examples. Mechanically, the naive bayes model simply sum up weights for all the word occurrences for the final prediction and the neural BOW model simply sums up (collapsing) word embeddings for the feed forward layers. Therefore, both models are throwing away any information of word order or structured subphrase sentiments, thus they don't work well when contrast words twists the overall sentiments of the sentences. Compared to the Naive Bayes baseline, the neural BOW model accuracy is a few percentage points lower. This is possibly because the number of parameters we are estimating is  `V*d + (d+1)*h1 + (h1+1)*k` which is much much higher than `V + 1` in the NB model. We only have 6920 examples in `train_x` so both our embeddings are weights are not well trained.

In [22]:
df = ds.test

gb = df.groupby(by=['root_id'])
interesting_ids = []   # root ids, index into ds.test_trees
interesting_idxs = []  # DataFrame indices, index into ds.test
# This groups the DataFrame by sentence
for root_id, idxs in gb.groups.items():
    # Get the average score of all the phrases for this sentence
    mean = df.loc[idxs].label.mean()
    if (mean > 0.4 and mean < 0.6):
        interesting_ids.append(root_id)
        interesting_idxs.extend(idxs)
        
print("Found {:,} interesting examples".format(len(interesting_ids)))
print("Interesting ids (into ds.test_trees): ", interesting_ids)
print("")

# This will extract only the "interesting" sentences we found above
test_x_interesting, test_ns_interesting, test_y_interesting = ds.as_padded_array("test", root_only=True, 
                                                                                 df_idxs=interesting_idxs)
#### YOUR CODE HERE ####
# Code for Part (f).2
test_input_fn_interesting = tf.estimator.inputs.numpy_input_fn(
                    x={"ids": test_x_interesting, "ns": test_ns_interesting}, y=test_y_interesting,
                    batch_size=128, num_epochs=1, shuffle=False)  


eval_metrics_interesting =  model.evaluate(input_fn=test_input_fn_interesting, name="test")

acc = eval_metrics_interesting['accuracy']  # replace with actual value

#### END(YOUR CODE) ####
print("Accuracy on test set: {:.02%}".format(acc))

Found 246 interesting examples
Interesting ids (into ds.test_trees):  [0, 27, 31, 32, 75, 80, 90, 96, 117, 124, 138, 140, 141, 160, 166, 186, 187, 205, 210, 212, 227, 232, 254, 269, 271, 285, 296, 307, 312, 327, 335, 373, 397, 399, 406, 407, 410, 426, 447, 511, 512, 516, 521, 534, 539, 563, 577, 588, 606, 610, 611, 637, 640, 645, 655, 662, 664, 713, 720, 721, 724, 739, 755, 758, 763, 776, 791, 793, 796, 802, 805, 810, 818, 840, 858, 887, 898, 899, 909, 910, 912, 929, 930, 961, 970, 973, 974, 975, 979, 1008, 1032, 1036, 1066, 1067, 1076, 1098, 1101, 1108, 1114, 1131, 1138, 1142, 1159, 1183, 1185, 1189, 1193, 1198, 1206, 1214, 1215, 1235, 1241, 1243, 1244, 1261, 1267, 1273, 1275, 1279, 1280, 1293, 1296, 1302, 1303, 1312, 1318, 1319, 1321, 1322, 1324, 1326, 1328, 1338, 1341, 1346, 1359, 1363, 1371, 1383, 1398, 1402, 1413, 1443, 1452, 1456, 1458, 1462, 1464, 1480, 1481, 1486, 1487, 1488, 1507, 1509, 1513, 1516, 1527, 1537, 1552, 1576, 1582, 1587, 1594, 1597, 1602, 1607, 1608, 1615, 1619, 1

## Part (f): Tuning Your Model

Our default model from Part (e) performs decently, but doesn't manage to beat even the Naive Bayes baseline. We might be able to fix that with a bit of tuning.

Answer the following in the cell below.

**Question 3.)** Look at your training curves in TensorBoard, after 20 epochs with the default parameters. Do you think that the model would benefit from more training time?
<p>
**Question 4.)** Based on the accuracy trace (on the dev set) and the cross entropy loss curves on the training and dev sets, do you think the model is overfitting?

## Answers for Part (f).3 and 4
<a id="answers_f34"></a>

**3.)** From my tensorboard graphs, accuracy and cross entropy loss curves seem to have stablized already. Performance improvement is unlikely or marginal.

**4.)** The model does seem overfitted. The cross entropy loss for the training set drops below 0.1 while the cross entropy loss for the dev set stablizes at higher than 0.6. Also, the dev set cross entropy stablizes early in training while training set cross entropy still drops for a while.

## Regularization & Tuning

The baseline model uses L2 regularization to combat overfitting, but this isn't particularly effective with neural networks since a deep network can still learn spurious logical relationships even with small values for the connection weights. Instead, it's common to use _dropout_, in which we randomly "drop out" a subset of the activations by setting them to zero. This prevents units from co-adapting too easily, and often leads to improved generalization

**(optional) 5.)** In `models.py`, implement dropout by filling in the missing block in the implementation of `fully_connected_layers(...)`. You'll also need to modify your implementation of `BOW_encoder(...)` to pass the `dropout_rate` and `is_training` parameters to `fully_connected_layers(...)`. 

**_Do not_** apply dropout to the softmax layer, or to the embeddings.

**Hint:** use [`tf.layers.dropout`](https://www.tensorflow.org/api_docs/python/tf/layers/dropout).


We've replicated the training code in the cell below - modify `model_params` and `train_params`, and see if you can improve performance with a bit of tuning (_but don't spend too much time on this!_). Some things that might be worth trying:

- Enable dropout, and experiment with `dropout_rate`
- Train for more epochs (40 or 60). (_But, what happens if you train for too long?_)
- Use more hidden layers
- Use larger embedding and hidden dimensions
- Re-generate the training set with `root_only=False`, which will give set with fine-grained labels


**Note:** As it turns out, Naive Bayes is actually a pretty strong model for this dataset and it won't be easy to get a neural model to beat it *(see Table 1 from [Socher et al. 2013](http://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf) - our model is closest in design to the *VecAvg* model)*. Don't worry if tuning doesn't seem to help much for this particular problem.

In [37]:
# Run this if you implement dropout
reload(models)
utils.run_tests(models_test, ["TestFCWithDropout"])

test_fc_with_dropout (models_test.TestFCWithDropout) ... ok

----------------------------------------------------------------------
Ran 1 test in 0.044s

OK


In [40]:
import models; reload(models)

# Specify model hyperparameters as used by model_fn
model_params = dict(V=ds.vocab.size, embed_dim=70, hidden_dims=[35,35], num_classes=len(ds.target_names),
                    encoder_type='bow',
                    lr=0.1, optimizer='adagrad', beta=0.01, dropout_rate=0.25)  # fill this in

# Specify training schedule
train_params = dict(batch_size=32, total_epochs=60, eval_every=2)  # fill this in

assert(train_params['total_epochs'] % train_params['eval_every'] == 0)

###
# Don't change anything below this line
###
checkpoint_dir = "/tmp/tf_bow_sst_" + datetime.datetime.now().strftime("%Y%m%d-%H%M")
if os.path.isdir(checkpoint_dir): shutil.rmtree(checkpoint_dir)
ds.vocab.write_projector_config(checkpoint_dir, "Encoder/Embedding_Layer/W_embed")

model = tf.estimator.Estimator(model_fn=models.classifier_model_fn, params=model_params, model_dir=checkpoint_dir)
print("\nTo view training (once it starts), run:\n")
print("    tensorboard --logdir='{:s}' --port 6006".format(checkpoint_dir))
print("\nThen in your browser, open: http://localhost:6006\n")

train_input_fn = patched_numpy_io.numpy_input_fn(
                    x={"ids": train_x, "ns": train_ns}, y=train_y,
                    batch_size=train_params['batch_size'], 
                    num_epochs=train_params['eval_every'], shuffle=True, seed=42)
dev_input_fn = patched_numpy_io.numpy_input_fn(
                    x={"ids": dev_x, "ns": dev_ns}, y=dev_y,
                    batch_size=128, num_epochs=1, shuffle=False)
for _ in range(train_params['total_epochs'] // train_params['eval_every']):
    model.train(input_fn=train_input_fn)
    model.evaluate(input_fn=dev_input_fn, name="dev")

Vocabulary (16,474 words) written to '/tmp/tf_bow_sst_20180211-0753/metadata.tsv'
Projector config written to /tmp/tf_bow_sst_20180211-0753/projector_config.pbtxt
INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': '/tmp/tf_bow_sst_20180211-0753', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f85e0347240>, '_task_type': 'worker', '_task_id': 0, '_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}

To view training (once it starts), run:

    tensorboard --logdir='/tmp/tf_bow_sst_20180211-0753' --port 6006

Then in your browser, open: http://localhost:6006

INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Saving checkpoints for

INFO:tensorflow:global_step/sec: 81.6812
INFO:tensorflow:loss = 0.110953, step = 2999 (1.224 sec)
INFO:tensorflow:Saving checkpoints for 3031 into /tmp/tf_bow_sst_20180211-0753/model.ckpt.
INFO:tensorflow:Loss for final step: 0.0841192.
INFO:tensorflow:Starting evaluation at 2018-02-11-07:54:21
INFO:tensorflow:Restoring parameters from /tmp/tf_bow_sst_20180211-0753/model.ckpt-3031
INFO:tensorflow:Finished evaluation at 2018-02-11-07:54:21
INFO:tensorflow:Saving dict for global step 3031: accuracy = 0.738532, cross_entropy_loss = 0.828486, global_step = 3031, loss = 0.97414
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Restoring parameters from /tmp/tf_bow_sst_20180211-0753/model.ckpt-3031
INFO:tensorflow:Saving checkpoints for 3032 into /tmp/tf_bow_sst_20180211-0753/model.ckpt.
INFO:tensorflow:loss = 0.0986938, step = 3032
INFO:tensorflow:global_step/sec: 79.2198
INFO:tensorflow:loss = 0.099547, step = 3132 (1.266 sec)
INFO:tensorflow:global_step/sec: 83.4601
INFO:tensorf

INFO:tensorflow:Loss for final step: 0.0808796.
INFO:tensorflow:Starting evaluation at 2018-02-11-07:55:15
INFO:tensorflow:Restoring parameters from /tmp/tf_bow_sst_20180211-0753/model.ckpt-6062
INFO:tensorflow:Finished evaluation at 2018-02-11-07:55:15
INFO:tensorflow:Saving dict for global step 6062: accuracy = 0.744266, cross_entropy_loss = 0.781448, global_step = 6062, loss = 0.914353
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Restoring parameters from /tmp/tf_bow_sst_20180211-0753/model.ckpt-6062
INFO:tensorflow:Saving checkpoints for 6063 into /tmp/tf_bow_sst_20180211-0753/model.ckpt.
INFO:tensorflow:loss = 0.0906378, step = 6063
INFO:tensorflow:global_step/sec: 73.1613
INFO:tensorflow:loss = 0.0948631, step = 6163 (1.371 sec)
INFO:tensorflow:global_step/sec: 76.584
INFO:tensorflow:loss = 0.0989347, step = 6263 (1.306 sec)
INFO:tensorflow:global_step/sec: 77.0977
INFO:tensorflow:loss = 0.099958, step = 6363 (1.297 sec)
INFO:tensorflow:global_step/sec: 77.1594
INF

INFO:tensorflow:Finished evaluation at 2018-02-11-07:56:08
INFO:tensorflow:Saving dict for global step 9093: accuracy = 0.745413, cross_entropy_loss = 0.759242, global_step = 9093, loss = 0.888405
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Restoring parameters from /tmp/tf_bow_sst_20180211-0753/model.ckpt-9093
INFO:tensorflow:Saving checkpoints for 9094 into /tmp/tf_bow_sst_20180211-0753/model.ckpt.
INFO:tensorflow:loss = 0.0881446, step = 9094
INFO:tensorflow:global_step/sec: 80.5323
INFO:tensorflow:loss = 0.0935572, step = 9194 (1.245 sec)
INFO:tensorflow:global_step/sec: 83.6153
INFO:tensorflow:loss = 0.0961791, step = 9294 (1.196 sec)
INFO:tensorflow:global_step/sec: 84.1875
INFO:tensorflow:loss = 0.0982487, step = 9394 (1.188 sec)
INFO:tensorflow:global_step/sec: 82.8961
INFO:tensorflow:loss = 0.0997345, step = 9494 (1.206 sec)
INFO:tensorflow:Saving checkpoints for 9526 into /tmp/tf_bow_sst_20180211-0753/model.ckpt.
INFO:tensorflow:Loss for final step: 0.0789016.

INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Restoring parameters from /tmp/tf_bow_sst_20180211-0753/model.ckpt-12124
INFO:tensorflow:Saving checkpoints for 12125 into /tmp/tf_bow_sst_20180211-0753/model.ckpt.
INFO:tensorflow:loss = 0.0867562, step = 12125
INFO:tensorflow:global_step/sec: 72.4472
INFO:tensorflow:loss = 0.0927769, step = 12225 (1.384 sec)
INFO:tensorflow:global_step/sec: 76.1289
INFO:tensorflow:loss = 0.0947497, step = 12325 (1.314 sec)
INFO:tensorflow:global_step/sec: 75.1643
INFO:tensorflow:loss = 0.0974639, step = 12425 (1.330 sec)
INFO:tensorflow:global_step/sec: 76.4365
INFO:tensorflow:loss = 0.098924, step = 12525 (1.308 sec)
INFO:tensorflow:Saving checkpoints for 12557 into /tmp/tf_bow_sst_20180211-0753/model.ckpt.
INFO:tensorflow:Loss for final step: 0.0778168.
INFO:tensorflow:Starting evaluation at 2018-02-11-07:57:08
INFO:tensorflow:Restoring parameters from /tmp/tf_bow_sst_20180211-0753/model.ckpt-12557
INFO:tensorflow:Finished evaluation at 20

In [41]:
test_input_fn = patched_numpy_io.numpy_input_fn(
                    x={"ids": test_x, "ns": test_ns}, y=test_y,
                    batch_size=128, num_epochs=1, shuffle=False)
eval_metrics = model.evaluate(input_fn=test_input_fn, name="test")
print("Accuracy on test set: {:.02%}".format(eval_metrics['accuracy']))

INFO:tensorflow:Starting evaluation at 2018-02-11-07:58:11
INFO:tensorflow:Restoring parameters from /tmp/tf_bow_sst_20180211-0753/model.ckpt-12990
INFO:tensorflow:Finished evaluation at 2018-02-11-07:58:12
INFO:tensorflow:Saving dict for global step 12990: accuracy = 0.769358, cross_entropy_loss = 0.704299, global_step = 12990, loss = 0.830208
Accuracy on test set: 76.94%
