# Recurrent Neural Network Language Model

This is the "working notebook", with skeleton code to load and train your model, as well as run unit tests. See [rnnlm-instructions.ipynb](rnnlm-instructions.ipynb) for the main writeup.

Run the cell below to import packages.

In [2]:
from __future__ import absolute_import
from __future__ import print_function
from __future__ import division

import json, os, re, shutil, sys, time
from importlib import reload
import collections, itertools
import unittest
from IPython.display import display, HTML

# NLTK for NLP utils and corpora
import nltk

# NumPy and TensorFlow
import numpy as np
import tensorflow as tf
assert(tf.__version__.startswith("1."))

# Helper libraries
from w266_common import utils, vocabulary, tf_embed_viz

# Your code
import rnnlm; reload(rnnlm)
import rnnlm_test; reload(rnnlm_test)

  from ._conv import register_converters as _register_converters


<module 'rnnlm_test' from '/home/nconidas/w266/assignment/a3/lstm/rnnlm_test.py'>

## (a) RNNLM Inputs and Parameters  

### Questions for Part (a)
You should use big-O notation when appropriate (i.e. computing $\exp(\mathbf{v})$ for a vector $\mathbf{v}$ of length $n$ is $O(n)$ operations).  Assume for problems a(1-5) that:   
> -- Cell is one layer,  
> -- the embedding feature length and hidden-layer feature lengths are both H, and   
> -- ignore at the moment batch and max_time dimensions.  

1. Let $\text{CellFunc}$ be a simple RNN cell (see async Section 5.8). Write the cell equation in terms of nonlinearities and matrix multiplication. How many parameters (matrix or vector elements) are there for this cell, in terms of `V` and `H`?
<p>
2. How many parameters are in the embedding layer? In the output layer? (By parameters, we mean total number of matrix elements across all train-able tensors. A $m \times n$ matrix has $mn$ elements.)
<p>
3. How many calculations (floating point operations) are required to compute $\hat{P}(w^{(i+1)})$ for a given *single* target word $w^{(i+1)}$, assuming $w^{(i)}$ given and $h^{(i-1)}$ already computed? How about for *all* target words?
<p>
4. How does your answer to 3. change if we approximate $\hat{P}(w^{(i+1)})$ with a sampled softmax with $k$ samples? How about if we use a hierarchical softmax? (*Recall that hierarchical softmax makes a series of left/right decisions using a binary classifier $P_s(\text{right}) = \sigma(u_s \cdot o^{(i)} + b_s)$ at each split $s$ in the tree.*)
<p>
5. If you have an LSTM with $H = 200$ and use sampled softmax with $k = 100$, what part of the network takes up the most computation time during training? (*Choose "embedding layer", "recurrent layer", or "output layer"*.)

Note: for $A \in \mathbb{R}^{m \times n}$ and $B \in \mathbb{R}^{n \times l}$, computing the matrix product $AB$ takes $O(mnl)$ time.

1. tanh(WM+B) weight matrix w = [2H, H], M = [H], b = [1,H]

2. Embedding Layer Dim = [H], Output Layer Dim = [H,V]

3. 

  * #### One Target Word 
      * Recurrent Layer: [1,2H] x [V,H] = O(2H^2) 
      * Output Layer: [1,H] x [H,1] = O(H) Therefore this takes O(H^2) time
  * #### All Target Words 
      * Recurrent Layer: [1,2H] * [2H,H] = O(2H^2)
      * Output Layer: [1,H]*[H,V] = O(H*V) Therefore this takes O(HV) time
      *  V may be orders of magnitude large than H so the differnce can be large.

4. 
  * Instead of calculating a soft max for every word in V sampled soft max will only do it for k words therefore it will run in O(Hk) time
  
  * Hierarchical Soft Max will run in O(Log(HV)) time 
 
5. 
  * #### Recurrent Layer 
      * [1,400] x [400, 200] = O(80,000) 
  * #### Output Layer
      * [1,200] x [200, 100] = O(20,000)
      
  * The Recurrent Layer should take the longest 

    


## (b) Implementing the RNNLM

### Aside: Shapes Review

Before we start, let's review our understanding of the shapes involved in this assignment and how they manifest themselves in the TF API.

As in the [instructions](rnnlm-instructions.ipynb) notebook, $w$ is a matrix of wordids with shape batch_size x max_time.  Passing this through the embedding layer, we retrieve the word embedding for each, resulting in $x$ having shape batch_size x max_time x embedding_dim.  I find it useful to draw this out on a piece of paper.  When you do, you should end up with a rectangular prism with batch_size height, max_time width and embedding_dim depth.  Many tensors in this assignment share this shape (e.g. $o$, the output from the LSTM, which represents the hidden layer going into the softmax to make a prediction at every time step in every batch).

![Three Dimensional Shape](common_shape.png)

Since batch size and sentence length are only resolved when we run the graph, we construct the placeholder using "None" in the dimensions we don't know.  The .shape property renders these as ?s.  This should be familiar to you from batch size handling in earlier assignments, only now there are two dimensions of variable length.

See the next cell for a concrete example (though in practice, we'd use a TensorFlow variable that we can train for the embeddings rather than a static array).  Notice how the shape of x_val matches the shape described earlier in this cell.

In [3]:
tf.reset_default_graph()

wordid_ph = tf.placeholder(tf.int32, shape=[None, None])
embedding_matrix = np.array([[1, 1, 1], [2, 2, 2], [3, 3, 3]])
x = tf.nn.embedding_lookup(embedding_matrix, wordid_ph)

print('wordid placeholder shape:', wordid_ph.shape)
print('x shape:', x.shape)

sess = tf.Session()
# Two sentences, each with four words.
wordids = [[1, 2, 1, 2], [0, 0, 0, 0]]
x_val = sess.run(x, feed_dict={wordid_ph: wordids})
print('Embeddings shape:', x_val.shape)  # 2 sentences, 4 words, 
print('Embeddings value:\n', x_val)

wordid placeholder shape: (?, ?)
x shape: (?, ?, 3)
Embeddings shape: (2, 4, 3)
Embeddings value:
 [[[2 2 2]
  [3 3 3]
  [2 2 2]
  [3 3 3]]

 [[1 1 1]
  [1 1 1]
  [1 1 1]
  [1 1 1]]]


### Implmenting the RNNLM

In order to better manage the model parameters, we'll implement our RNNLM in the `RNNLM` class in `rnnlm.py`. We've given you a skeleton of starter code for this, but the bulk of the implementation is left to you.

In [19]:
reload(rnnlm)

TF_GRAPHDIR = "/tmp/w266/a3_graph"

# Clear old log directory.
shutil.rmtree(TF_GRAPHDIR, ignore_errors=True)

lm = rnnlm.RNNLM(V=10000, H=200, num_layers=2)
lm.BuildCoreGraph()
lm.BuildTrainGraph()
lm.BuildSamplerGraph()

summary_writer = tf.summary.FileWriter(TF_GRAPHDIR, lm.graph)

(?, ?, 10000)


The code above will load your implementation, construct the graph, and write a logdir for TensorBoard. You can bring up TensorBoard with:
```
cd assignment/a3
tensorboard --logdir /tmp/w266/a3_graph --port 6006
```
As usual, check http://localhost:6006/ and visit the "Graphs" tab to inspect your implementation. Remember, judicious use of `tf.name_scope()` and/or `tf.variable_scope()` will greatly improve the visualization, and make code easier to debug.

We've provided a few unit tests below to verify some *very* basic properties of your model.

In [81]:
reload(rnnlm); reload(rnnlm_test)
utils.run_tests(rnnlm_test, ["TestRNNLMCore", "TestRNNLMTrain", "TestRNNLMSampler"])

test_shapes_embed (rnnlm_test.TestRNNLMCore) ... ok
test_shapes_output (rnnlm_test.TestRNNLMCore) ... ok
test_shapes_recurrent (rnnlm_test.TestRNNLMCore) ... ok
test_shapes_train (rnnlm_test.TestRNNLMTrain) ... ok
test_shapes_sample (rnnlm_test.TestRNNLMSampler) ... 

Tensor("Reshape_1:0", shape=(?, ?, 1), dtype=int64)


ok

----------------------------------------------------------------------
Ran 5 tests in 2.025s

OK


Note that the error messages are intentionally somewhat spare, and that passing tests are no guarantee of model correctness! Your best chance of success is through careful coding and understanding of how the model works.

## (c) Training your RNNLM (5 points)

We'll give you data loader functions in **`utils.py`**. They work similarly to the loaders in the Week 5 notebook.

Particularly, `utils.rnnlm_batch_generator` will return an iterator that yields minibatches in the correct format. Batches will be of size `[batch_size, max_time]`, and consecutive batches will line up along rows so that the final state $h^{\text{final}}$ of one batch can be used as the initial state $h^{\text{init}}$ for the next.

For example, using a toy corpus:  
*(Ignore the ugly formatter code.)*

In [6]:
toy_corpus = "<s> Mary had a little lamb . <s> The lamb was white as snow . <s>"
toy_corpus = np.array(toy_corpus.split())

html = "<h3>Input words w:</h3>"
html += "<table><tr><th>Batch 0</th><th>Batch 1</th></tr><tr>"
bi = utils.rnnlm_batch_generator(toy_corpus, batch_size=2, max_time=4)
for i, (w,y) in enumerate(bi):
    cols = ["w_%d" % d for d in range(w.shape[1])]
    html += "<td>{:s}</td>".format(utils.render_matrix(w, cols=cols, dtype=object))
html += "</tr></table>"
display(HTML(html))

html = "<h3>Target words y:</h3>"
html += "<table><tr><th>Batch 0</th><th>Batch 1</th></tr><tr>"
bi = utils.rnnlm_batch_generator(toy_corpus, batch_size=2, max_time=4)
for i, (w,y) in enumerate(bi):
    cols = ["y_%d" % d for d in range(y.shape[1])]
    html += "<td>{:s}</td>".format(utils.render_matrix(y, cols=cols, dtype=object))
html += "</tr></table>"
display(HTML(html))

Unnamed: 0_level_0,w_0,w_1,w_2,w_3
Unnamed: 0_level_1,w_0,w_1,w_2,Unnamed: 4_level_1
0,<s>,Mary,had,a
1,<s>,The,lamb,was
0,little,lamb,.,
1,white,as,snow,
Batch 0,Batch 1,,,
w_0  w_1  w_2  w_3  0  <s>  Mary  had  a  1  <s>  The  lamb  was,w_0  w_1  w_2  0  little  lamb  .  1  white  as  snow,,,

Unnamed: 0,w_0,w_1,w_2,w_3
0,<s>,Mary,had,a
1,<s>,The,lamb,was

Unnamed: 0,w_0,w_1,w_2
0,little,lamb,.
1,white,as,snow


Unnamed: 0_level_0,y_0,y_1,y_2,y_3
Unnamed: 0_level_1,y_0,y_1,y_2,Unnamed: 4_level_1
0,Mary,had,a,little
1,The,lamb,was,white
0,lamb,.,<s>,
1,as,snow,.,
Batch 0,Batch 1,,,
y_0  y_1  y_2  y_3  0  Mary  had  a  little  1  The  lamb  was  white,y_0  y_1  y_2  0  lamb  .  <s>  1  as  snow  .,,,

Unnamed: 0,y_0,y_1,y_2,y_3
0,Mary,had,a,little
1,The,lamb,was,white

Unnamed: 0,y_0,y_1,y_2
0,lamb,.,<s>
1,as,snow,.


Note that the data we feed to our model will be word indices, but the shape will be the same.

### 1. Implement the `run_epoch` function
We've given you some starter code for logging progress; fill this in with actual call(s) to `session.run` with the appropriate arguments to run a training step. 

Be sure to handle the initial state properly at the beginning of an epoch, and remember to carry over the final state from each batch and use it as the initial state for the next.

**Note:** we provide a `train=True` flag to enable train mode. If `train=False`, this function can also be used for scoring the dataset - see `score_dataset()` below.

#### Questions

1.  Explain what this function does.  Be sure to include the role of `batch_iterator` and what's going on with `h` in the inner loop.

In [7]:
def run_epoch(lm, session, batch_iterator,
              train=False, verbose=False,
              tick_s=10, learning_rate=None):
    assert(learning_rate is not None)
    start_time = time.time()
    tick_time = start_time  # for showing status
    total_cost = 0.0  # total cost, summed over all words
    total_batches = 0
    total_words = 0

    if train:
        train_op = lm.train_step_
        use_dropout = True
        loss = lm.train_loss_
    else:
        train_op = tf.no_op()
        use_dropout = False  # no dropout at test time
        loss = lm.loss_  # true loss, if train_loss is an approximation

    for i, (w, y) in enumerate(batch_iterator):
        # At first batch in epoch, get a clean intitial state.
        if i == 0:
            h = session.run(lm.initial_h_, {lm.input_w_: w})

        feed_dict = {
            lm.input_w_: w,
            lm.target_y_: y,
            lm.initial_h_: h,
            lm.learning_rate_: learning_rate,
            lm.use_dropout_: use_dropout
        }
        ops = [loss, lm.final_h_, train_op]        
        #### YOUR CODE HERE ####
        # session.run(...) the ops with the feed_dict constructed above.
        # Ensure "cost" becomes the value of "loss".
        # Hint: see "ops" for other variables that need updating in this loop.
        cost,h,_ = session.run(ops,feed_dict=feed_dict)

        #### END(YOUR CODE) ####
        total_cost += cost
        total_batches = i + 1
        total_words += w.size  # w.size = batch_size * max_time

        ##
        # Print average loss-so-far for epoch
        # If using train_loss_, this may be an underestimate.
        if verbose and (time.time() - tick_time >= tick_s):
            avg_cost = total_cost / total_batches
            avg_wps = total_words / (time.time() - start_time)
            print("[batch {:d}]: seen {:d} words at {:.1f} wps, loss = {:.3f}".format(
                i, total_words, avg_wps, avg_cost))
            tick_time = time.time()  # reset time ticker

    return total_cost / total_batches

In [8]:
def score_dataset(lm, session, ids, name="Data"):
    # For scoring, we can use larger batches to speed things up.
    bi = utils.rnnlm_batch_generator(ids, batch_size=100, max_time=100)
    cost = run_epoch(lm, session, bi, 
                     learning_rate=0.0, train=False, 
                     verbose=False, tick_s=3600)
    print("{:s}: avg. loss: {:.03f}  (perplexity: {:.02f})".format(name, cost, np.exp(cost)))
    return cost

You can use the cell below to verify your implementation of `run_epoch`, and to test your RNN on a (very simple) toy dataset:

In [None]:
reload(rnnlm); reload(rnnlm_test)
th = rnnlm_test.RunEpochTester("test_toy_model")
th.setUp(); th.injectCode(run_epoch, score_dataset)
unittest.TextTestRunner(verbosity=2).run(th)

Note that as above, this is a *very* simple test case that does not guarantee model correctness.

### 2. Run Training

We'll give you the outline of the training procedure, but you'll need to fill in a call to your `run_epoch` function. 

At the end of training, we use a `tf.train.Saver` to save a copy of the model to `/tmp/w266/a3_model/rnnlm_trained`. You'll want to load this from disk to work on later parts of the assignment; see **part (d)** for an example of how this is done.

#### Tuning Hyperparameters
With a sampled softmax loss, the default hyperparameters should train 5 epochs in around 15 minutes on a single-core GCE instance, and reach a training set perplexity between 120-140.

However, it's possible to do significantly better. Try experimenting with multiple RNN layers (`num_layers` > 1) or a larger hidden state - though you may also need to adjust the learning rate and number of epochs for a larger model.

You can also experiment with a larger vocabulary. This will look worse for perplexity, but will be a better model overall as it won't treat so many words as `<unk>`.

#### Notes on Speed

To speed things up, you may want to re-start your GCE instance with more CPUs. Using a 16-core machine will train *very* quickly if using a sampled softmax lost, almost as fast as a GPU. (Because of the sequential nature of the model, GPUs aren't actually much faster than CPUs for training and running RNNs.) The training code will print the words-per-second processed; with the default settings on a single core, you can expect around 8000 WPS, or up to more than 25000 WPS on a fast multi-core machine.

You might also want to modify the code below to only run score_dataset at the very end, after all epochs are completed. This will speed things up significantly, since `score_dataset` uses the full softmax loss - and so often can take longer than a whole training epoch!

#### Submitting your model
You should submit your trained model along with the assignment. Do:
```
cp /tmp/w266/a3_model/rnnlm_trained* .
git add rnnlm_trained*
git commit -m "Adding trained model."
```
Unless you train a very large model, these files should be < 50 MB and no problem for git to handle. If you do also train a large model, please only submit the smaller one.

In [15]:
# Load the dataset
V = 10000
vocab, train_ids, test_ids = utils.load_corpus("brown", split=0.8, V=V, shuffle=42)

[nltk_data] Downloading package brown to /home/nconidas/nltk_data...
[nltk_data]   Package brown is already up-to-date!
Vocabulary: 10,000 types
Loaded 57,340 sentences (1.16119e+06 tokens)
Training set: 45,872 sentences (924,077 tokens)
Test set: 11,468 sentences (237,115 tokens)


In [16]:
# Training parameters
max_time = 25
batch_size = 100
learning_rate = 0.01
num_epochs = 5

# Model parameters
model_params = dict(V=vocab.size, 
                    H=200, 
                    softmax_ns=200,
                    num_layers=2)

TF_SAVEDIR = "/tmp/w266/a3_model"
checkpoint_filename = os.path.join(TF_SAVEDIR, "rnnlm")
trained_filename = os.path.join(TF_SAVEDIR, "rnnlm_trained")

In [26]:
# Will print status every this many seconds
print_interval = 5

lm = rnnlm.RNNLM(**model_params)
lm.BuildCoreGraph()
lm.BuildTrainGraph()

# Explicitly add global initializer and variable saver to LM graph
with lm.graph.as_default():
    initializer = tf.global_variables_initializer()
    saver = tf.train.Saver()
    
# Clear old log directory
shutil.rmtree(TF_SAVEDIR, ignore_errors=True)
if not os.path.isdir(TF_SAVEDIR):
    os.makedirs(TF_SAVEDIR)

with tf.Session(graph=lm.graph) as session:
    # Seed RNG for repeatability
    tf.set_random_seed(42)

    session.run(initializer)
    
    
    for epoch in range(1,num_epochs+1):
        t0_epoch = time.time()
        bi = utils.rnnlm_batch_generator(train_ids, batch_size, max_time)
        print("[epoch {:d}] Starting epoch {:d}".format(epoch, epoch))
        #### YOUR CODE HERE ####
        # Run a training epoch.
        run_epoch(lm, session, bi,True,
                  True,1.0,learning_rate)
        
        #### END(YOUR CODE) ####
        print("[epoch {:d}] Completed in {:s}".format(epoch, utils.pretty_timedelta(since=t0_epoch)))
    
        # Save a checkpointls 
        saver.save(session, checkpoint_filename, global_step=epoch)
    
        ##
        # score_dataset will run a forward pass over the entire dataset
        # and report perplexity scores. This can be slow (around 1/2 to 
        # 1/4 as long as a full epoch), so you may want to comment it out
        # to speed up training on a slow machine. Be sure to run it at the 
        # end to evaluate your score.
      #  print("[epoch {:d}]".format(epoch), end=" ")
      #  score_dataset(lm, session, train_ids, name="Train set")
      #  print("[epoch {:d}]".format(epoch), end=" ")
      #  score_dataset(lm, session, test_ids, name="Test set")
      #  print("")
    print("[epoch {:d}]".format(epoch), end=" ")
    score_dataset(lm, session, train_ids, name="Train set")
    print("[epoch {:d}]".format(epoch), end=" ")
    score_dataset(lm, session, test_ids, name="Test set")
    print("")
    # Save final model
    saver.save(session, trained_filename)

[epoch 1] Starting epoch 1
[batch 0]: seen 2500 words at 2437.0 wps, loss = 7.950
[batch 2]: seen 7500 words at 2836.4 wps, loss = 7.549
[batch 4]: seen 12500 words at 2936.2 wps, loss = 7.463
[batch 6]: seen 17500 words at 2978.0 wps, loss = 7.324
[batch 8]: seen 22500 words at 2997.1 wps, loss = 7.137
[batch 10]: seen 27500 words at 3032.5 wps, loss = 6.981
[batch 12]: seen 32500 words at 3067.7 wps, loss = 6.874
[batch 14]: seen 37500 words at 3071.3 wps, loss = 6.764
[batch 16]: seen 42500 words at 3088.3 wps, loss = 6.652
[batch 18]: seen 47500 words at 3096.7 wps, loss = 6.561
[batch 20]: seen 52500 words at 3111.6 wps, loss = 6.490
[batch 22]: seen 57500 words at 3120.9 wps, loss = 6.408
[batch 24]: seen 62500 words at 3133.7 wps, loss = 6.337
[batch 26]: seen 67500 words at 3141.4 wps, loss = 6.275
[batch 28]: seen 72500 words at 3150.6 wps, loss = 6.205
[batch 30]: seen 77500 words at 3159.0 wps, loss = 6.139
[batch 32]: seen 82500 words at 3164.5 wps, loss = 6.076
[batch 34]:

[batch 280]: seen 702500 words at 3157.3 wps, loss = 4.742
[batch 282]: seen 707500 words at 3156.6 wps, loss = 4.739
[batch 284]: seen 712500 words at 3156.2 wps, loss = 4.735
[batch 286]: seen 717500 words at 3157.0 wps, loss = 4.733
[batch 288]: seen 722500 words at 3157.0 wps, loss = 4.730
[batch 290]: seen 727500 words at 3156.1 wps, loss = 4.727
[batch 292]: seen 732500 words at 3155.3 wps, loss = 4.725
[batch 294]: seen 737500 words at 3154.9 wps, loss = 4.722
[batch 296]: seen 742500 words at 3154.3 wps, loss = 4.719
[batch 298]: seen 747500 words at 3154.2 wps, loss = 4.716
[batch 300]: seen 752500 words at 3154.7 wps, loss = 4.713
[batch 302]: seen 757500 words at 3154.6 wps, loss = 4.710
[batch 304]: seen 762500 words at 3154.3 wps, loss = 4.707
[batch 306]: seen 767500 words at 3153.7 wps, loss = 4.704
[batch 308]: seen 772500 words at 3153.6 wps, loss = 4.702
[batch 310]: seen 777500 words at 3154.4 wps, loss = 4.699
[batch 312]: seen 782500 words at 3155.2 wps, loss = 4.6

[batch 173]: seen 435000 words at 3152.2 wps, loss = 4.140
[batch 175]: seen 440000 words at 3151.8 wps, loss = 4.139
[batch 177]: seen 445000 words at 3150.5 wps, loss = 4.138
[batch 179]: seen 450000 words at 3150.1 wps, loss = 4.139
[batch 181]: seen 455000 words at 3151.6 wps, loss = 4.138
[batch 183]: seen 460000 words at 3153.2 wps, loss = 4.136
[batch 185]: seen 465000 words at 3154.0 wps, loss = 4.135
[batch 187]: seen 470000 words at 3155.2 wps, loss = 4.135
[batch 189]: seen 475000 words at 3156.5 wps, loss = 4.135
[batch 191]: seen 480000 words at 3157.5 wps, loss = 4.133
[batch 193]: seen 485000 words at 3159.0 wps, loss = 4.133
[batch 195]: seen 490000 words at 3160.4 wps, loss = 4.132
[batch 197]: seen 495000 words at 3161.8 wps, loss = 4.131
[batch 199]: seen 500000 words at 3163.1 wps, loss = 4.130
[batch 201]: seen 505000 words at 3164.7 wps, loss = 4.130
[batch 203]: seen 510000 words at 3165.8 wps, loss = 4.129
[batch 205]: seen 515000 words at 3167.0 wps, loss = 4.1

[batch 63]: seen 160000 words at 3148.7 wps, loss = 3.974
[batch 65]: seen 165000 words at 3149.0 wps, loss = 3.976
[batch 67]: seen 170000 words at 3152.3 wps, loss = 3.975
[batch 69]: seen 175000 words at 3156.8 wps, loss = 3.975
[batch 71]: seen 180000 words at 3161.0 wps, loss = 3.974
[batch 73]: seen 185000 words at 3164.7 wps, loss = 3.973
[batch 75]: seen 190000 words at 3168.3 wps, loss = 3.974
[batch 77]: seen 195000 words at 3172.1 wps, loss = 3.973
[batch 79]: seen 200000 words at 3175.7 wps, loss = 3.971
[batch 81]: seen 205000 words at 3178.1 wps, loss = 3.972
[batch 83]: seen 210000 words at 3179.9 wps, loss = 3.971
[batch 85]: seen 215000 words at 3177.5 wps, loss = 3.969
[batch 87]: seen 220000 words at 3177.7 wps, loss = 3.969
[batch 89]: seen 225000 words at 3178.0 wps, loss = 3.967
[batch 91]: seen 230000 words at 3176.3 wps, loss = 3.969
[batch 93]: seen 235000 words at 3175.5 wps, loss = 3.968
[batch 95]: seen 240000 words at 3176.4 wps, loss = 3.967
[batch 97]: se

[batch 343]: seen 860000 words at 3179.2 wps, loss = 3.932
[batch 345]: seen 865000 words at 3179.7 wps, loss = 3.932
[batch 347]: seen 870000 words at 3180.0 wps, loss = 3.931
[batch 349]: seen 875000 words at 3180.0 wps, loss = 3.931
[batch 351]: seen 880000 words at 3180.3 wps, loss = 3.930
[batch 353]: seen 885000 words at 3180.5 wps, loss = 3.930
[batch 355]: seen 890000 words at 3180.7 wps, loss = 3.930
[batch 357]: seen 895000 words at 3180.6 wps, loss = 3.930
[batch 359]: seen 900000 words at 3180.9 wps, loss = 3.929
[batch 361]: seen 905000 words at 3181.1 wps, loss = 3.929
[batch 363]: seen 910000 words at 3181.5 wps, loss = 3.929
[batch 365]: seen 915000 words at 3180.9 wps, loss = 3.928
[batch 367]: seen 920000 words at 3180.4 wps, loss = 3.929
[batch 369]: seen 925000 words at 3180.0 wps, loss = 3.928
[batch 371]: seen 930000 words at 3179.5 wps, loss = 3.928
[batch 373]: seen 935000 words at 3178.2 wps, loss = 3.928
[batch 375]: seen 940000 words at 3177.5 wps, loss = 3.9

[batch 235]: seen 590000 words at 3118.8 wps, loss = 3.934
[batch 237]: seen 595000 words at 3119.2 wps, loss = 3.934
[batch 239]: seen 600000 words at 3119.8 wps, loss = 3.933
[batch 241]: seen 605000 words at 3119.8 wps, loss = 3.932
[batch 243]: seen 610000 words at 3119.8 wps, loss = 3.931
[batch 245]: seen 615000 words at 3119.5 wps, loss = 3.930
[batch 247]: seen 620000 words at 3119.9 wps, loss = 3.929
[batch 249]: seen 625000 words at 3120.0 wps, loss = 3.928
[batch 251]: seen 630000 words at 3120.3 wps, loss = 3.927
[batch 253]: seen 635000 words at 3120.4 wps, loss = 3.926
[batch 255]: seen 640000 words at 3119.3 wps, loss = 3.925
[batch 257]: seen 645000 words at 3119.5 wps, loss = 3.924
[batch 259]: seen 650000 words at 3119.5 wps, loss = 3.924
[batch 261]: seen 655000 words at 3119.2 wps, loss = 3.923
[batch 263]: seen 660000 words at 3118.9 wps, loss = 3.923
[batch 265]: seen 665000 words at 3118.9 wps, loss = 3.922
[batch 267]: seen 670000 words at 3118.9 wps, loss = 3.9

[batch 127]: seen 320000 words at 3103.0 wps, loss = 3.814
[batch 129]: seen 325000 words at 3103.8 wps, loss = 3.814
[batch 131]: seen 330000 words at 3104.1 wps, loss = 3.813
[batch 133]: seen 335000 words at 3103.2 wps, loss = 3.813
[batch 135]: seen 340000 words at 3102.7 wps, loss = 3.813
[batch 137]: seen 345000 words at 3102.9 wps, loss = 3.813
[batch 139]: seen 350000 words at 3103.0 wps, loss = 3.812
[batch 141]: seen 355000 words at 3103.1 wps, loss = 3.811
[batch 143]: seen 360000 words at 3103.1 wps, loss = 3.812
[batch 145]: seen 365000 words at 3103.0 wps, loss = 3.811
[batch 147]: seen 370000 words at 3102.3 wps, loss = 3.811
[batch 149]: seen 375000 words at 3102.2 wps, loss = 3.810
[batch 151]: seen 380000 words at 3102.4 wps, loss = 3.810
[batch 153]: seen 385000 words at 3102.8 wps, loss = 3.809
[batch 155]: seen 390000 words at 3102.4 wps, loss = 3.808
[batch 157]: seen 395000 words at 3102.5 wps, loss = 3.807
[batch 159]: seen 400000 words at 3102.4 wps, loss = 3.8

## (d) Sampling Sentences (5 points)

If you didn't already in **part (b)**, implement the `BuildSamplerGraph()` method in `rnnlm.py` See the function docstring for more information.

#### Implement the `sample_step()` method below (5 points)
This should access the Tensors you create in `BuildSamplerGraph()`. Given an input batch and initial states, it should return a vector of shape `[batch_size,1]` containing sampled indices for the next word of each batch sequence.

Run the method using the provided code to generate 10 sentences.

In [95]:
def sample_step(lm, session, input_w, initial_h):
    """Run a single RNN step and return sampled predictions.
  
    Args:
      lm : rnnlm.RNNLM
      session: tf.Session
      input_w : [batch_size] vector of indices
      initial_h : [batch_size, hidden_dims] initial state
    
    Returns:
      final_h : final hidden state, compatible with initial_h
      samples : [batch_size, 1] vector of indices
    """
    # Reshape input to column vector
    input_w = np.array(input_w, dtype=np.int32).reshape([-1,1])
    
    #### YOUR CODE HERE ####
    # Run sample ops

    feed_dict = {lm.input_w_:input_w,
                 lm.initial_h_:initial_h}
    ops = [lm.pred_samples_, lm.final_h_]        

    samples,final_h = session.run(ops,feed_dict=feed_dict)    

    #### END(YOUR CODE) ####
    # Note indexing here: 
    #   [batch_size, max_time, 1] -> [batch_size, 1]
    return final_h, samples[:,-1,:]

In [96]:
# Same as above, but as a batch
max_steps = 20
num_samples = 10
random_seed = 42

lm = rnnlm.RNNLM(**model_params)
lm.BuildCoreGraph()
lm.BuildSamplerGraph()
reload(rnnlm)

with lm.graph.as_default():
    saver = tf.train.Saver()

with tf.Session(graph=lm.graph) as session:
    # Seed RNG for repeatability
    tf.set_random_seed(random_seed)
    
    # Load the trained model
    saver.restore(session, trained_filename)

    # Make initial state for a batch with batch_size = num_samples
    w = np.repeat([[vocab.START_ID]], num_samples, axis=0)
    h = session.run(lm.initial_h_, {lm.input_w_: w})
    # We'll take one step for each sequence on each iteration 
    for i in range(max_steps):
        h, y = sample_step(lm, session, w[:,-1:], h)
        w = np.hstack((w,y))

    # Print generated sentences
    for row in w:
        for i, word_id in enumerate(row):
            print(vocab.id_to_word[word_id], end=" ")
            if (i != 0) and (word_id == vocab.START_ID):
                break
        print("")

Tensor("Reshape_1:0", shape=(?, ?, 1), dtype=int64)
INFO:tensorflow:Restoring parameters from /tmp/w266/a3_model/rnnlm_trained
<s> mantle <unk> one , for the <unk> catholic <unk> skyros with some of the battery during the land . <s> 
<s> this evidence ) develops the artist and token <unk> typical aid inquirer . <s> 
<s> don't congregational . <s> 
<s> it <unk> way to the space to check more to the rest of cause , had picturesque orchestra friday familiar 
<s> his , part-time and <unk> job in writing or atmosphere . <s> 
<s> he may butyrate a <unk> cheek of the public of the antique edge that a way of the same opposition 
<s> yes , should close himself down and went out down him . <s> 
<s> there were no day l . <s> 
<s> the <unk> that were greatly supplementary to miss experience , and they may make that i have often been trouble 
<s> we need financial and do for us . <s> 


## (e) Linguistic Properties (5 points)

Now that we've trained our RNNLM, let's test a few properties of the model to see how well it learns linguistic phenomena. We'll do this with a scoring task: given two or more test sentences, our model should score the more plausible (or more correct) sentence with a higher log-probability.

We'll define a scoring function to help us:

In [97]:
def score_seq(lm, session, seq, vocab):
    """Score a sequence of words. Returns total log-probability."""
    padded_ids = vocab.words_to_ids(utils.canonicalize_words(["<s>"] + seq + ["</s>"], 
                                                             wordset=vocab.word_to_id))
    w = np.reshape(padded_ids[:-1], [1,-1])
    y = np.reshape(padded_ids[1:],  [1,-1])
    h = session.run(lm.initial_h_, {lm.input_w_: w})
    feed_dict = {lm.input_w_:w,
                 lm.target_y_:y,
                 lm.initial_h_:h,
                 lm.dropout_keep_prob_: 1.0}
    # Return log(P(seq)) = -1*loss
    return -1*session.run(lm.loss_, feed_dict)

def load_and_score(inputs, sort=False):
    """Load the trained model and score the given words."""
    lm = rnnlm.RNNLM(**model_params)
    lm.BuildCoreGraph()
    
    with lm.graph.as_default():
        saver = tf.train.Saver()

    with tf.Session(graph=lm.graph) as session:  
        # Load the trained model
        saver.restore(session, trained_filename)

        if isinstance(inputs[0], str) or isinstance(inputs[0], bytes):
            inputs = [inputs]

        # Actually run scoring
        results = []
        for words in inputs:
            score = score_seq(lm, session, words, vocab)
            results.append((score, words))

        # Sort if requested
        if sort: results = sorted(results, reverse=True)

        # Print results
        for score, words in results:
            print("\"{:s}\" : {:.02f}".format(" ".join(words), score))

Now we can test as:

In [98]:
sents = ["once upon a time",
         "the quick brown fox jumps over the lazy dog"]
load_and_score([s.split() for s in sents])

INFO:tensorflow:Restoring parameters from /tmp/w266/a3_model/rnnlm_trained
"once upon a time" : -7.49
"the quick brown fox jumps over the lazy dog" : -7.23


### 1. Number agreement

Compare **"the boy and the girl [are/is]"**. Which is more plausible according to your model?

If your model doesn't order them correctly (*this is OK*), why do you think that might be? (answer in cell below)

In [99]:
#### YOUR CODE HERE ####

sents = ["the boy and the girl are",
         "the boy and the girl is"]
load_and_score([s.split() for s in sents])

#### END(YOUR CODE) ####

INFO:tensorflow:Restoring parameters from /tmp/w266/a3_model/rnnlm_trained
"the boy and the girl are" : -5.56
"the boy and the girl is" : -5.47


### 2. Type/semantic agreement

Compare:
- **"peanuts are my favorite kind of [nut/vegetable]"**
- **"when I'm hungry I really prefer to [eat/drink]"**

Of each pair, which is more plausible according to your model?

How would you expect a 3-gram language model to perform at this example? How about a 5-gram model? (answer in cell below)

In [100]:
#### YOUR CODE HERE ####


sents = ["peanuts are my favorite kind of nut",
         "peanuts are my favorite kind of vegetable",
         "when I'm hungry I really prefer to eat",
         "when I'm hungry I really prefer to drink"]
load_and_score([s.split() for s in sents])

#### END(YOUR CODE) ####

INFO:tensorflow:Restoring parameters from /tmp/w266/a3_model/rnnlm_trained
"peanuts are my favorite kind of nut" : -6.87
"peanuts are my favorite kind of vegetable" : -6.63
"when I'm hungry I really prefer to eat" : -7.17
"when I'm hungry I really prefer to drink" : -7.26


### 3. Adjective ordering (just for fun)

Let's repeat the exercise from Week 2:

![Adjective Order](adjective_order.jpg)
*source: https://twitter.com/MattAndersonBBC/status/772002757222002688?lang=en*

We'll consider a toy example (literally), and consider all possible adjective permutations.

Note that this is somewhat sensitive to training, and even a good language model might not get it all correct. Why might the NN fail, if the trigram model from Week 2 was able to solve it?

In [101]:
prefix = "I have lots of".split()
noun = "toys"
adjectives = ["square", "green", "plastic"]
inputs = []
for adjs in itertools.permutations(adjectives):
    words = prefix + list(adjs) + [noun]
    inputs.append(words)
    
load_and_score(inputs, sort=True)

INFO:tensorflow:Restoring parameters from /tmp/w266/a3_model/rnnlm_trained
"I have lots of plastic green square toys" : -7.73
"I have lots of green square plastic toys" : -7.73
"I have lots of green plastic square toys" : -7.74
"I have lots of plastic square green toys" : -7.76
"I have lots of square green plastic toys" : -7.85
"I have lots of square plastic green toys" : -7.92
